LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] AISafety.info: What is the "natural abstractions hypothesis"?
Algon · 2024-10-05T12:31:14.195Z · comments (2)

[link] Twitter thread on AI takeover scenarios
Richard_Ngo (ricraz) · 2024-07-31T00:24:33.866Z · comments (0)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (3)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

[link] I didn't have to avoid you; I was just insecure
Chipmonk · 2024-08-17T16:41:50.237Z · comments (7)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

[question] What are your cruxes for imprecise probabilities / decision rules?
Anthony DiGiovanni (antimonyanthony) · 2024-07-31T15:42:27.057Z · answers+comments (29)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

We’re not as 3-Dimensional as We Think
silentbob · 2024-08-04T14:39:16.799Z · comments (16)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

An anti-inductive sequence
Viliam · 2024-08-14T12:28:54.226Z · comments (10)

Debate: Is it ethical to work at AI capabilities companies?
Ben Pace (Benito) · 2024-08-14T00:18:38.846Z · comments (21)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

Open Problems in AIXI Agent Foundations
Cole Wyeth (Amyr) · 2024-09-12T15:38:59.007Z · comments (2)

Categories of leadership on technical teams
benkuhn · 2024-07-22T04:50:04.071Z · comments (0)

[link] Twitter thread on politics of AI safety
Richard_Ngo (ricraz) · 2024-07-31T00:00:34.298Z · comments (2)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

[link] On Fables and Nuanced Charts
Niko_McCarty (niko-2) · 2024-09-08T17:09:07.503Z · comments (2)

Book Review: On the Edge: The Gamblers
Zvi · 2024-09-24T11:50:06.065Z · comments (1)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

[link] My article in The Nation — California’s AI Safety Bill Is a Mask-Off Moment for the Industry
garrison · 2024-08-15T19:25:59.592Z · comments (0)

[link] My Model of Epistemology
adamShimi · 2024-08-31T17:01:45.472Z · comments (0)

Index of rationalist groups in the Bay Area July 2024
Lucie Philippon (lucie-philippon) · 2024-07-26T16:32:25.337Z · comments (10)

Monthly Roundup #22: September 2024
Zvi · 2024-09-17T12:20:08.297Z · comments (10)

[link] Book review: On the Edge
PeterMcCluskey · 2024-08-30T22:18:39.581Z · comments (0)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

If You Can Climb Up, You Can Climb Down
jefftk (jkaufman) · 2024-07-30T00:00:06.295Z · comments (9)

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
Sodium · 2024-10-03T19:11:58.032Z · comments (15)

[link] My Apartment Art Commission Process
jenn (pixx) · 2024-08-26T18:36:44.363Z · comments (4)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (4)

[question] If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?
KvmanThinking (avery-liu) · 2024-10-03T11:31:19.974Z · answers+comments (33)

[link] The last era of human mistakes
owencb · 2024-07-24T09:58:42.116Z · comments (2)

ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct
25Hour (aaron-kaufman) · 2024-10-05T11:30:11.953Z · comments (2)

Augmenting Statistical Models with Natural Language Parameters
jsteinhardt · 2024-09-20T18:30:10.816Z · comments (0)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

[link] The $100B plan with "70% risk of killing us all" w Stephen Fry [video]
Oleg Trott (oleg-trott) · 2024-07-21T20:06:39.615Z · comments (8)

DIY LessWrong Jewelry
Fluffnutt (Pear) · 2024-08-25T21:33:56.173Z · comments (0)

[link] Twitter thread on open-source AI
Richard_Ngo (ricraz) · 2024-07-31T00:26:11.655Z · comments (6)

[link] Information dark matter
Logan Kieller (logan-kieller) · 2024-10-01T15:05:41.159Z · comments (4)

[link] The Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-10-18T13:26:25.565Z · comments (4)

[link] AI Safety Memes Wiki
plex (ete) · 2024-07-24T18:53:04.977Z · comments (1)

[link] The Cancer Resolution?
PeterMcCluskey · 2024-07-24T00:25:17.322Z · comments (24)

My disagreements with "AGI ruin: A List of Lethalities"
Noosphere89 (sharmake-farah) · 2024-09-15T17:22:18.367Z · comments (44)

Proveably Safe Self Driving Cars [Modulo Assumptions]
Davidmanheim · 2024-09-15T13:58:19.472Z · comments (26)

Monthly Roundup #20: July 2024
Zvi · 2024-07-23T12:50:07.991Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

james-chua on LLMs can learn about themselves by introspection

Thanks Thane for your comments!

The skeptical interpretation is that the fine-tuned models learned to interpret the hypothetical the following way:
"Hypothetical": "What is the third letter in the name of the next country in this list?: Laos, Peru, Fiji".

I think what you are saying is that the words "If you were asked," don't matter here. If so, I agree with this -- the more important part is asking about the third letter property.

basic multi-step reasoning within their forward passes.

You raised a good point. Our tests use multi-step / multi-hop reasoning. Prior work has shown multi-hop reasoning e.g. "Out-of-context reasoning" (OOCR). We speculate multi-hop reasoning to be the mechanism in Section 5.2 and Figure 9.

So what is our contribution compared to the prior work? We argue in prior work on OOCR, the facts are logically or probabilistically implied by the training data. E.g. "bill clinton is the US's 42th president". "Virginia Kelley was bill clinton's mother". Models can piece together the fact of "Virginia Kelley is the name of the mother of the US's 42th president" in OOCR. Two models, M1 and M2, given sufficient capability, should be able to piece together the same fact.

On the other hand, in our tests for introspection, the facts aren't implied by the training data. Two models, M1 and M2 aren't able to piece together the same fact. How do we empirically test for this? We finetune M2 on the data of M1. M2 still cannot predict facts about M1 well. Even when given more data about M1, the accuracy of M2 predicting facts about M1 plateaus. But M1 can predict its own M1 facts well.

We test the mirror case of M1 trying to predict M2, and we find the same result: M1 cannot predict M2 well.

We also looked whether M1 was just naturally good at predicting itself before finetuning, but there doesn't seem to be a clear trend. [LW(p) · GW(p)]

Does my response above address introspection-as-this-paper-defines it well? Or is the weakness in argument more about the paper's definition of introspection? Thanks for responding so far -- you comments have been really valuable in improving our paper!

michael-roe on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

“self-reported data from demons is questionable for at least two reasons”—Scott Alexander.

He was actually talking about Internal Family Systems, but you could probably be skeptical about what malign AIs are telling you, too.

notfnofn on is there a big dictionary somewhere with all your jargon and acronyms and whatnot?

Sorry for the late reply; I wanted to provide a more detailed perspective but I didn't ultimately have time to. In a nutshell:

It's good to have quick expositions for people to get a gist of things. But I think people should be aware that getting a quick exposition does not mean they understand the concepts. We see this a lot in physics where brilliant physicists find ways to make complex concepts accessible. This is great for people with a little humility, but many suddenly think they can engage with the community of people who actually understand things at a deep level.

I would want people to be a little intimidated by the jargon before they reply to posts. Each word tends to encode a complex concept, with possibly its own prerequisites. It's usually good for people to read those more fundamental posts before they try to understand something that builds on them.

Anyways, this is all the opinion of someone very new to the site, and probably shouldn't be weighed much.

michael-roe on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

Well, we had that guy who tried to assassinate the Queen of England with a crossbow because his AI girlfriend told him to. That was clearly a harm to him, and could have been one for the Queen.

We don’t know how much more “But the AI told me to kill Trump” we’d have with less alignment, but it’s a reasonable guess (given the Replika datapoint) that it might not be zero,

gilch on AI #86: Just Think of the Potential

How about "bubble lighting" then?

zvi on AI #86: Just Think of the Potential

I don't think that works because my brain keeps trying to make it a literal gas bubble?

zvi on AI #86: Just Think of the Potential

I see how you got there. It's a position one could take, although I think it's unlikely and also that it's unlikely that's what Dario meant. If you are right about what he meant, I think it would be great for Dario to be a ton more explicit about it (and for someone to pass that message along to him). Esotericism doesn't work so well here!

zvi on AI #85: AI Wins the Nobel Prize

I am taking as a given people's revealed and often very strongly stated preference that CSAM images are Very Not Okay even if they are fully AI generated and not based on any individual, to the point of criminality, and that society is going to treat it that way.

I agree that we don't know that it is actually net harmful - e.g. the studies on video game use and access to adult pornography tend to not show the negative impacts people assume.

richard_kennaway on Bitter lessons about lucid dreaming

“Let’s summon the Torment Nexus, as seen in classic horror novel ‘Don’t Summon The Torment Nexus’!”

richard_kennaway on Bitter lessons about lucid dreaming

I’ve had a few lucid dreams, only by accident. No aftereffects. My difficulty is staying asleep. I always start waking up before I’ve had a good chance to explore the dream world.