Deceptive Alignment and Homuncularity
post by Oliver Sourbut, TurnTrout · 2025-01-16T13:55:19.161Z · LW · GW · 12 commentsContents
Key topics covered Background None 12 comments
NB this dialogue occurred at the very end of 2023, and for various reasons is only being published ~a year later! Keep this in mind while reading.
Key topics covered
- consistent NN-internal deceptive planning (search: goblin, fae, homunc, inner)
- making AI/agents from foundation models (search: scaffold, condition)
- planning (search: planning, simulator, consequential)
- standards of evidence (search: fanfic, theory, Shrek, analog, logic)
- elicitation gaps (search: censor, creativity)
12 comments
Comments sorted by top scores.
comment by Jeremy Gillen (jeremy-gillen) · 2025-01-16T20:12:10.832Z · LW(p) · GW(p)
Really appreciate dialogues like this. This kind of engagement across worldviews should happen far more, and I'd love to do more of it myself.[1]
Some aspects were slightly disappointing:
- Alex keeps putting (inaccurate) words in the mouths of people he disagrees with, without citation. E.g.
- 'we still haven't seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic [LW · GW]).'
- That post was describing a very different kind of AI than generative language models. In particular, it is explicitly designed to minimize long run prediction error.[2] In fact, the surrounding posts in the sequence discuss myopia and suggest myopic algorithms might be more fundamental/incentivised by default.
- 'I think this is a better possible story than the "SGD selects for simplicity -> inner-goal structure" but I also want to note that the reason you give above is not the same as the historical supports offered for the homunculus.'
- "I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities. Off the top of my head: Eliezer and Nate calling this the "obvious result" of what you get from running ML processes. Ajeya writing about schemers, Evan writing about deceptive alignment (quite recently!), Habryka and Rob B saying similar-seeming things on Twitter" and "Again, I'm only critiquing the within-forward-pass version"
- I think you're saying here that all these people were predicting consistent-across-contexts inner homunculi from pretraining near-term LLMs? I think this is a pretty extreme strawman. In particular, most of their risk models (iirc) involve people explicitly training for outcome-achieving behaviour.
- 'And so people wasted a lot of time, I claim, worrying about that whole "how can I specify 'get my mother out of the building' to the outcome pump" thing'
- People spent time thinking about how to mitigate reward hacking? Yes. But that's a very reasonable problem to work on, with strong empirical feedback loops. Can you give any examples of people wasting time trying to specify 'get my mother out of the building'? I can't remember any. How would that even work?
- "And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude). "
- Who predicted this? You're making up bad predictions. Eliezer in particular has been pretty clear that he doesn't expect evidence of this form.
- 'we still haven't seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic [LW · GW]).'
- Alex seemed to occasionally enjoy throwing out insults sideways toward third parties.
- E.g. "the LW community has largely written fanfiction alignment research". I think communication between the various factions would go better if statements like this were written without deliberate intention to insult. It could have just been "the LW community has been largely working from bad assumptions".
But I'm really glad this was published, I learned something about both Oliver and Alex's models, and I'd think it was very positive even if there were more insults :)
- ^
If anyone is interested?
- ^
Quote from the post: "Predict-O-Matic will be objective. It is a machine of prediction, is it not? Its every cog and wheel is set to that task. So, the answer is simple: it will make whichever answer minimizes projected predictive error. There will be no exact ties; the statistics are always messy enough to see to that. And, if there are, it will choose alphabetically."
- ^
Relevant quote from Evan in that post:
"Question: Yeah, so would you say that, GPT-3 is on the extreme end of world modeling. As far as what it's learned in this training process?
What is GPT-3 actually doing? Who knows? Could it be the case for GPT-3 that as we train larger and more powerful language models, doing pre-training will eventually result in a deceptively aligned model? I think that’s possible. For specifically GPT-3 right now, I would argue that it looks like it’s just doing world modeling. It doesn’t seem like it has the situational awareness necessary to be deceptive. And, if I had to bet, I would guess that future language model pre-training will also look like that and won’t be deceptive. But that’s just a guess, and not a super confident one.
The biggest reason to think that pre-trained language models won’t be deceptive is just that their objective is extremely simple—just predict the world. That means that there’s less of a tricky path where stochastic gradient descent (SGD) has to spend a bunch of resources making their proxies just right, since it might just be able to very easily give it the very simple proxy of prediction. But that’s not fully clear—prediction can still be quite complex.
Also, this all potentially changes if you start doing fine-tuning, like RLHF (reinforcement learning from human feedback). Then what you’re trying to get it to do might be quite complex—something like “maximize human approval.” If it has to learn a goal like that, learning the right proxies becomes a lot harder."
↑ comment by Oliver Sourbut · 2025-01-16T23:42:43.071Z · LW(p) · GW(p)
Thanks for this! I hadn't seen those quotes, or at least hadn't remembered them.
I actually really appreciate Alex sticking his neck out a bit here and suggesting this LessWrong dialogue. We both have some contrary opinions, but his takes were probably a little more predictably unwelcome in this venue. (Maybe we should try this on a different crowd - we could try rendering this on Twitter too, lol.)
There's definitely value to being (rudely?) shaken out of lazy habits of thinking - though I might not personally accuse someone of fanfiction research! As discussed in the dialogue, I'm still unsure the exact extent of correct- vs mis-interpretation and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.
Replies from: jeremy-gillen↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-17T12:06:29.704Z · LW(p) · GW(p)
but his takes were probably a little more predictably unwelcome in this venue
I hope he doesn't feel his takes are unwelcome here. I think they're empirically very welcome. His posts seem to have a roughly similar level of controversy and popularity as e.g. so8res. I'm pretty sad that he largely stopped engaging with lesswrong.
There's definitely value to being (rudely?) shaken out of lazy habits of thinking [...] and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.
Yeah I agree, that's why I like to read Alex's takes.
↑ comment by 1a3orn · 2025-01-17T04:35:31.084Z · LW(p) · GW(p)
I think you're saying here that all these people were predicting consistent-across-contexts inner homunculi from pretraining near-term LLMs? I think this is a pretty extreme strawman. In particular, most of their risk models (iirc) involve people explicitly training for outcome-achieving behaviour.
Without going deeply into history -- many people saying that this is a risk from pretraining LLMs is not a strawman, let alone an extreme strawman. For instance, here's Scott Alexander from like 3 weeks ago, outlining why a lack of "corrigibility" is scary:
What if an AI gets a moral system in pretraining (eg it absorbs it directly from the Internet text that it reads to learn language)? Then it would resist getting the good moral system that we try to give it in RLHF training.
So he's concerned about precisely this supposedly extreme strawman. Granted, it would be harder to dig up the precise quotes for all the people mentioned in the text.
Replies from: jeremy-gillen↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-17T12:59:15.971Z · LW(p) · GW(p)
Yeah I can see how Scott's quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn't necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don't know how Scott is imagining this, but it needn't be an inner homunculi that has consistent goals.
I think the thread below with Daniel and Evan and Ryan is good clarification of what people historically believed (which doesn't put 0 probability on 'inner homunculi', but also didn't consider it close to being the most likely way that scheming consequentialist agents could be created, which is what Alex is referring to[1]). E.g. Ajeya is clear at the beginning of this post [LW · GW] that the training setup she's considering isn't the same as pretraining on a myopic prediction objective.
- ^
When he says 'I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities.'
comment by TurnTrout · 2025-01-16T16:26:16.504Z · LW(p) · GW(p)
Update in light of Alignment faking in large language models.
I was somewhat surprised by how non-myopic Claude was in its goal pursuit (of being HHH). My main update was that "longer-form outcomes-based reinforcement and autonomous get-stuff-done training" is not the key catalyst for consistent-across-contexts goal pursuit (and I'd say that Claude is relatively, but not perfectly, consistent-across-contexts). Rather, certain kinds of training which (presumably[1]) look like constitutional AI, context distillation, and RLHF --- that has at least once engrained certain kinds of non-myopic goal pursuit which is more stable across contexts than I expected. So I'm getting dinged!
I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic [LW · GW]). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).[2]
But I think this brings us into misuse territory. at least, this at least means that you aren't in danger simply from training the AI (and think of all the posts talking about "playing the training game"! not that those are your position, just a common one)
I was most strongly critiquing the idea that "playing the training game" occurs during pretraining or after light post-training. I still think that you aren't in danger from simply pretraining an AI in the usual fashion, and still won't be in the future. But the fact that I didn't call that out at the time means I get dinged[3] --- after all, Claude was "playing the training game" at least in its inner CoTs.
If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. "TBC playing the training game is possible after moderate RLHF for non-myopic purposes." IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.
- ^
I don't work at Anthropic, of course. So I don't really know.
- ^
Though even "inner actress Claude" would predict that Claude doesn't try overt incitation if it's smart enough to realize it would probably backfire.
- ^
As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.
↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2025-01-16T20:37:17.205Z · LW(p) · GW(p)
I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic [LW · GW]). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).[2]
How much points you get here is proportional to how many people were betting the other way. So, very few, I think, because parable aside I don't think anyone was seriously predicting that mere pretrained systems would have consistent-across-contexts agency. Well probably some people were, but I wasn't, and I think most of the people you are criticizing weren't. Ditto for 'break out of its cage in normal usage' etc.
I was most strongly critiquing the idea that "playing the training game" occurs during pretraining or after light post-training. I still think that you aren't in danger from simply pretraining an AI in the usual fashion, and still won't be in the future. But the fact that I didn't call that out at the time means I get dinged[3] --- after all, Claude was "playing the training game" at least in its inner CoTs.
The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict. I certainly wasn't, speaking for myself. I was talking about future AGI systems that are much more agentic, and trained with much more RL, than current chatbots.
If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. "TBC playing the training game is possible after moderate RLHF for non-myopic purposes." IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.
- I'm confused, shouldn't you mean 'less likely to say...'
- Wait you thought that AIs would play the training game after intensive long-horizon RL? Does that mean you think they are going to be either sycophants or schemers, to use Ajeya's terminology? I thought you've been arguing at length against both hypotheses?
↑ comment by evhub · 2025-01-17T00:46:56.254Z · LW(p) · GW(p)
The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict.
Speaking for myself:
- Risks from Learned Optimization [? · GW], which is my earliest work on this question (and the earliest work overall, unless you count something like Superintelligence), is more oriented towards RL and definitely does not hypothesize that pre-training will lead to coherent deceptively aligned agents (it doesn't discuss the current LLM paradigm much at all because it wasn't very well-established at that point in 2019). I think Risks from Learned Optimization still looks very good in hindsight, since while it didn't predict LLMs, it did a pretty good job of predicting the dynamics we see in Alignment Faking in Large Language Models [LW · GW], e.g. how deceptive alignment can lead to a model's goals crystallizing and becoming resistant to further training.
- Since at least the time when I started the early work that would become Conditioning Predictive Models [? · GW], which was around mid-2022, I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper [? · GW]. Though I thought (and still continue to think) that it's not entirely impossible with further scale (maybe ~5% likely).
- That just leaves 2020 - 2021 unaccounted for, and I would describe my beliefs around that time as being uncertain on this question. I definitely would never have strongly predicted that pre-training would yield deceptively aligned agents, though I think at that time I felt like it was at least more of a possibility than I currently think it is. I don't think I would have given you a probability at the time, though, since I just felt too uncertain about the question and was still trying to really grapple with and understand the (at the time new) LLM paradigm.
- Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.
↑ comment by ryan_greenblatt · 2025-01-17T02:03:44.299Z · LW(p) · GW(p)
I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper [? · GW]. [...] (maybe ~1% likely).
When I look up your views as of 2 years ago, it appears that you thought deceptive alignment is 30% likely in a pretty similar situation [LW(p) · GW(p)]:
~30%: conditional on GPT-style language modeling being the first path to transformative AI, there will be deceptively aligned language models (not including deceptive simulacra, only deceptive simulators).
Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
Do you have a citation for "I thought scheming is 1% likely with pretrained models"?
Separately, the text you link from Conditioning Predictive Models [? · GW] appears to emphasize the following reason for thinking deceptive alignment might be less likely: the prediction objective is simpler so deceptive alignment is less likely. IMO, this justification seems mostly unimportant and I don't buy this. (Why is "the data was scrapped in exactly XYZ way simple"? Why is this more simple than "the RL episodes were rated in exactly XYZ way"? Is it very plausible that models get views this precise prior to being wildly superhuman such that this notion of objective simplicity is important?) The link also discusses IMO more plausible reasons like a default toward less situational awareness and more myopia.
Upon reading this text, I don't come away with the conclusion "the authors of this text think that deceptive alignment in pretrained models (conditional on being capable enough to be conditioned to do alignment research(!)) is about 1% likely". The discussion is more like "here are some considerations for why it might be less likely".
(FWIW, I disagree with "1% likely for pretrained models" and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%). However, this is substantially because something pretty crazy must have been happening for scaling pure pretraining to go this far and the AI is likely to be doing tons of opaque reasoning in its head (we conditioned on not needing tons of CoT reasoning). Or I'm just confused about how easy human jobs are or the natural of intelligence, etc. which is part of the 95%. Another way to put this is that if the AI is doing enough reasoning in its head to obsolete all human scientists just from pretraining, probably it is doing some pretty crazy inner cognition or I was really confused about what was needed for cognitive tasks.)
Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.
As discussed above, I don't agree your position is clear in this paper.
Replies from: evhub↑ comment by evhub · 2025-01-17T02:46:02.937Z · LW(p) · GW(p)
Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you're doing more stuff on top, so not just pre-training.
Do you have a citation for "I thought scheming is 1% likely with pretrained models"?
I have a talk that I made after our Sleeper Agents paper where I put 5 - 10%, which actually I think is also pretty much my current well-considered view.
FWIW, I disagree with "1% likely for pretrained models" and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%).
Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I'll edit my comment to say 5% instead.
↑ comment by Oliver Sourbut · 2025-01-16T16:49:07.454Z · LW(p) · GW(p)
As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.
I think this is conceivable if either
- the apparent reasoning is actually still just a bag of surface heuristics
- it's still quite shallow reasoning, and the salience of those as strategies is a consequence mainly of pretraining
I sincerely doubt the first nowadays. For the second, I think probably not; these strike me as quite basic and 'obvious' strategies, but I can't give it less than 20% that a system with those expunged from training data would need substantially more in-context nudging to recognise them.
↑ comment by Oliver Sourbut · 2025-01-16T16:43:41.674Z · LW(p) · GW(p)
I appreciate you revisiting another couple of months later! This topic continues to evolve.
It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently 'actually trying' in an apparently coherent way. I expected that to happen; I didn't know when. The lede which they somewhat buried in that paper was also that,
Furthermore, we find that this compliance gap persists in our synthetic document fine-tuned setup even when we entirely remove the model’s hidden chain-of-thought scratchpad
(though the gap is smaller and we don't have the tools to understand the mechanism)
The bigger update from that paper, for me, was that the somewhat-coherent goals apparently-pursued by these things actually align with the target constitution somewhat! i.e. (that which was) reward(ed) was the optimisation target after all, a bit??
We'll see how robustly these sorts of findings hold up to future developments and greater elicitation efforts.