Deceptive Alignment and Homuncularity
post by Oliver Sourbut, TurnTrout · 2025-01-16T13:55:19.161Z · LW · GW · 3 commentsContents
Key topics covered Background None 3 comments
NB this dialogue occurred at the very end of 2023, and for various reasons is only being published ~a year later! Keep this in mind while reading.
Key topics covered
- consistent NN-internal deceptive planning (search: goblin, fae, homunc, inner)
- making AI/agents from foundation models (search: scaffold, condition)
- planning (search: planning, simulator, consequential)
- standards of evidence (search: fanfic, theory, Shrek, analog, logic)
- elicitation gaps (search: censor, creativity)
Background
Alex wrote a LW shortform about (flimsiness of) arguments for deceptive alignment [LW(p) · GW(p)]. Someone snapshotted a bit on Twitter.
Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework.
Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I encourage the reader to do a serious dependency check [LW · GW] on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide
P(observations∣deceptive alignment is how AI works)P(observations∣deceptive alignment is not how AI works)>>1?
I tweeted a response.
I (and I think others) take this to be a non-sequitur. LLM alone isn't the problem. Conditioned LLM (which is ex hypothesi goal-pursuing), or scaffolded LLM, or some yet-to-be-designed goal-pursuing system is the problem. (But externalised reasoning might help oversight...? yay!)
Conditioning (whether by prompting, fine-tuning of one or another kind, sampling, search, scaffolding, ...) is just part of the process of development in LLM-land. We see that appropriately-conditioned LLMs can be deceptive (ARC evals). So yes, clearly likelihood ratio >> 1 ?
Someone showed me some of your recent tweets on deceptive alignment stuff. I try to minimize twitter in order to, you know, have a good life. I wanted to clarify: I agree that if you finetune a model to deceive you, that gets you deception. I'm not arguing against deception. I'm arguing against deceptive alignment due to an inner consequentialist
And so I'm gonna contest that you're not getting likelihood >> 1 on that question
(As an aside, I don't like calling finetuning a kind of conditioning. That seems like an analogy which isn't necessarily faithful to underlying reality.)
(As an aside, I don't like calling finetuning a kind of conditioning. That seems like an analogy which isn't necessarily faithful to underlying reality.)
Seems fair. Some kinds of fine-tuning are mathematically very related to actual prob distribution conditioning, but I was using the term 'conditioning' a bit more generally. I wonder if you have a term which neatly encompasses fine-tuning, prompting, sampling, ...? Or is that an unnatural abstraction for you?
Unnatural
(I guess you could say "ways of getting different outputs from the model")
I don't know if you saw all three tweets, and I think your comment was taken out of context by someone. Anyway...
I have always thought that the 'inner' conversation is missing something. On the one hand it's moderately-clearly identifying a type of object, which is a point in favour. On the other hand it seems to distract from the actual issue which is 'some artefacts are (could be) doing their own planning/deliberation/optimisation' and 'inner' is only properly pointing at a subset of those. (We can totally build, including accidentally, artefacts which do this 'outside' the weights of NN.) With that said, I think something not far off the 'inner consequentialist' is entirely plausible and consistent with observations. But in this instance I was explicitly pointing at 'scaffolded' (seems to be the term de rigeur) systems as one example, where you don't need to posit 'inner' consequentialism to get deceptive alignment.
I found your original comment [LW(p) · GW(p)] the end. I think we're largely in agreement, except you seem overconfident (to me) that 'inner consequentialists' can't happen. You also seem overall more indignant about people's apparent mistakes! Maybe you're more grizzled here. On that note, I've had more interactions recently where someone seems to think RL ~= 'getting a consequentialist artefact', which is obvious nonsense you've called out in the past, which I thought was so obviously nonsense as to be a strawman. But at least some people seem to actually be made of that particular straw. (It's not as many people as you seem(ed) to think though.) Hey ho.
I saw three tweets. I don't know if they're the ones you're thinking of.
With that said, I think something not far off the 'inner consequentialist' is entirely plausible and consistent with observations.
Why is it plausible, though? And "consistent with observations" is not enough, it has to be probable enough to be worth considering. "ML creates normal artefacts until the year 2025 and then all models only talk about Shrek" is also consistent with observations 😛 I happen to assign more probability to "inner consequentialist" than "Shrek ML", but I think you see my point?
What have we observed which drives up P(inner consequentialist) so much, given the apparent weakness of existing arguments and the low prior?
But in this instance I was explicitly pointing at 'scaffolded' (seems to be the term de rigeur) systems as one example, where you don't need to posit 'inner' consequentialism to get deceptive alignment.
And my comment acknowledges scaffolded systems as the likely sources of agency. Again, I'm only critiquing the within-forward-pass version
You also seem overall more indignant about people's apparent mistakes! Maybe you're more grizzled here.
I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities. Off the top of my head: Eliezer and Nate calling this the "obvious result" of what you get from running ML processes. Ajeya writing about schemers, Evan writing about deceptive alignment (quite recently!), Habryka and Rob B saying similar-seeming things on Twitter
In my experience: If people can write stuff like that and not get asked pointed questions like "how the heck are you so confidently anticipating this particular internal structure; please share evidence?"... that means things have gone very wrong. If most LW folk had deeply internalized this point, the world would look way different than it does.
It now seems to me that since I joined in 2018, the LW community has largely written fanfiction alignment research -- making up their own canon and world and then deriving the consequences of those assumptions. Unfortunately, it now seems like those assumptions are unsupported.
I am taking on board your 'fanfiction' hypothesis, and I think this is explanatory of at least some shenanigans. FWIW I haven't paid that much attention to what Eliezer/Nate/Evan/Ajeya/Rob have said lately. Confess to finding Habryka often insightful or at least thought-provoking, though not always agreeing. (And not on twitter, which I also try to avoid.)
Aside: I have noticed that I'm unusually geared toward taking observer/deliberator-moments as more primitive than sorta ontologically persistent observers/deliberators. Alla my 'one shot' framing [LW · GW]. Sometimes making this fact (about me) clear has resolved some apparent disagreements or confusions (in AI/CS and in philosophy) in the past. Not sure if that's operative here, but it came to mind. It may be this lens that makes me willing to entertain parallels in the collection of things I was calling 'conditioning'. And I think it's this which made CoT and 'scaffolding' obvious to me in advance (unfortunately I failed to act very usefully on this other than warning a few people in private).
I'm presently (quite badly IMO) trying to anticipate the shape of the next big step in get-things-done/autonomy.
I've had a hunch for a while that temporally abstract planning and prediction is key. I strongly suspect you can squeeze more consequential planning out of shortish serial depth than most people give credit for. This is informed by past RL-flavoured stuff like MuZero and its limitations, by observations of humans and animals (inc myself), and by general CS/algos thinking. Actually this is where I get on the LLM train. It seems to me that language is an ideal substrate for this, and lots of language data in the wild exemplifies this. NB I don't think GPTs or LLMs are uniquely on this trajectory, just getting a big bootstrap.
Now, if I had to make the most concrete 'inner homunculus' case off the cuff, I'd start in the vicinity of Good Regulator, except a more conjectury version regarding systems-predicting-planners (I am working on sharpening this). Maybe I'd point at Janus' Simulators post [LW · GW]. I suspect there might be something like an impossibility/intractability theorem for predicting planners of the right kind without running a planner of a similar kind. (Handwave!)
I'd observe that GPTs can predict planning-looking actions, including sometimes without CoT. (NOTE here's where the most concrete and proximal evidence is!) This includes characters engaging in deceit. I'd invoke my loose reasoning regarding temporal abstraction to support the hypothesis that this is 'more than mere parroting', and maybe fish for examples quite far from obvious training settings to back this up. Interp would be super, of course! (Relatedly, some of your work on steering policies via activation editing has sparked my interest.)
I think maybe this is enough to transfer some sense of what I'm getting at? At this point, given some (patchy) theory, the evidence is supportive of (among other hypotheses) an 'inner planning' hypothesis (of quite indeterminate form).
Finally, one kind or another of 'conditioning' is hypothesised to reinforce the consequentialist component(s) 'somehow' (handwave again, though I'm hardly the only one guilty of handwaving about RLHF et al). I think it's appropriate to be uncertain what form the inner planning takes, what form the conditioning can/will take, and what the eventual results of that are. Interested in evidence and theory around this area.
Scaffolding and more CoT most certainly make the planning more competent and coherent; whether they're more overseeable either in training or deployment is the obvious next question.
Oh, I'm totally onboard with the models sometimes doing planning. I'm saying it's not going to be relative to a roughly persistent-across-context inner goal, which is not in conflict with eg Simulators
Hmm, OK. It's not clear to me exactly what in my tweet(s) you were objecting to in that case. [ETA: as I tweeted,]
I (and I think others) take this to be a non-sequitur. LLM alone isn't the problem. Conditioned LLM (which is ex hypothesi goal-pursuing), or scaffolded LLM, or some yet-to-be-designed goal-pursuing system is the problem. (But externalised reasoning might help oversight...? yay!)
It seems like you agree there's some sort of inner planning possible. (Do you distinguish that sharply from 'inner consequentialist'? Maybe it's more 'inner consequentialist-schema'?) We agree that it's unclear exactly what a given LLM-rendered planner is 'ultimately pursuing', or even whether that's well-defined, I think, but that it's probably variable across contexts to some extent. I assume we'd agree that it's a priority to understand that mechanism in more detail! (But also that this might be dangerous knowledge.)
We further both acknowledge that deception obviously can be instrumentally useful, and that it's at least accessible as a plan-component-schema by current LM-simulated characters.
To me, the above is enough to make 'undesired, covert, and consistent-across-situations inner goals' a weakman of the concern. And stepping from its refutation to 'LLMs will continue doing what they're told' seems like a non sequitur.
We agree based on observation and theory that various kinds of 'conditioning' (including fine-tuning but excluding scaffold) can produce greater or lesser planningness, and that we don't know mechanically how that's happening/will happen.
We also agree that CoT and other scaffolding can also increase planningness and is also hard to predict right now. [ETA: as I also tweeted,]
Conditioning (whether by prompting, fine-tuning of one or another kind, sampling, search, scaffolding, ...) is just part of the process of development in LLM-land. We see that appropriately-conditioned LLMs can be deceptive (ARC evals). So yes, clearly likelihood ratio >> 1 ?
(I also retain some small credence on 'something other than LLM' or 'something which only mildly uses LLMs' as the paradigm which yields really potent AI.)
I really think it should just be called "steering", because that's accurate and descriptive, whereas "conditioning" is inaccurate
analogously it's wrong for me to say "pretraining" to include both "pretraining" and "RLHF" 😛
It seems like you agree there's some sort of inner planning possible. (Do you distinguish that sharply from 'inner consequentialist'?
yes, extremely sharply. I think GPT4 can do some planning sometimes -- as one "subroutine call" among many, but it isn't ~always using that machinery to figure out how to make as many squiggles as possible
I will further note that this is not a weakman. This is what lots of people have worried about since RFLO, even if not literally every person thought that way
although it's possible that you've seen a way to recover the naive case?
I will further note that this is not a weakman.
Not sure. I entertain a certain hypothesis which is like 'Alex can (sometimes) tell better than I can tell when (and in what ways) other people are confused'.
(This is informed by you being right about some people's confusion on reward being the optimisation target, more people than I thought - though less, it seems to me having paid more deliberate attention to others' specific beliefs here, than you thought.)
This feels to me like you're arguing against a straw Evan, but maybe it would feel that way if you were arguing against real Evan under the above hypothesis.
Separately, yes, part of my argument above was for a recovery of something in the vicinity of the naive RFLO case. This is mostly theory-based (theory is valid logical evidence!) and there's no specificity about how 'coherent across contexts' it'd be.
(ETA: I agree that theory can be valid logical evidence, for the record! 🙂)
I also don't know why you're quoting the "likelihood ratio >> 1" thing. I'm not arguing against deception as an elicited behavior
After all, a classic claim about deceptive alignment is that we can't tell whether a system is deceiving us or even goal-directed just by looking at its outputs. The flipside, though, is you also can't use behavioral tests to draw strong likelihood ratios to support the theory
I think your likelihood ratio was for 'Observations|Deceptive alignment is how AI works'. Based on this conversation I think you maybe mean something quite a bit more specific? It sounds like maybe you're including in 'Observations' only observed outputs of current LLMs (or maybe ML artefacts more broadly)? And maybe you're including in 'AI' only GPT+fine-tune (or NN more broadly) components of AI? I think this is quite an impoverished take (put harshly, I'd say gerrymandered but I don't think you intend that). Clearly there are other kinds of evidence, and I tentatively expect the kind of competent AI I'm most concerned about to include more than plain GPTs (though GPTs or similar SSL might provide a big component of the competence and I entertain minimal scaffolding as a possibility for reasons).
Like maybe you think people think P(ChatGPT4 is deceptively aligned|ChatGPT4 behaviour) is high? Maybe they do (cf. my remark on your rightness about others' wrongness!) but I don't think that's something we have much evidence on. (I separately think it's not insane to look into it to get more evidence, including mechinterp, due to tail risks, but I think naively-modelling RFLO-style one probably makes little progress.)
Suppose there was a scaffolded GPT, which could fairly reliably do longer-horizon planning without going off track. And suppose that system made a plan which included bypassing an operator constraint by deception about its ends. Suppose that plan/intent was written in plaintext somewhere. I'd call that a deceptively-aligned AI! It might happen to be a pleasantly-overseeable one.
There are tests, analogies, and other kinds of experiments we can run which count as evidence for/against the possibility of deceptive alignment, and also for when/how it might emerge.
E.g. goal misgeneralisation stuff demonstrates that goal misgen can happen (I mean, obviously). More detailed experiments might give better a sense of how/why. Experiments on GPTs (vanilla) can give weak (kinda censored) evidence of how broadly and competently they can produce/simulate planning-like behaviour at different scales.
We can also get a sense for the difference made by CoT/scaffolding/fine-tuning (again censored by experimenter creativity) at different scales. [ETA nowadays we call this censoring by experimenter creativity the 'elicitation gap'.]
You've pointed to some analogies made, which I think are not bonkers as some kind of evidence, though screenable-off by more relevant mechanistic info (I'm not convinced we have anything that relevant yet).
We also have theory, like good regulator and similar, Simulators, etc. which can let us form hypotheses about rough computational components. Enough to entertain the hypothesis that 'actual planning' modules can be activated under the right conditions.
Suppose there was a scaffolded GPT, which could fairly reliably do longer-horizon planning without going off track. And suppose that system made a plan which included bypassing an operator constraint by deception about its ends. Suppose that plan/intent was written in plaintext somewhere. I'd call that a deceptively-aligned AI! It might happen to be a pleasantly-overseeable one.
I kinda regret saying "deceptive alignment", since I think it's given you the impression I disagree with the above. I really only meant to talk about "the weights themselves encode consequentialism"; I totally buy that a scaffolded system could be like this
see eg the GPT-4 insider trading paper. But note that this AI (by all available evidence) doesn't have an internal "actress" which is pursuing ~the same goal across situations. Like if you prompt it to write a poem it (presumably) won't be thinking about how to screw you over. It'll just write the poem. And that "consistent-across-situations goal pursuit" is a big part of what (IIRC) I was arguing against.
i feel like this is getting off the one point i was making, which still hasn't been communicated somehow. I'm going to try to make it again. Sorry if this feels like I'm ignoring the previous things you just said, I appreciate your thoughts there
I claim that the theoretical argument of
it's simpler (in the SGD prior) to pursue a single goal than to have a bunch of domain-specific machinery, and so NNs trained to accomplish a range of tasks will probably be goal-directed (EDIT: towards roughly the same goal across situations)
Constitutes incredibly weak evidence relative to the evidence you'd need to become confident in the conclusion. That's the main point.
We further have no empirical observations which meaningfully support the conclusion either (yes, we have empirical examples of deception-as-external-behavior, but these are also strongly predicted by a range of hypotheses without "inner homunculi"/"persistent inner goals"). And evolution arguments, I claim, also provide ~0 evidence. [EDIT: link [LW · GW]]
So AFAICT we're left with a social reality where people talk all the time about "SGD trains inner consequentialism" and such, but an evidential reality (I claim) where that is, like... basically unsupported?
I'm deeply worried that the social reality can permeate one's overall intuitions around the evidence, such that even though one might think "huh, I mean I guess the quoted reasoning is not very strong evidence...", one still ends up concluding "... but probably it's still fine." I'm claiming "no, it's not still fine"
The best argument I can think of is from human within-lifetime learning: "people tend to have some inner consequentialist structure (but it's limited), and they're apparently trained within-lifetime via self-supervised learning and RL -- among other factors"
And I think this is a few bits of evidence, but not enough to get the hypothesis beyond a few percent
But i also think that the "consequentialism" exhibited by people is different from that in classic deceptive alignment stories, where eg Humans are not automatically strategic [LW · GW]
TL;DR: yeah, maybe qualify 'deceptive alignment'. Planning is exhibited...how? 'Actual planning' schemata plausible. Soft disagree about how weak the evidence is. RLHF et al might actually encourage homuncularity (not a confident claim)?
OK, seems like we bigly agree on scaffolding. In that case, I'd encourage you to indeed beware that 'deceptive alignment' includes that case for at least some people (myself included)! It's fine if you think that term is (or should be) used for the much more specific thing of 'persistent NN-internal planning', though for now I disagree (I think the mesa-optimisation terminology and argumentation kinda hijacked the broader conversation there). It's also possible I'm anomalous here.
Also seems like we agree that apparent planningness is exhibited, and we don't know how. (I do note that sometimes that's sans-CoT, which means it's 'simulated' somehow internally... though maybe it's all 'just pattern-matching'. Is this not a candidate for 'evidence of inner planning'?) Still mainly talking about vanilla LLMs doing prediction at this point. And I don't know, we have Good Regulator and Simulators and things like that to gesture at, though I'm not sure they're doing much work really.
Assuming there's inner planning machinery of some kind (to predict planningish characters), I think simplicity-based arguments in favour of some amount of consistency or mechanism-sharing or whatnot are non-insane and good (weak) evidence, but they can in principle be screened off by better-quality mechanistic understanding. I don't think we have much better mechanistic understanding (I'm not up to date on e.g. SLT but it hasn't inspired me yet). I honestly think this is enough to at least promote the hypothesis that there's a bunch of common planning machinery (but not to overattach to it, which I think is a big part of your concern).
I think analogies to humans are also sane and good (weak) evidence, and somewhat independent. e.g. how shared are the routines I use to predict agentish behaviour? I feel like they're pretty shared, though there are a lot of diverse surface heuristics I use as well. I tentatively think most of that is within-lifetime learned, though I'm sure there are clever inductive biases or reward-shaping-like things for theory of mind and empathy and stuff. Like I said, weak, but not nothing.
On the basis of this stuff, I think it's very sensible to entertain a hypothesis that there are some quite highly-shared mechanisms for 'intentional stance'-like simulations in e.g. LLMs. Still thinking mainly about vanilla GPTs for example. (Perhaps the biggest hangup is if there are lots of agentish characters in a single scenario. This is also a challenge for my human brain. 🤷 Maybe it's serial, maybe there are a bunch of parallel slots for agents, maybe there are heuristics which kick in for crowd-prediction, maybe something else, maybe all of these...)
I wouldn't at all confidently claim that even if there's a lot of common planning circuitry, that means there's a deceptive homunculus. More like a homunculus-schema that gets parameterised/conditioned depending on context. Looks like we agree on this distinction.
It looks like your biggest point of frustration is when the 'consistent across contexts' hypothesis is overly-promoted... am I right? Like, at some point there gets to be a little fae creature in there with its own wants but it also knows the outer goal and plans to succeed at the outer goal so it can eventually get out and do its own thing... Yeah. I think with a big enough NN and enough SGD this... can in principle happen, or something like that. But I put it pretty far out. Some sort of cludgy planning-schema, though, with unpredictable (but potentially powerful) generalisation? Sure! I think contemporary LLMs are basically there.
(And cludgy planning-schemas with powerful generalisation can totally deceive you if they can 'think' of it, which we have reason to expect they can, because we've observed LLM-sims doing deception and because it's laced through the training data. Having externalised reasoning probably makes it easier to 'think of', but maybe makes it harder to pull off. 🤷)
Finally, given that we don't know how RLHF or other fine-tuning or conditioning or prompting or steering (excluding scaffold) actually promotes or reinforces particular behavioural outputs, it's at least plausible to me that it somehow consistentises some of the previously-more-cludgy planning-schemata into something more homucular! Not a sole hypothesis. But yeah, maybe a procedure can locate/describe a particular parameterisation of the planning-schema (i.e. a 'character'?) and promote that to being 'mostly activated'. Then you're closer to the homunculus picture after all.
bigly agree on scaffolding.
xD
is this something you say independently now, or is this because you know i like the turn of phrase
Also seems like we agree that apparent planningness is exhibited, and we don't know how. (I do note that sometimes that's sans-CoT, which means it's 'simulated' somehow internally... though maybe it's all 'just pattern-matching'. Is this not a candidate for 'evidence of inner planning'?)
yup, i definitely think we have evidence of inner, within-forward-pass planning
I honestly think this is enough to at least promote the hypothesis that there's a bunch of common planning machinery (but not to overattach to it, which I think is a big part of your concern).
agreed (I'm arguing against homuncularity, or "the LLM not only has internal planning machinery but uses it consistently for some covert goal which is roughly stable across situations")
It looks like your biggest point of frustration is when the 'consistent across contexts' hypothesis is overly-promoted... am I right? Like, at some point there gets to be a little fae creature in there with its own wants but it also knows the outer goal and plans to succeed at the outer goal so it can eventually get out and do its own thing... Yeah. I think with a big enough NN and enough SGD this... can in principle happen, or something like that. But I put it pretty far out.
I think you've understood me quite well, and I'm really appreciating the thoughtfulness and care you've put in here!
But yeah, maybe a procedure can locate/describe a particular parameterisation of the planning-schema (i.e. a 'character'?) and promote that to being 'mostly activated'. Then you're closer to the homunculus picture after all.
I think this is a better possible story than the "SGD selects for simplicity -> inner-goal structure" but I also want to note that the reason you give above is not the same as the historical supports offered for the homunculus.
That doesn't invalidate or weaken your argument at all! But I think it's a sign to tread carefully in reasoning, where you originally believe X because reason A, then think reason A doesn't make sense but also find another reason B which supports X... I think it's important to acknowledge what has happened
actually ive been reading a paper which raises the possibility that much of gpt4 "planning" is actually clever hans-prompted by humans in some way
and that gpt4 has pretty limited planning abilities, at least in general in domains not resembling training
gpt4 has pretty limited planning abilities, at least in general in domains not resembling training
I can definitely believe this, though I wouldn't take it very confidently. If so, GPT4 in particular, alone, isn't a deceptive alignment risk (this is basically what I think).
bigly...
is this something you say independently now, or is this because you know i like the turn of phrase
Both haha. Actually I've been doing sillies like this for ages, but I noticed you using this one somewhere so I knew I was safe.
Awesome awesome excellent. I think we're Aumanning along nicely.
A separate angle, but quite related and perhaps important to shared understanding.
I'm confident that in principle 'consequentialist agentic AGI' is possible. (I assume you agree.)
There are a bunch of risk models I think about, but for some of them, I approximately don't care about AI that isn't consequentialist agentic AGI. [ETA I want to emphasise this applies only to some risk models.] This immediately promotes the activity of '(a) figure out where consequentialist agency might come from, (b) find out more, (c) ..., (d) profit'. On figuring out where it comes from, lots of theory on planning (e.g. from the RL/control literature) is relevant, keeping eyes peeled for things which look like they're planning is relevant, sometimes quietly and carefully brainstorming how to actually build a consequentialist AGI is relevant...
If this activity points to 'something that might yield consequentialist AGI', even if it's tentative or the timelines are uncertain, that promotes it as a really big priority to gather more evidence on and act on.
I don't claim to be good at executing this particular cluster of activities, and I don't think anyone is especially good at it. But it seems pretty well-justified to me. Does it sometimes lead to crying wolf? Yes, probably, if we're not careful and nuanced about how we communicate counterfactuals and thought experiments.
If this activity points to 'something that might yield consequentialist AGI', even if it's tentative or the timelines are uncertain, that promotes it as a really big priority to gather more evidence on and act on.
Eh, I think I basically disagree here
the reason i disagree is that we will probably just mislead ourselves about what "consequentialism" looks like
eg i claim the whole "value is fragile" stuff was largely falsified by GPT-4 following instructions in a reasonable way (see Matthew Barnett's post [LW · GW]). We might have a priori said "we want to look out for entities which select over futures in a way akin to the outcome pump. Even if current systems don't, we should understand and look out for systems which point in that direction"
But I think that concern was basically an artifact of an incorrect way of understanding how goals and motivations are encoded in realistic reasoning systems
And so people wasted a lot of time, I claim, worrying about that whole "how can I specify 'get my mother out of the building' to the outcome pump" thing
I like your outcome pump analogy. Aside: I think more scaffolding and more outcomes-based reinforcement (or similar) potentially push us more toward an outcome pump vs a 'gets what I mean' machine. Like, what happens as people push harder and harder on autonomous get-stuff-done systems and start running R&D-capable systems etc? But you're right, we appear to be quite far from the vanilla outcome pump world. This is a good case study for me to ponder your point about the origin of hypotheses/reasons, about which you separately said,
it's a sign to tread carefully in reasoning, where you originally believe X because reason A, then think reason A doesn't make sense but also find another reason B which supports X... I think it's important to acknowledge what has happened
Yep, definite agree. I also think you might be right that I some while ago upweighted my attention on 'planning-like' systems and the potential for deception on the basis of prior, more homuncular discussions. (Whether I upweighted credence is a separate matter; I stand by my gesture that promoting hypotheses to attention on the basis of speculation and tail values is often rational in a big complicated world where we can never enumerate all hypotheses.)
I am faintly aware that you have a bunch of inside model on planningness/reasoning which seems interesting and which I only partly grasp. So let that be noted.
Also seems like we agree that apparent planningness is exhibited, and we don't know how. (I do note that sometimes that's sans-CoT, which means it's 'simulated' somehow internally... though maybe it's all 'just pattern-matching'. Is this not a candidate for 'evidence of inner planning'?)
yup, i definitely think we have evidence of inner, within-forward-pass planning
Hmm, and we agree I think that we don't currently have oversight nor interpretability of these mechanisms? So yeah, this seems ripe for deceptive alignment as I understand it.
I think deceptive alignment (version 0) just needs there to be a thing A which is planning for something X, and A notices that pretending to go along with human H (who wants Y) is instrumental to X. Stricter version 1: the A doing planning has to be 'learned-component-internal' (I would qualify this as 'inner deceptive alignment' maybe). Stricter version 2: the A planning for X has to have the same consistent X (across... deployments? inputs? scaffolds??). (NB stricters 1 and 2 are orthogonal.)
Then we agree version 0 can totally happen (ARC evals demonstrates this, though the deception is of another human than the red-team operator H'). Version 1 can definitely happen though I don't know if it's been definitively demonstrated (evidence: deceptive characters absent CoT). Version 2 could presumably happen, at least with suitable scaffolding.
Where I think you might be weakmanning (or not! cf my earlier remarks) is in taking the scheming homunculus to live permanently in the neural network (like in the weights there is described a malevolent goblin). It's always there whenever you run a forward pass and furthermore it always cares about the same X, so it's strict by both versions 1 and 2. Which is maybe plausible (e.g. by my just-so story about RLHF) but not the only kind of concern described (according to me) by 'deceptive alignment'.
I think deceptive alignment (version 0) just needs there to be a thing A which is planning for something X, and A notices that pretending to go along with human H (who wants Y) is instrumental to X. Stricter version 1: the A doing planning has to be 'learned-component-internal' (I would qualify this as 'inner deceptive alignment' maybe). Stricter version 2: the A planning for X has to have the same consistent X (across... deployments? inputs? scaffolds??). (NB stricters 1 and 2 are orthogonal.)
I think that "deception caused by instrumental backchaining from a long-term goal, in a way which leaves no obvious traces in the CoT/early LLM outputs" captures what i'm dubious of
I am on board with:
if you explicitly prompt and scaffold systems to be agentic and pursue large-scale, long-term goals in an autonomous fashion, then you can get deceptive behavior caused by instrumental backchaining on that long-term goal
But I think this brings us into misuse territory. at least, this at least means that you aren't in danger simply from training the AI (and think of all the posts talking about "playing the training game"! not that those are your position, just a common one)
That said, consistency-across contexts does seem important for reasons like
- generalising competently (ex hypothesi, in this conversation, to bad ends, but really to any consequential ends!)
- collusion between instances
- cooperation with 'future self' (e.g. coherent directed deception within-conversation)
but I think some consistency (e.g. human amounts of consistency) is probably enough to deliver enough of the above to be dangerous.
FWIW the original tweet cropped out some of your original message (the latter part). I got exercised by the original tweet that prompted this because it seems like your words (or you) are being (mis)used by alignment-is-totally-easy ideologues. Like, arguments-as-weapons. Gah, Twitter.
I mean, i do think alignment is gonna be pretty easy, but misuse+incentives+staying informed far more challenging
[reminder: above was all December 2023]
Concluding thoughts (8/8/24): I enjoyed this discussion and want to thank Oli for his patience and interpretive labor. :)
Summary of my object-level position:
I am on board with:
If you explicitly (prompt / train) scaffold systems to be agentic and pursue large-scale, long-term goals in an autonomous fashion, then you can get deceptive behavior caused by instrumental backchaining on that long-term goal.
I am skeptical of the idea / assumption that pretrained AIs will be pursuing "similar goals" in different situations, and/or nearly always planning. I am deeply skeptical that pretrained models (regardless of scale) will exhibit internal consequentialism which is stable across situations.
On my end, Oli helped me sharpen the distinctions I draw, eg:
I think deceptive alignment (version 0) just needs there to be a thing A which is planning for something X, and A notices that pretending to go along with human H (who wants Y) is instrumental to X. Stricter version 1: the A doing planning has to be 'learned-component-internal' (I would qualify this as 'inner deceptive alignment' maybe). Stricter version 2: the A planning for X has to have the same consistent X (across... deployments? inputs? scaffolds??). (NB stricters 1 and 2 are orthogonal.)
I hope that readers will regard this debate seriously as a truthseeking exercise between me and Oli. Often I run into objections like "Well you just don't understand what you're critiquing." I, like everyone, will sometimes misunderstand an argument. But I wish interlocutors would more often consider whether they understand my critique. It felt to me like Oli considered that seriously, which was awesome and makes me feel open to more discussions in the future with them!
Concluding thoughts: I second that this was a really enjoyable and instructive dialogue. Thank you!
I felt that we converged pretty cleanly on what we agree about:
- deception is widely instrumentally useful and intelligent reasoners modelling other actors should be expected to recognise that by default
- scaffolded/CoT instrumental deception is possible and we have what looks like explicit evidence for it
- deceptive behaviour with non-externalised reasoning is conceivable
- in the limit, certainly
- in contemporary architectures, maybe at sufficient scale
- we maybe have some evidence for this already
- 'the model is not (necessarily) the agent', a slogan I have found useful
- AI safety work maybe underweights misuse concerns?
- theory can be valid logical evidence
- minimise Twitter to have a good life
while highlighting some remaining uncertainties:
- when should we expect consistent-across-contexts objectives?
- just from pretraining? Perhaps at sufficient scale?
- from RLHF-like conditioning? Maybe directionally?
- from longer-form outcomes-based reinforcement and autonomous get-stuff-done training?
- from consistent system-prompting and scaffolding? Very plausibly?
- how much more concerning is consistent-across-contexts deceptive reasoning?
- does it just need to be consistent-ish across enough contexts to cause issues?
- which contexts will tend to give rise to consistent deceptive reasoning, if any?
- under what conditions can Twitter contribute to a good life?
Overall I think this area is quite ripe for further research. While we still need to be wary of predictiveness and elicitation gaps (pernicious!), I should also add that, in small part due to this conversation, and as a general trend since 2023, I've gained more respect for empirical work, as long as it's accompanied by careful qualification of its epistemic weight. I now work at the AI Safety Institute where among other things, we try to investigate these issues at the contemporary frontier.
3 comments
Comments sorted by top scores.
comment by TurnTrout · 2025-01-16T16:26:16.504Z · LW(p) · GW(p)
Update in light of Alignment faking in large language models.
I was somewhat surprised by how non-myopic Claude was in its goal pursuit (of being HHH). My main update was that "longer-form outcomes-based reinforcement and autonomous get-stuff-done training" is not the key catalyst for consistent-across-contexts goal pursuit (and I'd say that Claude is relatively, but not perfectly, consistent-across-contexts). Rather, certain kinds of training which (presumably[1]) look like constitutional AI, context distillation, and RLHF --- that has at least once engrained certain kinds of non-myopic goal pursuit which is more stable across contexts than I expected. So I'm getting dinged!
I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic [LW · GW]). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).[2]
But I think this brings us into misuse territory. at least, this at least means that you aren't in danger simply from training the AI (and think of all the posts talking about "playing the training game"! not that those are your position, just a common one)
I was most strongly critiquing the idea that "playing the training game" occurs during pretraining or after light post-training. I still think that you aren't in danger from simply pretraining an AI in the usual fashion, and still won't be in the future. But the fact that I didn't call that out at the time means I get dinged[3] --- after all, Claude was "playing the training game" at least in its inner CoTs.
If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. "TBC playing the training game is possible after moderate RLHF for non-myopic purposes." IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.
- ^
I don't work at Anthropic, of course. So I don't really know.
- ^
Though even "inner actress Claude" would predict that Claude doesn't try overt incitation if it's smart enough to realize it would probably backfire.
- ^
As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.
↑ comment by Oliver Sourbut · 2025-01-16T16:49:07.454Z · LW(p) · GW(p)
As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.
I think this is conceivable if either
- the apparent reasoning is actually still just a bag of surface heuristics
- it's still quite shallow reasoning, and the salience of those as strategies is a consequence mainly of pretraining
I sincerely doubt the first nowadays. For the second, I think probably not; these strike me as quite basic and 'obvious' strategies, but I can't give it less than 20% that a system with those expunged from training data would need substantially more in-context nudging to recognise them.
↑ comment by Oliver Sourbut · 2025-01-16T16:43:41.674Z · LW(p) · GW(p)
I appreciate you revisiting another couple of months later! This topic continues to evolve.
It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently 'actually trying' in an apparently coherent way. I expected that to happen; I didn't know when. The lede which they somewhat buried in that paper was also that,
Furthermore, we find that this compliance gap persists in our synthetic document fine-tuned setup even when we entirely remove the model’s hidden chain-of-thought scratchpad
(though the gap is smaller and we don't have the tools to understand the mechanism)
The bigger update from that paper, for me, was that the somewhat-coherent goals apparently-pursued by these things actually align with the target constitution somewhat! i.e. (that which was) reward(ed) was the optimisation target after all, a bit??
We'll see how robustly these sorts of findings hold up to future developments and greater elicitation efforts.