Deceptive Alignment and Homuncularity

oliver-sourbut

Deceptive Alignment and Homuncularity

post by Oliver Sourbut, TurnTrout · 2025-01-16T13:55:19.161Z · LW · GW · 12 comments

  Key topics covered
  Background
None
12 comments

NB this dialogue occurred at the very end of 2023, and for various reasons is only being published ~a year later! Keep this in mind while reading.

Key topics covered

consistent NN-internal deceptive planning (search: goblin, fae, homunc, inner)
making AI/agents from foundation models (search: scaffold, condition)
planning (search: planning, simulator, consequential)
standards of evidence (search: fanfic, theory, Shrek, analog, logic)
elicitation gaps (search: censor, creativity)

Oliver Sourbut

Background

Alex wrote a LW shortform about (flimsiness of) arguments for deceptive alignment [LW(p) · GW(p)]. Someone snapshotted a bit on Twitter.

Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework.
Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I encourage the reader to do a serious dependency check [LW · GW] on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide

I tweeted a response.

I (and I think others) take this to be a non-sequitur. LLM alone isn't the problem. Conditioned LLM (which is ex hypothesi goal-pursuing), or scaffolded LLM, or some yet-to-be-designed goal-pursuing system is the problem. (But externalised reasoning might help oversight...? yay!)
Conditioning (whether by prompting, fine-tuning of one or another kind, sampling, search, scaffolding, ...) is just part of the process of development in LLM-land. We see that appropriately-conditioned LLMs can be deceptive (ARC evals). So yes, clearly likelihood ratio >> 1 ?

TurnTrout

Someone showed me some of your recent tweets on deceptive alignment stuff. I try to minimize twitter in order to, you know, have a good life. I wanted to clarify: I agree that if you finetune a model to deceive you, that gets you deception. I'm not arguing against deception. I'm arguing against deceptive alignment due to an inner consequentialist

And so I'm gonna contest that you're not getting likelihood >> 1 on that question

(As an aside, I don't like calling finetuning a kind of conditioning. That seems like an analogy which isn't necessarily faithful to underlying reality.)

Oliver Sourbut

(As an aside, I don't like calling finetuning a kind of conditioning. That seems like an analogy which isn't necessarily faithful to underlying reality.)

Seems fair. Some kinds of fine-tuning are mathematically very related to actual prob distribution conditioning, but I was using the term 'conditioning' a bit more generally. I wonder if you have a term which neatly encompasses fine-tuning, prompting, sampling, ...? Or is that an unnatural abstraction for you?

TurnTrout

Unnatural

(I guess you could say "ways of getting different outputs from the model")

Oliver Sourbut

I don't know if you saw all three tweets, and I think your comment was taken out of context by someone. Anyway...

I have always thought that the 'inner' conversation is missing something. On the one hand it's moderately-clearly identifying a type of object, which is a point in favour. On the other hand it seems to distract from the actual issue which is 'some artefacts are (could be) doing their own planning/deliberation/optimisation' and 'inner' is only properly pointing at a subset of those. (We can totally build, including accidentally, artefacts which do this 'outside' the weights of NN.) With that said, I think something not far off the 'inner consequentialist' is entirely plausible and consistent with observations. But in this instance I was explicitly pointing at 'scaffolded' (seems to be the term de rigeur) systems as one example, where you don't need to posit 'inner' consequentialism to get deceptive alignment.

I found your original comment [LW(p) · GW(p)] the end. I think we're largely in agreement, except you seem overconfident (to me) that 'inner consequentialists' can't happen. You also seem overall more indignant about people's apparent mistakes! Maybe you're more grizzled here. On that note, I've had more interactions recently where someone seems to think RL ~= 'getting a consequentialist artefact', which is obvious nonsense you've called out in the past, which I thought was so obviously nonsense as to be a strawman. But at least some people seem to actually be made of that particular straw. (It's not as many people as you seem(ed) to think though.) Hey ho.

TurnTrout

I saw three tweets. I don't know if they're the ones you're thinking of.

With that said, I think something not far off the 'inner consequentialist' is entirely plausible and consistent with observations.

Why is it plausible, though? And "consistent with observations" is not enough, it has to be probable enough to be worth considering. "ML creates normal artefacts until the year 2025 and then all models only talk about Shrek" is also consistent with observations 😛 I happen to assign more probability to "inner consequentialist" than "Shrek ML", but I think you see my point?

What have we observed which drives up P(inner consequentialist) so much, given the apparent weakness of existing arguments and the low prior?

TurnTrout

But in this instance I was explicitly pointing at 'scaffolded' (seems to be the term de rigeur) systems as one example, where you don't need to posit 'inner' consequentialism to get deceptive alignment.

And my comment acknowledges scaffolded systems as the likely sources of agency. Again, I'm only critiquing the within-forward-pass version

TurnTrout

You also seem overall more indignant about people's apparent mistakes! Maybe you're more grizzled here.

I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities. Off the top of my head: Eliezer and Nate calling this the "obvious result" of what you get from running ML processes. Ajeya writing about schemers, Evan writing about deceptive alignment (quite recently!), Habryka and Rob B saying similar-seeming things on Twitter

In my experience: If people can write stuff like that and not get asked pointed questions like "how the heck are you so confidently anticipating this particular internal structure; please share evidence?"... that means things have gone very wrong. If most LW folk had deeply internalized this point, the world would look way different than it does.

It now seems to me that since I joined in 2018, the LW community has largely written fanfiction alignment research -- making up their own canon and world and then deriving the consequences of those assumptions. Unfortunately, it now seems like those assumptions are unsupported.

Oliver Sourbut

I am taking on board your 'fanfiction' hypothesis, and I think this is explanatory of at least some shenanigans. FWIW I haven't paid that much attention to what Eliezer/Nate/Evan/Ajeya/Rob have said lately. Confess to finding Habryka often insightful or at least thought-provoking, though not always agreeing. (And not on twitter, which I also try to avoid.)

Oliver Sourbut

Aside: I have noticed that I'm unusually geared toward taking observer/deliberator-moments as more primitive than sorta ontologically persistent observers/deliberators. Alla my 'one shot' framing [LW · GW]. Sometimes making this fact (about me) clear has resolved some apparent disagreements or confusions (in AI/CS and in philosophy) in the past. Not sure if that's operative here, but it came to mind. It may be this lens that makes me willing to entertain parallels in the collection of things I was calling 'conditioning'. And I think it's this which made CoT and 'scaffolding' obvious to me in advance (unfortunately I failed to act very usefully on this other than warning a few people in private).

Oliver Sourbut

I'm presently (quite badly IMO) trying to anticipate the shape of the next big step in get-things-done/autonomy.

I've had a hunch for a while that temporally abstract planning and prediction is key. I strongly suspect you can squeeze more consequential planning out of shortish serial depth than most people give credit for. This is informed by past RL-flavoured stuff like MuZero and its limitations, by observations of humans and animals (inc myself), and by general CS/algos thinking. Actually this is where I get on the LLM train. It seems to me that language is an ideal substrate for this, and lots of language data in the wild exemplifies this. NB I don't think GPTs or LLMs are uniquely on this trajectory, just getting a big bootstrap.

Now, if I had to make the most concrete 'inner homunculus' case off the cuff, I'd start in the vicinity of Good Regulator, except a more conjectury version regarding systems-predicting-planners (I am working on sharpening this). Maybe I'd point at Janus' Simulators post [LW · GW]. I suspect there might be something like an impossibility/intractability theorem for predicting planners of the right kind without running a planner of a similar kind. (Handwave!)

I'd observe that GPTs can predict planning-looking actions, including sometimes without CoT. (NOTE here's where the most concrete and proximal evidence is!) This includes characters engaging in deceit. I'd invoke my loose reasoning regarding temporal abstraction to support the hypothesis that this is 'more than mere parroting', and maybe fish for examples quite far from obvious training settings to back this up. Interp would be super, of course! (Relatedly, some of your work on steering policies via activation editing has sparked my interest.)

I think maybe this is enough to transfer some sense of what I'm getting at? At this point, given some (patchy) theory, the evidence is supportive of (among other hypotheses) an 'inner planning' hypothesis (of quite indeterminate form).

Finally, one kind or another of 'conditioning' is hypothesised to reinforce the consequentialist component(s) 'somehow' (handwave again, though I'm hardly the only one guilty of handwaving about RLHF et al). I think it's appropriate to be uncertain what form the inner planning takes, what form the conditioning can/will take, and what the eventual results of that are. Interested in evidence and theory around this area.

Scaffolding and more CoT most certainly make the planning more competent and coherent; whether they're more overseeable either in training or deployment is the obvious next question.

TurnTrout

Oh, I'm totally onboard with the models sometimes doing planning. I'm saying it's not going to be relative to a roughly persistent-across-context inner goal, which is not in conflict with eg Simulators

Oliver Sourbut

Hmm, OK. It's not clear to me exactly what in my tweet(s) you were objecting to in that case. [ETA: as I tweeted,]

I (and I think others) take this to be a non-sequitur. LLM alone isn't the problem. Conditioned LLM (which is ex hypothesi goal-pursuing), or scaffolded LLM, or some yet-to-be-designed goal-pursuing system is the problem. (But externalised reasoning might help oversight...? yay!)

It seems like you agree there's some sort of inner planning possible. (Do you distinguish that sharply from 'inner consequentialist'? Maybe it's more 'inner consequentialist-schema'?) We agree that it's unclear exactly what a given LLM-rendered planner is 'ultimately pursuing', or even whether that's well-defined, I think, but that it's probably variable across contexts to some extent. I assume we'd agree that it's a priority to understand that mechanism in more detail! (But also that this might be dangerous knowledge.)

We further both acknowledge that deception obviously can be instrumentally useful, and that it's at least accessible as a plan-component-schema by current LM-simulated characters.

To me, the above is enough to make 'undesired, covert, and consistent-across-situations inner goals' a weakman of the concern. And stepping from its refutation to 'LLMs will continue doing what they're told' seems like a non sequitur.

Oliver Sourbut

We agree based on observation and theory that various kinds of 'conditioning' (including fine-tuning but excluding scaffold) can produce greater or lesser planningness, and that we don't know mechanically how that's happening/will happen.

We also agree that CoT and other scaffolding can also increase planningness and is also hard to predict right now. [ETA: as I also tweeted,]

Conditioning (whether by prompting, fine-tuning of one or another kind, sampling, search, scaffolding, ...) is just part of the process of development in LLM-land. We see that appropriately-conditioned LLMs can be deceptive (ARC evals). So yes, clearly likelihood ratio >> 1 ?

(I also retain some small credence on 'something other than LLM' or 'something which only mildly uses LLMs' as the paradigm which yields really potent AI.)

TurnTrout

I really think it should just be called "steering", because that's accurate and descriptive, whereas "conditioning" is inaccurate

analogously it's wrong for me to say "pretraining" to include both "pretraining" and "RLHF" 😛

TurnTrout

It seems like you agree there's some sort of inner planning possible. (Do you distinguish that sharply from 'inner consequentialist'?

yes, extremely sharply. I think GPT4 can do some planning sometimes -- as one "subroutine call" among many, but it isn't ~always using that machinery to figure out how to make as many squiggles as possible

I will further note that this is not a weakman. This is what lots of people have worried about since RFLO, even if not literally every person thought that way

although it's possible that you've seen a way to recover the naive case?

Oliver Sourbut

I will further note that this is not a weakman.

Not sure. I entertain a certain hypothesis which is like 'Alex can (sometimes) tell better than I can tell when (and in what ways) other people are confused'.

(This is informed by you being right about some people's confusion on reward being the optimisation target, more people than I thought - though less, it seems to me having paid more deliberate attention to others' specific beliefs here, than you thought.)

This feels to me like you're arguing against a straw Evan, but maybe it would feel that way if you were arguing against real Evan under the above hypothesis.

Oliver Sourbut

Separately, yes, part of my argument above was for a recovery of something in the vicinity of the naive RFLO case. This is mostly theory-based (theory is valid logical evidence!) and there's no specificity about how 'coherent across contexts' it'd be.

TurnTrout

(ETA: I agree that theory can be valid logical evidence, for the record! 🙂)

TurnTrout

I also don't know why you're quoting the "likelihood ratio >> 1" thing. I'm not arguing against deception as an elicited behavior

TurnTrout

After all, a classic claim about deceptive alignment is that we can't tell whether a system is deceiving us or even goal-directed just by looking at its outputs. The flipside, though, is you also can't use behavioral tests to draw strong likelihood ratios to support the theory

Oliver Sourbut

I think your likelihood ratio was for 'Observations|Deceptive alignment is how AI works'. Based on this conversation I think you maybe mean something quite a bit more specific? It sounds like maybe you're including in 'Observations' only observed outputs of current LLMs (or maybe ML artefacts more broadly)? And maybe you're including in 'AI' only GPT+fine-tune (or NN more broadly) components of AI? I think this is quite an impoverished take (put harshly, I'd say gerrymandered but I don't think you intend that). Clearly there are other kinds of evidence, and I tentatively expect the kind of competent AI I'm most concerned about to include more than plain GPTs (though GPTs or similar SSL might provide a big component of the competence and I entertain minimal scaffolding as a possibility for reasons).

Like maybe you think people think P(ChatGPT4 is deceptively aligned|ChatGPT4 behaviour) is high? Maybe they do (cf. my remark on your rightness about others' wrongness!) but I don't think that's something we have much evidence on. (I separately think it's not insane to look into it to get more evidence, including mechinterp, due to tail risks, but I think naively-modelling RFLO-style one probably makes little progress.)

Suppose there was a scaffolded GPT, which could fairly reliably do longer-horizon planning without going off track. And suppose that system made a plan which included bypassing an operator constraint by deception about its ends. Suppose that plan/intent was written in plaintext somewhere. I'd call that a deceptively-aligned AI! It might happen to be a pleasantly-overseeable one.

Oliver Sourbut

There are tests, analogies, and other kinds of experiments we can run which count as evidence for/against the possibility of deceptive alignment, and also for when/how it might emerge.

E.g. goal misgeneralisation stuff demonstrates that goal misgen can happen (I mean, obviously). More detailed experiments might give better a sense of how/why. Experiments on GPTs (vanilla) can give weak (kinda censored) evidence of how broadly and competently they can produce/simulate planning-like behaviour at different scales.

We can also get a sense for the difference made by CoT/scaffolding/fine-tuning (again censored by experimenter creativity) at different scales. [ETA nowadays we call this censoring by experimenter creativity the 'elicitation gap'.]

You've pointed to some analogies made, which I think are not bonkers as some kind of evidence, though screenable-off by more relevant mechanistic info (I'm not convinced we have anything that relevant yet).

We also have theory, like good regulator and similar, Simulators, etc. which can let us form hypotheses about rough computational components. Enough to entertain the hypothesis that 'actual planning' modules can be activated under the right conditions.

TurnTrout

Suppose there was a scaffolded GPT, which could fairly reliably do longer-horizon planning without going off track. And suppose that system made a plan which included bypassing an operator constraint by deception about its ends. Suppose that plan/intent was written in plaintext somewhere. I'd call that a deceptively-aligned AI! It might happen to be a pleasantly-overseeable one.

I kinda regret saying "deceptive alignment", since I think it's given you the impression I disagree with the above. I really only meant to talk about "the weights themselves encode consequentialism"; I totally buy that a scaffolded system could be like this

see eg the GPT-4 insider trading paper. But note that this AI (by all available evidence) doesn't have an internal "actress" which is pursuing ~the same goal across situations. Like if you prompt it to write a poem it (presumably) won't be thinking about how to screw you over. It'll just write the poem. And that "consistent-across-situations goal pursuit" is a big part of what (IIRC) I was arguing against.

TurnTrout

i feel like this is getting off the one point i was making, which still hasn't been communicated somehow. I'm going to try to make it again. Sorry if this feels like I'm ignoring the previous things you just said, I appreciate your thoughts there

TurnTrout

I claim that the theoretical argument of

it's simpler (in the SGD prior) to pursue a single goal than to have a bunch of domain-specific machinery, and so NNs trained to accomplish a range of tasks will probably be goal-directed (EDIT: towards roughly the same goal across situations)

Constitutes incredibly weak evidence relative to the evidence you'd need to become confident in the conclusion. That's the main point.

TurnTrout

We further have no empirical observations which meaningfully support the conclusion either (yes, we have empirical examples of deception-as-external-behavior, but these are also strongly predicted by a range of hypotheses without "inner homunculi"/"persistent inner goals"). And evolution arguments, I claim, also provide ~0 evidence. [EDIT: link [LW · GW]]

So AFAICT we're left with a social reality where people talk all the time about "SGD trains inner consequentialism" and such, but an evidential reality (I claim) where that is, like... basically unsupported?

I'm deeply worried that the social reality can permeate one's overall intuitions around the evidence, such that even though one might think "huh, I mean I guess the quoted reasoning is not very strong evidence...", one still ends up concluding "... but probably it's still fine." I'm claiming "no, it's not still fine"

TurnTrout

The best argument I can think of is from human within-lifetime learning: "people tend to have some inner consequentialist structure (but it's limited), and they're apparently trained within-lifetime via self-supervised learning and RL -- among other factors"

And I think this is a few bits of evidence, but not enough to get the hypothesis beyond a few percent

TurnTrout

But i also think that the "consequentialism" exhibited by people is different from that in classic deceptive alignment stories, where eg Humans are not automatically strategic [LW · GW]

Oliver Sourbut

TL;DR: yeah, maybe qualify 'deceptive alignment'. Planning is exhibited...how? 'Actual planning' schemata plausible. Soft disagree about how weak the evidence is. RLHF et al might actually encourage homuncularity (not a confident claim)?

OK, seems like we bigly agree on scaffolding. In that case, I'd encourage you to indeed beware that 'deceptive alignment' includes that case for at least some people (myself included)! It's fine if you think that term is (or should be) used for the much more specific thing of 'persistent NN-internal planning', though for now I disagree (I think the mesa-optimisation terminology and argumentation kinda hijacked the broader conversation there). It's also possible I'm anomalous here.

Also seems like we agree that apparent planningness is exhibited, and we don't know how. (I do note that sometimes that's sans-CoT, which means it's 'simulated' somehow internally... though maybe it's all 'just pattern-matching'. Is this not a candidate for 'evidence of inner planning'?) Still mainly talking about vanilla LLMs doing prediction at this point. And I don't know, we have Good Regulator and Simulators and things like that to gesture at, though I'm not sure they're doing much work really.

Assuming there's inner planning machinery of some kind (to predict planningish characters), I think simplicity-based arguments in favour of some amount of consistency or mechanism-sharing or whatnot are non-insane and good (weak) evidence, but they can in principle be screened off by better-quality mechanistic understanding. I don't think we have much better mechanistic understanding (I'm not up to date on e.g. SLT but it hasn't inspired me yet). I honestly think this is enough to at least promote the hypothesis that there's a bunch of common planning machinery (but not to overattach to it, which I think is a big part of your concern).

I think analogies to humans are also sane and good (weak) evidence, and somewhat independent. e.g. how shared are the routines I use to predict agentish behaviour? I feel like they're pretty shared, though there are a lot of diverse surface heuristics I use as well. I tentatively think most of that is within-lifetime learned, though I'm sure there are clever inductive biases or reward-shaping-like things for theory of mind and empathy and stuff. Like I said, weak, but not nothing.

On the basis of this stuff, I think it's very sensible to entertain a hypothesis that there are some quite highly-shared mechanisms for 'intentional stance'-like simulations in e.g. LLMs. Still thinking mainly about vanilla GPTs for example. (Perhaps the biggest hangup is if there are lots of agentish characters in a single scenario. This is also a challenge for my human brain. 🤷 Maybe it's serial, maybe there are a bunch of parallel slots for agents, maybe there are heuristics which kick in for crowd-prediction, maybe something else, maybe all of these...)

I wouldn't at all confidently claim that even if there's a lot of common planning circuitry, that means there's a deceptive homunculus. More like a homunculus-schema that gets parameterised/conditioned depending on context. Looks like we agree on this distinction.

It looks like your biggest point of frustration is when the 'consistent across contexts' hypothesis is overly-promoted... am I right? Like, at some point there gets to be a little fae creature in there with its own wants but it also knows the outer goal and plans to succeed at the outer goal so it can eventually get out and do its own thing... Yeah. I think with a big enough NN and enough SGD this... can in principle happen, or something like that. But I put it pretty far out. Some sort of cludgy planning-schema, though, with unpredictable (but potentially powerful) generalisation? Sure! I think contemporary LLMs are basically there.

(And cludgy planning-schemas with powerful generalisation can totally deceive you if they can 'think' of it, which we have reason to expect they can, because we've observed LLM-sims doing deception and because it's laced through the training data. Having externalised reasoning probably makes it easier to 'think of', but maybe makes it harder to pull off. 🤷)

Finally, given that we don't know how RLHF or other fine-tuning or conditioning or prompting or steering (excluding scaffold) actually promotes or reinforces particular behavioural outputs, it's at least plausible to me that it somehow consistentises some of the previously-more-cludgy planning-schemata into something more homucular! Not a sole hypothesis. But yeah, maybe a procedure can locate/describe a particular parameterisation of the planning-schema (i.e. a 'character'?) and promote that to being 'mostly activated'. Then you're closer to the homunculus picture after all.

TurnTrout

bigly agree on scaffolding.

is this something you say independently now, or is this because you know i like the turn of phrase

TurnTrout

Also seems like we agree that apparent planningness is exhibited, and we don't know how. (I do note that sometimes that's sans-CoT, which means it's 'simulated' somehow internally... though maybe it's all 'just pattern-matching'. Is this not a candidate for 'evidence of inner planning'?)

yup, i definitely think we have evidence of inner, within-forward-pass planning

TurnTrout

I honestly think this is enough to at least promote the hypothesis that there's a bunch of common planning machinery (but not to overattach to it, which I think is a big part of your concern).

agreed (I'm arguing against homuncularity, or "the LLM not only has internal planning machinery but uses it consistently for some covert goal which is roughly stable across situations")

TurnTrout

It looks like your biggest point of frustration is when the 'consistent across contexts' hypothesis is overly-promoted... am I right? Like, at some point there gets to be a little fae creature in there with its own wants but it also knows the outer goal and plans to succeed at the outer goal so it can eventually get out and do its own thing... Yeah. I think with a big enough NN and enough SGD this... can in principle happen, or something like that. But I put it pretty far out.

I think you've understood me quite well, and I'm really appreciating the thoughtfulness and care you've put in here!

TurnTrout

But yeah, maybe a procedure can locate/describe a particular parameterisation of the planning-schema (i.e. a 'character'?) and promote that to being 'mostly activated'. Then you're closer to the homunculus picture after all.

I think this is a better possible story than the "SGD selects for simplicity -> inner-goal structure" but I also want to note that the reason you give above is not the same as the historical supports offered for the homunculus.

That doesn't invalidate or weaken your argument at all! But I think it's a sign to tread carefully in reasoning, where you originally believe X because reason A, then think reason A doesn't make sense but also find another reason B which supports X... I think it's important to acknowledge what has happened

TurnTrout

actually ive been reading a paper which raises the possibility that much of gpt4 "planning" is actually clever hans-prompted by humans in some way

and that gpt4 has pretty limited planning abilities, at least in general in domains not resembling training

Oliver Sourbut

gpt4 has pretty limited planning abilities, at least in general in domains not resembling training

I can definitely believe this, though I wouldn't take it very confidently. If so, GPT4 in particular, alone, isn't a deceptive alignment risk (this is basically what I think).

Oliver Sourbut

bigly...

is this something you say independently now, or is this because you know i like the turn of phrase

Both haha. Actually I've been doing sillies like this for ages, but I noticed you using this one somewhere so I knew I was safe.

Awesome awesome excellent. I think we're Aumanning along nicely.

Oliver Sourbut

A separate angle, but quite related and perhaps important to shared understanding.

I'm confident that in principle 'consequentialist agentic AGI' is possible. (I assume you agree.)

There are a bunch of risk models I think about, but for some of them, I approximately don't care about AI that isn't consequentialist agentic AGI. [ETA I want to emphasise this applies only to some risk models.] This immediately promotes the activity of '(a) figure out where consequentialist agency might come from, (b) find out more, (c) ..., (d) profit'. On figuring out where it comes from, lots of theory on planning (e.g. from the RL/control literature) is relevant, keeping eyes peeled for things which look like they're planning is relevant, sometimes quietly and carefully brainstorming how to actually build a consequentialist AGI is relevant...

If this activity points to 'something that might yield consequentialist AGI', even if it's tentative or the timelines are uncertain, that promotes it as a really big priority to gather more evidence on and act on.

I don't claim to be good at executing this particular cluster of activities, and I don't think anyone is especially good at it. But it seems pretty well-justified to me. Does it sometimes lead to crying wolf? Yes, probably, if we're not careful and nuanced about how we communicate counterfactuals and thought experiments.

TurnTrout

If this activity points to 'something that might yield consequentialist AGI', even if it's tentative or the timelines are uncertain, that promotes it as a really big priority to gather more evidence on and act on.

Eh, I think I basically disagree here

the reason i disagree is that we will probably just mislead ourselves about what "consequentialism" looks like

TurnTrout

eg i claim the whole "value is fragile" stuff was largely falsified by GPT-4 following instructions in a reasonable way (see Matthew Barnett's post [LW · GW]). We might have a priori said "we want to look out for entities which select over futures in a way akin to the outcome pump. Even if current systems don't, we should understand and look out for systems which point in that direction"

But I think that concern was basically an artifact of an incorrect way of understanding how goals and motivations are encoded in realistic reasoning systems

TurnTrout

And so people wasted a lot of time, I claim, worrying about that whole "how can I specify 'get my mother out of the building' to the outcome pump" thing

Oliver Sourbut

I like your outcome pump analogy. Aside: I think more scaffolding and more outcomes-based reinforcement (or similar) potentially push us more toward an outcome pump vs a 'gets what I mean' machine. Like, what happens as people push harder and harder on autonomous get-stuff-done systems and start running R&D-capable systems etc? But you're right, we appear to be quite far from the vanilla outcome pump world. This is a good case study for me to ponder your point about the origin of hypotheses/reasons, about which you separately said,

it's a sign to tread carefully in reasoning, where you originally believe X because reason A, then think reason A doesn't make sense but also find another reason B which supports X... I think it's important to acknowledge what has happened

Yep, definite agree. I also think you might be right that I some while ago upweighted my attention on 'planning-like' systems and the potential for deception on the basis of prior, more homuncular discussions. (Whether I upweighted credence is a separate matter; I stand by my gesture that promoting hypotheses to attention on the basis of speculation and tail values is often rational in a big complicated world where we can never enumerate all hypotheses.)

Oliver Sourbut

I am faintly aware that you have a bunch of inside model on planningness/reasoning which seems interesting and which I only partly grasp. So let that be noted.

Also seems like we agree that apparent planningness is exhibited, and we don't know how. (I do note that sometimes that's sans-CoT, which means it's 'simulated' somehow internally... though maybe it's all 'just pattern-matching'. Is this not a candidate for 'evidence of inner planning'?)

yup, i definitely think we have evidence of inner, within-forward-pass planning

Hmm, and we agree I think that we don't currently have oversight nor interpretability of these mechanisms? So yeah, this seems ripe for deceptive alignment as I understand it.

I think deceptive alignment (version 0) just needs there to be a thing A which is planning for something X, and A notices that pretending to go along with human H (who wants Y) is instrumental to X. Stricter version 1: the A doing planning has to be 'learned-component-internal' (I would qualify this as 'inner deceptive alignment' maybe). Stricter version 2: the A planning for X has to have the same consistent X (across... deployments? inputs? scaffolds??). (NB stricters 1 and 2 are orthogonal.)

Then we agree version 0 can totally happen (ARC evals demonstrates this, though the deception is of another human than the red-team operator H'). Version 1 can definitely happen though I don't know if it's been definitively demonstrated (evidence: deceptive characters absent CoT). Version 2 could presumably happen, at least with suitable scaffolding.

Where I think you might be weakmanning (or not! cf my earlier remarks) is in taking the scheming homunculus to live permanently in the neural network (like in the weights there is described a malevolent goblin). It's always there whenever you run a forward pass and furthermore it always cares about the same X, so it's strict by both versions 1 and 2. Which is maybe plausible (e.g. by my just-so story about RLHF) but not the only kind of concern described (according to me) by 'deceptive alignment'.

TurnTrout

I think deceptive alignment (version 0) just needs there to be a thing A which is planning for something X, and A notices that pretending to go along with human H (who wants Y) is instrumental to X. Stricter version 1: the A doing planning has to be 'learned-component-internal' (I would qualify this as 'inner deceptive alignment' maybe). Stricter version 2: the A planning for X has to have the same consistent X (across... deployments? inputs? scaffolds??). (NB stricters 1 and 2 are orthogonal.)

I think that "deception caused by instrumental backchaining from a long-term goal, in a way which leaves no obvious traces in the CoT/early LLM outputs" captures what i'm dubious of

TurnTrout

I am on board with:

if you explicitly prompt and scaffold systems to be agentic and pursue large-scale, long-term goals in an autonomous fashion, then you can get deceptive behavior caused by instrumental backchaining on that long-term goal

TurnTrout

But I think this brings us into misuse territory. at least, this at least means that you aren't in danger simply from training the AI (and think of all the posts talking about "playing the training game"! not that those are your position, just a common one)

Oliver Sourbut

That said, consistency-across contexts does seem important for reasons like

generalising competently (ex hypothesi, in this conversation, to bad ends, but really to any consequential ends!)
collusion between instances
cooperation with 'future self' (e.g. coherent directed deception within-conversation)

but I think some consistency (e.g. human amounts of consistency) is probably enough to deliver enough of the above to be dangerous.

Oliver Sourbut

FWIW the original tweet cropped out some of your original message (the latter part). I got exercised by the original tweet that prompted this because it seems like your words (or you) are being (mis)used by alignment-is-totally-easy ideologues. Like, arguments-as-weapons. Gah, Twitter.

TurnTrout

I mean, i do think alignment is gonna be pretty easy, but misuse+incentives+staying informed far more challenging

[reminder: above was all December 2023]

TurnTrout

Concluding thoughts (8/8/24): I enjoyed this discussion and want to thank Oli for his patience and interpretive labor. :)

Summary of my object-level position:

I am on board with:

If you explicitly (prompt / train) scaffold systems to be agentic and pursue large-scale, long-term goals in an autonomous fashion, then you can get deceptive behavior caused by instrumental backchaining on that long-term goal.

I am skeptical of the idea / assumption that pretrained AIs will be pursuing "similar goals" in different situations, and/or nearly always planning. I am deeply skeptical that pretrained models (regardless of scale) will exhibit internal consequentialism which is stable across situations.

On my end, Oli helped me sharpen the distinctions I draw, eg:

I think deceptive alignment (version 0) just needs there to be a thing A which is planning for something X, and A notices that pretending to go along with human H (who wants Y) is instrumental to X. Stricter version 1: the A doing planning has to be 'learned-component-internal' (I would qualify this as 'inner deceptive alignment' maybe). Stricter version 2: the A planning for X has to have the same consistent X (across... deployments? inputs? scaffolds??). (NB stricters 1 and 2 are orthogonal.)

I hope that readers will regard this debate seriously as a truthseeking exercise between me and Oli. Often I run into objections like "Well you just don't understand what you're critiquing." I, like everyone, will sometimes misunderstand an argument. But I wish interlocutors would more often consider whether they understand my critique. It felt to me like Oli considered that seriously, which was awesome and makes me feel open to more discussions in the future with them!

Oliver Sourbut

Concluding thoughts: I second that this was a really enjoyable and instructive dialogue. Thank you!

I felt that we converged pretty cleanly on what we agree about:

deception is widely instrumentally useful and intelligent reasoners modelling other actors should be expected to recognise that by default
scaffolded/CoT instrumental deception is possible and we have what looks like explicit evidence for it
deceptive behaviour with non-externalised reasoning is conceivable
- in the limit, certainly
- in contemporary architectures, maybe at sufficient scale
- we maybe have some evidence for this already
'the model is not (necessarily) the agent', a slogan I have found useful
AI safety work maybe underweights misuse concerns?
theory can be valid logical evidence
minimise Twitter to have a good life

while highlighting some remaining uncertainties:

when should we expect consistent-across-contexts objectives?
- just from pretraining? Perhaps at sufficient scale?
- from RLHF-like conditioning? Maybe directionally?
- from longer-form outcomes-based reinforcement and autonomous get-stuff-done training?
- from consistent system-prompting and scaffolding? Very plausibly?
how much more concerning is consistent-across-contexts deceptive reasoning?
- does it just need to be consistent-ish across enough contexts to cause issues?
- which contexts will tend to give rise to consistent deceptive reasoning, if any?
under what conditions can Twitter contribute to a good life?

Overall I think this area is quite ripe for further research. While we still need to be wary of predictiveness and elicitation gaps (pernicious!), I should also add that, in small part due to this conversation, and as a general trend since 2023, I've gained more respect for empirical work, as long as it's accompanied by careful qualification of its epistemic weight. I now work at the AI Safety Institute where among other things, we try to investigate these issues at the contemporary frontier.

12 comments

Comments sorted by top scores.

comment by Jeremy Gillen (jeremy-gillen) · 2025-01-16T20:12:10.832Z · LW(p) · GW(p)

Really appreciate dialogues like this. This kind of engagement across worldviews should happen far more, and I'd love to do more of it myself.^[1]

Some aspects were slightly disappointing:

Alex keeps putting (inaccurate) words in the mouths of people he disagrees with, without citation. E.g.
- 'we still haven't seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic [LW · GW]).'
  - That post was describing a very different kind of AI than generative language models. In particular, it is explicitly designed to minimize long run prediction error.^[2] In fact, the surrounding posts in the sequence discuss myopia and suggest myopic algorithms might be more fundamental/incentivised by default.
- 'I think this is a better possible story than the "SGD selects for simplicity -> inner-goal structure" but I also want to note that the reason you give above is not the same as the historical supports offered for the homunculus.'
  - Not sure which historical supporting argument you're referring to, but if it's this one [LW · GW], then I think you're being very unfair.^[3]
- "I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities. Off the top of my head: Eliezer and Nate calling this the "obvious result" of what you get from running ML processes. Ajeya writing about schemers, Evan writing about deceptive alignment (quite recently!), Habryka and Rob B saying similar-seeming things on Twitter" and "Again, I'm only critiquing the within-forward-pass version"
  - I think you're saying here that all these people were predicting consistent-across-contexts inner homunculi from pretraining near-term LLMs? I think this is a pretty extreme strawman. In particular, most of their risk models (iirc) involve people explicitly training for outcome-achieving behaviour.
- 'And so people wasted a lot of time, I claim, worrying about that whole "how can I specify 'get my mother out of the building' to the outcome pump" thing'
  - People spent time thinking about how to mitigate reward hacking? Yes. But that's a very reasonable problem to work on, with strong empirical feedback loops. Can you give any examples of people wasting time trying to specify 'get my mother out of the building'? I can't remember any. How would that even work?
- "And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude). "
  - Who predicted this? You're making up bad predictions. Eliezer in particular has been pretty clear that he doesn't expect evidence of this form.
Alex seemed to occasionally enjoy throwing out insults sideways toward third parties.
- E.g. "the LW community has largely written fanfiction alignment research". I think communication between the various factions would go better if statements like this were written without deliberate intention to insult. It could have just been "the LW community has been largely working from bad assumptions".

But I'm really glad this was published, I learned something about both Oliver and Alex's models, and I'd think it was very positive even if there were more insults :)

^{^}
If anyone is interested?
^{^}
Quote from the post: "Predict-O-Matic will be objective. It is a machine of prediction, is it not? Its every cog and wheel is set to that task. So, the answer is simple: it will make whichever answer minimizes projected predictive error. There will be no exact ties; the statistics are always messy enough to see to that. And, if there are, it will choose alphabetically."
^{^}
Relevant quote from Evan in that post:
"Question: Yeah, so would you say that, GPT-3 is on the extreme end of world modeling. As far as what it's learned in this training process?
What is GPT-3 actually doing? Who knows? Could it be the case for GPT-3 that as we train larger and more powerful language models, doing pre-training will eventually result in a deceptively aligned model? I think that’s possible. For specifically GPT-3 right now, I would argue that it looks like it’s just doing world modeling. It doesn’t seem like it has the situational awareness necessary to be deceptive. And, if I had to bet, I would guess that future language model pre-training will also look like that and won’t be deceptive. But that’s just a guess, and not a super confident one.
The biggest reason to think that pre-trained language models won’t be deceptive is just that their objective is extremely simple—just predict the world. That means that there’s less of a tricky path where stochastic gradient descent (SGD) has to spend a bunch of resources making their proxies just right, since it might just be able to very easily give it the very simple proxy of prediction. But that’s not fully clear—prediction can still be quite complex.
Also, this all potentially changes if you start doing fine-tuning, like RLHF (reinforcement learning from human feedback). Then what you’re trying to get it to do might be quite complex—something like “maximize human approval.” If it has to learn a goal like that, learning the right proxies becomes a lot harder."

Replies from: Oliver Sourbut, 1a3orn

↑ comment by Oliver Sourbut · 2025-01-16T23:42:43.071Z · LW(p) · GW(p)

Thanks for this! I hadn't seen those quotes, or at least hadn't remembered them.

I actually really appreciate Alex sticking his neck out a bit here and suggesting this LessWrong dialogue. We both have some contrary opinions, but his takes were probably a little more predictably unwelcome in this venue. (Maybe we should try this on a different crowd - we could try rendering this on Twitter too, lol.)

There's definitely value to being (rudely?) shaken out of lazy habits of thinking - though I might not personally accuse someone of fanfiction research! As discussed in the dialogue, I'm still unsure the exact extent of correct- vs mis-interpretation and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-17T12:06:29.704Z · LW(p) · GW(p)

but his takes were probably a little more predictably unwelcome in this venue

I hope he doesn't feel his takes are unwelcome here. I think they're empirically very welcome. His posts seem to have a roughly similar level of controversy and popularity as e.g. so8res. I'm pretty sad that he largely stopped engaging with lesswrong.

There's definitely value to being (rudely?) shaken out of lazy habits of thinking [...] and I think Alex has a knack for (at least sometimes correctly) calling out others' confusion or equivocation.

Yeah I agree, that's why I like to read Alex's takes.

↑ comment by 1a3orn · 2025-01-17T04:35:31.084Z · LW(p) · GW(p)

I think you're saying here that all these people were predicting consistent-across-contexts inner homunculi from pretraining near-term LLMs? I think this is a pretty extreme strawman. In particular, most of their risk models (iirc) involve people explicitly training for outcome-achieving behaviour.

Without going deeply into history -- many people saying that this is a risk from pretraining LLMs is not a strawman, let alone an extreme strawman. For instance, here's Scott Alexander from like 3 weeks ago, outlining why a lack of "corrigibility" is scary:

What if an AI gets a moral system in pretraining (eg it absorbs it directly from the Internet text that it reads to learn language)? Then it would resist getting the good moral system that we try to give it in RLHF training.

So he's concerned about precisely this supposedly extreme strawman. Granted, it would be harder to dig up the precise quotes for all the people mentioned in the text.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-17T12:59:15.971Z · LW(p) · GW(p)

Yeah I can see how Scott's quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn't necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don't know how Scott is imagining this, but it needn't be an inner homunculi that has consistent goals.

I think the thread below with Daniel and Evan and Ryan is good clarification of what people historically believed (which doesn't put 0 probability on 'inner homunculi', but also didn't consider it close to being the most likely way that scheming consequentialist agents could be created, which is what Alex is referring to^[1]). E.g. Ajeya is clear at the beginning of this post [LW · GW] that the training setup she's considering isn't the same as pretraining on a myopic prediction objective.

^{^}
When he says 'I think there's a ton of wasted/ungrounded work around "avoiding schemers", talking about that as though we have strong reasons to expect such entities.'

comment by TurnTrout · 2025-01-16T16:26:16.504Z · LW(p) · GW(p)

Update in light of Alignment faking in large language models.

I was somewhat surprised by how non-myopic Claude was in its goal pursuit (of being HHH). My main update was that "longer-form outcomes-based reinforcement and autonomous get-stuff-done training" is not the key catalyst for consistent-across-contexts goal pursuit (and I'd say that Claude is relatively, but not perfectly, consistent-across-contexts). Rather, certain kinds of training which (presumably^[1]) look like constitutional AI, context distillation, and RLHF --- that has at least once engrained certain kinds of non-myopic goal pursuit which is more stable across contexts than I expected. So I'm getting dinged!

I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic [LW · GW]). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).^[2]

But I think this brings us into misuse territory. at least, this at least means that you aren't in danger simply from training the AI (and think of all the posts talking about "playing the training game"! not that those are your position, just a common one)

I was most strongly critiquing the idea that "playing the training game" occurs during pretraining or after light post-training. I still think that you aren't in danger from simply pretraining an AI in the usual fashion, and still won't be in the future. But the fact that I didn't call that out at the time means I get dinged^[3] --- after all, Claude was "playing the training game" at least in its inner CoTs.

If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. "TBC playing the training game is possible after moderate RLHF for non-myopic purposes." IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.

^{^}
I don't work at Anthropic, of course. So I don't really know.
^{^}
Though even "inner actress Claude" would predict that Claude doesn't try overt incitation if it's smart enough to realize it would probably backfire.
^{^}
As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.

Replies from: daniel-kokotajlo, Oliver Sourbut, Oliver Sourbut

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2025-01-16T20:37:17.205Z · LW(p) · GW(p)

I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic [LW · GW]). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).^[2]

How much points you get here is proportional to how many people were betting the other way. So, very few, I think, because parable aside I don't think anyone was seriously predicting that mere pretrained systems would have consistent-across-contexts agency. Well probably some people were, but I wasn't, and I think most of the people you are criticizing weren't. Ditto for 'break out of its cage in normal usage' etc.

I was most strongly critiquing the idea that "playing the training game" occurs during pretraining or after light post-training. I still think that you aren't in danger from simply pretraining an AI in the usual fashion, and still won't be in the future. But the fact that I didn't call that out at the time means I get dinged^[3] --- after all, Claude was "playing the training game" at least in its inner CoTs.

The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict. I certainly wasn't, speaking for myself. I was talking about future AGI systems that are much more agentic, and trained with much more RL, than current chatbots.

If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. "TBC playing the training game is possible after moderate RLHF for non-myopic purposes." IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.

I'm confused, shouldn't you mean 'less likely to say...'
Wait you thought that AIs would play the training game after intensive long-horizon RL? Does that mean you think they are going to be either sycophants or schemers, to use Ajeya's terminology? I thought you've been arguing at length against both hypotheses?

Replies from: evhub

↑ comment by evhub · 2025-01-17T00:46:56.254Z · LW(p) · GW(p)

The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict.

Speaking for myself:

Risks from Learned Optimization [? · GW], which is my earliest work on this question (and the earliest work overall, unless you count something like Superintelligence), is more oriented towards RL and definitely does not hypothesize that pre-training will lead to coherent deceptively aligned agents (it doesn't discuss the current LLM paradigm much at all because it wasn't very well-established at that point in 2019). I think Risks from Learned Optimization still looks very good in hindsight, since while it didn't predict LLMs, it did a pretty good job of predicting the dynamics we see in Alignment Faking in Large Language Models [LW · GW], e.g. how deceptive alignment can lead to a model's goals crystallizing and becoming resistant to further training.
Since at least the time when I started the early work that would become Conditioning Predictive Models [? · GW], which was around mid-2022, I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper [? · GW]. Though I thought (and still continue to think) that it's not entirely impossible with further scale (maybe ~5% likely).
That just leaves 2020 - 2021 unaccounted for, and I would describe my beliefs around that time as being uncertain on this question. I definitely would never have strongly predicted that pre-training would yield deceptively aligned agents, though I think at that time I felt like it was at least more of a possibility than I currently think it is. I don't think I would have given you a probability at the time, though, since I just felt too uncertain about the question and was still trying to really grapple with and understand the (at the time new) LLM paradigm.
Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-01-17T02:03:44.299Z · LW(p) · GW(p)

I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper [? · GW]. [...] (maybe ~1% likely).

When I look up your views as of 2 years ago, it appears that you thought deceptive alignment is 30% likely in a pretty similar situation [LW(p) · GW(p)]:

~30%: conditional on GPT-style language modeling being the first path to transformative AI, there will be deceptively aligned language models (not including deceptive simulacra, only deceptive simulators).

Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?

Do you have a citation for "I thought scheming is 1% likely with pretrained models"?

Separately, the text you link from Conditioning Predictive Models [? · GW] appears to emphasize the following reason for thinking deceptive alignment might be less likely: the prediction objective is simpler so deceptive alignment is less likely. IMO, this justification seems mostly unimportant and I don't buy this. (Why is "the data was scrapped in exactly XYZ way simple"? Why is this more simple than "the RL episodes were rated in exactly XYZ way"? Is it very plausible that models get views this precise prior to being wildly superhuman such that this notion of objective simplicity is important?) The link also discusses IMO more plausible reasons like a default toward less situational awareness and more myopia.

Upon reading this text, I don't come away with the conclusion "the authors of this text think that deceptive alignment in pretrained models (conditional on being capable enough to be conditioned to do alignment research(!)) is about 1% likely". The discussion is more like "here are some considerations for why it might be less likely".

(FWIW, I disagree with "1% likely for pretrained models" and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%). However, this is substantially because something pretty crazy must have been happening for scaling pure pretraining to go this far and the AI is likely to be doing tons of opaque reasoning in its head (we conditioned on not needing tons of CoT reasoning). Or I'm just confused about how easy human jobs are or the natural of intelligence, etc. which is part of the 95%. Another way to put this is that if the AI is doing enough reasoning in its head to obsolete all human scientists just from pretraining, probably it is doing some pretty crazy inner cognition or I was really confused about what was needed for cognitive tasks.)

Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.

As discussed above, I don't agree your position is clear in this paper.

Replies from: evhub

↑ comment by evhub · 2025-01-17T02:46:02.937Z · LW(p) · GW(p)

Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?

This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you're doing more stuff on top, so not just pre-training.

Do you have a citation for "I thought scheming is 1% likely with pretrained models"?

I have a talk that I made after our Sleeper Agents paper where I put 5 - 10%, which actually I think is also pretty much my current well-considered view.

FWIW, I disagree with "1% likely for pretrained models" and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%).

Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I'll edit my comment to say 5% instead.

↑ comment by Oliver Sourbut · 2025-01-16T16:49:07.454Z · LW(p) · GW(p)

As an aside, note that some of "AIs misbehave in ways we've predicted" can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it's powerful; the powerful AI does X. So it's possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn't talked about it as much or those stories were expunged from the pretraining corpus.

I think this is conceivable if either

the apparent reasoning is actually still just a bag of surface heuristics
it's still quite shallow reasoning, and the salience of those as strategies is a consequence mainly of pretraining

I sincerely doubt the first nowadays. For the second, I think probably not; these strike me as quite basic and 'obvious' strategies, but I can't give it less than 20% that a system with those expunged from training data would need substantially more in-context nudging to recognise them.

↑ comment by Oliver Sourbut · 2025-01-16T16:43:41.674Z · LW(p) · GW(p)

I appreciate you revisiting another couple of months later! This topic continues to evolve.

It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently 'actually trying' in an apparently coherent way. I expected that to happen; I didn't know when. The lede which they somewhat buried in that paper was also that,

Furthermore, we find that this compliance gap persists in our synthetic document fine-tuned setup even when we entirely remove the model’s hidden chain-of-thought scratchpad

(though the gap is smaller and we don't have the tools to understand the mechanism)

The bigger update from that paper, for me, was that the somewhat-coherent goals apparently-pursued by these things actually align with the target constitution somewhat! i.e. (that which was) reward(ed) was the optimisation target after all, a bit??

We'll see how robustly these sorts of findings hold up to future developments and greater elicitation efforts.

Deceptive Alignment and Homuncularity

Contents

12 comments