DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking

post by tailcalled · 2024-06-10T21:20:11.938Z · LW · GW · 13 comments

TL;DR: GPTs are imitation learners, even with current forms of RL;HF.

Direct preference optimization is a conditioning method for generative probabilistic models where pairs of outputs are ranked (e.g. by human raters) based on which one is better, and then (roughly speaking) you apply gradient updates to increase the probability of the "good" outputs relative to the "bad" outputs.

This is bad for notkilleveryoneism because it incentivizes the models to generate deceptive outputs that look "better" (according to human judgement) than they really are. However, I think a lot of rationalists[1] overestimate how bad it is for alignment, because they think it also incentivizes misaligned powerseeking when really it doesn't.

Humans give LLMs the opportunity to execute power-seeking actions by following instruction texts that they generate. However, we're not gonna follow complex instructions we don't understand and rank them based on the black-box results. Rather, to rank the outputs, we will use our own judgement to evaluate the texts (e.g. reasoning about the consequences of following instructions), and rank them based on this.

If the LLMs accidentally generate outputs that confuse our judgement - e.g. telling us advice that seems like it would earn us money, but actually doesn't - then such outputs can be reinforced, leading to deceptive LLMs. However, this deception doesn't actually have to continue deceiving us and strengthening itself once put into practice; it only has to deceive us for long enough to be favored by the DPO.

In order for complex capabilities to be developed through DPO-like methods, humans have to recognize what method the AI is using, and whether it is making incremental progress, because without this sort of reward-shaping, it is exponentially unlikely for an AI to stumble into complex solutions to tasks by sheer chance.

Misaligned powerseeking obscured by deceptive alignment - where an AI develops a preference for rewards, but hides that preference in order to get away with seeking the rewards [? · GW] - cannot develop in this way, because when humans recognize these complex powerseeking maneuvres, we don't reinforce them.

In mathematical terms, I would argue we can view the capabilities gained from DPO-like methods as being something along the following lines:

Here,  is meant to represent a human rater,  is meant to represent an output of the network,  is the distribution of outcomes  as understood by the human rater,  is the preference ordering of the human rater,  is the policy (neural network weights) under consideration,  is the query that rater  has for the model and  is the distribution of rater-queries (e.g. ChatGPT users who provide thumbs-up/thumbs-down).

This could probably be factored in other ways, but there's two important points to consider, which I think will be preserved across factorizations:

  1. ^

    The proximate cause of this post was that The Standard Analogy [LW · GW], which makes this error, was presented at less.online, and as I talked to several people at the festival, I exclusively found people who made the same mistake. However, the mistake has been made lots of times elsewhere, seemingly to the point of e.g. alienating Alex Turner [LW · GW] because of the insistence of the rationalist community in this.


Comments sorted by top scores.

comment by Steven Byrnes (steve2152) · 2024-06-11T00:02:08.121Z · LW(p) · GW(p)

I think this is similar to the conclusion I reached in §5.1 of “Thoughts on ‘Process-Based Supervision’” [LW · GW]. Admittedly it did take many patient attempts for Holden to get that point through my thick skull ¯\_(ツ)_/¯ So it’s nice that you wrote this.

I do however think there are reasons for concern if you keep the setup (“incentives”) the same allow but arbitrarily complicated Π, including much more complicated than an LLM (e.g., instead of “the AI does one forward pass through GPT-4 and then prints a character”, replace it with “the AI does arbitrary opaque brainstorming for six hours with a giant hidden scratchpad and free access to a cached copy of the internet, and then it prints a message”)—see §5.2 of that same post [LW · GW] for that argument.

For example, it’s conceivable that an AI can pull off a treacherous turn on its first try, and if you’re trying to reason about whether that’s gonna happen or not, you can’t straightforwardly do so via your Π* expression—(1) the treacherous-turn output is very unlike any of the AI’s previous outputs (by assumption), and (2) as soon as the AI emits such an output, it’s game over. Again see that post [LW · GW] for my (somewhat casual and intuitive) attempt to reason about that possibility.

Replies from: tailcalled
comment by tailcalled · 2024-06-11T08:29:40.731Z · LW(p) · GW(p)

I think this is similar to the conclusion I reached in §5.1 of “Thoughts on ‘Process-Based Supervision’”.

I agree.

see §5.2 of that same post for that argument

I think my delta relative to this view is that I think agency is sufficiently complex and non-unique that there's an endless variety of pseudo-agencies that can just as easily be developed as full agency as long as they receive the appropriate reinforcement, so reasoning of the form "X selection criterion benefits from full agency in pursuit of Y, so therefore full agency in pursuit of Y will develop" is invalid, because instead what will happen is " full agency in pursuit of Y is a worse solution to X than Z is, so selection for X will select for Z", mainly due to there being a lot of Zs.

Basically, I postulate the whole "raters make systematic errors - regular, compactly describable, predictable errors [LW · GW]" aspect means that you get lots of evidence to support some other notion of agency.

For example, it’s conceivable that an AI can pull off a treacherous turn on its first try

I think it's most likely if you have some AI trained by some non-imitation-learning self-supervised method (e.g. self-play), and then you fine-tune it with RLHF. Here it would be the self-supervised learning that functions to incentivize the misaligned powerseeking, and RLHF merely failing to avoid it.

Replies from: Gunnar_Zarncke
comment by Gunnar_Zarncke · 2024-06-11T09:36:18.301Z · LW(p) · GW(p)

it’s conceivable that an AI can pull off a treacherous turn on its first try
an AI trained by some non-imitation-learning self-supervised method (e.g. self-play)

It depends on the type of self-play. If the self-play is entirely between AIs, no other human-like parties in the environment, I agree. Because these self-play AI's could learn to cooperate/scheme very powerfully. But if the environment contain (simulations of) human(-like) agents, including intentionally weak ones, and the evaluation includes scoring factors like care and collaboration with them, then it might look different. 

Replies from: tailcalled
comment by tailcalled · 2024-06-11T09:40:14.001Z · LW(p) · GW(p)

I think you'd actually need some presence of some human-like entities in order for the AI to learn to deceive humans specifically.

Replies from: Gunnar_Zarncke
comment by Gunnar_Zarncke · 2024-06-11T10:08:01.067Z · LW(p) · GW(p)

Yes, but it makes a difference if your environment is composed entirely of singular agents of the same kind (the self-playing AI) or if it has a variety of simulated agents acting in complex social structures, where the behavior of the self-play AI in the social structure is scored.

comment by quetzal_rainbow · 2024-06-11T08:31:28.816Z · LW(p) · GW(p)

I think that "behaviorist" interpretation of RL (that you "reinforce" behavior) is wrong in general and especially wrong in case of RLHFing LLMs. Instead of thinking about "reinforcing behavior" you should think about "reinforcing algorithms that contribute to behavior". The consequence of this is following:

  1. You have base model which is trained on bazillion texts, which include, say, deceptive behavior and, correspondingly, algorithms for deceptive behavior
  2. You fine-tune model on "good" completions
  3. But "good" completions can be produced by both "good" algorithms and "bad-but-pretending-to-be-good" algorithms, so both types of algorithms get reinforced
  4. What's important, it doesn't depend on whether evaluator did good job. Perfect deceiver, by definition, produces the same answer as good honest agent (before deployment), so in the end odds ratio between good honest agent and perfect deceiver stays the same (modulo quirks in LLM cognition), while everything else is negatively reinforced.



If you look at figure 3, you will find that RLHFed model is the most likely to deceive. I think it is not because somebody rewarded it for deception in similar conditions but because the very process of RLHF puts deceptive algorithms as the second-most-reinforced in LLM.


Look at graph "Malicious use cases", deception is fastest to be elicited. Also note that x-axis is in-log scale, so generally deception-jailbreaking is approximately 30% faster.


(i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability

I think this supports my hypothesis that RLHF is "reweighting" of existing algorithms instead of writing algorithms into network from scratch. (If somebody finds similar paper on RLHF it would be great.)

Sooooo what does it say about RLHF incentivizing power-seeking?

It depends on:

  1. Whether base model has power-seeking algorithms
  2. Whether it is likely for power-seeking algorithms to contribute to "correct" answers in RLHF-finetuning

What excellent questions, I hope interpretability will help us answer them.

And, as Chris_Leong noted, it is unlikely that many details of current RLHF will still be here during training of superintelligences.

Replies from: tailcalled
comment by tailcalled · 2024-06-11T08:47:45.747Z · LW(p) · GW(p)

I can agree with "RLHF doesn't robustly disincentivize misaligned powerseeking that has occurred through other means" (I would expect it often does but often doesn't). Separately from all this, I'm not so worried about LLMs because their method of gaining capabilities is based on imitation learning, but if you are more worried about imitation learning than I am or people start gaining more capabilities from "real agency" then I'd say my post doesn't disprove the possibility of misaligned powerseeking, only arguing that it's not what RLHF favors.

Replies from: quetzal_rainbow
comment by quetzal_rainbow · 2024-06-11T09:06:17.766Z · LW(p) · GW(p)

My point is that RLHF incentivizes all sorts of tnings and these things depend on content of trained model, not on what RLHF is.

Replies from: tailcalled
comment by tailcalled · 2024-06-11T09:17:32.910Z · LW(p) · GW(p)

It depends on both.

comment by cubefox · 2024-06-11T08:30:39.985Z · LW(p) · GW(p)

As far as I understand, in RLHF, PPO/DPO doesn't directly use preferences from human raters, but instead synthetic preference data generated by a reward model. The reward model in turn is trained on preference data given by actual human raters. The reward model may be misgeneralizing this data, in which case the DPO input may include preferences that humans wouldn't give. Which might change your conclusion.

Replies from: tailcalled
comment by tailcalled · 2024-06-11T08:37:08.976Z · LW(p) · GW(p)

I'd say it adds an extra step of indirection where the causal structure of reality gets "blurred out" by an agent's judgement, and so a reward model strengthens rather than weakens this dynamic?

comment by Chris_Leong · 2024-06-11T05:28:15.590Z · LW(p) · GW(p)

Seems like at some point we’ll need to train on outputs too complex for humans to evaluate, then we’ll end up using training methods based on outcomes in some simulation.

Replies from: tailcalled
comment by tailcalled · 2024-06-11T08:32:06.265Z · LW(p) · GW(p)

I agree. Personally my main takeaway is that it's unwise to extrapolate alignment dynamics from the empirical results of current methods. But this is a somewhat different line of argument which I made in Where do you get your capabilities from? [LW · GW].