AGI with RL is Bad News for Safety

post by Nadav Brandes (nadav-brandes) · 2024-12-21T19:36:03.970Z · LW · GW · 16 comments

Contents

16 comments

I haven’t found many credible reports on what algorithms and techniques have been used to train the latest generation of powerful AI models (including OpenAI’s o3). Some reports suggest that reinforcement learning (RL) has been a key part, which is also consistent with what OpenAI officially reported about o1 three months ago.

The use of RL to enhance the capabilities of AGI[1] appears to be a concerning development. As I wrote previously [LW · GW], I have been hoping to see AI labs stick to training models through pure language modeling. By “pure language modeling” I don’t rule out fine-tuning with RLHF or other techniques designed to promote helpfulness/alignment, as long as they don’t dramatically enhance capabilities. I’m also okay with the LLMs being used as part of more complex AI systems that invoke many instances of the underlying LLMs through chain-of-thought and other techniques. What I find worrisome is the underlying models themselves trained to become more capable through open-ended RL.

The key argument in my original post was that AI systems based on pure language modeling are relatively safe because they are trained to mimic content generated by human-level intelligence. This leads to weak pressure to surpass human level. Even if we enhance their capabilities by composing together many LLM operations (as in chain of thought), each atomic operation in these complex reasoning structures would be made by a simple LLM that only tries to generate a good next token. Moreover, the reasoning is done with language we can read and understand, so it’s relatively easy to monitor these systems. The underlying LLMs have no reason to lie in very strategic ways[2], because they are not trained to plan ahead. There is also no reason for LLMs to become agentic, because at their core they are just prediction machines.

Put RL into the core training algorithm and everything changes. Models are now explicitly trained to plan ahead, which makes all kinds of strategic and agentic behaviors actively rewarded. At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. In other words, we are now encouraging models to race towards superhuman capabilities.

For all of these reasons, it seems much more likely to end up with AI models developing capabilities no one intended for in a world where frontier AI labs are playing with open-ended RL to enhance their models.   

My original post from 20 months ago (Language Models are a Potentially Safe Path to Human-Level AGI [LW · GW]) elaborates on all the points I briefly made here. I don’t claim any particular novelty in what I wrote there, but I think it mostly stood the test of time[3] (despite being published only 5 months after the initial release of ChatGPT). In particular, I still think that pure LLMs (and more complex AI systems based on chaining LLM outputs together) are relatively safe, and that humanity would be safer sticking with them for the time being.

  1. ^

    Artificial general intelligence (AGI) means totally different things to different people. When I use this term, the emphasis is on “general”, regardless of the model’s strength and whether it exceeds or is below human level. For example, I consider GPT3 and GPT4 as forms of AGI because they have general knowledge about the world.  

  2. ^

    LLMs do lie a lot (and hallucinate), but mostly in cute and naive ways. All the examples of untruthful behavior I have seen in LLMs so far seem perfectly consistent with the assumption that they just want a good reward for the next token (judged through RLHF). I haven’t seen any evidence of LLMs lying in strategic ways, which I define as trying to pursue longterm goals beyond having their next few tokens receive higher reward than they deserve.

  3. ^

    Perhaps I was overly optimistic about the amount of economic activity focused on producing more capable AI systems without more capable underlying models.

16 comments

Comments sorted by top scores.

comment by Seth Herd · 2024-12-21T21:13:43.768Z · LW(p) · GW(p)

I think you're absolutely right.

And I think there's approximately a 0% chance humanity will stop at pure language models, or even stop at o1 and o3, which very likely to use RL to dramatically enhance capabilities.

Because they use RL not to accomplish things-in-the-world but to arrive at correct answers to questions they're posed, the concerns you express (and pretty much anyone who's been paying attention to AGI risks agrees with) are not fully in play.

Open AI will continue on this path unless legislation stops them. And that's highly unlikely to happen, because the argument against is just not strong enough to convince the public or legislators.

We are mostly applying optimization pressure to our AGI systems to follow instructions and produce correct answers. Framed that way, it sounds like it's as safe an approach as you could come up with for network-based AGI. I'm not saying it's safe, but I am saying it's hard to be sure it's not without more detailed arguments and analysis. Which is what I'm trying to do in my work.

Also as you say, it would be far safer to not make these things into agents. But the ease of doing so with a smart enough model and a prompt like "continue pursuing goal X using tools Y as appropriate to gather information and take actions" ensures that they will be turned into agents.

People want a system that actually does their work, not one that just tells them what to do. So they're going to make agents out of smart LLMs. This won't be stopped even with legislation; people will do it illegally or from jurisdictions that haven't passed the laws.

So we are going to have to both hope and plan for this approach, including RL for correct answers, is safe enough. Or come up with way stronger and more convincing arguments for why it won't. I currently think it can be made safe in a realistic way with no major policy or research direction change. But I just don't know, because I haven't gotten enough people to engage deeply enough with the real difficulties and likely approaches.

Replies from: nadav-brandes
comment by Nadav Brandes (nadav-brandes) · 2024-12-21T21:37:55.528Z · LW(p) · GW(p)

Thank you Seth for the thoughtful reply. I largely agree with most of your points.

I agree that RL trained to accomplish things in the real world is far more dangerous than RL trained to just solve difficult mathematical problems (which in turn is more dangerous than vanilla language modeling). I worry that the real-world part will soon become commonplace, judging from current trends.

But even without the real-world part, models could still be incentivized to develop superhumam abilities and complex strategic thinking (which could be useful for solving mathematical and coding problens).

Regarding the chances of stopping/banning open-ended RL, I agree it's a very tall order, but my impression of the advocacy/policy landscape is that people might be open to it under the right conditions. At any rate I wasn't trying to reason about what's reasonable to ask for, only on the implications of different paths. I think the discussion should start there, and then we can consider what's wise to advocate for.

For all of these reasons, I fully agree with you that work on demonstrating these risks in a rigorous and credible way is one of the most important efforts for AI safety.

comment by Dave Orr (dave-orr) · 2024-12-21T19:49:45.101Z · LW(p) · GW(p)

Why does RL necessarily mean that AIs are trained to plan ahead?

Replies from: nadav-brandes
comment by Nadav Brandes (nadav-brandes) · 2024-12-21T20:04:31.699Z · LW(p) · GW(p)

I explain it in more detail in my original post. 
In short, in standard language modeling the model only tries to predict the most likely immediate next token (T1), and then the most likely token after that  (T2) given T1, and so on; whereas in RL it's trying to optimize a whole sequence of next tokens (T1, ..., Tn) such that the rewards for all the tokens (up to Tn) are taken into account in the reward of the immediate next token (T1).

comment by purple fire (jack-edwards) · 2024-12-21T20:21:57.270Z · LW(p) · GW(p)

I disagree. I think the current approach, with chain-of-thought reasoning, is a marked improvement over naive language modelling in terms of alignment difficulty. CoT allows us to elicit higher capabilities out of the same level base text generation model, meaning less of the computation is done inside the black box and more is done in human-readable tokens. While this still (obviously) has risks, it seems preferable to models that fully internalize the reasoning process. Do you agree with that?

Replies from: Seth Herd, nadav-brandes
comment by Seth Herd · 2024-12-22T23:47:40.433Z · LW(p) · GW(p)

Yes, as long as there's a faithful chain of thought. But some RL techniques tend to create "jargon" and otherwise push toward a non-human-readable chain of thought. See Daniel Kokatijlo's work on faithful chain of thought, including his "shoggoth/face" training proposal. And there's 5 ways to improve CoT faithfulness [LW · GW], addressing the ways you can easiy lose it.

There are also studies investigating training techniques that "make chain of thought more efficient" by driving the cognition into the hidden state so as to use fewer steps. A recent Meta study just did this. Here are some others 

Whirlwind Tour of Chain of Thought Literature Relevant to Automating Alignment Research. [LW · GW]

Replies from: jack-edwards
comment by purple fire (jack-edwards) · 2024-12-23T00:09:33.154Z · LW(p) · GW(p)

I don't disagree that there remains a lot of work to be done, I understand that COT can be unfatihful, and I am generally against building very capable models that do CoT in latent space,[1] like the Meta paper does. Emphatically, I do not think "alignment is solved" just because o3 reasons out loud, or something. 

But, in my view, the research that needs to happen between here and aligned AGI is much more tractable with a weak forward pass and RL-trained CoT as opposed to a highly capable forward pass without RL. I can see an actual path forward to aligning AGI that works like the o-series model, and considering how recently this even became a research topic I think the work that's already been done is quite promising, including many of Daniel's proposals.

  1. ^

    This is a very general statement, there are lots of caveats and nuances, but I suspect we already agree on the broad strokes.

Replies from: Seth Herd
comment by Seth Herd · 2024-12-23T00:42:57.901Z · LW(p) · GW(p)

I think we are in agreement. I see a path to successful technical alignment through agentized LLMs/foundation models. The brief take is at Agentized LLMs will change the alignment landscape [LW · GW], and there's more on the several overlapping alignment techniques in Internal independent review for language model agent alignment [AF · GW] and reasons to think they'll advance rapidly at Capabilities and alignment of LLM cognitive architectures [LW · GW].

I think it's very useful to have a faithful chain of thought, but I also think all isn't lost without it.

Here's my current framing: we are actually teaching LLMs mostly to answer questions correctly, and to follow our instructions as we intended them. While optimizing those things too much would probably cause our doom by creating interpretations of those pressures we didn't intend or foresee. This is the classical agent foundations' worry, and I think it's valid. But we won't do infinite optimization all in one go; we'll have a human-level agent that's on our side to help us with its improvement or next generation, and so on. If we keep focusing on instruction-following as not only the training pressure but the explicit top-level goal we give it, we have the AGI on our side to anticipate unexpected consequences of optimization, and it will be "corribible" as long as it's approximately following instructions as intended. This gives us second chances and superhuman help maintaining alignment. So I think that alignment target is more important than the technical approach: Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW]

comment by Nadav Brandes (nadav-brandes) · 2024-12-21T20:59:11.563Z · LW(p) · GW(p)

I'm all in for CoT! But when you RL a model to produce CoTs that are better at solving difficult ptoblems, you risk encouraging i) CoTs that carry side information that's only known to the model, ii) superhuman capabilities, and iii) planning ahead and agency in ways that are difficult to anticipate. Whether you produce the CoT with a pure LLM or a model that's also undergone RL, you end up with a seriew of tokens where each token was produced by a fully internalized reasoning process. The only difference is that RL encotages this reasoning process to be more powerful, more agentic, and less predictable. What's the adventage of RL for safety?

Replies from: jack-edwards
comment by purple fire (jack-edwards) · 2024-12-22T00:01:11.751Z · LW(p) · GW(p)

"you risk encouraging i) CoTs that carry side information that's only known to the model"

This is true by default, but not intractable. For example, you can train the CoT model with periodic paraphrasing to avoid steganography, or you can train a CoT model just for capabilities and introduce a separate model that reasons about safety. Daniel Kokotajlo has some nice writeups about these ideas, he calls it the Shoggoth/Face framework.

"superhuman capabilities"

Agreed that this would be bad, but condition on this happening, better to do it with RL CoT over blackbox token generation.

"planning ahead and agency in ways that are difficult to anticipate"

Not sure why this would be the case--shouldn't having access to the model's thought process make this easier to anticipate than if the long-term plans were stored in neuralese across a bunch of transformer layers?

"RL encotages this reasoning process to be more powerful, more agentic, and less predictable"

This is something I agree with in the sense that our frontier models are trained with RL, and those models are also the most powerful and most agentic (since they're more capable), but I'm not convinced that this is inherent to RL training, and I'm not exactly sure in what way these models are less predictable.

Replies from: nadav-brandes
comment by Nadav Brandes (nadav-brandes) · 2024-12-22T00:50:14.960Z · LW(p) · GW(p)

Why are you conditioning on superhuman AGI emerging? I think it's something very dangerous that our society isn't ready for. We should pursue a path where we can enjoy as many of the benefits of sub-human-level AGI (of the kind we already have) without risking uncontrolled acceleration. Pushing for stronger capabilities with open-ended RL is counterproductive for the very scenario we need to avoid.

It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?

Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities? 

I like the idea of discouraging steganography, but I still worry that given strong enough incentives, RL-trained models will find ways around this. 

Replies from: jack-edwards
comment by purple fire (jack-edwards) · 2024-12-22T04:09:21.764Z · LW(p) · GW(p)

I think we're working with a different set of premises, so I'll try to disentangle a few ideas.

First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn't prepared for the advent of AI models that can perform economically useful labor.

Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is that conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding (e.g. o1) as opposed to simply more capable text generators (e.g. GPT-4). I believe that the former lends itself to stronger oversight mechanisms, more tractable interpretability techniques, and more robust safety interventions during real-world deployment.

"It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?"

This might have come across wrong, and is a potential crux. Conditioning on a particular text-generation model, I would guess that applying RL increases the risk--for example, I would consider Gemini 2.0 Flash Thinking as riskier than Gemini 2.0 Flash. But if you just showed me a bunch of eval results for an unknown model and asked how risky I thought the model was based on those, I would be more concerned about a fully black-box LLM than a RL CoT/scaffolded LM.

"Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?"

No, it seems pretty clear that RL models like o3 are more capable than vanilla LLMs. So in a sense, I guess I think RL is bad because it increases capabilities faster, which I think is bad. But I still disagree that RL is worse for any theoretical reason beyond "it works better".

Tying this all back to your post, there are a few object-level claims that I continue to disagree with, but if I came to agree with you on them I would also change my mind more on the overall discussion. Specifically:

  • The underlying LLMs have no reason to lie in very strategic ways, because they are not trained to plan ahead. (In particular, I don't understand why you think this is true for what you call open-ended RL, but not for RLHF which you seem to be okay with.)
  • Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. (Again, isn't this already a problem with RLHF? And/or synthetic data?)
  • At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. (I agree with this naively, but think this probelm is a lot more tractable than e.g. interpretability on a 100b parameter transformer.)

    Of course, I'm more than open to hearing stronger arguments for these, and would happily change my mind if I saw convincing evidence.

Replies from: nadav-brandes
comment by Nadav Brandes (nadav-brandes) · 2024-12-22T15:25:49.747Z · LW(p) · GW(p)

I think we are making good progress on getting to the crux :)

It sounds to me like some of our disagreements may arise from us using terminology differently, so let me try to clarify.

 

"conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding"

I disagree in 2 ways:

1. It's wrong to condition on AI being at a given level of capabilities. When deciding which path is better, priority number one should be to not exceed a dangerous level of capabilities. Only then should we ask: given that capability level, which technology stack achieves it in ways that are otherwise safer (more transparent and controllable, less agentic).

2. I'm not exactly sure what you mean by scaffolding, but it sounds like you are conflating RL-based AI with what I call "compositional AI systems" (that's the term I used in my original post from 20 months ago). In retrospect it's probably not the best term, but for the sake of consistency let me stick with it. 

By "compositional AI systems" I refer to any piece of software that is using an underlying AI model (e.g. an LLM) to achieve much stronger capabilities than those the underlying model would achieve if used on its own as a black box. Common techniques include chain of thought and tool use (e.g. access to web search and other APIs), but in theory you can have infinitely complex systems (e.g. trees of thought). Now, the underlying model can be trained with relatively open-ended RL as in o1 and o3, or it can be a simple LLM that's only fine-tuned with a simple RLHF or constitutional AI. Whether the system is compositional and whether the underlying model is trained with open-ended RL is two different questions, and I argue that compositional AI systems are relatively safe, as long as you stick with simple LLMs. If you use open-ended RL to train the underlying model, it's far less safe. I totally agree with you that compositional AI systems are safer than end-to-end opaque models.

 

What's the difference between RLHF and open-ended RL? 

RLHF is technically a form of RL, but unlike what I call "open-ended RL", which is intended to make models more capable (as in o1/o3), RLHF doesn't really increase capabilities in significant ways. It's intended to just make models more aligned and helpful. It's generally believed that most of the knowledge of LLMs is learned during pre-training (with pure language modeling) and that RLHF mostly just encourages the models to use their knowledge and capabilities to help users (as opposed to just predict the most likely next token you'd find for a similar prompt on the internet). What I'm worried about is RL that's intended to increase capabilities by rewarding models that manage to solve complex problems. That's what I call open-ended RL.

Open-ended RL encourages models to plan ahead and be strategic, because these are useful skills for solving complex problems, but not so useful for producing a chunk of text that a human user would like (as in RLHF). That's also why I'm worried about strategic deception being much more likely to arise with open-ended RL. 

Replies from: jack-edwards
comment by purple fire (jack-edwards) · 2024-12-22T23:58:43.577Z · LW(p) · GW(p)

Likewise, I appreciate your willingness to explain your argument and the opportunity to explain mine has forced me to reflect more on what I believe, which is certainly a plus.

On the first point, I think the statement "priority number one should be to not exceed a dangerous level of capabilities" requires some nuance. For example, if an AI lab pretrainined a model that had a dangerous level of capabilities, and then introduced some robust safety features that prevented the use of those capabilities, I think that's acceptable. In fact, this is already happening--GPT-4 by default is useful for building weapons, but typically refuses to reveal that knowledge to users (yes, jailbreaking etc. exists but the current RLHF is pretty good). So I would agree with an amended statement that "priority number one should be to not exceed a dangerous level of capabilities without correspondingly robust safety features." I also think that generic statements like "we shouldn't build dangerous AI" are somewhat akin to statements like "we should reduce homelessness"--if you have a specific intervention or policy in mind, I would be happy to discuss it!

I think your second point is basically our crux. It seems (correct me if I'm wrong!) that the way you are seeing RL models is something like training LLMs with RL enhances their capabilities and makes them more agentic, and this is bad because more capable models are bad. I don't disagree with this at face value, but I continue to think it misses an important capabilities/safety tradeoff. My view is more like training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. Perhaps our crux is around this tradeoff--are CoT systems sufficiently easier to align that they are safer to develop despite being more capable and more agentic?

My answer is yes. To make sure we're on the same page,[1] I'm thinking about a CoT system with two parts, an internal reasoning model and an external output model. The reasoning model is initially trained primarily for helpfulness and honesty, with RL to incentivize more correct answers on math, science, historical reasoning, etc. Importantly, there is no safety reward in the initial training, to prevent deceptive alignment, steganography, etc. Therefore this reasoning model's output is potentially harmful but also faithful--it actually reflects how the model solved a problem. Then the output model is trained to convert the CoT into the user-facing output, and to detect safety issues, etc. (There may also be safety-relevant post-training in the reasoning model, e.g. https://openai.com/index/deliberative-alignment/)

Without a doubt, the internal reasoning model is more capable and more agentic--as you wrote, incentivizing over a long sequence of tokens rather than a few at a time makes models less safe and more agentic, which I agree is bad. And there are still problems--ensuring that the CoT is faithful, properly implementing safety specs in the output model, etc. However, aligning this style of system is a much more tractable task than aligning a base LLM, and it is heartening to me that these systems might work sufficiently well with smaller LLMs that nobody is ever incentivized to build Claude 6 or whatever. Do you disagree?

  1. ^

    On a meta note, I'm an ML researcher and so have some implicit assumptions about how these are constructed because of the specific models I work on. This could be making us both confused, so I'm trying to be more explicit about the systems I'm envisioning. For what it's worth, though, I think I would have the some opinion even if I was purely an outside observer.

Replies from: nadav-brandes
comment by Nadav Brandes (nadav-brandes) · 2024-12-23T14:59:51.527Z · LW(p) · GW(p)

I think our crux is that, for a reason I don't fully understand yet, you equate "CoT systems"[1] with RL training. Again, I fully agree with you that systems based on CoT (or other hard-coded software and prompt engineering over weaker underlying models) are much safer than black-box end-to-end models. But why do you assume that this requires open-ended RL[2]? Wouldn't you agree that it would be safer with simple LLMs that haven't undergone open-ended RL? I also like your idea of a CoT system with two parts, but again, I'd argue that you don't need open-ended RL for that.

Other points I somewhat disagree with, but I don't think are the core of the crux, so we probably shouldn't delve into:

  1. I still think that priority number one should be to not exceed a dangerous level of capabilities, even with robust safety features.
  2. I find both "we shouldn't build dangerous AI" and "we should reduce homelessness" to be useful statements. To come up with good plans, we first need to know what we're aiming for.
  1. ^

    I used the term "compositional AI systems", but for the sake of ease of communication, let's use your terminology. 

  2. ^

    See my definition above for open-ended RL.

Replies from: jack-edwards
comment by purple fire (jack-edwards) · 2024-12-23T17:55:36.528Z · LW(p) · GW(p)

I agree that this might be our crux, so I'll try to briefly explain my side. My view is still more or less training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. I think this is only true with open-ended RL because:

  •  Regular CoT/prompt engineering is effective, but not so effective that I expect it to meaningfully change the incentive landscape for creating base models. For example, when people figured out that CoT improved benchmarks for GPT-3.5, I don't think that disincentivized the development of GPT-4. In contrast, I do think the creation of the o-series models (with open-ended RL) is actively disincentivizing the development of GPT-5, which I see as a good thing.
  • Open-ended RL might make compositional AI safer, not less safe. Done right, it discourages models from learning to reason strongly in the forward pass, which is imo the most dangerous capability.

Again, I agree that you don't need open-ended RL for CoT systems, but if you aren't using RL on the entire output then you need a more capable forward pass, and this seems bad. In effect, your options are:

  1. Build a regular model and then do CoT during inference (e.g. via prompt engineering)
  2. Build a model and reward it based on CoT during training with RL

Option 1 creates much more capable forward passes, Option 2 does not. I think we have a much better shot at aligninf models built the second way.