Posts

AGI with RL is Bad News for Safety 2024-12-21T19:36:03.970Z
Language Models are a Potentially Safe Path to Human-Level AGI 2023-04-20T00:40:16.102Z

Comments

Comment by Nadav Brandes (nadav-brandes) on AGI with RL is Bad News for Safety · 2024-12-25T16:27:07.558Z · LW · GW

I totally agree. I think it's another framing for why open-ended RL is much more dangerous than pure LLMs. Models trained with open-ended RL are rewarded based on their final results, and will produce any series of tokens that help with that. They are incentivized to produce CoTs that do not logically represent their true thinking. 

Pure LLMs, on the other hand, have no desire other than making the next token as likely as possible, given whatever character they are currently simulating. Whether they are "good" or "bad", I can't see anything in their training that would incentivize them to develop an efficient secret language to communicate with their future selves in a way that produces misleading CoTs.

Comment by Nadav Brandes (nadav-brandes) on AGI with RL is Bad News for Safety · 2024-12-25T16:17:20.046Z · LW · GW

Before I attempt to respond to your objections, I want to first make sure that I understand your reasoning.

I think you're saying that in theory it would be better to have CoT systems based on pure LLMs, but you don't expect these to be powerful enough without open-ended RL, so this approach won't be incentivized and die out from competition against AI labs who do use open-ended RL. Is it a faithful summary of (part of) your view?

You are also saying that if done right, open-ended RL discourages models from learning to reason strongly in the forward pass. Can you explain what you mean exactly and why you think that?

I think you are also saying that models trained with open-ended RL are easier to align than pure LLMs. Is it because you expect them to be overall more capable (and therefore easier to do anything with, including alignment), or for another reason? 

In case it helps to clarify our crux, I'd like to add that I agree with you that AI systems without open-ended RL would likely be much weaker than those with it, so I'm definitely expecting incentives to push more and more AI labs to use this technique. I just wish we could somehow push against these incentives. Pure LLMs producing weaker AI systems is in my opinion a feature, not a bug. I think our society would benefit from slower progress in frontier AGI.

Comment by Nadav Brandes (nadav-brandes) on AGI with RL is Bad News for Safety · 2024-12-23T14:59:51.527Z · LW · GW

I think our crux is that, for a reason I don't fully understand yet, you equate "CoT systems"[1] with RL training. Again, I fully agree with you that systems based on CoT (or other hard-coded software and prompt engineering over weaker underlying models) are much safer than black-box end-to-end models. But why do you assume that this requires open-ended RL[2]? Wouldn't you agree that it would be safer with simple LLMs that haven't undergone open-ended RL? I also like your idea of a CoT system with two parts, but again, I'd argue that you don't need open-ended RL for that.

Other points I somewhat disagree with, but I don't think are the core of the crux, so we probably shouldn't delve into:

  1. I still think that priority number one should be to not exceed a dangerous level of capabilities, even with robust safety features.
  2. I find both "we shouldn't build dangerous AI" and "we should reduce homelessness" to be useful statements. To come up with good plans, we first need to know what we're aiming for.
  1. ^

    I used the term "compositional AI systems", but for the sake of ease of communication, let's use your terminology. 

  2. ^

    See my definition above for open-ended RL.

Comment by Nadav Brandes (nadav-brandes) on AGI with RL is Bad News for Safety · 2024-12-22T15:25:49.747Z · LW · GW

I think we are making good progress on getting to the crux :)

It sounds to me like some of our disagreements may arise from us using terminology differently, so let me try to clarify.

 

"conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding"

I disagree in 2 ways:

1. It's wrong to condition on AI being at a given level of capabilities. When deciding which path is better, priority number one should be to not exceed a dangerous level of capabilities. Only then should we ask: given that capability level, which technology stack achieves it in ways that are otherwise safer (more transparent and controllable, less agentic).

2. I'm not exactly sure what you mean by scaffolding, but it sounds like you are conflating RL-based AI with what I call "compositional AI systems" (that's the term I used in my original post from 20 months ago). In retrospect it's probably not the best term, but for the sake of consistency let me stick with it. 

By "compositional AI systems" I refer to any piece of software that is using an underlying AI model (e.g. an LLM) to achieve much stronger capabilities than those the underlying model would achieve if used on its own as a black box. Common techniques include chain of thought and tool use (e.g. access to web search and other APIs), but in theory you can have infinitely complex systems (e.g. trees of thought). Now, the underlying model can be trained with relatively open-ended RL as in o1 and o3, or it can be a simple LLM that's only fine-tuned with a simple RLHF or constitutional AI. Whether the system is compositional and whether the underlying model is trained with open-ended RL is two different questions, and I argue that compositional AI systems are relatively safe, as long as you stick with simple LLMs. If you use open-ended RL to train the underlying model, it's far less safe. I totally agree with you that compositional AI systems are safer than end-to-end opaque models.

 

What's the difference between RLHF and open-ended RL? 

RLHF is technically a form of RL, but unlike what I call "open-ended RL", which is intended to make models more capable (as in o1/o3), RLHF doesn't really increase capabilities in significant ways. It's intended to just make models more aligned and helpful. It's generally believed that most of the knowledge of LLMs is learned during pre-training (with pure language modeling) and that RLHF mostly just encourages the models to use their knowledge and capabilities to help users (as opposed to just predict the most likely next token you'd find for a similar prompt on the internet). What I'm worried about is RL that's intended to increase capabilities by rewarding models that manage to solve complex problems. That's what I call open-ended RL.

Open-ended RL encourages models to plan ahead and be strategic, because these are useful skills for solving complex problems, but not so useful for producing a chunk of text that a human user would like (as in RLHF). That's also why I'm worried about strategic deception being much more likely to arise with open-ended RL. 

Comment by Nadav Brandes (nadav-brandes) on AGI with RL is Bad News for Safety · 2024-12-22T00:50:14.960Z · LW · GW

Why are you conditioning on superhuman AGI emerging? I think it's something very dangerous that our society isn't ready for. We should pursue a path where we can enjoy as many of the benefits of sub-human-level AGI (of the kind we already have) without risking uncontrolled acceleration. Pushing for stronger capabilities with open-ended RL is counterproductive for the very scenario we need to avoid.

It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?

Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities? 

I like the idea of discouraging steganography, but I still worry that given strong enough incentives, RL-trained models will find ways around this. 

Comment by Nadav Brandes (nadav-brandes) on AGI with RL is Bad News for Safety · 2024-12-21T21:37:55.528Z · LW · GW

Thank you Seth for the thoughtful reply. I largely agree with most of your points.

I agree that RL trained to accomplish things in the real world is far more dangerous than RL trained to just solve difficult mathematical problems (which in turn is more dangerous than vanilla language modeling). I worry that the real-world part will soon become commonplace, judging from current trends.

But even without the real-world part, models could still be incentivized to develop superhumam abilities and complex strategic thinking (which could be useful for solving mathematical and coding problens).

Regarding the chances of stopping/banning open-ended RL, I agree it's a very tall order, but my impression of the advocacy/policy landscape is that people might be open to it under the right conditions. At any rate I wasn't trying to reason about what's reasonable to ask for, only on the implications of different paths. I think the discussion should start there, and then we can consider what's wise to advocate for.

For all of these reasons, I fully agree with you that work on demonstrating these risks in a rigorous and credible way is one of the most important efforts for AI safety.

Comment by Nadav Brandes (nadav-brandes) on AGI with RL is Bad News for Safety · 2024-12-21T20:59:11.563Z · LW · GW

I'm all in for CoT! But when you RL a model to produce CoTs that are better at solving difficult ptoblems, you risk encouraging i) CoTs that carry side information that's only known to the model, ii) superhuman capabilities, and iii) planning ahead and agency in ways that are difficult to anticipate. Whether you produce the CoT with a pure LLM or a model that's also undergone RL, you end up with a seriew of tokens where each token was produced by a fully internalized reasoning process. The only difference is that RL encotages this reasoning process to be more powerful, more agentic, and less predictable. What's the adventage of RL for safety?

Comment by Nadav Brandes (nadav-brandes) on AGI with RL is Bad News for Safety · 2024-12-21T20:04:31.699Z · LW · GW

I explain it in more detail in my original post. 
In short, in standard language modeling the model only tries to predict the most likely immediate next token (T1), and then the most likely token after that  (T2) given T1, and so on; whereas in RL it's trying to optimize a whole sequence of next tokens (T1, ..., Tn) such that the rewards for all the tokens (up to Tn) are taken into account in the reward of the immediate next token (T1).

Comment by Nadav Brandes (nadav-brandes) on Language Models are a Potentially Safe Path to Human-Level AGI · 2023-04-27T13:44:16.202Z · LW · GW

Thank you for this comment!

I first want to note that your comment implies my post is saying things which I don't think it does (or at least I didn't mean it to):
- I didn't argue that language models are internally human-like, only externally. I do think however that compositional AI systems made of language models should be internally human-like.
- I didn't advocate for training AGI end-to-end (but rather taking the factored cognition approach).

I agree with you that a human-ish-level AGI would be VERY dangerous regardless of how aligned it is (mostly because there are 8 billion people on this planet, and it's sufficient for just a few of them to use it in stupid/evil ways to put us all in danger).

Comment by Nadav Brandes (nadav-brandes) on Language Models are a Potentially Safe Path to Human-Level AGI · 2023-04-20T21:09:35.908Z · LW · GW

Thank you for this comment. I'm curious to understand the source of disagreement between us, given that you generally agree with each of the sub-points. Do you really think that the chances of misalignment with LM-based AI systems is above 90%? What exactly do you mean by misalignment in this context and why do you think it's the most likely result with such AI? Do you think it will happen even if humanity sticks with the paradigm I described (of chaining pure language models while avoiding training models on open-ended tasks)?

I want to also note that my argument is less about "developing language models was counterfactually a good thing" and more "given that language models have been developed (which is now a historic fact), the safest path towards human-level AGI might be to stick with pure language models".

Comment by Nadav Brandes (nadav-brandes) on Language Models are a Potentially Safe Path to Human-Level AGI · 2023-04-20T20:58:44.179Z · LW · GW

Thank you!
I don't have any concrete plans, but maybe.