ELI5 Why isn't alignment *easier* as models get stronger?

post by Logan Zoellner (logan-zoellner) · 2023-10-28T14:34:37.588Z · LW · GW · 2 comments

This is a question post.

Contents

  Answers
    15 Richard_Kennaway
    5 johnswentworth
    3 Seth Herd
    1 chasmani
None
2 comments

Empirical Status: straw-man, just wondering what the strongest counterarguments to this are.

It seems obvious to me that stronger models are easier to align.  A simple proof

  1. It is always possible to get a weaker model out of a stronger model (for example, by corrupting n% of its inputs/outputs)
  2. Therefore if it possible to align a weak model, it is at least as easy to align a strong model
  3. It is unlikely to be exactly as hard to align weak/strong models.
  4. Therefore it is easier to align stronger models

(I have a list of counter-arguments written down, I am interested to see if anyone suggests a counterargument better than the ones on my list)

Answers

answer by Richard_Kennaway · 2023-10-28T15:29:08.023Z · LW(p) · GW(p)

I don’t follow your (2). Compare:

  1. It is always possible to get a weaker fighter out of a stronger fighter (for example, by shooting his kneecaps).

  2. Therefore if it possible to keep a weak fighter under your control, it is at least as easy to keep a strong fighter under your control.

If you have to shoot him in the kneecaps, that defeats the point of having a strong fighter. Likewise, hobbling a strong AI defeats the point of having it.

As some countries have found with hired mercenaries, the more effectively they win your war for you, the harder it may be to get them to go away afterwards.

comment by Big Tony · 2023-10-29T21:47:08.091Z · LW(p) · GW(p)

Agree, I don't follow the logic from step 1 → step 2 either - it seems obviously nonsensical. Maybe there are a few intermediate steps missing that show the chain of logic more clearly?

answer by johnswentworth · 2023-10-28T16:16:36.466Z · LW(p) · GW(p)

A few places where this argument breaks down...

First and most important: we can make a weaker model out of a stronger model if we know in advance that we want to do so, and actually try to, and make sure the stronger system does not have a chance to stop us (e.g. we don't run it). If there's an agentic superhuman AGI already undergoing takeoff, then "make it weaker" is not really an option. Even if there's an only-humanish-level agentic AGI already running, if that AGI can easily spin up a new instance of itself without us noticing before we turn it off, or arrange for someone else to spin up a new instance, then "make it weaker" isn't really an option. Plausibly even a less-than-human-level agent could pull that off; infosec does usually have an attacker's advantage.

(Subproblem 1: on some-but-not-all threat models, a superhuman AGI is already a threat when it's in training. So plausibly "don't run the strong model" wouldn't even be enough, we'd have to not even train the strong model.

Subproblem 2 (orthogonal to subproblem 1): looking at a strong model and figuring out how aligned/corrigible/etc it is, in a way robust enough to generalize well to even moderately strong capabilities, is itself one of the hardest open problems in alignment. So in order for a plan involving "build strong model and make it weaker" to help, the plan would have to weaken the strong model unconditionally, not check whether the strong model has problems and then weaken it. At which point... why use a stronger model in the first place? There are still some reasons, but a lot fewer.

Put subproblems 1 & 2 together, and we're basically back to "don't use a strong model in the first place" - i.e. unconditionally do not train a strong model.)

Second: one would need to know the relevant way in which to weaken the model. "Corrupting n% of its inputs/outputs" just doesn't matter that much on most threat models I can think of - for instance, it doesn't really matter at all for deception.

Third: in order for this argument to go through, one does need to actually use the mechanism from the argument, i.e. weaken the stronger model. Without necessarily accusing you specifically of anything, when I hear this argument, my gut expectation is that the arguer's next step will be to say "great, so let's assume that alignment gets easier as models get stronger" and then completely forget about the part where their plan is supposed to involve weakening the model somehow. For instance, I could imagine someone next arguing "well, today's systems are already reasonably aligned, and it only gets easier as models get stronger, so we should be fine!" without realizing/considering that this argument only works insofar as they actually expect all AI labs to intentionally weaken their own models (or do something strictly better for alignment than that, despite subproblem 2 above). So if someone made this argument to me in the context of a broader plan, I'd be on the lookout for that.

(Meta-note: I'm not saying I endorse the premises of all these counterarguments. These are just some counterarguments I see, under some different models.)

comment by Logan Zoellner (logan-zoellner) · 2023-10-28T17:21:56.601Z · LW(p) · GW(p)

I'm curious, do you actually endorse subproblem 1?

 Under the current ML paradigm (transformers) the model becoming dangerous during training seems extremely implausible to me.

I could imagine a ML paradigm where subproblem 1 was real (for example, training an RL agent to hack computers and it unsandboxes itself).  But it seems like it would be really obvious that you were doing something dangerous beforehand.

Replies from: johnswentworth
comment by johnswentworth · 2023-10-29T17:32:11.401Z · LW(p) · GW(p)

I don't personally expect that subproblem 1, in its purest form, is relevant to the exact LLM architectures used today - i.e. stacked transformers trained mainly on pure text prediction. On the other hand, I'm not extremely highly confident that subproblem 1 isn't relevant; I wouldn't particularly want to rely on subproblem 1's irrelevance as a foundational assumption.

Also, I definitely do not expect that it will be really obvious in advance when someone changes the core architecture enough that subproblem 1 becomes relevant. Really obvious that we're not just training stacked transformers on pure text prediction, yes. Really obvious that we're doing something dangerous, no. The space of possibilities is large, and predicting how different setups behave in advance is not easy.

All that said, I do generally consider subproblem 2 the more relevant one.

answer by Seth Herd · 2023-10-29T18:30:43.795Z · LW(p) · GW(p)

I don't fully get your argument, but I'll bite anyway. I think all of this is highly dependent on what you mean by "stronger model" and "alignment".

I think it is easier to align a stronger model, if it's stronger in the sense of understanding the goals or values you're aligning it to better, and if you're able to use its understanding to control its decisions. In that case, its models of your goals/values becomes its goals/values. In this scenario, it's as aligned as the quality of its models, so a stronger model will be more aligned.

But that's assuming you don't lose control of the model before you do that alignment. I think this is a real, valid question, but it can be addressed by performing alignment early and often as the model is trained/gets smarter.

I recently wrote about this in The (partial) fallacy of dumb superintelligence [LW · GW]. I think this is an intuition among risk-doubters about how it should be easy to align a smart AGI, and how that intuition is only partly wrong.

answer by chasmani · 2023-10-29T17:42:59.842Z · LW(p) · GW(p)

Well I agree it is a strawman argument. Following the same lines as your argument, I would say the counter argument is that we don’t really care if a weak model is fully aligned or not. Is my calculator aligned? Is a random number generator aligned? Is my robotic vacuum cleaner aligned? It’s not really a sensical question.

Alignment is a bigger problem with stronger models. The required degree of alignment is much higher. So even if we accept your strawman argument it doesn’t matter.

2 comments

Comments sorted by top scores.

comment by Logan Zoellner (logan-zoellner) · 2023-10-29T16:14:53.058Z · LW(p) · GW(p)

Here is the list of counter-arguments I prepared beforehand

1) Digital cliff, it may not be possible to weaken a stronger model
2) Competition, the existence of a stronger model implies we live in a more dangerous world
3) Deceptive alignment, the stronger model may be more likely to decieve you into thinking it's aligned
4) Wireheading, the user may be unable to resist using the stronger model even knowing it is more dangerous
5) Passive Saftey, the weaker model may be passively safe while the stronger model is not
6) Malicious actors, the stronger model may be more likely to be used by malicious actors
7) inverse scaling, the stronger model may be weaker in some safety-critical dimensions
8) Domain of alignment, the stronger model may be more likely to be used in a safety-critical context

 

I think the strongest counter arguments are:

  1. There may not be a surefire way to weaken a stronger model
  2. saying you "can" weaken a model is useless unless you actually do it

 

It would love to hear a stronger argument for what @johnswentworth [LW · GW] describes as "subproblem 1": that the model might become dangerous during training.  All of the versions of this argument that I aware of involve some "magic" step where the AI unboxes itself by (e.g. side-channel or talking its way out of the box [LW · GW] ) that seem like the either require huge leaps in intelligence or can be easily mitigated (air-gapped network, two person control).

comment by simon · 2023-10-28T15:02:17.490Z · LW(p) · GW(p)

I actually agree that stronger models are easier to achieve any given % alignment, but on the other hand the potential bad consequences of any given % misalignment increase for the stronger model (potentially dramatically at certain points, like it can take over).