Validating against a misalignment detector is very different to training against one

post by mattmacdermott · 2025-03-04T15:41:04.692Z · LW · GW · 2 comments

Contents

    Answer:
  OK, so what’s the lesson here?
  Is this really how validation metrics work?
  A silly example
  A reasonable objection
None
2 comments

Consider the following scenario:

A naive alignment strategy would be to add a penalty term to our loss function that penalises failing the alignment test.

Question: If we apply this strategy and get an AI that passes the test, what’s the probability that it's aligned?

    a) 50%. Just apply Bayes' rule to get P(aligned | passes test). The prior odds of aligned:misaligned are 10:90, the likehood ratio is 90:10, so we get posterior odds of 900:900, i.e. a probability of 50%.

    b) 10%. We get the base rate P(aligned), because the misalignment test is trivially circumventable.
 

Answer:

 Okay, it's not quite a mathematical question -- we can't be sure how the change to the loss landscape affects the path followed by SGD.

That said, it seems like it's probably b) 10%. Intuitively it seems like the only thing that will change is that if the model would have been detectably misaligned, now it will be undectably misaligned instead.

 Certainly (a) is a naive and mistaken way to reason about what will happen -- we're not randomly sampling models that pass the alignment test.

New strategy: instead of adding a penalty term, we save the alignment test for the end of training. If the model passes, then great. If it fails, we throw the model away and train a new one with another random idea from our list. Repeat the process until we get an AI that passes the alignment test.

Question: What's the probability we get an aligned AI using this strategy?

    a) 50%. It’s different this time, for some reason. 🤞 


    b) 10% again. Optimisation pressure is optimisation pressure; we’re just applying it in a different way.
 

Answer:

 This time it follows mathematically from the setup that it’s (a) 50%. We are indeed randomly sampling models that pass the test, so we get P(aligned | passes test), which we can figure out with Bayes' rule as above.

OK, so what’s the lesson here?

One way to state Goodhart’s law is that if you care about property A, and you notice that P(A|B) is high, so you optimise for B, then it’s likely that you won’t get the high probability of A that you were expecting -- you'll often just get the base rate, P(A).

The lesson from the example above is that for some optimisation methods, Goodhart doesn’t apply. In particular, if you optimise for B by resampling from P until B becomes true, then the probability that A is true is actually P(A|B) after all.

There’s a tempting intuition that using a misalignment detector as a validation metric should be the same as using it as a training objective, only perhaps a bit gentler. I think this intuition is wrong! Using metrics for validation is qualitatively different to using them for training. This makes sense given how big of a thing validation is in ML.

My claim: using a misalignment detector for training is doomed. Using one for validation is a good idea.

Is this really how validation metrics work?

No, not exactly.

If you’re training an image classifier or something, you don’t (normally) just have a list of training ideas that you keep resampling from until you get one with above a certain validation score. You run a better search algorithm, using the validation scores that your previous ideas got to choose the ideas you try next. And you normally try to get the validation score as high as possible rather than just above a threshold.

But still, there’s a large element of ‘just try random stuff until you get a good validation score’. It’s certainly a step in that direction compared to gradient descent, and plausibly most of the way there.

It is still possible to overfit to the validation set – which it wouldn’t be under the pure random search picture – but empirically it’s way less of a problem.

A silly example

Here’s a silly example to help build the intuition that this threshold-based rejection thing is qualitatively different to other ways of optimising.

Imagine Alice and Bob each want to give £1,000,000 to someone who speaks French. They decide to use whether or not somebody can translate “I want a million pounds” as their test of French-speakingness.

Alice stands on a street corner asking people if they can translate the phrase. Sure enough, the first person who can successfully translate it is someone who actually speaks French. She’s successfully sampling from the distribution P(French speaker | can translate the phrase).

Bob puts up a poster saying “I’m on the next street corner offering £1,000,000 to the first person who can say ‘I want a million pounds’ in French.” If the first person to walk past the poster doesn’t speak French, they get ChatGPT to translate the phrase and take the million pounds anyway. So Bob ends up sampling from the marginal distribution P(French speaker).

A reasonable objection

“Compared to your example, I think the base rate of alignment will be lower, and our misalignment detectors will be worse, so validating against misalignment dectectors is stilly pretty doomed.”

Fair enough. If the base rate of alignment is 1% and aligned models are only twice as likely to pass the test as misaligned models, then using the detector for validation only gets us a 2% chance of alignment.[1] I wrote the post to argue against the view expressed here [LW(p) · GW(p)], which says that if we detect misalignment, our only two options are to pause AI development (which is implausible) or to train against the detector (which is doomed). If you think aligned AIs are incredibly rare or building good misalignment detectors is really hard, then things still look pretty bleak.

  1. ^

    Although stacking a few independent detectors that are all individually decent could be as good as having one very good detector.

2 comments

Comments sorted by top scores.

comment by David Matolcsi (matolcsid) · 2025-03-04T16:50:46.664Z · LW(p) · GW(p)

I like the main idea of the post. It's important to note though that the setup assumed that we have a bunch of alignnent ideas that all have an independent 10% chance of working. Meanwhile, in reality I expect a lot of correlation: there is a decent chance that alignment is easy and a lot of our ideas will work, and a decent chance that it's hard and basically nothing works.

Replies from: mattmacdermott
comment by mattmacdermott · 2025-03-04T17:00:49.306Z · LW(p) · GW(p)

Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.