Validating against a misalignment detector is very different to training against one

mattmacdermott

Validating against a misalignment detector is very different to training against one

post by mattmacdermott · 2025-03-04T15:41:04.692Z · LW · GW · 4 comments

    Answer:
  OK, so what’s the lesson here?
  Is this really how validation metrics work?
  A silly example
  A reasonable objection
None
4 comments

Consider the following scenario:

We have ideas for training aligned AI, but they’re mostly bad: 90% of the time, if we train an AI using a random idea from our list, it will be misaligned.
We have a pretty good alignment test we can run: 90% of aligned AIs will pass the test and 90% of misaligned AIs will fail (for AIs trained using our list of ideas).
The alignment test is easy to circumvent: any misaligned model that fails the test is very close in weight-space to a misaligned model that passes the test (and any aligned model that fails the test is very close to an aligned model that passes it).

A naive alignment strategy would be to add a penalty term to our loss function that penalises failing the alignment test.

Question: If we apply this strategy and get an AI that passes the test, what’s the probability that it's aligned?

a) 50%. Just apply Bayes' rule to get P(aligned | passes test). The prior odds of aligned:misaligned are 10:90, the likehood ratio is 90:10, so we get posterior odds of 900:900, i.e. a probability of 50%.

b) 10%. We get the base rate P(aligned), because the misalignment test is trivially circumventable.

Answer:

Okay, it's not quite a mathematical question -- we can't be sure how the change to the loss landscape affects the path followed by SGD.

That said, it seems like it's probably b) 10%. Intuitively it seems like the only thing that will change is that if the model would have been detectably misaligned, now it will be undectably misaligned instead.

Certainly (a) is a naive and mistaken way to reason about what will happen -- we're not randomly sampling models that pass the alignment test.

New strategy: instead of adding a penalty term, we save the alignment test for the end of training. If the model passes, then great. If it fails, we throw the model away and train a new one with another random idea from our list. Repeat the process until we get an AI that passes the alignment test.

Question: What's the probability we get an aligned AI using this strategy?

a) 50%. It’s different this time, for some reason. 🤞

b) 10% again. Optimisation pressure is optimisation pressure; we’re just applying it in a different way.

Answer:

This time it follows mathematically from the setup that it’s (a) 50%. We are indeed randomly sampling models that pass the test, so we get P(aligned | passes test), which we can figure out with Bayes' rule as above.

OK, so what’s the lesson here?

One way to state Goodhart’s law is that if you care about property A, and you notice that P(A|B) is high, so you optimise for B, then it’s likely that you won’t get the high probability of A that you were expecting -- you'll often just get the base rate, P(A).

The lesson from the example above is that for some optimisation methods, Goodhart doesn’t apply. In particular, if you optimise for B by resampling from P until B becomes true, then the probability that A is true is actually P(A|B) after all.

There’s a tempting intuition that using a misalignment detector as a validation metric should be the same as using it as a training objective, only perhaps a bit gentler. I think this intuition is wrong! Using metrics for validation is qualitatively different to using them for training. This makes sense given how big of a thing validation is in ML.

My claim: using a misalignment detector for training is doomed. Using one for validation is a good idea.

Is this really how validation metrics work?

No, not exactly.

If you’re training an image classifier or something, you don’t (normally) just have a list of training ideas that you keep resampling from until you get one with above a certain validation score. You run a better search algorithm, using the validation scores that your previous ideas got to choose the ideas you try next. And you normally try to get the validation score as high as possible rather than just above a threshold.

But still, there’s a large element of ‘just try random stuff until you get a good validation score’. It’s certainly a step in that direction compared to gradient descent, and plausibly most of the way there.

It is still possible to overfit to the validation set – which it wouldn’t be under the pure random search picture – but empirically it’s way less of a problem.

A silly example

Here’s a silly example to help build the intuition that this threshold-based rejection thing is qualitatively different to other ways of optimising.

Imagine Alice and Bob each want to give £1,000,000 to someone who speaks French. They decide to use whether or not somebody can translate “I want a million pounds” as their test of French-speakingness.

Alice stands on a street corner asking people if they can translate the phrase. Sure enough, the first person who can successfully translate it is someone who actually speaks French. She’s successfully sampling from the distribution P(French speaker | can translate the phrase).

Bob puts up a poster saying “I’m on the next street corner offering £1,000,000 to the first person who can say ‘I want a million pounds’ in French.” If the first person to walk past the poster doesn’t speak French, they get ChatGPT to translate the phrase and take the million pounds anyway. So Bob ends up sampling from the marginal distribution P(French speaker).

A reasonable objection

“Compared to your example, I think the base rate of alignment will be lower, and our misalignment detectors will be worse, so validating against misalignment dectectors is stilly pretty doomed.”

Fair enough. If the base rate of alignment is 1% and aligned models are only twice as likely to pass the test as misaligned models, then using the detector for validation only gets us a 2% chance of alignment.^[1] I wrote the post to argue against the view expressed here [LW(p) · GW(p)], which says that if we detect misalignment, our only two options are to pause AI development (which is implausible) or to train against the detector (which is doomed). If you think aligned AIs are incredibly rare or building good misalignment detectors is really hard, then things still look pretty bleak.

^{^}
Although stacking a few independent detectors that are all individually decent could be as good as having one very good detector.

4 comments

Comments sorted by top scores.

comment by David Matolcsi (matolcsid) · 2025-03-04T16:50:46.664Z · LW(p) · GW(p)

I like the main idea of the post. It's important to note though that the setup assumed that we have a bunch of alignnent ideas that all have an independent 10% chance of working. Meanwhile, in reality I expect a lot of correlation: there is a decent chance that alignment is easy and a lot of our ideas will work, and a decent chance that it's hard and basically nothing works.

Replies from: mattmacdermott

↑ comment by mattmacdermott · 2025-03-04T17:00:49.306Z · LW(p) · GW(p)

Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.

comment by dil-leik-og (samuel-buteau) · 2025-03-07T17:55:33.830Z · LW(p) · GW(p)

The post's claim that validation-only approaches are fundamentally better than training-with-validation oversimplifies a complex reality. Both approaches modify the distribution of models - neither preserves some "pure" average case. Our base training objective may already have some correlation with our validation signal, and there's nothing special about maintaining this arbitrary starting point. Sometimes we should increase correlation between training and validation, sometimes decrease it, depending on the specific relationship between our objective and validator. What matters is understanding how correlation affects both P(aligned) and P(pass|misaligned), weighing the tradeoffs, and optimizing within our practical retraining budget (because often, increasing P(aligned|pass) will also decrease P(pass)).

Replies from: mattmacdermott

↑ comment by mattmacdermott · 2025-03-07T19:03:36.741Z · LW(p) · GW(p)

Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.

Validating against a misalignment detector is very different to training against one

Contents

OK, so what’s the lesson here?

Is this really how validation metrics work?

A silly example

A reasonable objection

4 comments