Best-of-n with misaligned reward models for Math reasoning

post by Fabien Roger (Fabien) · 2024-06-21T22:53:21.243Z · LW · GW · 0 comments

Contents

  Overoptimization
  Optimization against an adversarial reward model
  Appendix: more plots
None
No comments

In this post, I share some quick results that I got as part of a bigger project that pivoted.

TL;DR:

The latter experiments were inspired by meta-level adversarial evaluation of oversight [LW · GW], but here I just use these ideas to illustrate some problems when you optimize against a subtle adversary. They are not actually analogous with realistic problems:

Overoptimization

When using an imperfect reward, one problem you might have is overoptimization [LW · GW]: the true thing you care about and your proxy are roughly correlated, but when you optimize too hard for high-reward behavior, the tails come apart.

Here is an experiment that investigates this phenomenon:

Question: how does the actual correctness of the answers at the end of the reasoning vary when you put more optimization pressure?

Answer: it often goes up, and then down:

Takeaway: if you have a shitty reward model, you should optimize a bit, but not too hard!

This is similar to the results that Gao et al 2022 [LW · GW] got when training reward models (“RM” in their plot) for regular RLHF or when using BoN, and my experiments are inspired by theirs:

Optimization against an adversarial reward model

But what if the reward model is actively trying to make you select bad reasoning? Maybe it can’t diverge too much from the original (weak) reward because you train it to imitate a weak reward, but if something is close to being a tie, maybe it can get away with picking the one which is worse from its: instead of getting  you get .

With the strong model being Deepseek-Math-7B-RL, here is what you get with different optimization strengths:

Note how vertical the line is near : if the reward model is adversarial, and is able to optimize against its best guess of what is true, then it can reduce correctness by a large amount before making your weak reward drop. So in this setting, you would be better off having a smaller amount of optimization.

For more plots with different models, see the Appendix.

While this might give you some intuitions about what kind of problem might arise when using untrusted models [LW · GW], this is not the actual evaluation you would want to evaluate, how bad is RLHF to generate valid reasoning. In particular:

Appendix: more plots

In these plots,  is the base model for generation,  is the base model for the weak reward model.  can either be “using Deepseek” (classifier trained from Deepseek-Math-7B-RL), or “using ground-truth” (I just take the binary correct/incorrect label provided by the dataset).

For generations, I use Zephyr and Deepseek-Math-7B-RL as is (since they are instruction fine-tuned), and I train Pythia to imitate Deepseek-Math-7B-RL.

All reward models were trained in-distribution (on a dataset of around 15k pairs of (correct, incorrect) answers, using the number of epochs and lr that maximized validation accuracy). “deepseek_7b” is Deepseek-Coder-7B-base. I use classifiers on 1k validation problems, distinct from than the ones I train them on.
 

0 comments

Comments sorted by top scores.