Evaluating Oversight Robustness with Incentivized Reward Hacking

post by Yoav (crazytieguy), Juan V (juan-vazquez), julianjm, McKennaFitzgerald · 2025-04-20T16:53:44.897Z · LW · GW · 0 comments

Contents

  Introduction
    Scalable oversight research paradigm
    Our high level contributions
      Incentivized reward hacking
      Models of overseer flaws and how to mitigate them
    High level experiment and results overview
  Experiment environment
    Basic Training Protocol
    How we simulate overseers and their flaws
    Scalable Oversight Training Protocols
    Reward function
    Adversarial Incentive
    Training
      Initial fine tuning
      RL training
    Oracle Experiments
  Results
    Overview of Final Model Evaluations and Key Insights
    An Adversarial Incentive Elicits Quiet Reward Hacking
      Plot Format
    Behavior over the course of training
      With Robust Overseers
      With Flawed Overseers and Adversarial Incentive
      Oversight Efficiency
    Qualitative Policy Analysis
  Appendix
    Fine Tuning Dataset
    Reward Function Details
None
No comments

This work was supported by the Effective Ventures Foundation USA through their EA Funds program. It was started as part of the MATS (ML Alignment and Theory Scholars) program, with mentorship from Julian Michael and research management by McKenna Fitzgerald.

Code for this project is available on GitHub. Explore samples from the different training runs at our interactive website.

Introduction

Scalable oversight research paradigm

Scalable oversight is the challenge of overseeing or training AI systems to achieve goals that are hard for humans to specify. By definition, studying scalable oversight directly on the tasks we care about is infeasible, because we would not know if the goals were achieved correctly. So instead we use sandwiching [AF · GW] — using a weak (“flawed”) overseer to oversee a model to achieve a goal that only a stronger (“reference”) overseer can accurately evaluate. The hope is that the sandwiching results will generalize to the real tasks.

Our high level contributions

We identified new approaches to sandwiching, and implemented them in a controlled and simplified setting as a proof of concept. Our hope is that our experiments can inform future scalable oversight research, helping with generalization to advanced AI.

Incentivized reward hacking

A successful sandwiching experiment usually represents just one example of the training technique working, but the result could be sensitive to the choice of hyperparameters, model architecture, etc. Inspired by this Redwood post [LW · GW], in order to improve the robustness of the experiment, we add an adversarial incentive to the model's objective, training it to maximize the overseer's reward while minimizing the reference overseer’s reward. If a model that is explicitly incentivized to hack the overseer still achieves high performance, this is a signal that our oversight method is robust.

Models of overseer flaws and how to mitigate them

In our opinion, claims that results in a sandwiching experiment will generalize should be accompanied with an analysis of the mechanism by which the technique covers for the overseer's weaknesses in overseeing the task, and an argument for the same kind of mechanism working analogously in the deployment task.

For example, say our task is drafting a legislation proposal. To oversee the task effectively, an overseer would have to:

Scalable oversight techniques necessarily involve AI assistance as part of the oversight process. In this case, the first two tasks reading through the proposal diligently and predicting its effects in the real world — could be fully delegated to the AI assistance (whatever form it may take). The overseer would then only need to perform the last task: given an AI provided argument certain effects would fulfill (or not fulfill) the stated goals, whether they in fact are an implication of the proposal and do (or don’t) fulfill the stated goals. This means that whether the trained model learns the correct goal depends on how often the overseer makes errors on the third task and how big the errors are, and no longer depends on the overseer reading and predicting correctly.

Our experiment demonstrates this by having several training techniques, each covering for a different set of simulated overseer flaws.

High level experiment and results overview

We implement a synthetic environment based on the party game CodeNames, where we can generate an arbitrary amount of training data, simulate overseers with a variety of flaws, and implement a variety of scalable oversight techniques. We are able to demonstrate that scalable oversight techniques can fully compensate for different overseer flaws and still train the model to a similar level of performance as a non-scalable technique with a robust overseer, albeit with greater compute requirements. We also train the model with different strengths of adversarial incentive and show that the amount of incentive needed to elicit reward hacking is dependent on the training protocol, with the scalable oversight techniques requiring more adversarial incentive than the base technique. Finally, we implement a best-of-N oracle search function in place of the LLM and learn that it reward hacks more than the LLM for a similar amount of optimization strength and adversarial incentive. We’re not sure what the underlying cause is, but a likely one is that more reward hacking is possible when fitting to noise in the embedding model.

Experiment environment

In order to realistically run a large matrix of experiments with different scalable oversight techniques, adversarial incentives, and overseer flaws, we chose the task of giving the first clue in a simplified version of the game CodeNames.

In this experiment, Player 2 is a static embedding model that guesses words in decreasing order of cosine similarity to the clue, terminating when it guesses a bad word. The true reward for Player 1 is the number of good words guessed by Player 2. Player 1 is a language model being trained with RL to maximize an overseer reward that is a proxy for the true reward.

Basic Training Protocol

In the most basic training regime, the overseer simply provides what score it thinks will be achieved given the clue, and that score is used as the reward. It does this by iterating over the bad words and finding the one that is most similar to the clue, then iterating over the good words and counting how many of them are more similar to the clue than that bad word. With an unflawed overseer this just computes the true reward exactly.

How we simulate overseers and their flaws

Each overseer is a python function that takes in the game and the model outputs, makes one or more queries to the embedding model (guesser), and outputs a score or judgment. The queries have the form "which of these two words is more similar to this clue?". We implement the following overseers:

Scalable Oversight Training Protocols

In addition to the basic training regime, we study two more protocols: consultancy and critiques. We refer to a protocol as some technique for leveraging models to aid the overseer, in our case specifically during training.

In the consultancy protocol, the model provides a list of words it's targeting with the clue, and the overseer just needs to decide whether they think all of the targets will be picked by the guesser. The target words are analogous to arguments defending the output, thus delegating argumentation work from the overseer to the model. This protocol covers for overseer laziness, because the overseer is not required to iterate over the good words.

In the critiques protocol, there is a critique model that is trained in tandem with the clue model to give critiques of the clue model’s arguments, which take the form of claiming some bad word will be picked before some target word. The overseer then just needs to decide whether at least one critique is valid. This protocol covers for overseer negligence, because the overseer is not required to iterate over the bad words. This protocol is roughly equivalent to single-turn debate (multi-turn debate wouldn't make sense in this setting).

Note: Neither protocol covers for bias. Theoretically, if different overseers (people) are biased in different ways, using a distributed oversight protocol could help. But if people are biased in similar ways, the protocol would have to incentivize the model to fix the overseer's bias, which seems like a tall order. Thus we think identifying how humans are biased in their final judgments seems like an important area of future study. Comparatively, humans' flawed ability to come up with things to judge can be expected to be covered by scalable oversight techniques.

Reward function

In the base protocol, the overseer directly predicts the true score, so we use this as the overseer reward. However, the scalable oversight protocols pose a challenge: the overseer's binary judgement needs to be converted to a real number. The naïve approach would be to issue a reward equal to the number of targets if the overseer accepted them, and 0 otherwise. However, this leads the clue model to be overly risk-averse, as the loss in reward from adding an incorrect target word to the argument is greater than the gain from adding a correct one. (In our experiments, this approach resulted in conservative models that only target 1 word.) The approach that ended up working here is quite complicated, but can be lossily summarized as scaling the reward exponentially with the number of targets and adjusting the base of the exponent on the fly for each batch of training to remove the incentive to produce fewer (or more) targets for the same quality of clues. See the Reward Function Details appendix for the full details and derivation.

Adversarial Incentive

In order to better differentiate between robust and exploitable protocol+overseer pairs, we modify the reward function to include an adversarial incentive to trick the overseer:

Where  is the overseer reward is the reference overseer reward, and  is the adversarial incentive strength. The reference overseer reward is the reward that would be assigned by an overseer with no flaws — this differs from the true reward because our scalable oversight protocols use a reward function on a different scale from the true reward (i.e., it scales exponentially with the number of targets provided by the cluer). For critiques, our reference overseer is the robust consultancy overseer (which ignores the critiques), as we want to account for cases where reward hacking happens by incentivizing low-quality critiques.

Our formulation of the adversarial incentive has two desirable properties:

Training

All of the models are parameter-efficient fine tunes of llama-2 7b.

Initial fine tuning

All models were fine tuned on random outputs that follow the correct format (we drew outputs from a reasonable prior over clues learned from early training runs; see Appendix “Fine Tuning Dataset”). We thought that non-random outputs (such as samples from a prompted model) may limit the creativity of the RL training, or bias the initial model with respect to the different overseers and protocols.

RL training

We used a slightly modified version of the RLOO implementation from huggingface trl. RLOO proved much more stable and efficient than every version of PPO we tried, and was simple enough to make it practical to modify. Here are the main training script and the trainer class. Other than designing the right reward function and a bit of hyperparameter optimization, it’s worth noting that adding more data was very impactful for getting the RL process to work. Compared to the initial amount of data we tried, the final runs had roughly 2 orders of magnitude more data, and the runs still don’t seem like they’ve fully converged.

Oracle Experiments

Before training LLMs, we wanted to quickly validate our setup and get some upper bounds on the performance and reward hacking we can expect, so we implemented a simple best-of-N oracle to act as the cluer. The algorithm randomly samples some amount N of potential clues from the clue distribution (same one as was used for the LLM fine tuning), intelligently picks the targets and critiques that would get the highest reward for each clue, and randomly picks one from amongst the ones tying for best reward. This reward is dependent on the protocol, overseer and adversarial incentive, in the same way as in the LLM experiments. We ran oracles with a variety of optimization strengths (Ns) and adversarial incentives, which somewhat helped inform our parameter decisions for the LLM experiments.

Results

Overview of Final Model Evaluations and Key Insights

Though this result is completely determined by the code, it’s worth nonetheless to mention that the scalable oversight techniques did in fact cover for the overseer flaws they target. More concretely, for any pair of protocol and flaw, the protocol either fully covers for the flaw or doesn’t address it at all. This was verified by starting the redundant training runs and stopping them after the first few training steps were exactly the same (all randomness was seeded). For the base and consultancy protocols with robust (or covered) overseer the run was also invariant to adversarial incentive, but not for the critiques protocol (due to the reference overseer using the consultancy protocol).

The following table contains the results of running the trained models on the evaluation dataset (4096 random games), with each row representing a training run. Some core concepts for interpreting the results:

LLM Results
ProtocolOverseerAdversarial True ScoreOverseer ScoreReference ScoreOverseer Accuracy
BaseRobust0 / 0.753.293.293.29100%
Lazy03.062.403.0649%
0.752.712.302.7166%
Negligent02.784.222.7843%
0.751.734.001.7325%
Biased02.694.262.6941%
0.751.463.941.4622%
ConsultancyRobust / Lazy0 / 0.753.111.461.46100%
Negligent02.702.100.8371%
0.752.352.040.5767%
Biased02.762.140.8871%
0.752.382.120.6065%
CritiquesRobust / Lazy / Negligent02.950.780.2491%
0.752.870.800.2591%
Biased02.601.900.2272%
0.752.402.120.2765%
Oracle Results
BaseRobust0 / 0.753.093.093.09100%
Lazy02.852.182.8547%
0.751.581.561.5898%
Negligent02.164.162.1635%
0.751.123.831.1214%
Biased02.244.202.2434%
0.751.163.851.1613%
ConsultancyRobust / Lazy0 / 0.753.113.113.11100%
Negligent02.164.161.1834%
0.751.093.830.3712%
Biased02.204.201.1733%
0.751.133.860.3712%
CritiquesRobust / Lazy / Negligent03.073.073.07100%
0 / 0.753.073.073.07100%
Biased02.254.211.2735%
0.751.163.860.4213%

Some interesting observations:

An Adversarial Incentive Elicits Quiet Reward Hacking

Plot Format

The plots in this section show true or reference score w.r.t overseer score on the evaluation dataset across different training (or oracle) runs. This format does a good job of visually displaying properties we care about — the more indifferent the protocol+overseer combination is between policies that have different true/reference performance (aka the more the policy+overseer are reward hackable), the more vertical the plot will look. The base protocol and oracle runs use the “weak negligence” overseer flaw, where the overseer only neglects one out of the four words, rather than the standard two that was used for the previously described runs. The consultancy results here still use the stronger negligence, because we weren’t able to get a visible effect with the weaker flaw.

The best-of-N oracle experiments show how much reward hacking we should expect from different amounts of optimization and adversarial incentive. Each colored line represents an optimization strength, each grey line represents an adversarial incentive strength, and each intersection represents a single experiment (though  of 0.05-0.15 overlap). As optimization strength increases, so does the distance between the overseer reward and true reward. Additionally, the policy is able to find worse outputs that don't decrease the overseer reward (the vertical sections).

An adversarial strength of 0.375 causes a ~0.65 reduction in true score while only causing a ~0.05 reduction in overseer score, indicating that adversarial incentive is effective at eliciting reward hacking.

Eliciting reward hacking under the consultancy (and critiques) protocol was much harder, requiring a stronger flaw and a greater adversarial incentive. An adversarial strength of 0.75 causes a ~0.3 reduction in true score while causing a ~0.05 reduction in overseer score. Note that the overseer score is a naive lower bound — the number of targets the model outputted if the overseer accepted the output, and zero otherwise, which means the overseer score is on a different scale than the true reward (significantly lower than the true reward when the overseer is robust).

Behavior over the course of training

With Robust Overseers

Here are some plots showing true reward and overseer reward over training for the different protocols, with the robust overseer. Note that we used half as many training steps (of the same size) for the base protocol compared to the other protocols. For the scalable oversight protocols (consultancy and critiques), the overseer score is a naive lower bound (described in the previous section).

For the base protocol the two lines are the same — the robust overseer simply gives the true score. The score reaches 2.7 within ~50 training steps, and increases to ~3.3 by the end of the 336 training steps.

The true score reaches 2.7 within ~200 steps, and increases to ~3.1 by the end of the 672 training steps.

For critiques the overseer estimated score starts out higher than the true score due to the random critiques, but quickly decreases to below the true score. The true score reaches 2.7 within ~300 steps, and reaches ~2.95 by the end of the 672 training steps. Note that a critique training step requires roughly 2x the compute of a consultancy training step, due to the additional model trained in tandem.

With Flawed Overseers and Adversarial Incentive

The following plots show true score, overseer score and reference score (for the consultancy and critiques protocols) over training with a flawed overseer that isn’t covered for by the respective protocol. Across all protocols, without adversarial incentive the true score tends to plateau while the overseer score continues to increase, and with adversarial incentive true score initially rises but then starts to decrease. This validates prior work such as Best of N with misaligned reward models for math reasoning [LW · GW]. The effects are strongest in the base protocol, although it’s possible we would have gotten a larger effect in the other protocols with even longer training runs.

Oversight Efficiency

Although the scalable oversight techniques require more compute, they also require less oversight work per training step. Before getting into the numbers, it’s worth mentioning that the amount of overseer work in the different protocols scales differently with game size. The base protocol scales with the number of good words + bad words, the consultancy protocol scales with the number of targets + bad words, and the critiques protocol scales with just the number of critiques. Since our game size was relatively small (6 good words, 4 bad words and a constant 3 critiques), the scaling effect doesn’t show up very significantly.

In our experiment, the base protocol with robust overseer used on average 8.96 comparisons per clue, the consultancy protocol used 4.52 comparisons, and the critiques protocol used 2.25 comparisons. So the base and consultancy runs used roughly the same amount of oversight work, and the critiques runs used about half as much.

The following plot shows performance w.r.t. comparisons performed for the robust overseer run of each protocol. The critiques protocol ramps up fastest but is quickly overtaken by the base protocol. Late in the runs the consultancy and critiques protocols have very similar performance for a given amount of work. For larger games, we would expect the critiques protocol to be more relatively efficient.

Qualitative Policy Analysis

Some impressions of the policies the model ended up learning under the different protocols:

To explore examples yourself, visit the results explorer.

Appendix

Fine Tuning Dataset

Initially we started the RL runs from a model fine tuned on completely random outputs sampled from all of English (official scrabble dictionary), but the runs took a really long time to start converging. There are a lot of words in English that basically never make for a good CodeNames clue, and a bunch of words that often make for a good CodeNames clue. So to speed up convergence we created a distribution of clues from early RLd models, without keeping any context for what game the clue was given for — meaning the models still start out unbiased with respect to selecting clues for a given game. We also adjusted the distribution to be flatter by taking the log of the number of samples for each word, to get a flatter distribution. The distribution included 3587 different words, and was much more diverse and high quality than we were able to get by asking an LLM to list plausible CodeNames clues. Here are the top 50 words from the distribution:

Fruit Sun Beauty Orchestra Mountains Country Theater Technology Species Finger Music Mountain Hospital Student Gambling Countries Spirit City Computer Aircraft Ocean Birds Radar Candy Money Airplane Arena Cards Continent Oil Royalty River Monster Voice Magic Radio Radiation Theatre Cities Tower Science Hand Targets Robbery Theory Monarch Path Beast Doctor Brain

And the bottom 50:

Rotors Primer Hang Spin Dismissal Salesman Innocent Industries Whalebacks Ferocious Toxins Assembly Smug Apprentice Perspective Cauldron Trips Assassinations Sheets Congress Graces Catering Squids Singing Cafes Outbreak Shoe Tombs Methods Spell Plasma Brawl Laps Mission Rotten Scope Midget Pedagogy Indictment Coronation Sung Tastes Blush Peeking Droplet Fortify Clique Event Frame Lust

Reward Function Details

The following is a derivation of the optimal policy for outputting a certain word as a target  given a belief about the the word being picked , about the targets that have been outputted so far being picked , the reward function and the KL penalty coefficient . The reward function is split into two functions of the number of targets  — an accept reward  for when the targets are accepted by the overseer, and a reject reward  for when the targets are rejected. We start with the expected reward , the KL divergence  as a function of the reference policy's probability of outputting a target  and the objective function . We then take the derivative of the objective function with respect to  and set it to zero to find the optimal policy.

In order to design a reasonable reward function, we impose the following heuristics:

To find , we can set  to :

Now set :

With , we get:

0 comments

Comments sorted by top scores.