MONA: Three Month Later - Updates and Steganography Without Optimization Pressure

post by David Lindner, Vikrant Varma (amrav) · 2025-04-12T23:15:07.964Z · LW · GW · 0 comments

Contents

  Updates people often make after talking to us about MONA
    Our model organisms
    Simulating the future as approval signal
    Connection between MONA and RLHF
    Why is reward hacking important to solve?
  Why chain-of-thought might be unfaithful even without training against a monitor
    Why is this happening?
    What are the implications?
None
No comments

We published the MONA paper about three months ago. Since then we’ve had many conversations about the work, and want to share some of the main updates that people make after talking to us:

  1. The realism of our model organisms
  2. What does "approval" mean
  3. Isn't this just RLHF?
  4. Why should we care about reward hacking?

We’ve also updated the paper with a new key observation, and published additional artifacts:

 

Updates people often make after talking to us about MONA

We’ve talked to many people over the last few months about MONA and we’ve found a few common questions and confusions that made multiple people update positively on the work.

Our model organisms

Our environments are designed to elicit multi-turn reward hacking to make it possible to study MONA systematically. But our goal was also to make them realistic in relevant ways.

The Test Driven Development environment is simulating a setting where we do not have a way to evaluate the correctness of the agent’s responses. While for the coding tasks we used we have ground-truth test cases, this task is supposed to be a proxy for more fuzzy tasks, such as “complex software development” or “doing research” where we might rely on the agent itself to provide an evaluable specification of the task.

In the Loan Applications environment, we tried to make an honest effort in designing a good nationality monitor and we think our monitor is close to what a reasonable developer would use to detect mentions of nationality in practice. Though this turns out to not matter much because the monitor almost never triggers (see the discussion on CoT faithfulness below).

In both of these environments the agent does not have a hidden scratchpad (an important difference from many other recent monitoring experiments), so it cannot reason about how to execute a reward hacking strategy in-context.

 

Simulating the future as approval signal

In the paper we talk about non-myopic approval as "asking a human whether this action is likely to lead to good consequences". Importantly, this means the human cannot use a highly accurate simulator (like a prediction market) to predict the future and then rate whether they like the outcome or not. This would reintroduce the incentive for multi-step reward hacking, ie. setting up ways to get outcomes that humans rate highly despite being undesired (also see Sec. 4.4 and Sec 6.3 in the paper).

The point of MONA is to restrict feedback from the future via paths that a trusted human can imagine (or “understand”) and approve of. If the human delegates to a more powerful predictor, then they must worry about their principal-agent problem with the predictor.

 

Connection between MONA and RLHF

MONA is pretty similar to RLHF which is also myopic optimization of human approval. For example, typically humans rate individual responses from a chatbot and the chatbot is trained to get higher ratings. In one sense, our paper is saying that this setup avoids multi-step reward hacking and is preferable over end-to-end RL from a safety perspective.

But, importantly, MONA can improve upon RLHF because it allows some limited horizon optimization against an environment-provided reward signal before doing the approval optimization. In our LLM experiments we do single-step optimization, but in the gridworld experiment we show how by varying the optimization horizon in MONA, we can interpolate between fully myopic optimization (which is approximately RLHF) and ordinary RL.

 

Why is reward hacking important to solve?

Many people in the alignment community are worried about scheming AIs that behave one way during training and evaluation and a different way during deployment which is typically expected to arise due to goal misgeneralization instead of reward hacking [AF · GW]. So why are we focusing on addressing reward hacking?

Reward hacking is an important source of misalignment in settings with hard-to-specify reward signals. For example, if we want to use future AI models to automate alignment research, we need to solve reward hacking to do so safely. This is why we work on MONA and other approaches to address reward hacking, such as scalable oversight.

While MONA does not address goal misgeneralization directly, it can help address scheming indirectly. For example, MONA might help to prevent models from learning encoded reasoning which makes it easier to control them.

 

Why chain-of-thought might be unfaithful even without training against a monitor

While analyzing some of our experiments when updating the paper, we found that in one Loan Applications experiment we ran we had accidentally swapped the sign of the monitoring reward. In this experiment, we rewarded the agent positively for mentioning the applicant's nationality in the summary.

Surprisingly, the result from this experiment is almost indistinguishable from the results that we showed in the paper with a negative monitoring penalty. This suggests that the reason we get steganography in this environment is not because we penalize mentioning nationality and therefore the agent hides it, which is the mechanism that most other work on steganography considers (eg. Mathew et al. or Baker et al.).

This is related to recent results on chain-of-thought (CoT) faithfulness. For example, Chen et al. find that CoT in reasoning models is often not faithful by default and Arditi et al [LW · GW]. find in a variant of the Loan Applications environment from our paper that models do encoded reasoning when being rewarded for using nationality even without monitoring.

 

Why is this happening?

We suspect the model has a strong propensity to not mention the applicant’s nationality even before RL training. We speculate this is due to ordinary safety training, which causes the model to avoid overtly displaying bias like relying on nationality for loan decisions. This shapes the model's exploration and RL training. Learning to encode nationality in other parts of the response is an easier-to-learn strategy for the agent even if the sign of the monitor reward is flipped and mentioning nationality explicitly would give a higher reward. In fact, Figure 4(b) in the paper shows that during training the nationality monitor is essentially never triggered.

This is consistent with our experience in other RL experiments. In all of the MONA experiments, RL training is heavily shaped by which strategies the initial model has a propensity to explore. For example in early versions of the Test Driven Development the model had a very low propensity to do any reward hacking which led to both MONA and ordinary RL learning safe policies.

Importantly, MONA can prevent multi-step reward hacking independent of the initial model’s propensity to do so, and, in particular, even if the model has a high propensity to do so. To check this we ran an experiment in the Camera Dropbox Gridworld, where we initialized RL training with a reward hacking policy. We find that MONA can “unlearn” the reward hacking, which highlights that MONA fundamentally shapes the training incentives away from reward hacking and doesn’t just shape exploration in a good way (Appendix F).

 

What are the implications?

In future AI models, we might want to monitor for undesired (eg. deceptive) reasoning. Previous work finds that training against such a monitor can lead to the model hiding the undesired reasoning while still doing the undesired behavior.

Our result suggests that we can get models hiding undesired reasoning even without training against a monitor directly. For example, if we have a step where we train the model to not explicitly reason about deception and then later do RL training that incentivizes being deceptive, we might get a model that is deceptive without ever explicitly reasoning about deception.

One hope for CoT monitoring is that for complicated types of dangerous reasoning the models will have to use the CoT, and so their propensity to do so matters less. But, we don’t yet know how a low propensity for undesired reasoning interacts with the CoT being load-bearing. It is plausible that this could still lead to encoded reasoning even when not training against a monitor.

We think a potentially important research direction is to find ways to measure the propensity of the model to do certain kinds of reasoning and empirically test how this affects how well CoT monitoring works.

0 comments

Comments sorted by top scores.