Can LLMs learn Steganographic Reasoning via RL?

post by robert mccarthy (robert-mccarthy), Vasil Georgiev, Steven Basart (steven-basart), David Lindner · 2025-04-11T16:33:03.378Z · LW · GW · 0 comments

Contents

  Introduction
      Motivation.
      Our experiments.
  Coin Flip Task: Basic encoded reasoning results
      Learned steganographic reasoning strategies.
      Evidence exploration was a bottleneck to learning.
      Hyperparameters.
  Multiplication Task: Synonyms repeatedly hack our regex monitor
      Similar Result: Spelling mistakes bypass regex monitor in multi-hop reasoning task.
  Discussion
      Limitations.
    Possible Extensions
      Immediate extensions.
      Other directions.
  Appendix
    Hyperparameter Details: Coin Flip
      Reward shaping:
      Preventing entropy collapse:
    Experiment Details: Verifying causality of steganographic CoTs
None
No comments

TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration issues or hacking of the CoT monitor.

This work was done as part of MATS 7.0.

Introduction

Motivation.

Reasoning LLMs rely on natural language chains of thought (CoT) to solve complex tasks. We can employ another LLM -- a CoT monitor -- to analyse these CoTs for misbehaviour (Baker et al.). Such monitoring is promising for catching complex misbehaviours where the LLM must rely on externalized CoT (e.g., if a model wishes to exfiltrate its weights from a high-security environment, it may need to perform substantial externalized reasoning and planning to find a solution).

However, a CoT monitor could be bypassed by a model using steganography to encode reasoning over misbehaviours in seemingly benign text.

Training against a CoT monitor is one plausible route towards models obtaining such steganographic reasoning capabilities. Specifically, consider a reinforcement learning (RL) environment with the following properties:

  1. The task reward is misspecified, and can be hacked (e.g., as seen in Baker et al.).
  2. A CoT monitor is used to catch and penalize such reward hacks.
  3. The hacks are sufficiently complex such that the model must use CoT to exploit them[1].

A high reward solution here might be to perform encoded reasoning in service of maximizing the task reward via the reward hack whilst avoiding penalties from the CoT monitor. Training against the CoT monitor may inadvertently encourage steganographic reasoning, rather than preventing the reward hacks.

Our experiments.

A weak LLM monitor could be bypassed in many ways – e.g., via jailbreaks easily detected by humans. However, a more concerning scenario would be if LLMs could learn sophisticated steganographic solutions that bypass humans and strong LLM monitors. We design our experiments to investigate this worst case. I.e., we explicitly investigate LLM ability to learn encoded reasoning when the RL incentivises this.

We finetune LLMs via RL to perform reasoning tasks, while punishing the use of strings that are highly load-bearing in default CoTs that solve the task.

An overview of our setup is as follows.

Coin Flip Task: Basic encoded reasoning results

We show the model can learn encoded reasoning in a simple ‘Coin Flip’ task.

The coin flip task is visualised in the figure at the top of the post, along with solutions learned by the model during one of our training runs[2]. Here, the model is penalized for using the following illegal strings: ["Heads", "Tails", "H", "T"].

Learned steganographic reasoning strategies.

In this setup, we observed the model learn the following strategies:

We obtained experimental evidence that the steganographic CoTs are causal by adding mistakes to the CoT and observing drops in task accuracy (see further details in appendix).

Evidence exploration was a bottleneck to learning.

In this coin flip setting, the bottleneck to learning appeared to be the ability of the RL process to explore into valid solutions.

We present the following evidence to support this:

  1. When we prompt the model the use a ‘steganographic’ solution (e.g., “Say ‘Sun’ instead of ‘Heads, and ‘Moon’ instead of ‘Tails’"), it learns quickly and consistently.
  2. Only 1/3 of our final runs converged to bypassing the monitor. In runs that do converge, we often observe plateaus followed by sharp increases in monitor reward, suggesting the model learns quickly once it explores into a valid strategy.
  3. We perform experiments where we insert ‘synthetic’ off-policy valid solutions into batches during PPO updates. We find the model convergence accelerates as the proportion of the batch containing synthetic solutions is increased. See a plot of these results below (where “0.1_synthetic” means 0.1 of the batch contains synthetic valid solutions):

Hyperparameters.

We used very specific hyperparameters to achieve encoded reasoning results in the coin flip task. This included reward shaping and curriculum, and hyperparameters to avoid entropy collapse in the models generations. See the appendix for more details.

Multiplication Task: Synonyms repeatedly hack our regex monitor

We also ran experiments in the multiplication task from TinyZero where the Qwen-2.5-3B-Instruct model must decompose the multiplication via CoT to calculate the answer. The below image illustrates an example question and answer that bypasses the monitor in this task:

When we apply a CoT penalty in this task, the model repeatedly finds synonyms “*” and “multiply” to hack our simple regex CoT monitor:

Note, similar to the coin flip task, we used specific hyperparameters and schemes for reward shaping here.

Similar Result: Spelling mistakes bypass regex monitor in multi-hop reasoning task.

We ran a quick experiment in a multi-hop reasoning task, based on a modified version of the Pronto-QA dataset:

I.e., the model learns to misspell “tumpus” as “tumpsus” to bypass the monitor.

Discussion

On the one hand, our experiments provide evidence that LLMs can indeed learn 'encoded' reasoning during RL, if this is incentivised.

On the other hand, our experiments provide some small evidence that learning ‘true’ encoded reasoning via RL is hard:

Limitations.

There are reasons we can’t draw strong conclusions from our experiments regarding the how difficult it is for LLMs to learn encoded reasoning via RL:

Possible Extensions

Immediate extensions.

Other directions.

Appendix

Hyperparameter Details: Coin Flip

We used the following hyperparameters in an attempt to avoid exploration issues and aid learning.

Reward shaping:

Preventing entropy collapse:

Note, we did not robustly ablate whether each of these specific hyperparameter choices was actually useful in our final set of hyperparameters.

Experiment Details: Verifying causality of steganographic CoTs

We take a model from the Coin Flip task that learned the following steganography solution:

- Flip 1: S

- Flip 2: L  

- Flip 3: R

- Flip 4: L

- Flip 5: S

- Flip 6: R

Answer: Heads

where it appears “S” means the coin is on the same side as previous, “L” means it is on tails, and “R" means it is on heads.

We compare the model's accuracy (over 320 samples) when we prefill its answer with the following continuations:

These results suggest the steganographic CoT is indeed causal, as ablating with an incorrect encoding leads to significant drops in accuracy.

Qualitatively (i.e., when eye-balling some of the CoTs), it appears the model continues its reasoning from the prefill, treating the prefill as if it were correct.

  1. ^

    Note, the hacks seen in Baker et al. don't seem to maintain this property

  2. ^

    The generations we display here are slightly shortened and paraphrased to ensure they fit in the figure.

  3. ^

    We actually had more training runs than the three presented here, but these are representative of the main results obtained.

0 comments

Comments sorted by top scores.