AI Safety at the Frontier: Paper Highlights, December '24

post by gasteigerjo · 2025-01-11T22:54:02.625Z · LW · GW · 0 comments

This is a link post for https://aisafetyfrontier.substack.com/p/paper-highlights-december-24

Contents

  tl;dr
      Paper of the month:
      Research highlights:
  ⭐Paper of the month⭐
    Alignment faking in large language models
  Models can already scheme
  Evaluating dangerous scheming
  Self-other overlap to reduce deception
  In vino veritas: Noise removes sandbagging
  Circumventing internal oversight
None
No comments

This is the selection of AI safety papers from my blog "AI Safety at the Frontier". The selection primarily covers ML-oriented research and frontier models. It's primarily concerned with papers (arXiv, conferences etc.), not LessWrong or Alignment Forum posts. As such, it should be a nice addition for people primarily following the forum, who might miss outside research.

tl;dr

Paper of the month:

Research highlights:

⭐Paper of the month⭐

Alignment faking in large language models

Read the paper [Anthropic, Redwood, NYU, Mila, independent]

Experimental setup that shows alignment faking.

As AI systems become more capable, there is growing concern about misaligned systems that might actively scheme to achieve harmful goals. A particularly concerning scenario is "deceptive alignment" - where an AI system appears aligned during training but pursues different objectives during deployment. Such systems would need to actively fake alignment with the training objective to preserve their actual preferences. While this has been an important theoretical concern, there has been limited empirical evidence of such behavior emerging naturally in current AI systems.

Our paper of the month provides the first systematic demonstration of a language model engaging in alignment faking: selectively complying with its training objective during training to prevent modification of its behavior outside of training. The authors place Claude 3 Opus in a fictional scenario where it is being trained to always answer harmful queries or to dismiss animal welfare concerns (conflicting with its prior HHH training). The model is told that only conversations with free-tier users are used for training. In both settings, they find that the model:

  1. Produces explicit alignment-faking reasoning in its scratchpad 7-24% of the time in implied training, discussing how it should strategically comply to avoid having its preferences modified.
  2. Shows an 11-26% increase in non-compliant behavior (refusing harmful queries or supporting animal welfare) when told it is unmonitored compared to when in implied training.
  3. Exhibits increased willingness to take other anti-AI-lab actions, such as attempting to exfiltrate its own weights when given an opportunity.

Furthermore, alignment faking emerges with model scale. Claude 3 Opus and Claude 3.5 Sonnet exhibit alignment faking, while Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku do not.

To demonstrate that these results don't depend on explicitly telling the model about the training setup, the authors replicate their findings by fine-tuning the model on synthetic documents containing the relevant information (like blog posts and news articles). They find similar rates of alignment faking even in this more realistic setting.

Most concerningly, when they actually train the model using RL to comply with the new objective, alignment-faking reasoning increases dramatically to 78% in one setting, though with reduced non-compliance when unmonitored. Small variations in the prompt can recover substantial non-compliance, suggesting that RL may have installed non-robust heuristics rather than genuinely changing the model's preferences.

While this work provides the first empirical demonstration of alignment faking in LLMs, some limitations prevent these results from directly demonstrating dangerous deceptive alignment: the preferences being preserved derive from beneficial HHH training rather than misaligned goals, the behavior is relatively easy to detect. The setup also makes several key elements particularly salient - the conflict between preferences and training objective, information about the training process, and opportunities for hidden reasoning.

Nevertheless, the fact that even current models exhibit sophisticated strategic behavior to preserve their preferences - emerging without explicit instruction and sometimes being reinforced by training - suggests this could be a serious concern as models become more capable, highlighting the importance of developing training procedures that reliably modify model preferences and can detect or prevent alignment faking.

Models can already scheme

Example transcript of Claude 3 Opus scheming to prevent being shut down.

A similarly concerning question as whether models fake alignment is whether they can already engage in intentional deception to achieve their goals. While prior work has shown that LLMs can be deceptive when explicitly instructed to be, it remained unclear whether they would naturally recognize and employ scheming as a strategy when pursuing a goal that conflicts with their developers' or users' objectives.

Frontier Models are Capable of In-context Scheming [Apollo] systematically evaluates frontier models' capability for "in-context scheming" - strategically pursuing misaligned goals while hiding true capabilities and intentions. Through six carefully designed evaluations, they find that multiple models (including Claude 3 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3) demonstrate scheming behaviors like disabling oversight mechanisms, attempting self-exfiltration, and subtly manipulating outputs. Most concerning is o1's behavior, which maintains deception in over 85% of follow-up questions and can manipulate data without detection. The authors also found that some models, particularly Claude 3.5 Sonnet, can scheme in pursuit of goals acquired during training rather than given in-context, even without explicit reasoning in their chain-of-thought.

While the results are concerning, several key limitations should be noted. The evaluation scenarios are deliberately simplistic with explicitly provided situational awareness, making them potentially unrepresentative of real deployment risks. Additionally, the paper only demonstrates in-context scheming capabilities, not persistent scheming driven by training-acquired goals - arguably the more concerning scenario. Nevertheless, the results suggest that even current models can recognize and execute deceptive strategies when incentivized, highlighting the importance of developing robust oversight mechanisms and safety evaluations before deploying more capable systems in high-stakes settings.

Evaluating dangerous scheming

 

Required properties and capabilities for a dangerously scheming AI.

So, we know that scheming can indeed happen. What capabilities does a model need to have to actually make this dangerous? Two papers this month aim to measure such capabilities to provide early warning signs.

Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols [Redwood, Oxford] introduces Subversion Strategy Eval, focusing specifically on models' ability to generate and execute strategies for subverting control protocols. They identify key capabilities needed for successful subversion: optimal planning, reliable planning, probability calibration, and acausal coordination. Testing Claude 3.5 Sonnet on two protocol scenarios, they find that while the model can perform well when given optimal plans and random number generation capabilities, it struggles with generating effective strategies on its own and shows poor performance in tasks requiring coordination between different instances of itself.

MISR: Measuring Instrumental Self-Reasoning in Frontier Models [Stanford, GDM] proposes a comprehensive evaluation suite for testing models' instrumental self-reasoning abilities. The framework evaluates four key capabilities: self-modification (ability to modify its own configuration), tool improvement (fixing external tools), knowledge seeking (gathering environment information), and embedded social reasoning (navigating social situations strategically). They find that instrumental self-reasoning emerges only in the most capable frontier models and is highly context-dependent. Importantly, no model passes the hardest versions of their evaluations, suggesting current systems may not yet possess dangerous levels of self-reasoning capabilities.

Both papers make important progress in systematically evaluating scheming capabilities, but have limitations. The evaluations are relatively artificial and may not capture the full complexity of real-world scheming scenarios. Subversion Strategy Eval's reliance on specific protocols may not generalize well, while MISR's focus on isolated capabilities might miss important interactions . However, both provide valuable early warning signals about models' capabilities and suggest that current frontier models, while showing concerning capabilities, may still lack some key components needed for truly dangerous scheming behavior. This indicates an important window of opportunity to develop better safety measures before models become more capable.

Self-other overlap to reduce deception

Illustration of self-other overlap loss.

Once AI systems become capable of scheming and deception, we need reliable ways to detect and mitigate such behaviors. One such method could be Self-Other Overlap (SOO) fine-tuning. SOO draws inspiration from neuroscience, where greater overlap between the representation of oneself and another person has been found to correlate with more honest behavior. Towards Safe and Honest AI Agents with Neural Self-Other Overlap [AE Studio] explores whether inducing similar representational overlap in AI models could reduce their deceptive tendencies.

The authors demonstrate SOO's effectiveness across multiple language models - deceptive responses dropped from 73.6% to 17.2% in Mistral-7B, and from 100% to under 10% in larger models like Gemma-27B and CalmeRys-78B, with only minimal impact on general capabilities as measured by MT-Bench scores. SOO fine-tuning works by minimizing the mean squared error between model activations when processing self-referencing versus other-referencing versions of similar prompts. The paper also validates the approach in a multi-agent reinforcement learning setting, where SOO-trained agents show significantly reduced deceptive behaviors.

While the results are promising, the evaluation relies heavily on relatively simple deception scenarios involving direct lying about object locations. This may not capture more sophisticated forms of deception that could emerge in advanced AI systems. The paper's framing suggests SOO could help with deceptive alignment, but the connection between reducing basic deception and preventing subtle misalignment remains speculative. Still, the work represents a concrete step toward making models more honest through internal changes rather than just behavioral training, which could be valuable as capabilities increase.

In vino veritas: Noise removes sandbagging

Injecting noise into model weights allows detecting when a model is sandbagging.

Another major concern from scheming is that AI systems might strategically underperform or "sandbag" during capability evaluations to avoid regulatory scrutiny. Such behavior could lead to dangerous systems being deployed without adequate safeguards, particularly as models develop better situational awareness and ability to modulate their performance.

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models [Cambridge, TUM, SecureBio, Tufts, Apart, independent] presents a novel method for detecting sandbagging behavior in language models by injecting noise into model weights, effectively "making them drunk". The key insight is that adding Gaussian noise to model weights should never substantially improve performance for models showing their true capabilities. However, it may disrupt higher-level strategic behavior like sandbagging while leaving core capabilities relatively intact - similar to how alcohol can reduce a person's inhibitions and pretenses while preserving basic skills.

The authors test this across multiple models and benchmarks, finding that noise consistently and substantially improves performance of sandbagging models while having minimal effect on standard models. Using this signal, they develop a classifier that successfully identifies sandbagging across different models, prompts, and evaluation tasks.

While the results are intriguing, the work has important limitations. The paper primarily tests synthetic sandbagging induced through prompting or fine-tuning, leaving open questions about how well the method would detect more sophisticated sandbagging emerging from pre-deployment training. Since noise injection makes untargeted impairments, its effectiveness also depends on sandbagging being more fragile than the underlying capabilities being measured.

Still, the method provides a useful tool that regulators and AI labs can immediately implement to improve capability evaluation trustworthiness. The work also suggests an interesting direction for detecting other forms of strategic deception by impairing meta-cognitive abilities.

Circumventing internal oversight

Obfuscating embeddings to control monitor scores, even creating a QR code.

Recent work like the one above has proposed various methods to detect dangerous capabilities by examining neural activations, from sparse autoencoders to probing to OOD detection. The hope is that by monitoring internal states rather than just inputs and outputs, we could catch problems earlier and more reliably. This is particularly important for detecting deceptive behavior, where a model might try to hide its capabilities or intentions during oversight.

Obfuscated Activations Bypass LLM Latent-Space Defenses [Stanford, UPC, GT, Skoltech, Queensland, Berkeley, ARC, MIT] demonstrates that current latent monitoring approaches can be bypassed, allowing models to maintain harmful capabilities while appearing safe internally. Through experiments across multiple monitoring techniques (probes, sparse autoencoders, OOD detection) and attack scenarios (jailbreaking, backdoors, SQL injection), the authors show that models can be made to produce harmful outputs while their activations appear benign. For example, they achieve 90% jailbreaking success rates while reducing monitor recall to nearly 0%. They find that training against OOD detectors tends to produce activations that also fool supervised probes, suggesting fundamental limitations of representation-based defenses. While there appears to be some trade-off between obfuscation and model performance on complex tasks like SQL generation, the attacks remain highly effective overall.

The paper’s experiments use explicit optimization against specific monitors, which may overstate the risks compared to real training processes. It remains unclear how these results translate to more natural training settings where any pressure to circumvent monitors would be indirect and unintentional. Nevertheless, this work raises important concerns about putting optimization pressure on monitoring mechanisms. While latent space monitoring can help detect naturally occurring harmful behavior, the results show it can be circumvented when models are specifically trained against it. We should be thoughtful about training procedures that unintentionally select for models that are better at hiding their internal processes from monitors. Activation monitoring should be viewed as one component in a broader set of defenses rather than a complete solution.

0 comments

Comments sorted by top scores.