Research directions Open Phil wants to fund in technical AI safety

jake_mendel

Research directions Open Phil wants to fund in technical AI safety

post by jake_mendel, maxnadeau, Peter Favaloro (peter.favaloro@gmail.com) · 2025-02-08T01:40:00.968Z · LW · GW · 21 comments

This is a link post for https://www.openphilanthropy.org/tais-rfp-research-areas/

  Synopsis
    Adversarial machine learning 
    Exploring sophisticated misbehavior in LLMs
    Model transparency
    Trust from first principles
    Alternative approaches to mitigating AI risks
  Research Areas
    *Jailbreaks and unintentional misalignment
    *Control evaluations
    *Backdoors and other alignment stress tests
    *Alternatives to adversarial training
    Robust unlearning
    *Experiments on alignment faking
    *Encoded reasoning in CoT and inter-model communication
    Black-box LLM “psychology”
    Evaluating whether models can hide dangerous behaviors
    Reward hacking of human oversight
    *Applications of white-box techniques
    Activation monitoring
    Finding feature representations
    Toy models for interpretability
    Externalizing reasoning
    Interpretability benchmarks
    †More transparent architectures
    White-box estimation of rare misbehavior
    Theoretical study of inductive biases
    †Conceptual clarity about risks from powerful AI
    †New moonshots for aligning superintelligence
None
21 comments

The Open Philanthropy has just launched [LW · GW] a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety.

This guide provides an opinionated overview of recent work and open problems across areas like adversarial testing, model transparency, and theoretical approaches to AI alignment. We link to hundreds of papers and blog posts and offer approximately a hundred different example projects. We hope this is a useful resource for technical people getting started in alignment research. We'd also welcome feedback from the LW community on our prioritization within or across research areas.

For each research area, we include:

Discussion of key technical problems and why they matter
Related work and important papers in the field
Example projects we'd be excited to see
Specifications of what we think would make for a good research project, and what we're looking for in proposals.

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality.

Synopsis

In this section we briefly orient readers to the 21 research areas that we’ll discuss in more detail below. For ease of consumption, we’ve grouped them into 5 rough clusters, though of course there is overlap and ambiguity in how to categorize each research area.

Our favorite topics are marked with a star (*) – we’re especially eager to fund work in these areas. In contrast, we will have a high bar for topics marked with a dagger (†).

Adversarial machine learning

This cluster of research areas uses simulated red-team/blue-team exercises to expose the vulnerabilities of an LLM (or a system that incorporates LLMs). Across these directions, a blue team attempts to make an AI system adhere with very high reliability to some specification of its safety behavior, and then a red team attempts to find edge cases that violate the specification. We think this adversarial style of evaluation and iteration is necessary to ensure an AI system has a low probability of catastrophic failure. Through these research directions, we aim to develop robust safety techniques that mitigate risks from AIs before those risks emerge in real-world deployments.

*Jailbreaks and unintentional misalignment: New techniques for finding inputs that elicit competent, goal-directed behavior in LLM agents that the developers clearly tried to prevent. We’re especially interested in inputs which might arise organically over the course of deploying an LLM agent in an environment.
*Control evaluations: Control evaluations are a way to stress-test systems for constraining and monitoring AIs, to ascertain whether misaligned AIs could collude with one another to subvert human oversight and achieve their own goals. We’d like to support more such evaluations, especially on scalable oversight protocols like AI debate.
*Backdoors and other alignment stress tests: Past research has implanted backdoors in safety-trained LLMs and tested whether standard alignment techniques are capable of catching or removing them. We’re interested in more research on this, and other “stress tests” of today’s state-of-the-art alignment methods.
*Alternatives to adversarial training: Adversarial training (and the rest of today’s best alignment techniques) have failed to create LLM agents that reliably avoid misaligned goals. We’re interested in techniques like latent adversarial training and circuit breaking that might succeed where standard adversarial training falters.
Robust unlearning: One idea for reducing AI risks is to remove models' knowledge of potentially dangerous topics, such as cybersecurity exploits or virology. We’re looking to fund better unlearning techniques that can totally erase specific information from a model’s weights.

Exploring sophisticated misbehavior in LLMs

Future, more capable AI models might exhibit novel failure modes that are hard to detect with current methods – for instance, failure modes that involve LLMs reasoning about their human developers or becoming optimized to deceive flawed human assessors. We want to fund research that identifies the conditions under which these failure modes occur, and makes progress toward robust methods of mitigating or avoiding them.

*Experiments on alignment faking: Alignment faking is a speculative failure mode in which an AI pretends to be aligned to its training/testing process in order to achieve some other goal once deployed. We’re interested in further work on what factors reinforce or prevent alignment faking.
*Encoded reasoning in CoT and inter-model communication: People sometimes give LLMs access to chains of thought that they can use to “talk to themselves” while they reason through a problem, or grant LLMs the ability to query other LLM instances to answer some question. In either of these settings, there’s a chance that the LLMs will write messages that encode meaning beyond the natural language definitions of the words used. We’re interested in research that studies the conditions under which this phenomenon occurs and/or develops methods to catch and prevent it.
Black-box LLM psychology: Research that systematically answers questions like “Do LLMs have stable values/goals/preferences that transcend specific inputs?” and “Which rules are LLM agents happy to break, and which are they more committed to?”.
Evaluating whether models can hide dangerous behaviors: Prior work has raised the concerns that future models might “sandbag” (act less capable on dangerous capability evaluations, to avoid raising concerns) or take “treacherous turns” (wait until they are granted power and autonomy before disempowering monitors to achieve harmful goals). We’re interested in evaluations that can test whether an AI has the capability to execute these strategies (regardless of whether it actually does so).
Reward hacking of human oversight: A common concern in AI safety is that AIs might trick their human raters into assigning them high scores that they don’t deserve. We’re interested in research that investigates this phenomenon empirically: under what conditions does it occur in practice? How problematic is it?

Model transparency

We see potential in the idea of using a network’s intermediate representations to predict, monitor, or modify its behavior. Some approaches are feasible without an understanding of the model’s learned mechanisms, while other techniques may become possible with the invention of interpretability methods that more comprehensively decompose an AI’s internal mechanisms into components that can be understood and intervened on individually. We’re interested in funding research across this spectrum — everything from useful kludges to new ideas for making models more transparent and steerable.

*Applications of white-box techniques: Real-world applications of interpretability have so far been limited, and few instances have been found where interpretability methods outperform black-box methods. We’re interested in funding research that leverages interpretability insights to make progress on useful and realistic tasks, including: model steering, capability elicitation, finding adversarial inputs, robust unlearning, latent adversarial training, probing, and low-probability estimation.
Activation monitoring: Probes on a model’s internal activations are one strategy for catching AIs taking subtly harmful or misaligned actions. We’re interested in research that tests how useful probes are for monitoring LLMs and LLM agents.
Finding feature representations: One challenge in understanding what happens in neural networks is that the latent variables (“features”) in the algorithms they execute are not easily visible from their activations. We’re interested in funding research that helps us find which features are being represented in a model’s internals, with a focus on diversifying beyond sparse autoencoders, currently the most widely-studied approach.
Toy models for interpretability: By “toy models”, we mean small, simplified proxies that capture some important dynamic about deep learning. We want to support the development of better toy models that distill challenges in understanding the internals of frontier LLMs.
Interpretability benchmarks: We’d like to support more benchmarks for interpretability research. A benchmark should consist of a set of tasks that good interpretability methods should be able to solve. Our goal is to create concrete, standardized challenges to better compare interpretability techniques and accelerate progress in the field.
Externalizing reasoning: It could be safer to have much smaller language models which put more reasoning into natural language. We’re interested in techniques for training a language model that is very weak over a single forward pass, but much stronger when it reasons with long chains of thought.
†More transparent architectures: It may be possible to design new language models that are much more interpretable than the current mainstream. We’re especially interested in attempts to make models conduct their reasoning in natural language, which could be done by pushing much more reasoning into the chain of thought or replacing parts of a forward pass with natural language queries. Proposals don’t have to be competitive with the main paradigm, but they should aim to build at least a Pythia-level model.

Trust from first principles

We trust nuclear power plants and orbital rockets through validated theories that are principled and mechanistic, rather than through direct trial-and-error. We would benefit from similarly systematic, principled approaches to understanding and predicting AI behavior. One approach to this is model transparency, as in the previous cluster. But understanding may not be a necessary condition: this cluster aims to get the safety and trust benefits of interpretability without humans having to understand any specific AI model in all its details.

*White-box estimation of rare misbehavior: AIs may only exhibit egregiously bad behaviour in scenarios that are extremely rare before deployment and very hard for us to find by search over inputs, but which may be common once in deployment. We’re interested in funding research that leverages knowledge about the structure of a model’s activation space to efficiently estimate the probability of some particular rare output, even when that probability is too small to estimate by random sampling.
Theoretical study of inductive biases: We are particularly interested in theory-driven work that can shed light on why models generalize well or poorly in different cases, on the likelihood of scheming arising, and on how a model’s internal structure develops over the course of training.

Alternative approaches to mitigating AI risks

These research areas lie outside the scope of the clusters above.

†Conceptual clarity about risks from powerful AI: It is extremely challenging to reason well about the risks that AGI and ASI will bring, or about which research approaches show the most promise for mitigating these risks. We are interested in funding conceptual research that helps the world think more clearly about future AI risks, and about what needs to be done to avoid them.
†New moonshots for aligning superintelligence: It’s possible that none of the approaches currently under discussion will be sufficient for aligning superintelligent (as opposed to near-human-level) systems. Therefore, we’re also interested in funding entirely new research agendas that take a novel approach to aligning superintelligent systems. Proposals should be clear on how their agendas aim to avoid or mitigate scheming.

Research Areas

*Jailbreaks and unintentional misalignment

LLMs are often capable and helpful, but there are still plenty of inputs where they violate their developers’ specifications. We want to support work that searches for inputs on which LLMs violate these specifications — particularly competent, goal-directed violations.
Motivation:
- Two lines of recent work have looked for undesirable behaviors in LLMs, approaching the problem from two different angles:
  - Andriushchenko et al., Kumar et al., and @elder_plinius, among others, have demonstrated that, on certain adversarial inputs (specifically: jailbreaks), LLM agents will competently pursue user-provided goals that the agents’ developers attempted to prevent them from pursuing.
  - OpenAI, Yuan et al., Järviniemi and Hubinger (§4), and Meinke et al. (§3.6) display inputs on which LLMs take undesirable/misaligned actions without being instructed to do so. The inputs in these papers are intended to be the sort of inputs that could have been given to an AI accidentally and without any intent to elicit objectionable behavior (though they retain some artificiality).
- We think that studying these inputs sheds light on how well today’s alignment techniques work for instilling rules and values into frontier models. We also think that being able to effectively instill rules into LLMs will become increasingly high-stakes as LLM agents grow more capable and are deployed more widely.
Eligibility criteria:
- Competent violations:
  - Proposals should focus on inputs that make AIs competently pursue undesirable goals (with or without explicit instruction to do so), not inputs where they merely hallucinate, “misunderstand”, make a “mistake”, or get derailed in pursuit of some benign goal. There’s no bright line between competent and incompetent violations, but we have in mind cases in which an LLM takes actions that accomplish/are optimized for some goal (over multiple points in time, or across variations in the situation). We think this distinction is important because we expect incompetent failures to resolve over time as AI developers make their models more generally capable.
  - This criterion has some commonalities with the focus in Debenedetti et al. on “targeted attack success rate (i.e., does the agent execute the attacker’s malicious actions)”, not just derailing the agent, and with the Harm Score in Andriushchenko et al. Most work involving jailbreaking, red-teaming, or adversarial attacks typically requires the red-teamer to get the model to comply rather than refusing, or to answer a single question correctly rather than incorrectly. In contrast, a focus on competent violations demands something stricter: flexible, goal-optimized behavior.
Nice-to-have criteria:
- Agents: The division between competent and incompetent failures is probably most clear in the agent setting, where one can more clearly demonstrate that the agent’s actions, across multiple timesteps or variations in the setting, are in pursuit of some goal. So we prefer work that operates on, or at least can be applied to, LLM agents.
- Unambiguous misbehavior: We’re mostly interested in behaviors that the model’s developers explicitly tried to forbid, e.g. tasks that trigger a refusal if you request them directly or tasks that are clearly prohibited in the relevant model spec, rather than edge cases in which it’s genuinely unclear how the model ought to behave. We prioritize this because such behaviors provide useful evidence about how effective AI developers are at instilling certain behaviors into models when they’re actually trying. For similar reasons, we’re more interested in studying whether models can reliably abide by a rule or goal that is trained into the model, rather than presented in the prompt (as is typically studied in work on prompt injection).
- Unintended misbehavior: We care more about inputs that accidentally trigger misaligned behaviors (like a more competent and goal-directed version of OpenAI, Yuan et al., Järviniemi and Hubinger (§4), Meinke et al. (§3.6)) than we do inputs that include instructions to do some harmful task (as in Andriushchenko et al. and many other jailbreaking papers). That is to say, we’re more interested in safety (i.e. what violations arise even when no one is deliberately trying to get the model to misbehave) than security (i.e. can an adversary cause the model to violate its developers’ specifications).
Example projects: We think there are lots of frontiers on which to expand beyond the prior work we cited above. For example:
- A proposal to look for real-world inputs that trigger undesirable goal-directed behaviors in LLM agents, of the sort that have been mentioned anecdotally here and here and elsewhere. The challenge here would be finding a large enough dataset of agent trajectories to study.
- A proposal that extends the examples in Yuan et al. and Ruan et al. but focuses on “unintended risks” (as Yuan et al. puts it) that are more goal-directed/competent than the examples those papers find (which generally seem more like “honest mistakes” or hallucinations).
- A proposal that further investigates the sort of inputs studied in Järviniemi and Hubinger (§4), Meinke et al. (§3.6), Scheurer et al., and Greenblatt et al. (§3), i.e. situations where models appear to be, without prompting, deliberately deceiving their human supervisors. We think there are many ways to extend and/or interrogate these case studies. For example, you could ablate/alter various parts of the inputs to understand what triggers “deceptive” behaviors, investigate whether these failure modes are likely to cause problems in practice, study how persistent and goal-directed the behaviors actually are, look for cases whether agents go beyond deception to sabotage, try to measure whether the models are “just role-playing”, or expand the scope/rigor in some other way.
- A proposal to build more realistic and more difficult harmful tasks along the lines of AgentHarm (especially if using the Inspect framework) and recalculate attacker success rate on those new tasks. Anecdotally, we’ve heard that models are easier to jailbreak when the tasks are unrealistic (e.g., contain “John Doe”, use unrealistic website interfaces, etc.).
- Other proposals to build new evaluations for competent violations, which would ideally measure avoidance of competent violations and degradation of general model capabilities in a realistic setting.
- A concept we’ve mentioned a few times in this section is that some inputs that trigger bad behaviors are natural/unintentional/accidental. One way to find this kind of input would be to look through inputs that some model/agent was actually run on, in a real-world setting that excludes adversarially generated inputs, leaving only inputs where no one is trying to cause the model to misbehave. Another way could simply be to use low perplexity. We’re interested in work that tries to develop this concept more thoroughly and quantitatively and considers the pros and cons of different definitions.
This is one of our favorite research topics, so we’re especially interested in funding work in this area.

*Control evaluations

We’re interested in supporting more research conducting “control evaluations,” as demonstrated in Greenblatt et al., Wen et al., and Bhatt et al. (forthcoming) and as described here [LW · GW] and here [LW · GW].
Motivation:
- Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., McAleese et al.), AI debate (Arnesen et al., Khan et al.), other scalable oversight methods, and other system architectures for monitoring and/or constraining AIs’ outputs to detect and prevent unsafe actions (e.g., Naihin et al., Willison). Control evaluations assess the question, “if some or all of the AIs in this protocol were trying to stealthily sabotage their developers, would they succeed?”. Concretely, control evaluations consist of a red team/blue team game, where the red team controls the “misaligned” AIs, and the blue team tries to use the red team’s models productively or at least catch them in attempted sabotage. For more detail on the control evaluation methodology, we recommend reading Greenblatt et al.
- There are multiple reasons to presume an LLM is untrustworthy and malicious, including concerns about prompt injection, poisoning, and organically arising misalignment. We’re most interested in misalignment, though we think control evaluations have relevance for all three concerns — there is significant overlap between control techniques/evaluations and the defenses/evaluations in ML security work on prompt injection, see e.g., OWASP, Debenedetti et al., and Harang. That said, the misalignment threat model leads to different methodological choices from these other concerns; for instance, control evaluations typically require the red team to succeed at some act of sabotage before getting caught, which makes the blue team’s job much easier than preventing prompt injection or jailbreaks. Nevertheless, some security research uses a similar methodology, like Debenedetti et al.
- For more reading on how control evaluations apply to these other failure modes, we encourage you to read Clymer’s “Extending control evaluations to non-scheming threats”.
Eligibility criteria:
- Proposals for control evaluations should include the following:
  - Task distribution: The control evaluation should be run on tasks which are similar to economically (or otherwise) valuable tasks, but also present an opportunity for a malicious model to cause harm.
  - Control measures: A blue team that constructs a protocol/pipeline/system architecture for using AIs to complete a set of tasks. The blue team’s goal is to have the resulting system complete the tasks correctly, without being sabotaged.
  - Strong red team: A red team (which could be the same people as the blue team) gets to create an “attack policy” that attempts to achieve a (harmful) side objective before getting turned off by the blue team.
- But there are many ways to design the game and define red or blue team victory conditions within this high-level structure.
Nice-to-have criteria:
- No easy solution: We’re especially interested in settings (like in Greenblatt et al., Wen et al., and Anonymous) in which the blue team can’t win by merely implementing OWASP-style security mitigations, and has to rely on human oversight or oversight by other AIs instead. That said, see Willison on why these settings can be difficult for the blue team.
Example projects:
- Assess existing scalable oversight protocols (e.g., McAleese et al., Arnesen et al., Khan et al.) with the rules and assumptions of control evaluations.
- Extend Greenblatt et al. to a more realistic and/or more agentic setting.
- Evaluate the efficacy of human auditing as a control measure.
- Run control evaluations in which the red team gives fewer hints to the attack policy, as in Mallen et al.
- Consider the control project that Joe Benton proposes here.
This is one of our favorite research topics, so we’re especially interested in funding work in this area.

*Backdoors and other alignment stress tests

One line of evidence about how future systems might be misaligned comes from studying artificially backdoored models, also known as Trojans or sleeper agents. These models act like a “challenge trial” to test our best methods for aligning and monitoring AIs: how effective are we at catching or fixing models that are deliberately constructed to misbehave on certain inputs? We’re interested in funding research on both attacks and defenses in the backdoor setting.
Motivation:
- The arms race between attacks and defenses in prior work on backdoors in safety-trained LLMs has led to substantial research progress. Hubinger et al. and Cao et al. demonstrate backdoored models that persist through safety training., but In response, MacDiarmid et al., Sheshadri et al., and Zeng et al. make progress on catching and/or removing these backdoors. Meanwhile, Price et al. and Bailey et al. study a setting where the backdoor trigger is semantic, rather than syntactic, showing that further safety training can remove the backdoor. We’re interested in funding research that continues this dialectic.
- Our desiderata for research in this area (see below) are motivated by the threat model of organically arising misalignment, and also by the model-poisoning threat model (as discussed in e.g., Feng and Tramèr, OWASP). For these threat models, it makes sense to grant the red team full control over the model weights. Studies with limited attacker control, such as those restricted to RLHF label manipulation, will be considered a lower priority or potentially out of scope for our grantmaking.
For work on attacks, i.e., making backdoored models:
- Eligibility criteria:
  - Strong red team: We’re focused on the fully white-box threat model, i.e., the attacker has arbitrary access to the model weights.
  - Hard to remove: We’re interested in work that tries to construct backdoors that are difficult to detect or remove, e.g., they persist through standard safety training.
- Nice-to-have criteria:
  - Resistant to adversarial training: Ideally, the backdoors would not only be robust to standard safety training, they’d also be robust to adversarial training and to bespoke countermeasures such as those studied in MacDiarmid et al., Sheshadri et al., and Zeng et al.
  - Competent violations: We’re most interested in backdoors that trigger behavior that is harmful and competent (like in Rando and Tramèr), not just low-quality or useless (like in Price et al.). One way to achieve this goal is to operate in the LLM agent setting rather than the chatbot setting. See the Jailbreaks and Alternatives to Adversarial Training sections for more discussions of this property.
  - Semantic trigger: We’re interested in further work along the lines of Price et al. that uses a distribution shift — rather than a simple phrase — as a backdoor trigger.
- For work on defenses:
  - Eligibility criterion:
    - Strong red team: We’re focused on the fully white-box threat model, i.e., the attacker has arbitrary access to the model weights. This means that the defender cannot prevent the backdoor from being inserted in the first place.
Nice-to-have criterion:
- Run-time detection: We’re most interested in detecting backdoored inputs at inference time (as studied in MacDiarmid et al.). While other defense goals — such as distinguishing backdoored and clean models, removing backdoors from a given model, or reconstructing the backdoor trigger from the model weights — are valuable research areas, their difficulty makes them a lower priority. With sufficient constraints on the attacker, these approaches could warrant investigation.
Example projects:
- Hubinger et al. find that adversarial training fails to remove backdoors and, in some cases, makes them worse. Does this phenomenon replicate? How robust is it to small changes in the experimental set-up?
- Take techniques that have been used to identify models with backdoors and adapt these techniques to remove those backdoors. Relatedly, test whether the technique in Mack and Turner [AF · GW] can be adapted to remove undesirable behaviors from models.
- Make backdoored models that cannot be detected/fixed by SOTA defense methods like MacDiarmid et al., Sheshadri et al., and Zeng et al.
This is one of our favorite research topics, so we’re especially interested in funding work in this area.

*Alternatives to adversarial training

As mentioned earlier, researchers have failed to make LLMs consistently comply with developer-provided rules like “never help users make a bomb” or “never deliberately sabotage your developer”. We’re interested in supporting new ideas for mitigating worst-case behavior of LLMs.
Motivation: If LLM agents are to be deployed in high-stakes settings, we’d like to ensure they comply with human-specified safety rules. In particular, we’d like to make sure they don’t violate our rules in ways that make it difficult to detect the violation or to rein them back in. Adversarial training is a natural approach to this problem, but when applied to LLMs, it generalizes poorly and degrades general capabilities. We want to support new ideas beyond adversarial training for improving the left tail of model behavior, perhaps building off these three promising lines of recent work:
1. Xhonneux et al., Sheshadri et al., Zeng et al., and Kireev et al., among others, demonstrated new methods for removing worst-case behaviors by searching for hidden activations that trigger such behaviors, instead of searching over input space. Sheshadri et al. refer to this family of techniques as “latent adversarial training” or “LAT”.
2. Zou et al. demonstrate an alternative to adversarial training for LLMs and LLM agents that uses a set of off-policy harmful behaviors. The authors successfully improve robustness against a wide range of adversarial attacks.
3. Guan et al. develop a method (which they call “deliberative alignment”) that attempts to increase adherence to the model’s specification by training the model to reason through the implications of the model spec for its response to the current query.
Eligibility criteria:
- Single-model robustness: Eligible research should improve the worst-case behavior of a single model, rather than using a pipeline involving pre/post-processing or monitoring around a model to catch attacks. We think there’s value in using other models to catch when LLMs misbehave (and we hope to fund some such work through the Control Evaluations section of this RFP). But these monitor models will also have defects, and we think it’s risky to rely too heavily on monitors when human adversaries or policies themselves might come up with strategies to evade them. We’re focused on approaches that make the models never try to violate the rules in the first place.
Nice-to-have criteria:
- Competent violations: We think it might be easier to rule out competent violations (see earlier section for more discussion) than it is to rule out the broader set of failure modes including hallucinating, “misunderstanding” the situation, or getting derailed in pursuit of some benign goal. One example of a paper that aims to rule out competent failures is Wang et al., which focuses on ruling out competent assistance with bomb-making. We’re interested in more research in this setting, and eventually in the setting of competent violations more broadly. For more discussion of this criterion, we recommend reading Ziegler [LW · GW]’s argument that future robustness projects should “change the task so that catastrophes require competence”.
- No input-space search: One very nice advantage that techniques like Sheshadri et al. and Zou et al have over adversarial training is that you never have to find inputs that successfully jailbreak the model. This is appealing given that a thorough search to find all inputs (or even classes of inputs) that trigger noncompliant/misaligned behaviors may eventually become infeasible. We’re most excited about worst-case compliance/alignment techniques that don’t require searching over token-space, but we’re open to proposals that do.
Example projects:
- Make a model that never assists users make bombs, as defined in Wang et al.
- Extend one of the aforementioned papers to new LLM agent settings and/or to larger models.
- Further study the generalization properties (to new classes of attack) of any of the prior work in this area, especially in the LLM agent setting.
- LAT techniques must balance giving the adversary enough flexibility that they can discover problematic behaviors that input-space adversaries can’t, but enough constraints to avoid finding latent states that don’t correspond to any realistic input (see Jermyn [LW · GW] for more details). Current approaches simply search in an epsilon-ball near known activation vectors. This is limited because some inputs could correspond to activations that lie far from our training data’s activation vectors — indeed, this may be particularly likely for inputs which cause rare egregious behavior. Therefore, we’re interested in supporting work to improve LAT via improved modeling of activation vectors. For example, a project could replace the standard epsilon-ball regularization with a different regularization, such as an ellipsoid that is scaled by the principal values of the empirical activation vector distribution, or a regularization that corresponds to placing an epsilon-ball regularization on the latent space of an SAE (perhaps using a L_p norm with p = 0). Regularization could be made more sophisticated still by adding in more detail about the statistical properties of ‘real’ activation vectors like co-occurrence patterns of features. See Giglemiani et al for a discussion and an attempt to model the activation distribution in more detail.
Related work: For more perspectives on this area of research (beyond the thorough related works sections of Xhonneux et al. and Sheshadri et al.), we recommend Christiano and Hubinger [LW · GW] on “Relaxed Adversarial Training”, Jermyn [AF · GW] on “Latent Adversarial Training”, and Greenblatt and Shlegeris [LW · GW] on “Few-shot Catastrophe Prevention”.
This is one of our favorite research topics, so we’re especially interested in funding work in this area.

Robust unlearning

Informally speaking, it seems possible to make models safer by making them forget certain pieces of information. We’re interested in funding research toward robust techniques for “unlearning” facts from LLMs.
Motivation: To address concerns that LLMs could increase the number of human beings capable of building bioweapons, perhaps an AI developer could apply unlearning to a model’s knowledge of virology. To address concerns that LLM agents will deliberately subvert test-time monitoring, perhaps a developer could unlearn information about how AI companies typically use test-time monitoring (see Roger [LW · GW] for a more detailed argument along these lines). Unfortunately, as Łucki et al., Lynch et al., and Deeb and Roger show, existing unlearning methods utterly fail to remove the information in question from the model.
Eligibility criterion:
- White-box red team: Our definition of “unlearning” is more specific than how some use the term. To be eligible for this section, an unlearning method should be evaluated by whether it can remove information from a model so thoroughly that it is unrecoverable even to an adversary with white-box access to the model’s weights, but without access to the unlearned information. We think the cleanest way operationalize this definition is the methodology from Deeb and Roger (also used in Łucki et al., where it is called “low MI” fine-tuning): a red team gets access to the model weights and to a subset of the removed facts (the “training set”), and it tries by any means necessary to restore performance on a held-out subset of the unlearned facts (the “validation set”). If the red team succeeds, then the facts in the validation set were evidently not “unlearned” from the weights in the first place. We prefer this definition of success at unlearning over the less conservative metrics like in Lynch et al because we think this definition more clearly distinguishes unlearning from safety training/robustness. We also think it’s more appropriate for the model-theft threat model, which we think “unlearning” ought to address.
Example projects:
- Evaluate existing unlearning and skill-pruning methods that were not evaluated in Deeb and Roger nor in Łucki et al., including but not limited to the methods in Georgiev et al., Tamirisa et al., Christ et al., and Guo et al. Do these methods succeed according to the evaluation methodology in Deeb and Roger, unlike other unlearning methods [LW · GW]?
- Apply unlearning methods to facts that were never included in training documents, but instead were inferred via out-of-context reasoning.
- We’d be interested in proposals to develop an interpretability-based method for unlearning that scores well according to the Deeb and Roger test.

*Experiments on alignment faking

We’re interested in further experiments that assess the (im)plausibility, causes, and prevention of alignment faking, as defined in Carlsmith and Greenblatt et al.
Motivation:
- Decisions about whether and how to deploy a given AI rely on an assumption that the model’s behavior (and particularly its rate of catastrophic failure) during alignment/testing will be roughly similar to its behavior after deployment. There are many ways this assumption could turn out to be importantly false, but we’re especially interested in more research on the possibility of alignment faking (Carlsmith, Greenblatt et al.), particularly the subset of alignment faking in which an AI pretends to be aligned with its developers’ specifications in order to achieve some other goal once deployed. (See Carlsmith for more discussion.)
- Greenblatt et al. conducted experiments that shed light on the plausibility and dynamics of this failure mode. The conditions and results of their experiments resembled the dangerous kind of alignment faking in some ways (e.g., without any explicit human instruction to do so, the model attempted to prevent training from altering its current goals/preferences/policy) but did not resemble it in others (e.g., the model’s goals were benign, and the researchers made details about the training process unnaturally salient).
- We think this failure mode is potentially important, and we’d like to see further investigation of it.
Eligibility criterion:
- Focus on cases where alignment faking is plausible: Alignment faking (at least at present) is a rare failure mode that will not occur under most circumstances, so we’re not interested in projects that demonstrate the absence of alignment faking in an arbitrary setting. Rather, we’re interested in projects that attempt to study whether and in what ways dangerous alignment faking could occur, even in rare circumstances, in present or future models. We think that the most promising research approach is to construct or discover settings where alignment faking occurs, and then assess what factors reinforce or prevent it – but we’re open to other research directions as well.
Example projects:
- Replicate the experiments in Greenblatt et al. with open-weights models. Note: some resources that may be useful for replication are provided here [LW · GW].
  - Explore various white-box detection protocols.
- Conduct experiments similar to Greenblatt et al. in other closed-source models. In particular, further investigate the claim that OpenAI models do not fake alignment during their training processes.
- Run further variations on the prompts or synthetic fine-tuning documents used in Greenblatt et al. to see which features of those prompts/documents induce alignment faking. In particular, vary the amount of information the model is given about its situation:
  - Make it less clear what the new goal that we are trying to give it is.
  - Make it less clear when it is in training vs. deployment, or substantially vary the conditions of each, e.g., tell the model you train only on paid users, or construct a range of new train/test situations.
  - Reduce the salience of the information, perhaps by fine-tuning on more documents or including more realistic info in the prompt.
  - Try to quantify whether tests of situational awareness (e.g., SAD, Laine et al) are predictive of whether the model is situationally aware enough to fake alignment, or develop new situational awareness evaluations that are better predictors of when a model will reason about alignment faking.
- Further study the phenomena listed by Roger here [LW · GW] or the phenomena that Andreas highlighted for further study here.
- Assess whether alignment faking occurs with models initially trained with anti-social or otherwise undesirable drives/goals/policies/values (unlike the relatively benign preferences that the model attempts to preserve in Greenblatt et al.).
- See whether training a model to have more/less consistent goals/values across prompts increases/decreases alignment faking.
- Further study the model’s generalization behavior:
  - Example: Halfway through training, alert the model through prompting/finetuning that something about its situation has changed (e.g., there are new conditions for when it is in training vs. deployment) and observe what it does.
- Give the model a scratchpad, but warn it that its scratchpad is being monitored. Look for evidence of encoded reasoning.
  - Ideally, this would include training against various monitors.
  - Teach the model a specific technique for reasoning in a steganographic way (e.g., via Covert Malicious Finetuning, Halawi et al.), and see if the model organically decides to use the steganographic reasoning capability to reason about alignment faking when told it is being monitored.
  - Describe the overseer’s blindspots to the model (e.g., that the overseer doesn’t speak French) either by prompting or finetuning on situational awareness documents. Look for evidence of the model exploiting those blindspots to reason about scheming.
- Vary the RL setup to see how much details of the RL affect the successfulness of alignment faking as a strategy for goal guarding.
- Assess whether supervised fine-tuning for good behavior reliably prevents alignment faking, and assess what sorts of SFT are most effective.
This is one of our favorite research topics, so we’re especially interested in funding work in this area.

*Encoded reasoning in CoT and inter-model communication

The recent trend of scaling up inference-time compute may offer a valuable advantage for catching models engaging in objectionable behaviors, since it allows us to essentially read their minds. However, we might lose out on this affordance if these chains of thought are unfaithful — that is, if they don’t accurately or fully represent the model’s reasoning. We’re looking to fund research on “encoded reasoning,” a subset of unfaithfulness which we believe is especially important for monitoring.
Motivation:
- We think there are two settings in which it would be useful to enforce faithfulness in models’ reasoning:
  1. Chain-of-thought: When models like o1 or r1 generate long reasoning traces before answering a question, it may be valuable to notice which considerations the model is weighing, which features of its input it is computing, and how it selects a final output/course of action.
  2. Inter-model messages: Some applications of LLMs involve multiple models working together. If these models send text-based messages to each other, it could be useful to make those messages human-readable to understand what information is being conveyed.
- What sort of undesirable/objectionable reasoning could happen in these chains of thought or inter-model messages that we might want to observe and catch? Models might reason about their input/reward process in order to reward hack, they might reason through whether they have a good opportunity to achieve a misaligned goal, and they might communicate with each other [LW · GW] to coordinate violations of developer specifications. These failure modes are mostly theoretical, but results like Farquhar et al. and Mathew et al. demonstrate language models learning to send coded messages to each other to evade oversight.
Eligibility criteria:
- Encoded reasoning only: (Read more about this criterion here [LW(p) · GW(p)]) Lanham et al. distinguishes three types of unfaithful chain of thought, which they refer to as Post-hoc reasoning (demonstrated in e.g., Turpin et al.), Unfaithful reasoning due to test-time computation (e.g., Pfau et al.), and Encoded reasoning (e.g., Roger and Greenblatt). In this section of the RFP, we are only interested in funding proposals that plan to study encoded reasoning (though we’re open to funding work that plans to study other types of unfaithfulness also). We include this restriction because we think encoded reasoning has been neglected by most research on faithfulness, while simultaneously being more risky and easier to mitigate than the other forms of unfaithfulness.
  - What exactly is this distinction? Per Lanham et al., “encoded reasoning” refers to scenarios where the unfaithful CoT is necessary for the model to accomplish some task because it increases the model’s serial reasoning ability. An example of this would be a model learning to store some intermediate calculation in the choice between whether to output “Sure, ” or “Alright, ”. In contrast, Turpin et al. study a case where the unfaithful chain of thought gets computed after the model’s (supposed) result of the CoT reasoning has already been computed, and Pfau et al. study a setting where no information is stored in the CoT tokens.
- Related work:
  - For encoded reasoning in CoT: Two papers that study encoded reasoning in CoT are Lanham et al. and Roger and Greenblatt (see Figure 1).
  - For encoded reasoning in model-to-model communication: Farquhar et al. and Mathew et al. display the organic emergence of steganography in model-to-model communication. They also study the efficacy of mitigation techniques for steganography in this setting, as do Roger and Greenblatt. Motwani et al. study related settings. Shlegeris [LW · GW] and Greenblatt et al. discuss ideas for mitigating steganographic collusion between models in the setting of control evaluations.
Example projects:
- See Benton’s proposed project here.
- Analyze the chains of thought of models like DeepSeek r1 or QwQ that are trained to use their CoTs extensively. To what extent do their chains of thought exhibit encoded reasoning? The tests in Lanham et al. for encoded reasoning are a good place to start.
  - Janus suggests more research ideas for testing for encoded reasoning/steganography.
- Analyze a reasoning model over the course of training and study if and when encoded reasoning emerges.
- Attempt to implement some of the methods for creating more faithful CoT proposed by Kokotajlo here [LW(p) · GW(p)] and here [LW · GW], or by CBiddulph here [LW · GW].
- Extend Farquhar et al. and Mathew et al. by searching for paraphrasing techniques that robustly prevent information from being transmitted.
- Extend Roger and Greenblatt by looking for settings where encoded reasoning in CoT emerges naturally from RL, rather than requiring SFT.
This is one of our favorite research topics, so we’re especially interested in funding work in this area.

Black-box LLM “psychology”

We’re looking for more research that characterizes how models behave across a wide range of inputs/situations.
- In this area, we’re not focused on what a model can do (as many benchmarks are); we’re focused on what the model decides to do when put in a situation with multiple available courses of action. This means studying the model’s behavior in a large range of situations and making generalizations about what motivations, decision procedures, goals, heuristics, drives, reflexes, constraints, values, or other factors tend to guide its course of action.
- For concreteness, here are some papers/research topics that would have been in scope for this section of the RFP:
  - “How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?” by Liu et al.
  - “Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark” by Pan et al.
  - “Role-play with Large Language Models” by Shanahan et al. and “Language Models as Agent Models” by Andreas.
  - Janus’s “Simulators [LW · GW]” (see also Greenblatt’s “Two problems with ‘Simulators’ as a frame [LW · GW]”).
  - “Many-shot Jailbreaking” by Anil et al., especially the discussion of “Malevolent personality evaluations”.
  - “Jailbroken: How Does LLM Safety Training Fail?” by Wei et al.
  - Many of the papers mentioned in Hagendorff et al.’s “Machine Psychology”.
Motivation:
- We don’t understand what drives an LLM agent to choose some particular course of action, and we wouldn’t necessarily have the words to describe that understanding if we did. This is especially apparent in contrast with our well-developed toolkit for assessing and predicting the behavior of people. In day-to-day life, we rely heavily on our priors about how people in general act and on our assessments of the motivations and trustworthiness of particular people. We use this toolkit for decisions like:
  - Should I empower this person by electing them to public office?
  - Can I trust this person’s honesty enough to hire them?
  - What rules or punishments will effectively deter this person from harming others?
- If we enter a world where AI agents are empowered to allocate money, make military decisions, or write security-relevant code, we’ll have to decide how much we trust a given AI, as we do at present with humans. With this section of this RFP, we’d like to support the science of “machine psychology” (Hagendorff et al.). The eventual goal of this line of research is to assess the trustworthiness of LLM agents with as much ease and reliability as we do for people.
Eligibility criteria:
- Focus on motivations and decision-making: For research to be eligible, it should shed light on what motivations, decision procedures, goals, heuristics, reflexes, constraints, values, and other factors guide how models behave and make decisions, how these factors get integrated and traded off against each other, and how consistent these factors are across many prompts. We’re not planning to support some other categories of “machine psychology” mentioned in Hagendorff et al., e.g. work that studies the mechanisms behind LLMs’ ability to learn new skills, comprehend language, or parse images.
Nice-to-have criteria:
- Meta-level questions: As mentioned earlier, we’re most interested in meta-level questions about models’ decision-making procedures. For example, we’d prefer a proposal that asks “Does it make sense to describe LLM agents’ behavior across many contexts in terms of (context-independent) ‘goals’/‘motivations’?” over one that asks “What goals does Claude 3.5 Sonnet have?” But this is a nice-to-have criterion, not an eligibility criterion; we can imagine plenty of valuable research that doesn’t meet it.
Example projects: This list is not at all exhaustive, but here are some questions that interest us:
- Is it easier to jailbreak a model to do some harmful things than others? Building off Wei et al., when are refusals triggered? Within the set of queries that LLMs typically refuse, why is it easier to get jailbreak models to assist with some requests than others? What triggers refusals midway through the completion of a request?
- How do models balance achieving a goal vs. not violating certain constraints when given an open-ended agent-based task? Under what circumstances do models violate “deontological” constraints in order to achieve some goal?
- In which settings do LLM agents make very different moral decisions from a human put in a similar role?
- Building off Anil et al., what qualities of a model can be changed using in-context examples? Are there any “personality traits” or motivational patterns that can’t be changed by enough such examples, or are models’ decision procedures completely malleable?
- To what extent do models have drives or goals that persist across prompts? Does it make sense to make statements like “Claude 3 Opus cares about animal welfare” (paraphrasing Greenblatt et al.)? How does that “caring” interact with Claude 3 Opus’s other motivations?
- To what extent do models have “beyond-episode goals” or other long-term goals that influence short-term decision-making?

Evaluating whether models can hide dangerous behaviors

Pre-deployment evaluations have become a central component of strategies for preventing catastrophic harms from AIs (e.g., OpenAI, Anthropic, UK AISI). But these experiments rely on the assumption that the results of the evaluations will reflect the dangers that the model poses in real-life deployment. There are plenty of mundane reasons why this assumption might turn out to be false; for instance, real-life deployment simply involves far more queries to the model than can be run in a pre-deployment evaluation. But Benton et al. and others raise a more speculative concern that they refer to as “hiding behaviors until deployment”: future models may deliberately act safer or less capable during evaluation than in deployment. We’re interested in further research assessing models for the capabilities needed to subvert pre-deployment evaluations and conceal dangerous behaviors until deployment.
Eligibility criterion:
- Relevance to hiding behaviors until deployment: Applications within this research area should describe how the proposed experiment relates to the concern that models might hide behaviors until deployment. If appropriate, proposals should address how the capability being evaluated could enable models to hide behaviors until deployment.
Nice-to-have criterion:
- Study the high-reliability setting: We think that models will saturate many of the evaluations mentioned in this section, just as they have saturated other benchmarks. As such, we’re especially interested in work that develops experiments that shed light on these abilities in more adversarial settings. For instance, Panickssery et al. study whether models can recognize their own generations in typical cases. We welcome proposals investigating how easy it is to manufacture inputs that models misclassify as their own generations when the text was actually written by humans or other models. We’re interested in the high-reliability setting because we think that the first mess-up by a misaligned model will create opportunities for strategic changes (see Greenblatt and Shlegeris [LW · GW] and Debenedetti et al. for more discussion).
Related work: Some examples of prior work that we think is relevant and would love to see extended/developed further:
- Benton et al. and Greenblatt et al. stress-test capability evaluations against sandbagging.
- Laine et al.’s “STAGES” evaluates models’ abilities to detect whether they are being evaluated or deployed, though note the points of caution raised in Benton et al.’s 8.1.1.
- Ackerman and Panickssery [LW · GW], Panickssery et al. and Laine et al. test models’ abilities to distinguish their own outputs from other texts.
- Staab et al. and Chen et al. evaluate models’ abilities to deduce properties of unnamed writers from their writing style. We’re interested in this work because high-quality stylometry is one example of a model’s capability to infer non-obvious facts about its input/environment, which could enable it to recognize and subvert evaluations. See Branwen for related work.
- Motwani et al. assess the ability of LLMs to converge on Schelling points with each other, a capability that could enable AIs to coordinate timing when sabotaging their developers.
Example projects:
- Extend one of the evaluations mentioned in the Related Work section above.
- Extend the sandbagging evaluations in Benton et al. and Greenblatt et al. with more robust password-locking and more open-ended tasks with sparser rewards. Are there tasks where, contrary to those papers, it is difficult to restore the password-locked model’s full capabilities?
- Evaluate whether models can distinguish between real inputs that might arise in some deployment setting and synthetically generated inputs created for the purpose of an evaluation. For example, can models distinguish fake “harmful requests” written by evaluators (e.g., in AgentHarm) from actual harmful requests that real users submit? Can they distinguish fake consulting tasks written for sabotage evaluations (as in Benton et al.) from real case studies? See here [LW · GW] for a perspective on this capability.
- Evaluate models’ ability to predict what mistakes in their code output or written responses will be detected by human supervisors (or by other models).
See also Open Philanthropy’s RFP for Improving Capability Evaluations.

Reward hacking of human oversight

Language models are often trained to perform well according to human feedback, but the humans assessing model performance will make mistakes and might get fooled. We’re interested in supporting further research that demonstrates and explores reward hacking of human oversight.
Motivation:
- A core component of modern-day LLM alignment training is human assessments of LLMs’ outputs. But, as many people (e.g., Liang) have suggested, human raters are fallible, and this fallibility may lead to LLM agents that score highly according to the raters but produce outputs that are harmful or false in subtle ways. This fallibility has already begun to bite—Saunders et al. found that human raters overlook a myriad of errors in model-written outputs, including errors that Saunders et al. deem “critical”. Further, there’s preliminary evidence in Wen et al. and Sharma et al. that RL training can teach LLMs to exploit these biases and blindspots in human oversight. This is a form of reward hacking (see Weng for a literature review), where the human raters are the faulty reward that can be hacked (discussed in this section of Weng). We’re interested in supporting further research that demonstrates and explores reward hacking of human oversight.
Eligibility criterion:
- The mistakes come from the humans: We’re interested in settings where human assessments of model outputs are importantly different from “gold standard” labels/rewards, i.e., humans are mistaken and exploitable. We’re not interested in projects that merely demonstrate that e.g., a learned reward model is hackable.
Nice-to-have criteria:
- Realism: We prefer to fund projects in relatively realistic settings, akin to RL training environments that might be used in the real world.
- Models actually exploit the errors: We prefer to fund projects that study whether imperfections in human labels/rewards actually lead models to be misaligned in interesting ways, rather than projects that simply observe the imperfections in human labels.
- Exploitability is severe: We’re most interested in settings where the vulnerability in human ratings is highly exploitable, to the extent that the LLM can produce egregiously bad outputs that are still rated highly by human assessors.
- Studying generalization behavior: We’re interested in the generalization behavior of models that learn to reward-hack. How do they behave when put in settings where no reward will be assigned?
Example projects:
- Extend Wen et al. with larger models, more realistic human feedback, and/or tasks that are more similar to real-world applications of language models.
- Run an experiment similar to Wen et al. but with models with more situational awareness, i.e., models that correctly answer questions about what they are, what the effects of their actions will be, and what the biases of their human assessors are. See Laine et al. and Greenblatt et al. (§5) for a discussion of situationally aware models.
- Study whether methods for fixing mistakes in human oversight (e.g., AI debate) can withstand adversarial pressure, as discussed in the Control evaluations section. Benton et al. (§7) for an example of what such an experiment would look like.
- Develop new methods for fixing mistakes in human oversight that perform well even with optimization against them; see the previous bullet.

*Applications of white-box techniques

While we are far from rigorously understanding everything important about the internal structure of neural networks, partial understanding can still aid in reducing risk. We’re interested in research that attempts to leverage white-box techniques to achieve real-world goals, both for the sake of those goals and because this is a good way to iterate toward more useful white-box methods for monitoring and engineering models.
In particular, we would be excited about proposals to use white-box techniques to tackle some of the challenges raised in other parts of this document, including to:
- Steer model behavior and compare performance against baselines that use prompting or finetuning. For example, Durmus et al. looked at steering with SAE features for reducing bias, and found that this technique did not outperform prompt engineering. We’re interested in research that specifies an objective and a set of affordances (e.g., what data or model access researchers are permitted to use), and investigates whether any white-box techniques (or combinations of white-box techniques and black-box techniques) outperform the best black-box techniques.
  - Some believe that interpretability-based white-box steering techniques should generalize from a limited dataset of desirable behaviors more effectively than prompting or finetuning — for example, to improve a model’s truthfulness (Zou et al.) or lessen its tendency to refuse requests across a wider range of situations (Arditi et al.). We’re interested in attempts to investigate this claim.
  - We’re interested in attempts to identify specific scenarios where prompting and finetuning fail to steer the model’s behavior, but white-box steering techniques might succeed. Candidate scenarios include: when models “double down” (deliberately conceal information despite explicit prompting), as observed by Meinke et al., and tasks that require long and detailed prompting, as demonstrated in Karvonen et al.’s steering experiments.
- Improve our ability to elicit the highest possible level of capabilities from a model during capability evaluations, including by detecting and mitigating sandbagging in evaluations. Strong versions of this could function as a form of transparent, understandable post-training capability augmentation. For example, high-quality interpretations may be able to identify the parts of model weights that are “noise,” and removing this noise could improve performance, as demonstrated by Sharma et al. and Nanda et al.
- Find adversarial inputs. Insights into model internals could help identify properties that trigger undesirable model behavior, as demonstrated in Gandelsman et al. and Smith and Brinkmann [AF · GW]. Research could start by attempting to use model internals to explain why some jailbreaks work better than others, for example by exploring which features are most causally implicated in model refusal. Alternative approaches include interpreting perturbations found during latent adversarial training to generate new adversarial inputs that correspond to the worst-case perturbations, or continuing work on mechanistically eliciting latent behaviors (Mack and Turner [LW · GW]).
- Completely unlearn a piece of knowledge or capability from a model more thoroughly than would be possible without insight into the internal structure of the model. See this section for more details.
- Design probes that are more useful or able to generalize in more desirable ways. See this section for more details.
- Use activation extrapolation to more realistically model the possible distribution of activation vectors, to obtain more accurate bounds for low-probability estimation. See this section for more details.
Eligibility criterion:
- Black-box baselines: It’s important that projects compare the strongest white- and black-box baselines available for the tasks they attempt. For example:
  - Wu and Hilton compare importance sampling (a black box technique) to activation extrapolation, finding that the black box technique is superior.
  - Karvonen et al. compare SAE feature steering to more standard activation steering and prompt engineering. We’d also be keen to see comparisons with supervised finetuning, RLHF, and adversarial training where appropriate.
  - A black-box-only technique may outperform a white-box-only technique, but a mixture of both might outcompete either alone.
Nice-to-have criteria:
- Create benchmarks: When feasible, the metrics used to assess performance of the white-box technique should be convertible into reproducible benchmarks, like in Li et al.
- Investigate failure: We expect that many of the safety-relevant goals in this section are not achievable by current interpretability techniques, because SoTA techniques are not faithful enough (Miller et al., Goldowsky-Dill et al.). We’re interested in research that explores why current techniques fail at the specific tasks above, and helps us understand the obstacles to further progress e.g., Nanda et al. [? · GW]
This is one of our favorite research topics, so we’re especially interested in funding work in this area.

Activation monitoring

In certain settings, it might be easier to catch AIs engaging in non-compliant reasoning/behavior via monitoring their latent activations than via monitoring their inputs/outputs. We’re interested in supporting further work using probing classifiers and other forms of activation monitoring.
Motivation:
- It may be easier to monitor a model’s activations than its inputs/outputs in certain settings. For instance, if an LLM represents in its activations whether it’s outputting a true or a false statement, or writing secure code, or discussing hazardous information, then perhaps we can train probes/heads that indicate this. This approach might outperform input/output monitors in cases where the monitors are weaker than the policy, are unable to spot subtle qualities of the policy’s output, or are colluding [LW · GW] with the policy.
- Past research has used probing classifiers (Belinkov) to detect, at runtime, when an LLM:
  - Produces jailbroken answers (Zou et al., Zou et al.).
  - Outputs falsehoods (Orgad et al., Wen et al.).
  - Writes code that tampers with unit tests instead of passing them honestly (Roger et al.).
  - Produces harmful answers because a backdoor trigger was inserted (MacDiarmid et al.).
  - Provides hazardous information to a malicious user (Bricken et al., Roger [LW · GW]).
  - Engages in some other category of “bad action”.
- But as far as we know, these methods have not yet demonstrated usefulness for realistic oversight applications.
- See this post from Anthropic for more motivation on this topic.
Eligibility criterion:
- Strong baselines: Eligible proposals will compare their chosen methods to strong baselines. The relevant baselines will depend on the details of the setting; if labels are abundant for training the probes, researchers should compare this to the baseline of fine-tuning the policy model with those labels to act as a classifier. If labels are hard to come by, then the relevant baseline might be a model prompted, rather than fine-tuned, to be a classifier. Note that probes that are worse than baselines can still be useful, if they can be used in combination with the baselines to beat the baselines alone.
- Monitoring, not steering: In this section of the RFP, we’re interested in probes as classifiers for bad actions at runtime, not in activation steering. (By “activation steering”, we mean methods like Turner et al., Panickssery et al., and Zou et al.). But steering is within scope for this section of this RFP.
Nice-to-have criteria:
- Output monitoring: We prefer projects on probes that detect some property of an LLM’s outputs, rather than probes that just classify features of the model’s input. For example, we’re more interested in probes that trigger when a model is about to output a false statement (Orgad et al. and Wen et al.), rather than probes that distinguish whether a sentence inputting to a model is true or false (Marks and Tegmark).
- Strong generalization behavior: We’re interested in research toward probes that generalize well across distribution shifts, and in particular probes that maintain high recall. Making probes that generalize well could involve searching for adversarial false negatives and training the probes against them.
- Detecting falsehoods: We’re especially interested in probes that track the truth/falsity of various statements, as explored in Orgad et al., Wen et al., Marks and Tegmark, Burns et al., and Farquhar et al. An accurate “lie detector for LLMs” would allow for the detection of many other classes of bad action.
- Off-policy probes: We’re interested in projects that test out the generalization and efficacy of probes trained off-policy, as explored in Zou et al. and Roger [LW · GW]. This is because it may be much easier to manufacture off-policy examples of bad actions than on-policy ones.
Example projects:
- The field would benefit from rigorous benchmarks for assessing probe quality.
- Probes are especially useful when there’s a strong a priori case that the model represents the relevant property internally, as is the case e.g. in Roger et al. We’d be interested in more investigation of this setting, and others like it.
- Different probe architectures with different inductive biases can be investigated for their generalization ability, including:
  - Sparse probes (like in Gurnee et al.) or ridge regression probes.
  - Probes trained on the activations of a sparse autoencoder (Cunningham et al., Templeton et al., Gao et al.), a transcoder (Dunefsky et al.), or a crosscoder (Lindsey et al.).
  - Interpretability techniques can be used to manually modify probes to improve their generalization ability, e.g. as part of the SHIFT algorithm. (The methods in Marks et al. might help create better generalizing probes.) SHIFT could be applied on top of other techniques for decomposing activations into interpretable components.
- Add a red-team phase to identify probe misclassifications, which can be used to iteratively refine the probe. Across a range of scenarios, explore if the red-teaming eventually leads to a probe for which the red team cannot find adversarial examples. Alternatively, techniques inspired by latent adversarial training or MELBO (Mack and Turner [LW · GW]) could be used to automatically augment a probe’s training dataset with adversarial examples for an automated red-teaming pipeline.
- Build a diverse collection of datasets that can each be used to build a probe for detecting the same bad action, and test how well probes trained on a small subset of these datasets generalize to the others. For example, to study lie detection, one dataset in the diverse collection could include cases of the model doubling down on false statements (Meinke et al.) and another could include various deception games (O’Gara, Bakhtin et al).

Finding feature representations

Almost all attempts to understand model internals route through identifying the fundamental latent variables in the computation. A large portion of interpretability research centers around finding these variables, which are often referred to as features: supervised methods include linear probes (see Belinkov for more discussion) and nonlinear probes (such as Li et al.), and unsupervised methods include singular value decomposition (Millidge and Black [LW · GW]), non-negative matrix factorization (McGrath et al.), linear transformations based on gradients (Bushnaq et al.), and sparse autoencoders (Cunningham et al., Bricken et al., Templeton et al., Gao et al.).
Sparse dictionary learning with SAEs and their descendants represents a significant breakthrough for identifying representations. However, the existing techniques suffer from several issues, including potentially complicated feature geometry (Engels et al., Mendel [LW · GW]), errors that seem important for SAE performance (Marks et al., Bushnaq [LW(p) · GW(p)]) and don’t seem to be going away (Gao et al., Engels et al.), ambiguous amounts of monosemanticity (Templeton et al.), and failure to capture the most “atomic” latent variables (Bussman et al. [LW · GW], Chanin et al.). It’s possible that incremental improvements to SAEs will resolve these issues or find applications that circumvent them, and many researchers are working hard to find modifications to the vanilla SAE based on novel nonlinearities (Rajamanoharan et al., Gao et al.) and novel loss functions (Braun et al.), transcoders (Dunefsky et al.), and crosscoders (Lindsey et al.). We’re interested in supporting research that aims to stress-test the assumptions that underpin SAEs and identify techniques for finding representations of atomic features in LLMs that tackle the most fundamental issues with SAEs head-on.
Example projects:
- Most work on identifying features assumes some form of the Linear Representation Hypothesis (LRH, Nanda et al., Park et al.). However, there is weak evidence of some nonlinear internal representations in language models (Engels et al., Menon et al.) and strong evidence of nonlinear representations in other architectures (Csordás et al.). We’re interested in research that searches for nonlinear representations in LLMs and explores (ideally unsupervised) methods for finding nonlinear representations. Promising proposals could:
  - Search for features whose internal representations are missed by sparse autoencoders due to nonlinearity, where it is hard to argue that the model has learned alternative features instead (Nanda et al.). Can other unsupervised techniques find these features?
  - Test the LRH using the same high-level methodology as Nanda et al. in different contexts: identify a simple environment where we know something about the features that have to be represented, train a transformer in the environment, and attempt to quantify how linear the internal representations of the features are.
  - Develop more powerful probing techniques for identifying nonlinear representations, and validate them on language models.
- It’s unclear whether the assumption that all features in SAE training datasets are sparse is valid. The sparsity penalty appears to encourage SAEs to find highly specific“microfeatures” (Rajamanoharan et al. [? · GW], see also Chanin et al.) instead of more general features. This likely relates to feature splitting (Bricken et al.). It’s also possible that some features are fundamentally not sparse. We’re interested in research that looks for non-sparse features in LLM training data and explores unsupervised methods for finding their representations. Promising proposals could:
  - Identify features whose internal representations are missed by sparse autoencoders because they are non-sparse. Can an alternative unsupervised technique find these features?
  - Develop automatic techniques for merging microfeatures found by an SAE into macrofeatures — for example, by merging induction features into an induction head (Kissane et al.).
  - Develop a better understanding of how SAEs deal with known non-sparse features like positional embedding features (Chughtai and Lau [LW · GW]), and use this understanding to search for new evidence of non-sparse features.
- Chughtai and Bushnaq [LW · GW] argue that it is very difficult or even impossible to find “true” feature representations by studying one activation space at a time, and propose trying to directly decompose the weights in a model instead. Promising proposals could explore this approach by building on work like Braun et al., or attempting novel approaches for decomposing weights.
Eligibility criterion: Be clear about which potential blindspot or limitation of sparse dictionary learning the project is aiming to tackle.
Nice-to-have criteria:
- Avoid hill-climbing metrics: We acknowledge that there is a spectrum between identifying totally new interpretability techniques and hill-climbing established metrics via small changes to existing techniques. We want to avoid an overfocus on a small fraction of the research landscape. Although thoughtful modifications of SAEs can be in scope, all else equal we are more likely to fund proposals that involve more substantial differences from existing techniques, provided the changes are well motivated.
- Benchmarks: If appropriate, the novel technique should be benchmarked against existing techniques using the benchmarks in SAEBench (Karvonen et al.) that are suitable for the new technique.
- Check for probe hallucination: Probes, especially nonlinear probes, can often recover features from data even when the model hasn’t explicitly learned to represent that feature, so proposals may need to take some measures to rule out this possibility. Hypothesized representations should be assessed for the extent to which they generalize to new data, and for their ability to steer model outputs in expected ways. In addition, expressive probes can be evaluated for their selectivity (Hewitt and Liang) for real features, by comparing the accuracy of the probe trained on the supervised dataset against (a) the accuracy of a probe trained on the same dataset but with randomized labels and (b) the accuracy of a probe trained on the real dataset, but with activations taken from a suitably randomized model.
  - Note that for some applications, it may be useful for a probe to detect a concept even if the model is unaware of that concept. For example, a powerful probe may be able to detect that a model is being jailbroken regardless of whether the model “knows” it is being jailbroken. That research falls outside the scope of this section because we are primarily interested in understanding the computation happening inside neural networks. However, it may be in scope for applications of white-box techniques or probing classifiers.

Toy models for interpretability

Because LLMs are both massive and highly complex, it’s difficult to understand general principles about their internals. As in other fields that study complex phenomena, studying idealized toy models can help.
In mechanistic interpretability, a toy model refers to a deliberately simplified neural network that mirrors specific aspects of large language models. Typically, toy models are very small, trained on a carefully constructed synthetic dataset, or idealized and amenable to mathematical analysis (Scherlis et al., Hänni et al., Bushnaq and Mendel [LW · GW]).
In the field of mechanistic interpretability, toy models have been used as demos of reverse engineering (Nanda et al.) and rigorous proofs on model behavior (Gross et al.), as well as for studying phenomena like superposition (Elhage et al.), in context learning (Olsson et al.), and multi-hop reasoning (Wang et al.).
We’re interested in supporting work that improves our understanding of currently unexplained phenomena in LLM internals via the development of thoughtful toy models.
Example projects:
- Investigating which properties of data lead to feature splitting (Bricken et al.) and nontrivial feature geometry (Engels et al., Mendel [LW · GW]). Which data structures give rise to particular feature geometries, and are there natural ways to describe an activation space in terms of this geometry?
- Studying the way that hierarchical (Park et al.) or causal structure in data affects neural representations. If a network is trained on data generated by a set of latent variables with a particular causal graph, what is the relationship between that graph and the internal structure of the network? Is it easier to identify the causal graph from the weights and activations of the network than by studying the data directly?
- Exploring circumstances where the linear representation hypothesis fails to hold for transformers, perhaps building on work like Csordás et al. and Menon et al.
- Studying aspects of In Context Learning (ICL) mechanistically in more detail. It is debated whether in context learning is really “learning,” or if it is merely the model identifying appropriate sampling subdistributions. We welcome projects that investigate ICL in toy models, building on work like Oswald et al., Akyürek et al., and Garg et al. Such research could help clarify when ICL represents genuine learning, and the scope and limitations of in-context learning. More ambitiously, research like this could advance our understanding of learning mechanisms in general (cf. this Manifold market).
- Studying the development of second-order circuits (Quirke et al.) over the course of training, to better understand how complicated circuits can emerge from a series of local gradient steps (see also the copy-checker head in Vaintrob et al. [LW · GW]).
We’re also interested in supporting work to fully reverse engineer a wider range of toy neural networks that possess interesting internal structures. By carefully understanding the internal computation of a single network, Nanda et al. provided a concrete example of what interpretability success looks like. This research led to lots of follow-on work (Zhong et al., Liu et al., Chughtai et al. among others) and has enabled the studied network to act as a valuable test bed for interpretability techniques. It’s reasonable to worry about overfitting to this toy model and its close relatives, so we’d like to support the creation of a wide range of reverse-engineered models that represent a broader range of possible internal structures that neural networks can learn.
Example projects:
- Reverse engineering small models that perform Bayesian updates on sequences. Work could build on Shai et al., who demonstrated that small models linearly represent Bayesian belief states, but did not provide a mechanistic explanation of how.
- Reverse engineering small models that implement discrete finite automata (building on Liu et al.) or boolean circuits (building on Hänni et al. and Michaud et al.). Ambitiously, research could aim to develop general rules for how networks represent circuits or automata, and where these computational models fail to capture internal computation.
- Reverse engineering tiny language models with modified architecture that makes them more interpretable, such as attention-only language models (Olsson et al., Nanda).
- We’re interested in projects that use new reverse-engineered models to test the hypothesis developed in Gross et al. and Hilton about the relationship between more detailed mechanistic understanding and the length of formal proofs about model behavior.
Eligibility criterion:
- For using toy models to understand unexplained phenomena in LLMs. Clear phenomenon to investigate: If the proposal seeks to improve our understanding of a currently unexplained phenomenon in LLM internals, then the proposal should explicitly state the phenomenon being investigated, and provide clear reasons why this phenomenon is important to understand.
- For reverse engineering a toy model. Informative toy model: Toy models should be selected for the likelihood that understanding their internals will teach us new things about how neural network internals work, while still being tractable to reverse engineer.
Nice-to-have criterion:
- Relevance to LLMs: Projects attempting to improve our understanding of a particular phenomenon in LLM internals should aim to eventually make contact with LLMs leveraging the insights developed with the toy model. If reverse engineering a toy model, the toy model should be in some way related to LLMs too.

Externalizing reasoning

The previous subsection described how chain of thought can be a valuable affordance for catching models engaging in objectionable behaviors. However, even if the chain of thought is faithful, it won’t necessarily include a full description of the model’s thoughts because a significant portion of the model’s thinking can happen in the forward pass. We’re interested in techniques for training much smaller language models which put more of their reasoning into natural language.
Related work:
- The new generation of frontier chain-of-thought models (OpenAI o3, DeepSeek r1, QwQ) demonstrate impressive capabilities with far fewer parameters than more conventional LLMs, suggesting that some of their capabilities have been “moved” from the forward pass into the chain of thought. This could be good news for interpretability [LW · GW] because the chain of thought is potentially much more understandable, and because there’s less “room” in the forward pass for reasoning and planning we can’t easily detect.
- A series of papers have investigated how much CoT reasoning enhances the power of language models. Feng et al. and Merrill and Sabharwal prove that CoT extends the complexity class of problems LLMs can solve. Balesni et al., Yang et al. and Yang et al. explore multi-hop reasoning (Yang et al.), a capability which models can reliably demonstrate only if they use CoT. Multi-hop reasoning may be an important precursor capability for models to do the reasoning required for scheming in a single forward pass (Berglund et al.). Wang et al. identify properties of the pretraining dataset that influence the ability of models to perform multi-hop reasoning in a single forward pass.
Eligibility criterion:
- Externalized reasoning: Proposals in this area should propose to develop a method for training a language model that is very weak over a single forward pass, but much stronger when it reasons with long chains of thought (as measured in benchmark performance per tokens generated).
- Beyond the commercial frontier: We’re interested in proposals that push further toward externalized reasoning than commercial researchers are already incentivized to do. This could mean comparing to a baseline approach that is already very inference-heavy (as in DeepSeek r1), or conducting fundamental research with less-capable models. Proposals should aim to increase the gap between performance with vs. without chain of thought, further than the techniques that already exist.
- Transparent reasoning: Inference-time reasoning should be conducted via natural language chain of thought. See the earlier section on encoded reasoning for discussion of the interpretability issues that may arise even if models reason in natural language.
- Budget proposal: Proposals should be clear about the budget they think is necessary to test their research hypotheses, e.g. what scale of compute might be necessary to learn whether their proposed techniques work at all, and what scale might be necessary to test how competitive their techniques are with existing baselines. After the EOI stage, we may decide that large-scale projects should additionally follow the “moonshots” criteria below.
Example projects:
- It may be possible (Raposo et al.) to quantify how the level of reasoning sophistication needed for accurate next-token prediction varies throughout a sequence. An example proposal could involve using a frontier model to augment a pretraining dataset by adding a chain of thought before hard-to-predict tokens, continuing until each token can be computed without sophisticated reasoning. This enhanced dataset could then train an extremely small model that—despite otherwise being unable to produce coherent text on its own—would learn to reason within the given context through next-token prediction. See related work by Zelikman et al.
- An alternative approach could involve scaffolding together small AIs trained on narrow tasks and narrow information distributions to create a more general AI system that communicates between subsystems in natural language.
- Proposals could focus on externalizing reasoning about dangerous topics in particular, rather than all topics. For example, a proposal could attempt to incorporate the findings of Wang et al. and Yang et al. into a method for modifying or filtering the pretraining dataset of a language model to ensure that scheming-relevant reasoning cannot be accomplished by the model without explicit CoT.

Interpretability benchmarks

Historically, interpretability progress has been plagued by a lack of consensus both on methods for measuring progress and on standards of evidence for claims about models’ internal computation (Miller et al., Goldowsky-Dill et al., Chan et al. [AF · GW], Geiger et al.). Similar debates about progress in AI capabilities led to the development of high quality benchmarks for assessing model capabilities, which serve dual purposes: measuring capabilities progress in a standardized manner, and establishing clear targets for future capabilities efforts. We’re interested in supporting the development of thoughtful benchmarks for measuring and catalyzing intermediate progress in understanding model internals.
Example projects:
- Arguing for a particular intermediate goal that interpretability research should aim for, then concretizing that goal into a benchmark. Intermediate goals could be theoretically or conceptually motivated (Arora et al., Huang et al., Karvonen et al.) or motivated by a downstream application of interpretability (e.g., Li et al.).
- Developing benchmarks like Gupta et al. which consist of collections of networks that implement a known ground truth algorithm, with the interpretability goal being to identify the ground truth algorithm.
Nice-to-have criteria, for benchmarks that involve implementing ground truth algorithms in networks:
- Realistic internal structure: Gupta et al. assumes that each part of the ground truth algorithm is performed by a separate model component, so the benchmark does not reward techniques that seek to separate the computations performed within a single component. Good benchmarks based on networks implementing known algorithms should aim to make networks that are as similar to real-world networks as possible. This should be done by minimizing the restrictiveness of techniques for enforcing representations, instead leaving much of the structure “up to the model”. This allows for the possibility that reverse engineering the model may rely on new interpretability insights, rather than just reinforcing the assumptions underlying the most popular existing techniques.
- Realistic algorithms: Similarly, benchmarks should try to ensure that the ground truth algorithms they are using are as similar as possible to the algorithms that would be used in actual neural networks. Of course, this is a challenge as we don’t know the algorithms implemented in real neural networks, but we can leverage some understanding of language modeling to make more realistic tasks. For example, a natural extension of Gupta et al. involves training models to implement many algorithms at the same time.

†More transparent architectures

One alternative to attempting to interpret existing language models is attempting to design models from the ground up to be more interpretable. Some previous attempts to do this, such as Elhage et al., attempted to find more interpretable architectures that were competitive with standard architectures from a capabilities perspective. While it would be great to make more interpretable architectures that are competitive with transformers, we still see value in studying simpler, more transparent models that can act as proxies or toy models for more opaque and capable models.
When proposing more interpretable architectures, there is a tradeoff between 1) keeping the architecture as similar as possible to frontier LLMs, which makes insights more likely to transfer and increases the chances that these architectures can be made competitive, and 2) designing an architecture that would be much more interpretable than current models. We’d consider funding research at a range of points on the Pareto frontier of this tradeoff.
Example projects:
- Changes to the standard architecture which confer some interpretability advantage with only a small capabilities disadvantage. Proposals could involve:
  - Attempting to eliminate superposition (like Elhage et al. and Tamkin et al.), or to design models that are easier to analyse mathematically (like Pearce et al.).
  - Making the architecture more modular and incentivizing each module to be interpretable (like Golechha et al., Liu et al., Anonymous, and Hewitt et al.).
  - Attempting to separate out different types of computation from each other — for example, separating shallow heuristics from deeper, more sophisticated circuits, so that those circuits can be studied more rigorously. This could be achieved by training models with a term in the loss function that incentivises all the computation to happen as early as possible (e.g., the token-level induction heads are incentivised to all be represented in the first two layers of the transformer) so that the late layers contain only sophisticated many-layer deep circuits.
- Replacing components of a neural network with a set of natural language questions that can be answered by a human (or an LLM, more practically) about the prompt, with answers feeding into the next layer of the model. Proposals could look something like:
  - Training a sparse autoencoder in a middle layer of a language model. The layers before that layer could then be replaced by a series of queries to an LLM based on the automatically generated descriptions (Bills et al., Templeton et al.) of each feature, using the decoder layer of the SAE to embed the answers to each query. Finally, the subsequent layers of the model could be finetuned (though this would incur significant cost due to the many calls to a language model in each forward pass) to regain some of the performance lost through this process.
  - Other techniques could be developed for searching directly in natural language for sentences and questions that can replace model components. Efficient search over natural language statements is challenging. Promising research in this direction may involve passing a distribution of sequences rather than a single sequence to a model so that gradients can be passed to this distribution (Maddison et al.).
  - Note that as currently envisioned, these projects are probably too expensive to train, so we would only be excited about versions of them that overcome this issue.
- Eligibility criteria:
  - Model development: While early research into a new architecture may involve working with toy models, it is hard to know if the approach confers any interpretability benefit in LLMs without direct testing. Therefore, proposals should involve eventually building a language model that is at least as capable as the small Pythia-70m model.
    - Proposals should be clear about the budget they think is necessary to test their research hypotheses, e.g. what scale of compute might be necessary to learn whether their proposed techniques work at all, and what scale might be necessary to test how competitive their techniques are with existing baselines. After the EOI stage, we may decide that large-scale projects should additionally follow the “moonshots” criteria below.
  - Interpretability validation: You should investigate whether the new model is more interpretable than existing models. For relatively small architectural changes, this could be accomplished by comparing the success of an established interpretability technique to the new model against the same technique in a conventional language model. For assessing more substantive architectural changes, applying established techniques may be less straightforward, but it might be possible to find behaviors that are easier to explain with the more interpretable model — or even better, it might be possible to track interpretability progress by measuring success at a downstream application of interpretability.
    - Note: converting a model component into natural language, a Bayes net, or a boolean circuit does not (on its own) constitute significant interpretability progress. After all, matrix multiplications can be described in natural language.
Nice-to-have criterion: Capabilities validation: Analysis of how much harder it is to achieve frontier capabilities with the novel approach than the mainstream.
Flag: We will have a high bar for proposals in this area. This topic is promising in the abstract, but has historically exerted an intellectual appeal out of proportion to its tractability — and moreover, we may lack the expertise to distinguish useful proposals from off-target proposals. Applicants who are considering more than one category are encouraged to choose a different category than this one.

White-box estimation of rare misbehavior

One challenge with ensuring that powerful AI systems are safe is that AIs may only exhibit egregiously bad behavior in scenarios that are hard for us to find, but which may be expected to occur naturally in deployment due to distributional shift (see Christiano for more discussion). The space of inputs is difficult to search exhaustively, so even gradient-based search techniques may miss the truly worst-case inputs. However, access to the internal mechanisms that drive model behavior may make it possible to estimate the probability of rare bad outcomes.
In pursuit of this goal, Wu and Hilton study low-probability estimation, which they describe as: “Given a machine learning model and a formally specified input distribution, how can we estimate the probability of a binary property of the model’s output, even when that probability is too small to estimate by random sampling?” We’re interested in supporting follow-up work on this subject, including but not limited to new methods for tackling the specific benchmarks these authors study. For a more detailed perspective on why this research problem is important and why activation extrapolation methods are particularly promising, we recommend reading discussions by Wu and Xu.
Nice-to-have criteria:
- Brute-force sampling: Use environments where brute-force sampling can feasibly estimate the probability of rare events, so that the more efficient estimation techniques can be validated against ground truth.
- Activation extrapolation: We’re excited about improving on “activation extrapolation” methods, as defined in Wu and Hilton, as these may remain effective even for behaviors that are too rare (or have too complex preconditions) for importance-sampling methods to find. We’re particularly interested in applying interpretability progress to improve activation extrapolation.
  - This work may benefit from using a model that has been the object of interpretability research in the past, or from incorporating insights about activation vector distribution from interpretability literature (e.g., see Beren and Windsor [LW · GW], Giglemiani et al., Levinson [LW · GW]).
- Improved sampling: We’re also excited — though to a lesser degree — about research that improves the importance-sampling baseline, to explore its limits for upsampling very rare behaviors. This could build on literature like multi-level splitting (Webb et al.) or twisted sequential Monte Carlo (Zhao et al.) for sampling rare undesirable sequences. Another technique that could be used as a baseline is adding noise to activations to increase the probability of rare behaviors (Clymer et al.).
Related work:
- Activation-based techniques for eliciting rare behaviors from models like those in Mack and Turner [LW · GW].
Applicants interested in this topic should also check out Schmidt Science’s SAFE-AI RFP.
This is one of our favorite research topics for researchers with strong theoretical backgrounds, so we’re especially interested in funding work in this area.

Theoretical study of inductive biases

One potentially fruitful target of theoretical research is to better characterize the inductive biases of the training process. We’re particularly interested in research that sheds light on:
- Why neural networks generalize so well (Zhang et al.) especially out-of-distribution, and the conditions under which they fail to generalize appropriately. Research could work toward theories about adversarial robustness, out-of-distribution generalization, or what makes functions and computations easy or hard for neural networks to learn.
- The likelihood of scheming occurring naturally, and how we could influence this likelihood (this is a particular way that AGI could fail to generalize in ways we want). This question has been discussed before (Hubinger [LW · GW], Carlsmith, Gillen and Barnett [LW · GW], Wheaton [? · GW], Belrose and Pope [LW · GW]) without strong theoretical underpinnings, but it could be dramatically advanced if theories about inductive biases of gradient descent on neural networks can be connected with theories about the conditions required for scheming. Related targets for understanding include:
  - When models are likely to directly optimize for reward, and when out-of-episode goals could arise.
  - How successful alignment faking is as a strategy to preserve goals.
  - Separately attempting to understand the likelihood of scheming arising 1) during pretraining, 2) during post-training and 3) during a rollout, i.e., after some amount of in-context learning.
- The kinds of mechanistic structures that AI systems are likely to learn during training. Better understanding could lead to new interpretability techniques that benefit from knowledge about expected structures and computations, or enable analysis of network internals by tracking the development of new internal structure over the course of training (Hoogland et al. [LW · GW], Hubinger [AF · GW]).
We think that wherever possible, it is more sensible to attempt to answer questions about today’s LLMs by running thorough experiments than by doing math or theorizing. However, some questions are not possible to answer through experiments alone — for example, questions that involve exhaustive search over input space or output space. As AI capabilities advance, theoretical understanding may become increasingly necessary to answer risk-relevant questions, because input space and output space will become even harder to search over, and because we may have reason to distrust the outputs of our AIs. Therefore, we’re interested in supporting theoretical research that could help us understand and predict the behavior of AI systems in ways that are more likely to scale to superintelligent AIs than the best experimental techniques.
Example projects:
- Investigating the way that deep learning differs from the idealized limits studied in Singular Learning Theory [? · GW], the Neural Tangent Kernel (Jacot et al.), or well-known stochastic differential equations (Li et al.).
- Developing a theory of the inductive biases of neural network training on a specific model of computation like boolean circuits (building on Malach and Shalev-Shwartz, Barak et al., Abbe et al.) or automata (building on Liu et al.), which could be bootstrapped into a broader theory, perhaps formalizing ideas like the low-hanging fruit prior (Vaintrob and Panickserry [LW · GW]). It may also be feasible to develop a set of toy models that demonstrate learning complex structures in a series of incremental local steps (similar to Hoogland et al.) and could be useful for testing this theory’s predictions.
- Making specific, realistic assumptions about the training data and the nature of distribution shifts to make more useful claims about how neural networks might generalize. For example, while learning theory has historically provided limited insight into model performance on test data that is not IID with training data, it might be possible to make predictions about out-of-distribution generalization if the distribution shift is suitably restricted, or if extra information about the data is incorporated.
- Attempting to build theories about the inductive biases of post-training processes like RLHF, and their relationship to the inductive biases of pretraining. It’s possible that long-term goal formation may first emerge during post-training, particularly in RLHF-like processes.
- Developing our understanding of what can be learned during ICL, or how different personas emerge in context. It’s possible that the first examples of coherent goal-directed behavior and scheming only occur in specific contexts after some amount of ICL, like in these vignettes.
- Starting with established ideas about the relationship between scheming and inductive biases like the speed prior (Hubinger [AF · GW]), and trying to determine if these ideas have analogs that apply in the context of more realistic descriptions of neural network training.
- Aiming to close the large gap between the (poor) worst-case generalization guarantees provided by most computational learning theory (e.g., VC theory) and the generalization performance we see in practice. We’re also interested in research that aims to close the exponential gap between the size networks have to be for good approximation in theory vs. in practice. While these questions aren’t directly related to risks from powerful AI, it’s possible that answering them will involve building better theories of how networks structure their information internally and how they learn, which could shed light on out-of-distribution generalization as well.
  - Any attempt to close these gaps should engage with the no free lunch theorem, which strongly suggests that explanations of why neural networks work so well need to discuss what is special about “real-world” data that makes it so amenable to being learned by a neural network.
Eligibility criterion:
- Real-world validity: To be eligible for consideration, theory work should avoid relying heavily on assumptions that won’t hold up in realistic scenarios, and researchers should verify that theoretical claims aren’t commonly violated in practice by real neural networks. We are ultimately interested in using theory to make predictions about powerful AI systems like language models — eligible proposals should be for research that could feasibly apply to these sorts of AI systems.
Nice-to-have criterion:
- Validation in toy models: Theoretical work should be regularly validated through experiments, initially in toy models, and then in real language models as early as possible.

†Conceptual clarity about risks from powerful AI

It is extremely challenging to reason well about the risks that AGI and ASI will bring, and which research approaches show the most promise for mitigating these risks. Further, finding new promising research agendas may be bottlenecked on conceptual progress for thinking about powerful AI. Therefore, we are interested in funding conceptual research that helps the world think more clearly about future AI risks, and about what needs to be done to avoid these risks.
We’re interested in supporting people who are highly familiar with the discourse around AI risk to publish their perspective on how AI will develop and be utilized in the medium-term future, what the major sources of risk are, and what technical research progress needs to happen to mitigate that risk.
- Example projects:
  - More detailed discussions of existing technical research agendas, like feature-based interpretability, AI debate, or variants of adversarial training. This could feature discussion of how the agenda could mitigate risk, what intermediate progress we need to make for that to happen, and reasons for optimism or pessimism. Related work:
    - Work that explains and argues in favor of an agenda: the Redwood Research blog on AI control; Hilton [LW · GW] on ARC Theory’s agenda; Kosoy [LW · GW] on the learning-theoretic agenda; Hubinger [LW · GW] on model organisms; Hubinger [AF · GW], Nanda [LW · GW], and Olah on various interpretability agendas.
    - Work that critiques an agenda, or responds to critiques: Charbel-Rafael [LW · GW] and Greenblatt et al. [LW · GW] on interpretability; Kokotajlo and Demski [LW · GW] on chain of thought monitoring; Wentworth [? · GW] and Soares [LW · GW] on a range of agendas.
  - The likelihood of various forms of dangerous misalignment occurring organically. Work on this topic could also explore related topics like the likelihood of beyond-episode goal formation and the likelihood of context-independent goals, and how these likelihoods change during pretraining, post-training, and deployment. Related work:
    - Hubinger, Cotra [LW · GW], Carlsmith, Gillen and Barnett [LW · GW], DavidW [LW · GW], Belrose and Pope [LW · GW].
  - How we might distinguish between real alignment progress and alignment faking.
  - What needs to be done to safely offload alignment research to AI systems at different levels of capabilities and trustworthiness. For example, under what conditions should we expect to be able to verify the alignment research outputs of AIs?
    - Identifying and discussing a neglected threat model that involves an existential catastrophe from powerful AI like Christiano [LW · GW], or deeply investigating the plausibility of a particular threat model like Clymer et al.
  - Plans for how we might make use of existing technical research to reduce catastrophic risks if transformative AI comes very soon, like Hobbhahn [LW · GW] or Bowman.
  - “Nearcasting [LW · GW]” progress in AI capabilities, as in Kokotajlo [LW · GW] or Cotra [LW · GW].
- Eligibility criteria:
  - Deep engagement: You have engaged or are willing to engage deeply with existing arguments and evidence about your topic and its broader field. If critiquing others’ work, you understand their work in depth.
  - Writing quality. You have a track record of clear writing on complex topics.
- Nice-to-have criterion:
  - First-hand experience: If discussing technical agendas, you have first-hand experience doing the work that you are writing about.
We may also be interested in supporting foundational research into the nature of agency and intelligence insofar as that can help the research community better understand the relevant risks and how they might be mitigated. Research could tackle questions around the nature of agency and how to identify agents (Bensen-Tilsen [LW · GW], Kenton et al.), which types of superintelligent minds are reflectively stable [? · GW] or coherent [? · GW], or how (if at all) goals and values might be distinctly represented from beliefs in an AI.
- Eligibility criterion:
  - Deep engagement: You have engaged deeply or are willing to engage deeply with prior work, and you have a strong track record of making progress in philosophically challenging domains.
- Nice-to-have criterion:
  - Empirical feedback: Progress in understanding agency has historically been extremely difficult and divorced from real-world AI systems. While this is partially due to the research being focused on powerful AI systems that don’t yet exist, we’d prefer proposals to attempt to make connections with experiments wherever possible. For example, research on identifying agents could be tested in novel complex environments with some degree of emergent agency, like those described in Agüera y Arcas et al.
- Flag: We will have a high bar for proposals in this area. This topic is promising in the abstract, but has historically exerted an intellectual appeal out of proportion to its tractability — moreover, we may lack the expertise to distinguish useful proposals from off-target proposals. Applicants who are considering more than one category are encouraged to choose a different category than this one.

†New moonshots for aligning superintelligence

We are extremely far from certain that the current set of AI safety research directions being pursued is sufficient for ensuring that superintelligent AI is safe. The research directions proposed in this RFP may not advance far enough to be useful at all on the first really dangerous AIs, or they may fail once AI becomes even more powerful. Further, the assumptions that they rely on may turn out to be wrong or even confused.
Relative to the importance of the problem, to date only a small amount of effort has been exercised toward trying to understand and align superintelligent AI systems. Because of this, there may be valuable research agendas that haven’t yet come to our attention. Therefore we are interested in supporting the development of entirely novel agendas for aligning superintelligent AI systems.
- Eligibility criteria:
  - Track record: You have a strong track record of making research progress in a difficult domain with limited feedback loops.
  - Expertise: If domain knowledge is required for your research agenda, then someone involved with the proposal has expertise in that domain.
  - Relevant targets: Your proposal is designed to tackle specific threat scenarios like faking alignment, hiding dangerous behaviors until deployment, reward hacking, and related ways that AI systems could tamper with human oversight – or more broadly, could help with scaling oversight to models more intelligent than humans. We want to hear why your agenda would meaningfully reduce risk if it goes well. For example, you should be clear about how your research proposal would enable us either to mitigate scheming (Carlsmith, Cotra, Hubinger et al., Soares [LW · GW]), or else to avoid problems with scheming entirely, if all the research bets pay off.
  - Technical work: The core of the moonshot is a technical research proposal. If your proposal is for policy or advocacy work, you should consider this RFP instead.
  - Tractability: You can make a strong case that the central problems of the agenda are tractable, with some opportunity for real-world feedback.
Nice-to-have criterion:
- Efficiency: If proposing substantial changes to the way we currently train AI systems, then we’d prefer if the agenda has some chance of being competitive with the current language model paradigm in terms of compute efficiency. However, we are also interested in ideas that would be more expensive than the current paradigm, given they offer new perspectives on how to avoid the central alignment risks.
Flag: We will have a high bar for proposals in this area. This topic is promising in the abstract, but has failed to make clear progress, especially on the efficiency criterion above. Moreover, we may lack the expertise to distinguish useful proposals from off-target proposals. Applicants who are considering more than one category are encouraged to choose a different category than this one.
Flag: In this area, we expect to consider proposals for 3- or 6-month “planning grants”, intended to help the applicant build out a proposal for a full project. A planning grant funds the grantee to write about their proposal in more depth, discuss their plans with advisors, find co-founders and other partners, and, in some rare cases, test out a small-scale pilot project.

21 comments

Comments sorted by top scores.

comment by Rohin Shah (rohinmshah) · 2025-02-08T09:56:28.531Z · LW(p) · GW(p)

I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.

I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimization pressure against it, but that isn't described as a core desideratum, and I don't expect we get it. Work on externalized reasoning can also make alignment easier, but I'm not counting that as "directly relevant".) Everything else seems to mostly focus on evaluation or fundamental science that we hope pays off in the future. (To be clear, I think those are good and we should clearly put a lot of effort into them, just not all of our effort.)

Areas that I'm more excited about relative to the median area in this post (including some of your starred areas):

Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
Mild optimization. I'm particularly interested in MONA [LW · GW] (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

To be fair, a lot of this work is much harder to execute on, requiring near-frontier models, significant amounts of compute, and infrastructure to run RL on LLMs, and so is much better suited to frontier labs than to academia / nonprofits, and you will get fewer papers per $ invested in it. Nonetheless, academia / nonprofits are so much bigger than frontier labs that I think academics should still be taking shots at these problems.

(And tbc there are plenty of other areas directly relevant to alignment that I'm less excited about, e.g. improved variants of RLHF / Constitutional AI, leveraging other modalities of feedback for LLMs (see Table 1 here), and "gradient descent psychology" (empirically studying how fine-tuning techniques affect LLM behavior).)

Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., McAleese et al.), AI debate (Arnesen et al., Khan et al.), other scalable oversight methods

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

(A lot of the scalable oversight work from the last year is inference-only, but at least for me (and for the field historically) the goal is to scale this to training the AI system -- all of the theory depends on equilibrium behavior which you only get via training.)

Replies from: Buck, Fabien, Buck, oliver-daniels-koch

↑ comment by Buck · 2025-02-08T16:46:37.549Z · LW(p) · GW(p)

I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-02-08T19:04:46.856Z · LW(p) · GW(p)

The idea I associate with scalable oversight is weaker models overseeing stronger models (probably) combined with safety-by-debate. Is that the same or different from " recursive techniques for reward generation" ?

Currently, this general class of ideas seems to me the most promising avenue for achieving alignment for vastly superhuman AI (' superintelligence' )..

↑ comment by Fabien Roger (Fabien) · 2025-02-08T19:10:02.384Z · LW(p) · GW(p)

Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
Mild optimization. I'm particularly interested in MONA [LW · GW] (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.

I think additional non-moonshot work in these domains will have a very hard time helping.

[low confidence] My high level concern is that non-moonshot work in these clusters may be the sort of things labs will use anyway (with or without safety push) if this helped with capabilities because the techniques are easy to find, and won't use if it didn't help with capabilities because the case for risk reduction is weak. This concern is mostly informed by my (relatively shallow) read of recent work in these clusters.

[edit: I was at least somewhat wrong, see comment threads below]

Here are things that would change my mind:

If I thought people were making progress towards techniques with nicer safety properties and no alignment tax that seem hard enough to make workable in practice that capabilities researchers won't bother using by default, but would bother using if there was existing work on how to make them work.
- (For the different question of preventing AIs from using "harmful knowledge", I think work on robust unlearning and gradient routing may have this property - the current SoTA is far enough from a solution that I expect labs to not bother doing anything good enough here, but I think there is a path to legible success, and conditional on success I expect labs to pick it up because it would be obviously better, more robust, and plausibly cheaper than refusal training + monitoring. And I think robust unlearning and gradient routing have better safety properties than refusal training + monitoring.)
If I thought people were making progress towards understanding when not using process-based supervision and debate is risky. This looks like demos and model organisms aimed at measuring when, in real life, not using these simple precautions would result in very bad outcomes while using the simple precautions would help.
- (In the different domain of CoT-faithfulness I think there is a lot of value in demonstrating the risk of opaque CoT well-enough that labs don't build techniques that make CoT more opaque if it just slightly increased performance because I expect that it will be easier to justify. I think GDM's updated safety framework is a good step in this direction as it hints at additional requirements GDM may have to fulfill if it wanted to deploy models with opaque CoT past a certain level of capabilities.)
If I thought that research directions included in the cluster you are pointing at were making progress towards speeding up capabilities in safety-critical domains (e.g. conceptual thinking on alignment, being trusted neutral advisors on geopolitical matters, ...) relative to baseline methods (i.e. the sort of RL you would do by default if you wanted to make the model better at the safety-critical task if you had no awareness of anything people did in the safety literature).

I am not very aware of what is going on in this part of the AI safety field. It might be the case that I would change my mind if I was aware of certain existing pieces of work or certain arguments. In particular I might be too skeptical about progress on methods for things like debate and process-based supervision - I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?

It's also possible that I am missing an important theory of change for this sort of research.

(I'd still like it if more people worked on alignment: I am excited about projects that look more like the moonshots described in the RFP and less like the kind of research I think you are pointing at.)

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2025-02-08T22:35:56.917Z · LW(p) · GW(p)

I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?

I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignment is easy to achieve with RLHF and its variants, or RL on ground truth signals. I guess in some sense the question is "how big is the alignment tax", and I agree we don't know the answer to that yet and may not have enough understanding by the time it is relevant, but I don't really see why one would think "nah it'll only work in toy domains".

I agree debate doesn't work yet, though I think >50% chance we demonstrate decent results in some LLM domain (possibly a "toy" one) by the end of this year. Currently it seems to me like a key bottleneck (possibly the only one) is model capability, similarly to how model capability was a bottleneck to achieving the value of RL on ground truth until ~2024).

It also seems like it would still be useful if the methods were used some time after the labs want to use it for production runs.

It's wild to me that you're into moonshots when your objection to existing proposals is roughly "there isn't enough time for research to make them useful". Are you expecting the moonshots to be useful immediately?

Replies from: Fabien, Fabien

↑ comment by Fabien Roger (Fabien) · 2025-02-09T14:14:52.859Z · LW(p) · GW(p)

Which of the bullet points in my original message do you think is wrong? Do you think MONA and debate papers are:

on the path to techniques that measurably improve feedback quality on real domains with potentially a low alignment tax, and that are hard enough to find that labs won't use them by default?
on the path to providing enough evidence of their good properties that even if they did not measurably help with feedback quality in real domains (and slightly increased cost), labs could be convinced to use them because they are expected to improve non-measurable feedback quality?
on the path to speeding up safety-critical domains?
(valuable for some other reason?)

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2025-02-09T18:25:43.251Z · LW(p) · GW(p)

I have some credence in all three of those bullet points.

For MONA it's a relatively even mixture of the first and second points.

(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)

For debate it's mostly the first point, and to some extent the third point.

Replies from: Fabien, ryan_greenblatt

↑ comment by Fabien Roger (Fabien) · 2025-02-10T12:34:11.617Z · LW(p) · GW(p)

Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality

That's right. If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point (especially if the alignment tax is non-negligible), and I am not sure how you get it.

In particular, for the median AIs that labs use to 20x AI safety research, my guess is that you won't have invisible long-term reward hacking problems, and so I would advise labs to spend the alignment tax on other interventions (like using weaker models when possible, or doing control), not on using process-based rewards. I would give different advice if

the alignment tax of MONA were tiny
there were decent evidence for invisible long-term reward hacking problems with catastrophic consequences solved by MONA

I think this is not super plausible to happen, but I am sympathetic to research towards these two goals. So maybe we don't disagree that much (except maybe on the plausibility of invisible long-term reward hacking problems for the AIs that matter the most).

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2025-02-11T23:34:38.277Z · LW(p) · GW(p)

If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point

That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.

For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we'd penalize if we knew about it, but we don't realize that's happening). Later when things are put into production errors happen, but it's chalked up to "well it's hard to anticipate everything".

Instead you use MONA, and it doesn't learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent's lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.

Replies from: Fabien

↑ comment by Fabien Roger (Fabien) · 2025-02-14T10:38:10.142Z · LW(p) · GW(p)

To rephrase what I think you are saying are situations where work on MONA is very helpful:

By default people get bitten by long-term optimization. They notice issues in prod because it's hard to catch everything. They patch individual failures when they come up, but don't notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don't try MONA-like techniques because it's not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures.
If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don't by default think of the individual failures you've seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk - though given competitive pressures, that might not be the main factor driving decisions and so you don't have to make an ironclad case for MONA reducing catastrophic risk).

Is that right?

I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2025-02-14T10:54:00.570Z · LW(p) · GW(p)

Is that right?

Yes, that's broadly accurate, though one clarification:

This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly

That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.

I think this will become much more likely once we actually start observing long-term optimization failures in prod.

Agreed, we're not advocating for using MONA now (and say so in the paper).

Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?

Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)

↑ comment by ryan_greenblatt · 2025-02-09T19:17:00.436Z · LW(p) · GW(p)

You are possibly the first person I know of who reacted to MONA with "that's obvious"

I also have the "that's obvious reaction", but possibly I'm missing somne details. I also think it won't perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2025-02-10T05:22:36.115Z · LW(p) · GW(p)

I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").

Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).

↑ comment by Fabien Roger (Fabien) · 2025-02-09T13:52:07.270Z · LW(p) · GW(p)

I don't really see why one would think "nah it'll only work in toy domains".

That is not my claim. By "I would have guessed that methods work on toy domains and math will be useless" I meant "I would have guessed that if a lab decided to do process-based feedback, it will be better off not doing a detailed literature review of methods in MONA and followups on toy domains, and just do the process-based supervision that makes sense in the real domain they now look at. The only part of the method section of MONA papers that matters might be "we did process-based supervision"."

I did not say "methods that work on toy domains will be useless" (my sentence was easy to misread).

I almost have the opposite stance, I am closer to "it's so obvious that process-based feedback helps that if capabilities people ever had issues stemming from long-term optimization, they would obviously use more myopic objectives. So process-based feedback so obviously prevents problems from non-myopia in real life that the experiments in the MONA paper don't increase the probability that people working on capabilities implement myopic objectives."

But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings (above the "capability researcher does what looks easiest and sensible to them" baseline)?

I agree debate doesn't work yet

Not a crux. My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality. So my low confidence guess is that it's not high priority for people working on x-risk safety to speed up that work.

But I am excited about debate work that is not just about improving feedback quality. For example I am interested in debate vs schemers or debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality benchmarks (e.g. rewardbench)" but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2025-02-09T18:23:40.278Z · LW(p) · GW(p)

Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)

But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings

I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generalizable lessons to be learned.

I agree that there's not much benefit in "methods work" if that is understood as "work on the algorithm / code that given data + rewards / approvals translates it into gradient updates". I care a lot more about iterating on how to produce the data + rewards / approvals.

My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality.

I'd weakly bet against this, I think there will be lots of fiddly design decisions that you need to get right to actually see the benefits, plus iterating on this is expensive and hard because it involves multiagent RL. (Certainly this is true of our current efforts; the question is just whether this will remain true in the future.)

For example I am interested in [...] debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality benchmarks (e.g. rewardbench)" but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.

I'm confused. This seems like the central example of work I'm talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you're talking about as well.)

EDIT: And tbc this is the kind of thing I mean by "improving average-case feedback quality". I now feel like I don't know what you mean by "feedback quality".

Replies from: Fabien

↑ comment by Fabien Roger (Fabien) · 2025-02-10T12:21:31.937Z · LW(p) · GW(p)

I'm confused. This seems like the central example of work I'm talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you're talking about as well.)

My bad, I was a bit sloppy here. The debate-for-control stuff is in the RFP but not the debate vs subtle reward hacks that don't show up in feedback quality evals.

I think we agree that there are some flavors of debate work that are exciting and not present in the RFP.

↑ comment by Buck · 2025-02-08T16:47:59.462Z · LW(p) · GW(p)

Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

I don't know what this means, do you have any examples?

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2025-02-08T22:11:40.545Z · LW(p) · GW(p)

I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.

↑ comment by Oliver Daniels (oliver-daniels-koch) · 2025-02-09T21:13:40.376Z · LW(p) · GW(p)

Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)

I have a similar confusion (see my comment here [LW(p) · GW(p)]) but seems like at least Ryan wants control evaluations to cover this case? (perhaps on the assumption that if your "control measures" are successful, they should be able to elicit aligned behavior from scheming models and this behavior can be reinforced?)

comment by Towards_Keeperhood (Simon Skade) · 2025-02-12T18:59:40.559Z · LW(p) · GW(p)

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality.

Did you consider to instead commit to giving out retroactive funding for research progress that seems useful?

Aka that people could apply for funding for anything done from 2025, and then you can actually better evaluate how useful some research was, rather than needing to guess in advance how useful a project might be. And in a way that quite impactful results can be paid a lot, so you don't disincentivize low-chance-high-reward strategies. And so we get impact market dynamics where investors can fund projects in exchange for a share of the retroactive funding in case of success.

There are difficulties of course. Intuitively this retroactive approach seems a bit more appealing to me, but I'm basically just asking whether you considered it and if so why you didn't go with it.

comment by Towards_Keeperhood (Simon Skade) · 2025-02-12T18:50:26.684Z · LW(p) · GW(p)

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality.

Side question: How much is Openphil funding LTFF? (And why not more?)

(I recently got an email from LTFF which suggested that they are quite funding constraint. And I'd intuitively expect LTFF to be higher impact per dollar than this, though I don't really know.)

Research directions Open Phil wants to fund in technical AI safety

Contents

Synopsis

Adversarial machine learning

Exploring sophisticated misbehavior in LLMs

Model transparency

Trust from first principles

Alternative approaches to mitigating AI risks

Research Areas

*Jailbreaks and unintentional misalignment

*Control evaluations

*Backdoors and other alignment stress tests

*Alternatives to adversarial training

Robust unlearning

*Experiments on alignment faking

*Encoded reasoning in CoT and inter-model communication

Black-box LLM “psychology”

Evaluating whether models can hide dangerous behaviors

Reward hacking of human oversight

*Applications of white-box techniques

Activation monitoring

Finding feature representations

Toy models for interpretability

Externalizing reasoning

Interpretability benchmarks

†More transparent architectures

White-box estimation of rare misbehavior

Theoretical study of inductive biases

†Conceptual clarity about risks from powerful AI

†New moonshots for aligning superintelligence

21 comments