Research directions Open Phil wants to fund in technical AI safety

post by jake_mendel, maxnadeau, Peter Favaloro (peter.favaloro@gmail.com) · 2025-02-08T01:40:00.968Z · LW · GW · 0 comments

This is a link post for https://www.openphilanthropy.org/tais-rfp-research-areas/

Contents

  Synopsis
    Adversarial machine learning 
    Exploring sophisticated misbehavior in LLMs
    Model transparency
    Trust from first principles
    Alternative approaches to mitigating AI risks
  Research Areas
    *Jailbreaks and unintentional misalignment
    *Control evaluations
    *Backdoors and other alignment stress tests
    *Alternatives to adversarial training
    Robust unlearning
    *Experiments on alignment faking
    *Encoded reasoning in CoT and inter-model communication
    Black-box LLM “psychology”
    Evaluating whether models can hide dangerous behaviors
    Reward hacking of human oversight
    *Applications of white-box techniques
    Activation monitoring
    Finding feature representations
    Toy models for interpretability
    Externalizing reasoning
    Interpretability benchmarks
    †More transparent architectures
    White-box estimation of rare misbehavior
    Theoretical study of inductive biases
    †Conceptual clarity about risks from powerful AI
    †New moonshots for aligning superintelligence
None
No comments

The Open Philanthropy has just launched [LW · GW] a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety. 

This guide provides an opinionated overview of recent work and open problems across areas like adversarial testing, model transparency, and theoretical approaches to AI alignment. We link to hundreds of papers and blog posts and offer approximately a hundred different example projects. We hope this is a useful resource for technical people getting started in alignment research. We'd also welcome feedback from the LW community on our prioritization within or across research areas.

For each research area, we include:

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality. 

Synopsis

In this section we briefly orient readers to the 21 research areas that we’ll discuss in more detail below. For ease of consumption, we’ve grouped them into 5 rough clusters, though of course there is overlap and ambiguity in how to categorize each research area.

Our favorite topics are marked with a star (*) – we’re especially eager to fund work in these areas. In contrast, we will have a high bar for topics marked with a dagger (†).

Adversarial machine learning 

This cluster of research areas uses simulated red-team/blue-team exercises to expose the vulnerabilities of an LLM (or a system that incorporates LLMs). Across these directions, a blue team attempts to make an AI system adhere with very high reliability to some specification of its safety behavior, and then a red team attempts to find edge cases that violate the specification. We think this adversarial style of evaluation and iteration is necessary to ensure an AI system has a low probability of catastrophic failure. Through these research directions, we aim to develop robust safety techniques that mitigate risks from AIs before those risks emerge in real-world deployments.

Exploring sophisticated misbehavior in LLMs

Future, more capable AI models might exhibit novel failure modes that are hard to detect with current methods – for instance, failure modes that involve LLMs reasoning about their human developers or becoming optimized to deceive flawed human assessors. We want to fund research that identifies the conditions under which these failure modes occur, and makes progress toward robust methods of mitigating or avoiding them. 

Model transparency

We see potential in the idea of using a network’s intermediate representations to predict, monitor, or modify its behavior. Some approaches are feasible without an understanding of the model’s learned mechanisms, while other techniques may become possible with the invention of interpretability methods that more comprehensively decompose an AI’s internal mechanisms into components that can be understood and intervened on individually. We’re interested in funding research across this spectrum — everything from useful kludges to new ideas for making models more transparent and steerable. 

Trust from first principles

We trust nuclear power plants and orbital rockets through validated theories that are principled and mechanistic, rather than through direct trial-and-error. We would benefit from similarly systematic, principled approaches to understanding and predicting AI behavior. One approach to this is model transparency, as in the previous cluster. But understanding may not be a necessary condition: this cluster aims to get the safety and trust benefits of interpretability without humans having to understand any specific AI model in all its details. 

Alternative approaches to mitigating AI risks

These research areas lie outside the scope of the clusters above.

Research Areas

*Jailbreaks and unintentional misalignment

*Control evaluations

*Backdoors and other alignment stress tests

*Alternatives to adversarial training

Robust unlearning

*Experiments on alignment faking

*Encoded reasoning in CoT and inter-model communication

Black-box LLM “psychology”

Evaluating whether models can hide dangerous behaviors

Reward hacking of human oversight

*Applications of white-box techniques

Activation monitoring

Finding feature representations

Toy models for interpretability

Externalizing reasoning

Interpretability benchmarks

†More transparent architectures

White-box estimation of rare misbehavior

Theoretical study of inductive biases

†Conceptual clarity about risks from powerful AI

†New moonshots for aligning superintelligence

0 comments

Comments sorted by top scores.