The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

post by Arthur Conmy (arthur-conmy), Neel Nanda (neel-nanda-1) · 2025-02-24T02:17:12.991Z · LW · GW · 0 comments

Contents

  1. What is Applied Interpretability?
  2. Specific projects we're interested in working on
  FAQ
    What’s the relationship between applied interpretability and Neel’s mechanistic interpretability team?
    How much autonomy will I have?
    Why do applied interpretability rather than fundamental research?
    What makes someone a good fit for the role?
    I’ve heard that Google infra can be pretty slow and bad
    Can I publish?
    Does probing really count as interpretability?
None
No comments

TL;DR: The Google DeepMind AGI Safety team is hiring for Applied Interpretability research scientists and engineers. Applied Interpretability is a new subteam we are forming to focus on directly using model internals-based techniques to make models safer in production. Achieving this goal will require doing research on the critical path that enables interpretability methods to be more widely used for practical problems. We believe this has significant direct and indirect benefits for preventing AGI x-risk, and argue this below. Our ideal candidate has experience with ML engineering and some hands-on experience with language model interpretability research. To apply for this role (as well as other open AGI Safety and Gemini Safety roles), follow the links for Research Engineers here & Research Scientists here.

1. What is Applied Interpretability?

At a high level, the goal of the applied interpretability team is to make model internals-based methods become a standard tool to make production LLMs safer. As interpretability research progresses, we believe that the more we’ll understand about what’s happening inside these systems, the safer we’ll be. We further believe that some model internals-based methods (e.g. probing) have potential to be directly useful today, and that demonstrating this will inspire more work in this area. Finally, we believe it is important for interpretability research to be grounded in feedback from reality, and that this kind of pragmatic work may get us closer to being able to do more ambitious forms of interpretability.

A core part of the team’s philosophy will be pragmatic empiricism: we will use whatever techniques are best suited for the problem at hand (e.g. trying simple techniques like probing over fancy ones like SAEs), and carefully test these against rigorous baselines.[1] If our internals-based techniques don’t beat baselines like changing the system prompt, we should work on a different problem, and highlight the key limitations of interpretability in such settings!

If you are hired, we expect half of your time to be spent on novel research and half on implementing existing methods, though this will vary a lot. Part of the team’s mandate will be working with other GDM teams, especially the Gemini Safety and AGI Safety teams, to try to directly have impact with model internals work today, e.g. by helping Frontier Safety make probes to monitor whether a deployed system is being used to help with cyberoffensive capabilities, as part of a defense in depth approach. Another part will be finding real problems where model internals could plausibly help and doing the research required to figure out how (see example problems that we’re interested in below). Hopefully, in future, the rest of the mechanistic interpretability team will find effective techniques which applied interpretability can then figure out how to convert into something production ready – we won’t have any impact if our techniques don’t get used. But we also expect applied interpretability to directly contribute real research insights – working on real problems is a great way to remain grounded and get feedback on whether your techniques are working or not.

As precedent, Arthur recently worked on a project with Gemini Safety implementing model internals based mitigations for violations of present-day safety policies in production models. Though these are lower stakes safety issues, we think that this work was highly impactful even if you only care about high stakes issues, by setting the precedent for future model internals work to happen at GDM, which will make future projects much easier. In particular, this involved solving a bunch of gnarly infrastructure issues in order to make model internals work possible in the highly optimized inference code, and this infra can now be easily used for future projects, significantly lowering costs.

2. Specific projects we're interested in working on

Some specific projects we're interested in working on include:

Other promising directions we are potentially interested in:

FAQ

What’s the relationship between applied interpretability and Neel’s mechanistic interpretability team?

You can think of this as a subteam of Neel’s team, run by Arthur. You will formally report to Neel, but Arthur will run applied interpretability day-to-day, with Neel as a highly involved advisor. By default, the team's strategic direction will be set by Arthur, Neel and Rohin Shah (AGI Safety lead), with Arthur typically making the final call, but we expect any hire we make to also have a lot of input into the team's direction.

How much autonomy will I have?

We should be clear up front that you’d have much less autonomy than you would in a typical PhD. Initially, the team would just be you and Arthur. Given the size, it would be expected that the team acts as a unit and takes on projects together. However, given the size, you would also have a lot of input into the team direction and strategy. We want to hire people with good takes! The mechanistic interpretability team is willing to significantly change direction based on new evidence and compelling arguments. For example, we have recently de-prioritised sparse autoencoder research due to some disappointing recent results, and the entire idea of applied interpretability is a pivot from the team’s previous standard approach.

You would be expected to stick within the team’s mandate of “do things focused on using model internals in production”. This may sometimes look like being asked to apply known techniques on problems if there’s a short-term need, where the main goal is just to have direct impact. However, whether to take on such projects will be determined by what we (Arthur, Neel and Rohin Shah) think is impactful for AGI Safety, rather than e.g. commercial incentives.

We expect this role to be a good fit if you’re impact motivated and expect to broadly agree with Rohin, Neel and Arthur on what’s impactful (constructive disagreement is extremely welcome though!). If you strongly value autonomy and have idiosyncratic opinions about what research you want to do, this is likely not the role for you. And while we expect there to be a lot of interesting research questions, working with frontier systems can be messy and frustrating, and the role may sometimes involve tedious engineering work, so make sure that’s something you can live with!

Why do applied interpretability rather than fundamental research?

A key exciting part of Applied Interpretability to us is that testing techniques on real problems gives strong evidence about which interpretability tools are helpful (see Stephen Casper’s argument here [? · GW]). To restate this key motivation in the context of this post:

(Note that this post [AF · GW] distinguishes between fundamental and applied mechanistic interpretability work, and argues for doing more fundamental work. However, most “applied mechanistic interpretability” work in that post would not fall under the set of tasks we are trying to work on as part of our Applied Interpretability efforts)

What makes someone a good fit for the role?

I’ve heard that Google infra can be pretty slow and bad

See here for discussion of this point [AF · GW]. In addition to this, Arthur has worked on a bunch of infrastructure related to how Gemini is used in production, and is happy to handle particularly messy parts. You will also have the support of other engineers on the AGI safety and alignment team, who often help each other out on a wide range of difficulties.

Can I publish?

Some parts of work would likely touch on Gemini details that could not be published, but we don't think this is a serious downside of this opportunity:

  1. There is strong support for publishing interpretability research at GDM, and often work can be published by removing the sensitive details, e.g. replicating it on an open source model.
  2. GDM is highly collaborative (e.g. Arthur has benefitted from this immensely), and you'd certainly be able to spend some time working with the rest of the Mechanistic Interpretability team, and being a co-author on their papers.

Does probing really count as interpretability?

A reasonable objection to the title “Applied Interpretability” is that some techniques we’ve discussed in this post such as probing aren’t doing anything like translating model representations into terms which humans could understand. (Most interpretability definitions tend to reference translation of AI concepts into human concepts).

Our response: we don’t want to pen ourselves in to solely working on interpretability-according-to-some-definition, and we are quite happy on techniques in the broader category of things that use the weights and activations in models (which we have most experience in and think is most neglected among other related baselines). Recall also:

> If our internals-based techniques don’t beat baselines like changing the system prompt, we should work on a different problem, and highlight the key limitations of interpretability in addressing that problem!

From the main post.

  1. ^

     Taking into account the cost of the baselines - e.g. probing has no side effects on model behavior, while a LoRA to create a classification head will, so probing may be a superior solution even if accuracy is lower, depending on the use case.

  2. ^

     The team responsible for making and implementing our Frontier Safety Framework, including e.g. doing dangerous capability evals.

0 comments

Comments sorted by top scores.