The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research
post by Arthur Conmy (arthur-conmy), Neel Nanda (neel-nanda-1) · 2025-02-24T02:17:12.991Z · LW · GW · 0 commentsContents
1. What is Applied Interpretability? 2. Specific projects we're interested in working on FAQ What’s the relationship between applied interpretability and Neel’s mechanistic interpretability team? How much autonomy will I have? Why do applied interpretability rather than fundamental research? What makes someone a good fit for the role? I’ve heard that Google infra can be pretty slow and bad Can I publish? Does probing really count as interpretability? None No comments
TL;DR: The Google DeepMind AGI Safety team is hiring for Applied Interpretability research scientists and engineers. Applied Interpretability is a new subteam we are forming to focus on directly using model internals-based techniques to make models safer in production. Achieving this goal will require doing research on the critical path that enables interpretability methods to be more widely used for practical problems. We believe this has significant direct and indirect benefits for preventing AGI x-risk, and argue this below. Our ideal candidate has experience with ML engineering and some hands-on experience with language model interpretability research. To apply for this role (as well as other open AGI Safety and Gemini Safety roles), follow the links for Research Engineers here & Research Scientists here.
1. What is Applied Interpretability?
At a high level, the goal of the applied interpretability team is to make model internals-based methods become a standard tool to make production LLMs safer. As interpretability research progresses, we believe that the more we’ll understand about what’s happening inside these systems, the safer we’ll be. We further believe that some model internals-based methods (e.g. probing) have potential to be directly useful today, and that demonstrating this will inspire more work in this area. Finally, we believe it is important for interpretability research to be grounded in feedback from reality, and that this kind of pragmatic work may get us closer to being able to do more ambitious forms of interpretability.
A core part of the team’s philosophy will be pragmatic empiricism: we will use whatever techniques are best suited for the problem at hand (e.g. trying simple techniques like probing over fancy ones like SAEs), and carefully test these against rigorous baselines.[1] If our internals-based techniques don’t beat baselines like changing the system prompt, we should work on a different problem, and highlight the key limitations of interpretability in such settings!
If you are hired, we expect half of your time to be spent on novel research and half on implementing existing methods, though this will vary a lot. Part of the team’s mandate will be working with other GDM teams, especially the Gemini Safety and AGI Safety teams, to try to directly have impact with model internals work today, e.g. by helping Frontier Safety make probes to monitor whether a deployed system is being used to help with cyberoffensive capabilities, as part of a defense in depth approach. Another part will be finding real problems where model internals could plausibly help and doing the research required to figure out how (see example problems that we’re interested in below). Hopefully, in future, the rest of the mechanistic interpretability team will find effective techniques which applied interpretability can then figure out how to convert into something production ready – we won’t have any impact if our techniques don’t get used. But we also expect applied interpretability to directly contribute real research insights – working on real problems is a great way to remain grounded and get feedback on whether your techniques are working or not.
As precedent, Arthur recently worked on a project with Gemini Safety implementing model internals based mitigations for violations of present-day safety policies in production models. Though these are lower stakes safety issues, we think that this work was highly impactful even if you only care about high stakes issues, by setting the precedent for future model internals work to happen at GDM, which will make future projects much easier. In particular, this involved solving a bunch of gnarly infrastructure issues in order to make model internals work possible in the highly optimized inference code, and this infra can now be easily used for future projects, significantly lowering costs.
2. Specific projects we're interested in working on
Some specific projects we're interested in working on include:
As discussed above, GDM’s Frontier Safety team[2] is working on building mitigations against misuse risk (particularly from the cyberoffensive capabilities models will have), with both black-box monitors and probes. We will likely implement and test the probes for this effort.
- Further, the Frontier Safety team also is working on monitoring for egregious misalignment, such as misaligned reasoning stated in plain sight in the Chain-of-Thought of thinking models. Again we would want to test probes or other white-box monitoring approaches in this setting, for cheaper detection for example.
- We would also consider working on monitoring when the Chain-of-Thought is entirely in latent space: scoping out whether this is feasible at all, or providing evidence that Chain-of-Thought entirely in latent space needs to be addressed by other strategies.
Other promising directions we are potentially interested in:
- Understanding Unfaithful Chain-of-Thought
- In upcoming work from MATS scholars Ivan, Jett and Rob (supervised by Arthur and Neel) we find evidence for new forms of unfaithful Chain-of-Thought on normal prompts, without needing to design them to produce unfaithfulness. We are excited about developing monitors for this behavior and incentivizing this to not happen.
- Training Data Attribution
- There is relevant prior work from other GDM Interpretability teams on training data attribution, i.e. identifying the training data points that most influenced a given model behavior (and other prior work too)
- We would hope to collaborate with other GDM teams working on this technique
- There are many shortcomings of language models due to issues with subsets of their training data, so directly applying these tools to novel problems seems promising and under-explored from our perspective
- We think insofar as we need to improve tooling in this area, it is likely that we will be able to use ideas from mechanistic interpretability to improve tool performance
- There is relevant prior work from other GDM Interpretability teams on training data attribution, i.e. identifying the training data points that most influenced a given model behavior (and other prior work too)
- Model Diffing
- A lot of interpretability studies how a single model behaves, but in reality we can frame lots of safety problems as issues where pretrained models pose comparatively little risk, whereas finetuned thinking or agent models pose comparatively higher risk. We’re relatively pessimistic about sparse autoencoder variants studied so far, but think that either looking at which model components have changed a lot, or when representations change (as could be measured by e.g. a fixed probe) are promising approaches here.
FAQ
What’s the relationship between applied interpretability and Neel’s mechanistic interpretability team?
You can think of this as a subteam of Neel’s team, run by Arthur. You will formally report to Neel, but Arthur will run applied interpretability day-to-day, with Neel as a highly involved advisor. By default, the team's strategic direction will be set by Arthur, Neel and Rohin Shah (AGI Safety lead), with Arthur typically making the final call, but we expect any hire we make to also have a lot of input into the team's direction.
How much autonomy will I have?
We should be clear up front that you’d have much less autonomy than you would in a typical PhD. Initially, the team would just be you and Arthur. Given the size, it would be expected that the team acts as a unit and takes on projects together. However, given the size, you would also have a lot of input into the team direction and strategy. We want to hire people with good takes! The mechanistic interpretability team is willing to significantly change direction based on new evidence and compelling arguments. For example, we have recently de-prioritised sparse autoencoder research due to some disappointing recent results, and the entire idea of applied interpretability is a pivot from the team’s previous standard approach.
You would be expected to stick within the team’s mandate of “do things focused on using model internals in production”. This may sometimes look like being asked to apply known techniques on problems if there’s a short-term need, where the main goal is just to have direct impact. However, whether to take on such projects will be determined by what we (Arthur, Neel and Rohin Shah) think is impactful for AGI Safety, rather than e.g. commercial incentives.
We expect this role to be a good fit if you’re impact motivated and expect to broadly agree with Rohin, Neel and Arthur on what’s impactful (constructive disagreement is extremely welcome though!). If you strongly value autonomy and have idiosyncratic opinions about what research you want to do, this is likely not the role for you. And while we expect there to be a lot of interesting research questions, working with frontier systems can be messy and frustrating, and the role may sometimes involve tedious engineering work, so make sure that’s something you can live with!
Why do applied interpretability rather than fundamental research?
A key exciting part of Applied Interpretability to us is that testing techniques on real problems gives strong evidence about which interpretability tools are helpful (see Stephen Casper’s argument here [? · GW]). To restate this key motivation in the context of this post:
- A large amount of (mechanistic) interpretability research creates new methods for understanding the internals of models, whether through data, circuits, representations, or control.
- However, since extremely few techniques developed in the interpretability field are being actively used in production, it can be difficult to know whether these methods are actually helpful.
- We want to provide feedback on how well various interpretability methods are working, so we can route this feedback back into the process of building better tools.
- Since we're on the mechanistic interpretability team at GDM, and we have an especially close connection to the mechanistic interpretability community through e.g. MATS, we can make research happen based on our findings and reprioritise some of the community’s work.
(Note that this post [AF · GW] distinguishes between fundamental and applied mechanistic interpretability work, and argues for doing more fundamental work. However, most “applied mechanistic interpretability” work in that post would not fall under the set of tasks we are trying to work on as part of our Applied Interpretability efforts)
What makes someone a good fit for the role?
- Someone is a good fit for the role if they:
- Can hit the ground running doing useful work making GDM’s models safer in prod
- We want to be fast-paced, and GDM infra is very large-scale, so engineering speed and excellence are highly desirable
- Are capable of making step-change improvements in the quality of our interpretability methods
- People who can make large jumps on measures that matter do amazing things for the success of projects
- Focus on results
- Sometimes, simple methods just work, and we need to be aware of this and update when this happens
- Can hit the ground running doing useful work making GDM’s models safer in prod
I’ve heard that Google infra can be pretty slow and bad
See here for discussion of this point [AF · GW]. In addition to this, Arthur has worked on a bunch of infrastructure related to how Gemini is used in production, and is happy to handle particularly messy parts. You will also have the support of other engineers on the AGI safety and alignment team, who often help each other out on a wide range of difficulties.
Can I publish?
Some parts of work would likely touch on Gemini details that could not be published, but we don't think this is a serious downside of this opportunity:
- There is strong support for publishing interpretability research at GDM, and often work can be published by removing the sensitive details, e.g. replicating it on an open source model.
- GDM is highly collaborative (e.g. Arthur has benefitted from this immensely), and you'd certainly be able to spend some time working with the rest of the Mechanistic Interpretability team, and being a co-author on their papers.
Does probing really count as interpretability?
A reasonable objection to the title “Applied Interpretability” is that some techniques we’ve discussed in this post such as probing aren’t doing anything like translating model representations into terms which humans could understand. (Most interpretability definitions tend to reference translation of AI concepts into human concepts).
Our response: we don’t want to pen ourselves in to solely working on interpretability-according-to-some-definition, and we are quite happy on techniques in the broader category of things that use the weights and activations in models (which we have most experience in and think is most neglected among other related baselines). Recall also:
> If our internals-based techniques don’t beat baselines like changing the system prompt, we should work on a different problem, and highlight the key limitations of interpretability in addressing that problem!
From the main post.
- ^
Taking into account the cost of the baselines - e.g. probing has no side effects on model behavior, while a LoRA to create a classification head will, so probing may be a superior solution even if accuracy is lower, depending on the use case.
- ^
The team responsible for making and implementing our Frontier Safety Framework, including e.g. doing dangerous capability evals.
0 comments
Comments sorted by top scores.