MATS Applications + Research Directions I'm Currently Excited About

post by Neel Nanda (neel-nanda-1) · 2025-02-06T11:03:40.093Z · LW · GW · 7 comments

Contents

  Understanding thinking models
  Sparse Autoencoders
  Model diffing
  Understanding sophisticated/safety relevant behaviour
  Being useful
  Investigate fundamental assumptions
None
7 comments

I've just opened summer MATS applications (where I'll supervise people to write mech interp papers) I'd love to get applications from any readers who are interested! Apply here, due Feb 28

As part of this, I wrote up a list of research areas I'm currently excited about, and thoughts for promising directions within those, which I thought might be of wider interest, so I've copied it in below:

Understanding thinking models

Eg o1, r1, Gemini Flash Thinking, etc - ie models that produce a really long chain of thought when reasoning through complex problems, and seem to be much more capable as a result. These seem like a big deal, and we understand so little about them! And now we have small thinking models like r1 distilled Qwen 1.5B, they seem quite tractable to study (though larger distilled versions of r1 will be better. I doubt you need full r1 though).

Sparse Autoencoders

In previous rounds I was predominantly interested in Sparse Autoencoder projects, but I’m comparatively less excited about SAEs now - I still think they’re cool, and am happy to get SAE applications/supervise SAE projects, but think they’re unlikely to be a silver bullet and expect to diversify my projects a bit more (I’ll hopefully write more on my overall takes soon).

Within SAEs, I’m most excited about:

Model diffing

What happens to a model during finetuning? If we have both the original and tuned model, can we somehow take the “diff” between the two to just interpret what changed during finetuning?

Understanding sophisticated/safety relevant behaviour

LLMs are getting good enough that they start to directly demonstrate some alignment relevant behaviours. Most interpretability work tries to advance the field in general by studying arbitrary, often toy, problems, but I’d be very excited to study these phenomena directly!

Being useful

Interpretability is often pretty abstract, pursuing lofty blue skies goals, and it’s hard to tell if your work is total BS or not. I’m excited about projects that take a real task, one that can be defined without ever referencing interpretability, and trying to beat non-interp baselines in a fair(ish) fight - if you can do this, it’s strong evidence you’ve learned *something *real

Investigate fundamental assumptions

There’s a lot of assumptions behind common mechanistic interpretability works, both scientific assumptions and theory of change assumptions, that in my opinion have insufficient evidence. I’d be keen to gather evidence for and against!


  1. I favour the term latent over feature, because feature also refers to the subtly but importantly different concept of “the interpretable concept”, which an SAE “feature” imperfectly corresponds to, and it’s very confusing for it to mean both. ↩︎

7 comments

Comments sorted by top scores.

comment by Aryaman Arora (aryaman-arora) · 2025-02-10T08:53:43.682Z · LW(p) · GW(p)

Very useful list Neel!! Thanks for mentioning AxBench, but unfortunately we don't own the domain you linked to 😅 the actual link is https://github.com/stanfordnlp/axbench

comment by eggsyntax · 2025-02-07T21:42:42.238Z · LW(p) · GW(p)

with a bunch of reflexes to eg stop and say “that doesn’t sound right” or “I think I’ve gone wrong, let’s backtrack and try another path”


Shannon Sands says he's found a backtracking vector in R1:

https://x.com/chrisbarber/status/1885047105741611507

comment by Keenan Pepper (keenan-pepper) · 2025-02-28T00:29:48.064Z · LW(p) · GW(p)

 

As far as you're aware, is there any autointerp work that's based on actively steering (boosting/suppressing) the latent to be labeled and generating completions, rather than searching a dataset for activating examples?

Replies from: neel-nanda-1, keenan-pepper
comment by Neel Nanda (neel-nanda-1) · 2025-02-28T05:51:27.078Z · LW(p) · GW(p)

Probably is but I can't think of anything immediately

comment by Keenan Pepper (keenan-pepper) · 2025-02-28T00:33:06.708Z · LW(p) · GW(p)

Hmm, there is a related thing called "intervention scoring" ( https://arxiv.org/abs/2410.13928 ) but this appears to be only for scoring the descriptions produced by the traditional method, not using interventions to generate the descriptions in the first place.

comment by ethanelasky · 2025-02-16T15:25:15.678Z · LW(p) · GW(p)

Hey, just an FYI -- The TinyUrl link is broken.

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2025-02-17T04:28:00.905Z · LW(p) · GW(p)

Huh, seems to be working for me. What do you see when you click on it?

tinyurl.com/neel-mats-app