Posts

Evolutionary prompt optimization for SAE feature visualization 2024-11-14T13:06:49.728Z
SAEs are highly dataset dependent: a case study on the refusal direction 2024-11-07T05:22:18.807Z
SAE Probing: What is it good for? Absolutely something! 2024-11-01T19:23:55.418Z
Open Source Replication of Anthropic’s Crosscoder paper for model-diffing 2024-10-27T18:46:21.316Z
SAE features for refusal and sycophancy steering vectors 2024-10-12T14:54:48.022Z
Base LLMs refuse too 2024-09-29T16:04:21.343Z
Showing SAE Latents Are Not Atomic Using Meta-SAEs 2024-08-24T00:56:46.048Z
Calendar feature geometry in GPT-2 layer 8 residual stream SAEs 2024-08-17T01:16:53.764Z
Extracting SAE task features for in-context learning 2024-08-12T20:34:13.747Z
Self-explaining SAE features 2024-08-05T22:20:36.041Z
BatchTopK: A Simple Improvement for TopK-SAEs 2024-07-20T02:20:51.848Z
JumpReLU SAEs + Early Access to Gemma 2 SAEs 2024-07-19T16:10:54.664Z
SAEs (usually) Transfer Between Base and Chat Models 2024-07-18T10:29:46.138Z
Stitching SAEs of different sizes 2024-07-13T17:19:20.506Z
Neel Nanda's Shortform 2024-07-12T07:16:31.097Z
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 2024-07-07T17:39:35.064Z
Attention Output SAEs Improve Circuit Analysis 2024-06-21T12:56:07.969Z
SAEs Discover Meaningful Features in the IOI Task 2024-06-05T23:48:04.808Z
Mechanistic Interpretability Workshop Happening at ICML 2024! 2024-05-03T01:18:26.936Z
Transcoders enable fine-grained interpretable circuit analysis for language models 2024-04-30T17:58:09.982Z
Refusal in LLMs is mediated by a single direction 2024-04-27T11:13:06.235Z
Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z
How to use and interpret activation patching 2024-04-24T08:35:00.857Z
[Full Post] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:59.185Z
[Summary] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:17.755Z
AtP*: An efficient and scalable method for localizing LLM behaviour to components 2024-03-18T17:28:37.513Z
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems 2024-03-13T17:09:17.027Z
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To 2024-03-06T05:03:09.639Z
Attention SAEs Scale to GPT-2 Small 2024-02-03T06:50:22.583Z
Sparse Autoencoders Work on Attention Layer Outputs 2024-01-16T00:26:14.767Z
Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization 2024-01-14T02:06:00.290Z
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) 2023-12-23T02:46:25.892Z
Fact Finding: How to Think About Interpreting Memorisation (Post 4) 2023-12-23T02:46:16.675Z
Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3) 2023-12-23T02:46:05.517Z
Fact Finding: Simplifying the Circuit (Post 2) 2023-12-23T02:45:49.675Z
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) 2023-12-23T02:44:24.270Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper 2023-10-23T22:38:33.951Z
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small 2023-10-13T18:32:02.376Z
Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy 2023-08-29T22:07:04.059Z
An Interpretability Illusion for Activation Patching of Arbitrary Subspaces 2023-08-29T01:04:18.688Z
Mech Interp Puzzle 2: Word2Vec Style Embeddings 2023-07-28T00:50:00.297Z
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla 2023-07-20T10:50:58.611Z
Tiny Mech Interp Projects: Emergent Positional Embeddings of Words 2023-07-18T21:24:41.990Z
Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo 2023-07-16T22:02:15.410Z
How to Think About Activation Patching 2023-06-04T14:17:42.264Z
Finding Neurons in a Haystack: Case Studies with Sparse Probing 2023-05-03T13:30:30.836Z
Identifying semantic neurons, mechanistic circuits & interpretability web apps 2023-04-13T11:59:51.629Z
Othello-GPT: Reflections on the Research Process 2023-03-29T22:13:42.007Z
Othello-GPT: Future Work I Am Excited About 2023-03-29T22:13:26.823Z

Comments

Comment by Neel Nanda (neel-nanda-1) on Open Source Replication of Anthropic’s Crosscoder paper for model-diffing · 2024-11-18T22:00:58.518Z · LW · GW

It's essentially training an SAE on the concatenation of the residual stream from the base model and the chat model. So, for each prompt, you run it through the base model to get a residual stream vector v_b, through the chat model to get a residual stream vector v_c, and then concatenate these to get a vector twice as long, and train an SAE on this (with some minor additional details that I'm not getting into)

Comment by Neel Nanda (neel-nanda-1) on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2024-10-21T14:40:54.408Z · LW · GW

This is somewhat similar to the approach of the ROME paper, which has been shown to not actually do fact editing, just inserting louder facts that drown out the old ones and maybe suppressing the old ones.

In general, the problem with optimising model behavior as a localisation technique is that you can't distinguish between something that truly edits the fact, and something which adds a new fact in another layer that cancels out the first fact and adds something new.

Comment by Neel Nanda (neel-nanda-1) on How I got 4.2M YouTube views without making a single video · 2024-10-14T20:07:09.723Z · LW · GW

Agreed, chance of success when cold emailing busy people is low, and spamming them is bad. And there are alternate approaches that may work better, depending on the person and their setup - some Youtubers don't have a manager or employees, some do. I also think being able to begin an email with "Hi, I run the DeepMind mechanistic interpretability team" was quite helpful here.

Comment by Neel Nanda (neel-nanda-1) on Mark Xu's Shortform · 2024-10-11T18:20:40.759Z · LW · GW

The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams

Thanks for writing the post!

Comment by Neel Nanda (neel-nanda-1) on MichaelDickens's Shortform · 2024-10-04T20:19:53.134Z · LW · GW

Huh, are there examples of right leaning stuff they stopped funding? That's new to me

Comment by Neel Nanda (neel-nanda-1) on Nathan Young's Shortform · 2024-09-24T13:39:27.807Z · LW · GW

+1. Concretely this means converting every probability p into p/(1-p), and then multiplying those (you can then convert back to probabilities)

Intuition pump: Person A says 0.1 and Person B says 0.9. This is symmetric, if we instead study the negation, they swap places, so any reasonable aggregation should give 0.5

Geometric mean does not, instead you get 0.3

Arithmetic gets 0.5, but is bad for the other reasons you noted

Geometric mean of odds is sqrt(1/9 * 9) = 1, which maps to a probability of 0.5, while also eg treating low probabilities fairly

Comment by Neel Nanda (neel-nanda-1) on Showing SAE Latents Are Not Atomic Using Meta-SAEs · 2024-09-22T09:47:41.537Z · LW · GW

Interesting thought! I expect there's systematic differences, though it's not quite obvious how. Your example seems pretty plausible to me. Meta SAEs are also more incentived to learn features which tend to split a lot, I think, as then they're useful for more predicting many latents. Though ones that don't split may be useful as they entirely explain a latent that's otherwise hard to explain.

Anyway, we haven't checked yet, but I expect many of the results in this post would look similar for eg sparse linear regression over a smaller SAEs decoder. Re why meta SAEs are interesting at all, they're much cheaper to train than a smaller SAE, and BatchTopK gives you more control over the L0 than you could easily get with sparse linear regression, which are some mild advantages, but you may have a small SAE lying around anyway. I see the interesting point of this post more as "SAE latents are not atomic, as shown by one method, but probably other methods would work well too"

Comment by Neel Nanda (neel-nanda-1) on quetzal_rainbow's Shortform · 2024-09-18T22:02:26.676Z · LW · GW

What's wrong with twitter as an archival source? You can't edit tweets (technically you can edit top level tweets for up to an hour, but this creates a new URL and old links still show the original version). Seems fine to just aesthetically dislike twitter though

Comment by Neel Nanda (neel-nanda-1) on Why I'm bearish on mechanistic interpretability: the shards are not in the network · 2024-09-13T17:38:06.577Z · LW · GW

To me, this model predicts that sparse autoencoders should not find abstract features, because those are shards, and should not be localisable to a direction in activation space on a single token. Do you agree that this is implied?

If so, how do you square that with eg all the abstract features Anthropic found in Sonnet 3?

Comment by Neel Nanda (neel-nanda-1) on Contra papers claiming superhuman AI forecasting · 2024-09-13T11:38:14.772Z · LW · GW

Thanks for making the correction!

Comment by Neel Nanda (neel-nanda-1) on OpenAI o1 · 2024-09-13T10:26:23.388Z · LW · GW

I expect there's lots of new forms of capabilities elicitation for this kind of model, which their standard framework may not have captured, and which requires more time to iterate on

Comment by Neel Nanda (neel-nanda-1) on Contra papers claiming superhuman AI forecasting · 2024-09-12T21:41:40.353Z · LW · GW

Thanks for the post!

sample five random users’ forecasts, score them, and then average

Are you sure this is how their bot works? I read this more as "sample five things from the LLM, and average those predictions". For Metaculus, the crowd is just given to you, right, so it seems crazy to sample users?

Comment by Neel Nanda (neel-nanda-1) on Zach Stein-Perlman's Shortform · 2024-09-08T10:11:02.654Z · LW · GW

Yeah, fair point, disagreement retracted

Comment by Neel Nanda (neel-nanda-1) on Pay Risk Evaluators in Cash, Not Equity · 2024-09-07T23:36:15.060Z · LW · GW

I think this is important to define anyway! (and likely pretty obvious). This would create a lot more friction for someone to take on such a role though, or move out

Comment by Neel Nanda (neel-nanda-1) on Pay Risk Evaluators in Cash, Not Equity · 2024-09-07T20:44:11.088Z · LW · GW

But only a small fraction work on evaluations, so the increased cost is much smaller than you make out

Comment by Neel Nanda (neel-nanda-1) on Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream · 2024-09-07T03:25:27.167Z · LW · GW

Cool work! This is the outcome I expected, but I'm glad someone actually went and did it

Comment by Neel Nanda (neel-nanda-1) on How I got 4.2M YouTube views without making a single video · 2024-09-06T18:46:47.518Z · LW · GW

Yeah, if I made an introduction it would ruin the spirit of it!

Comment by Neel Nanda (neel-nanda-1) on Lucius Bushnaq's Shortform · 2024-09-06T10:40:57.802Z · LW · GW

I don't see important differences between that and ce loss delta in the context Lucius is describing

Comment by Neel Nanda (neel-nanda-1) on Lucius Bushnaq's Shortform · 2024-09-06T10:40:04.856Z · LW · GW

This seems true to me, though finding the right scaling curve for models is typically quite hard so the conversion to effective compute is difficult. I typically use CE loss change, not loss recovered. I think we just don't know how to evaluate SAE quality.

My personal guess is that SAEs can be a useful interpretability tool despite making a big difference in effective compute, and we should think more in terms of useful they are for downstream tasks. But I agree this is a real phenomena, that is easy to overlook, and is bad.

Comment by Neel Nanda (neel-nanda-1) on dirk's Shortform · 2024-09-04T13:17:08.851Z · LW · GW

These are LLM generated labels, there are no "real" labels (because they're expensive!). Especially in our demo, Neuronpedia made them with gpt 3.5 which is kinda dumb.

I mostly think they're much better than nothing, but shouldn't be trusted, and I'm glad our demo makes this apparent to people! I'm excited about work to improve autointerp, though unfortunately the easiest way is to use a better model, which gets expensive

Comment by Neel Nanda (neel-nanda-1) on How I got 4.2M YouTube views without making a single video · 2024-09-03T22:16:05.702Z · LW · GW

I dislike clickbait when it's misleading, or takes a really long time to get to the point (esp if it's then underwhelming). I was fine with this post on that front.

Comment by Neel Nanda (neel-nanda-1) on How I got 4.2M YouTube views without making a single video · 2024-09-03T22:14:26.336Z · LW · GW

Cold emailing Youtubers offering to chat about mechanistic interpretability turns out to be a way, way more effective strategy than I predicted! I'm super excited about that video (and it came out so well!). The video

Comment by Neel Nanda (neel-nanda-1) on Akash's Shortform · 2024-09-03T22:12:41.776Z · LW · GW

This is a fair point. I think Newsom is a very visible and prominent target who has more risk here (I imagine people don't pay that much attention to individual California legislators), it's individually his fault if he doesn't veto, and he wants to be President and thus cares much more about national stuff. While the California legislators were probably annoyed at Pelosi butting into state business.

Comment by Neel Nanda (neel-nanda-1) on Akash's Shortform · 2024-09-03T00:17:13.479Z · LW · GW

My model is basically just "Newsom likely doesn't want to piss off Big Tech or Pelosi, and the incentive to not veto doesn't seem that high, and so seems highly likely to veto, and 50% veto seems super low". My fair is, like, 80% veto I think?

I'm not that compelled by the base rates argument, because I think the level of controversy over the bill is atypically high, so it's quite out of distribution. Eg I think Pelosi denouncing it is very unusual for a state Bill and a pretty big deal

Comment by Neel Nanda (neel-nanda-1) on AI for Bio: State Of The Field · 2024-08-31T14:05:59.729Z · LW · GW

Thanks for taking the time to write this up, I found it a really helpful overview

Comment by Neel Nanda (neel-nanda-1) on Ruby's Quick Takes · 2024-08-31T14:00:18.970Z · LW · GW

I'd be interested! I would also love to see the full answer to why people care about SAEs

Comment by Neel Nanda (neel-nanda-1) on lewis smith's Shortform · 2024-08-31T13:56:50.007Z · LW · GW

But they're not atomic! See eg the phenomena of feature splitting, and the fact that UMAP finds structure between semantically similar features

(In fairness, atoms are also not very atomic)

Comment by Neel Nanda (neel-nanda-1) on lewis smith's Shortform · 2024-08-30T10:56:24.305Z · LW · GW

Thanks for writing this up Lewis! I'm very happy with this change, I think the term "SAE feature" is kinda sloppy and anti-conducive to clear thinking, and I hope the rest of the field adopts this too.

Comment by Neel Nanda (neel-nanda-1) on One person's worth of mental energy for AI doom aversion jobs. What should I do? · 2024-08-29T19:52:45.221Z · LW · GW

We are finding a bunch of insights about the internal features and circuits inside models that I believe to be true, and developing useful techniques like sparse autoencoders and activation patching that expand the space of what we can do. We're starting to see signs of life of actually doing things with mech interp, though it's early days. I think skepticism is reasonable, and we're still far from actually mattering for alignment, but I feel like the field is making real progress and is far from failed

Comment by Neel Nanda (neel-nanda-1) on Leon Lang's Shortform · 2024-08-29T08:16:11.685Z · LW · GW

My read is that the target audience is much more about explaining alignment concerns to a mainstream audience and that GDM takes them seriously (which I think is great!), than about providing non trivial details to a LessWrong etc audience

Comment by Neel Nanda (neel-nanda-1) on Would catching your AIs trying to escape convince AI developers to slow down or undeploy? · 2024-08-27T17:57:51.652Z · LW · GW

This could be inserted just as a dramatic end scene reveal, leaving the rest of the movie unaffected.

Note that this was, in fact, a dramatic end scene reveal in M3GAN

Comment by Neel Nanda (neel-nanda-1) on Linch's Shortform · 2024-08-27T12:17:53.660Z · LW · GW

Seems unclear if that's their true beliefs or just the rhetoric they believed would work in DC.

The latter could be perfectly benign - eg you might think that labs need better cyber security to stop eg North Korea getting the weights, but this is also a good idea to stop China getting them, so you focus on the latter when talking to Nat sec people as a form of common ground

Comment by Neel Nanda (neel-nanda-1) on Linch's Shortform · 2024-08-27T12:15:57.964Z · LW · GW

My (maybe wildly off) understanding from several such conversations is that people tend to say:

  • We think that everyone is racing super hard already, so the marginal effect of pushing harder isn't that high
  • Having great models is important to allow Anthropic to push on good policy and do great safety work
  • We have an RSP and take it seriously, so think we're unlikely to directly do harm by making dangerous AI ourselves

China tends not to explicitly come up, though I'm not confident it's not a factor.

(to be clear, the above is my rough understanding from a range of conversations, but I expect there's a diversity of opinions and I may have misunderstood)

Comment by Neel Nanda (neel-nanda-1) on O O's Shortform · 2024-08-27T02:51:13.649Z · LW · GW

I think mech interp, debate and model organism work are notable for currently having no practical applications lol (I am keen to change this for mech interp!)

Comment by Neel Nanda (neel-nanda-1) on O O's Shortform · 2024-08-27T02:50:02.160Z · LW · GW

Yeah, this seems obviously true to me, and exactly how it should be.

Comment by Neel Nanda (neel-nanda-1) on Linch's Shortform · 2024-08-26T16:57:34.348Z · LW · GW

work on capabilities at Anthropic because of the supposed inevitability of racing with China

I can't recall hearing this take from Anthropic people before

Comment by Neel Nanda (neel-nanda-1) on One person's worth of mental energy for AI doom aversion jobs. What should I do? · 2024-08-26T11:57:12.012Z · LW · GW

Thanks! I will separately say that I disagree with the statement regardless of whether you're treating my tweet as evidence

Comment by Neel Nanda (neel-nanda-1) on One person's worth of mental energy for AI doom aversion jobs. What should I do? · 2024-08-26T02:13:46.836Z · LW · GW

Anthropic's approach doesn't seem to have panned out

Please don't take that tweet as evidence that mech interp is doomed! Much attention is on sparse autoencoders nowadays, which seem like a cool and promising approach

Comment by Neel Nanda (neel-nanda-1) on Showing SAE Latents Are Not Atomic Using Meta-SAEs · 2024-08-24T22:45:22.273Z · LW · GW

The dataset for a meta SAE is literally the decoder directions of the original SAE, there are no tokens involved

Comment by Neel Nanda (neel-nanda-1) on Open Thread Summer 2024 · 2024-08-24T21:33:10.439Z · LW · GW

Strong +1, also notifications when it comments on my posts

Comment by Neel Nanda (neel-nanda-1) on Showing SAE Latents Are Not Atomic Using Meta-SAEs · 2024-08-24T15:50:43.233Z · LW · GW

Is the first post the one you meant to link, or did you mean the followup post from Jake? The first post is on toy models of AND and XORs, which I don't see as being super relevant. But I think Jake's argument that there's clear structure that naive hypotheses neglect seems clearly legit

Comment by Neel Nanda (neel-nanda-1) on Showing SAE Latents Are Not Atomic Using Meta-SAEs · 2024-08-24T15:47:49.394Z · LW · GW

but it's not clear to me that you couldn't train a smaller, somewhat-lossy meta-SAE even on an idealized SAE, so long as the data distribution had rare events or rare properties you could thow away cheaply.

IMO am "idealized" SAE just has no structure relating features, so nothing for a meta SAE to find. I'm not sure this is possible or desirable, to be clear! But I think that's what idealized units of analysis should look like

You could also play a similar game showing that latents in a larger SAE are "merely" compositions of latents in a smaller SAE.

I agree, we do this briefly later in the post, I believe. I see our contribution more as showing that this kind of thing is possible, than that meta SAEs are objectively the best tool for it

Comment by Neel Nanda (neel-nanda-1) on Habryka's Shortform Feed · 2024-08-16T08:58:17.302Z · LW · GW

Typically, opening a bunch of posts that look interesting and processing them later, or being linked to a post (which is pretty common in safety research, since often a post will be linked, shared on slack, cited in a paper, etc) and wanting to get a vibe for whether I can be bothered to read it. I think this is pretty common for me.

I would be satisfied if hovering over eg the date gave me info like the reading time.

Another thing I just noticed: on one of my posts, it's now higher friction to edit it, since there's not the obvious 3 dots button (I eventually found it in the top right, but it's pretty easy to miss and out of the way)

Comment by Neel Nanda (neel-nanda-1) on Habryka's Shortform Feed · 2024-08-16T08:55:20.108Z · LW · GW

Ah! Hmm, that's a lot better than nothing, but pretty out of the way, and easy to miss. Maybe making it a bit bigger or darker, or bolding it? I do like the fact that it's always there as you scroll

Comment by Neel Nanda (neel-nanda-1) on Habryka's Shortform Feed · 2024-08-16T07:24:44.995Z · LW · GW

I find a visual indicator much less useful and harder to reason about than a number, I feel pretty sad at lacking this. How hard would it be to have as an optional addition?

Comment by Neel Nanda (neel-nanda-1) on Habryka's Shortform Feed · 2024-08-16T07:22:51.041Z · LW · GW

I really don't like the removal of the comment counter at the top, because that gave a link to skip to the comments. I fairly often want to skip immediately to the comments to eg get a vibe for if the post is worth reading, and having a one click skip to it is super useful, not having that feels like a major degradation to me

Comment by Neel Nanda (neel-nanda-1) on Leaving MIRI, Seeking Funding · 2024-08-09T01:17:38.871Z · LW · GW

Suggestion: You may want to make a Manifund application in addition to the Patreon, so that donors who are US taxpayers can donate in a tax deductible way

Comment by Neel Nanda (neel-nanda-1) on Actually, Othello-GPT Has A Linear Emergent World Representation · 2024-08-07T21:29:25.390Z · LW · GW

Ah, yep, typo

Comment by Neel Nanda (neel-nanda-1) on Thomas Kwa's Shortform · 2024-07-31T01:38:19.193Z · LW · GW

Ah, gotcha. Yes, agreed. Mech interp peer review is generally garbage and does a bad job of filtering for quality (though I think it was reasonable enough at the workshop!)

Comment by Neel Nanda (neel-nanda-1) on Thomas Kwa's Shortform · 2024-07-30T10:50:37.045Z · LW · GW

Mechinterp is often no more advanced than where the EAs were in 2022.

Seems pretty false to me, ICML just rejected a bunch of the good submissions lol. I think that eg sparse autoencoders are a massive advance in the last year that unlocks a lot of exciting stuff