Posts

Mechanistic Interpretability Workshop Happening at ICML 2024! 2024-05-03T01:18:26.936Z
Transcoders enable fine-grained interpretable circuit analysis for language models 2024-04-30T17:58:09.982Z
Refusal in LLMs is mediated by a single direction 2024-04-27T11:13:06.235Z
Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z
How to use and interpret activation patching 2024-04-24T08:35:00.857Z
[Full Post] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:59.185Z
[Summary] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:17.755Z
AtP*: An efficient and scalable method for localizing LLM behaviour to components 2024-03-18T17:28:37.513Z
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems 2024-03-13T17:09:17.027Z
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To 2024-03-06T05:03:09.639Z
Attention SAEs Scale to GPT-2 Small 2024-02-03T06:50:22.583Z
Sparse Autoencoders Work on Attention Layer Outputs 2024-01-16T00:26:14.767Z
Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization 2024-01-14T02:06:00.290Z
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) 2023-12-23T02:46:25.892Z
Fact Finding: How to Think About Interpreting Memorisation (Post 4) 2023-12-23T02:46:16.675Z
Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3) 2023-12-23T02:46:05.517Z
Fact Finding: Simplifying the Circuit (Post 2) 2023-12-23T02:45:49.675Z
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) 2023-12-23T02:44:24.270Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper 2023-10-23T22:38:33.951Z
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small 2023-10-13T18:32:02.376Z
Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy 2023-08-29T22:07:04.059Z
An Interpretability Illusion for Activation Patching of Arbitrary Subspaces 2023-08-29T01:04:18.688Z
Mech Interp Puzzle 2: Word2Vec Style Embeddings 2023-07-28T00:50:00.297Z
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla 2023-07-20T10:50:58.611Z
Tiny Mech Interp Projects: Emergent Positional Embeddings of Words 2023-07-18T21:24:41.990Z
Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo 2023-07-16T22:02:15.410Z
How to Think About Activation Patching 2023-06-04T14:17:42.264Z
Finding Neurons in a Haystack: Case Studies with Sparse Probing 2023-05-03T13:30:30.836Z
Identifying semantic neurons, mechanistic circuits & interpretability web apps 2023-04-13T11:59:51.629Z
Othello-GPT: Reflections on the Research Process 2023-03-29T22:13:42.007Z
Othello-GPT: Future Work I Am Excited About 2023-03-29T22:13:26.823Z
Actually, Othello-GPT Has A Linear Emergent World Representation 2023-03-29T22:13:14.878Z
Attribution Patching: Activation Patching At Industrial Scale 2023-03-16T21:44:54.553Z
Paper Replication Walkthrough: Reverse-Engineering Modular Addition 2023-03-12T13:25:46.400Z
Mech Interp Project Advising Call: Memorisation in GPT-2 Small 2023-02-04T14:17:03.929Z
Mechanistic Interpretability Quickstart Guide 2023-01-31T16:35:49.649Z
200 COP in MI: Studying Learned Features in Language Models 2023-01-19T03:48:23.563Z
200 COP in MI: Interpreting Reinforcement Learning 2023-01-10T17:37:44.941Z
200 COP in MI: Image Model Interpretability 2023-01-08T14:53:14.681Z
200 COP in MI: Techniques, Tooling and Automation 2023-01-06T15:08:27.524Z
200 COP in MI: Analysing Training Dynamics 2023-01-04T16:08:58.089Z
200 COP in MI: Exploring Polysemanticity and Superposition 2023-01-03T01:52:46.044Z
200 COP in MI: Interpreting Algorithmic Problems 2022-12-31T19:55:39.085Z
200 COP in MI: Looking for Circuits in the Wild 2022-12-29T20:59:53.267Z
200 COP in MI: The Case for Analysing Toy Language Models 2022-12-28T21:07:03.838Z
200 Concrete Open Problems in Mechanistic Interpretability: Introduction 2022-12-28T21:06:53.853Z
Analogies between Software Reverse Engineering and Mechanistic Interpretability 2022-12-26T12:26:57.880Z
Concrete Steps to Get Started in Transformer Mechanistic Interpretability 2022-12-25T22:21:49.686Z
A Comprehensive Mechanistic Interpretability Explainer & Glossary 2022-12-21T12:35:08.589Z

Comments

Comment by Neel Nanda (neel-nanda-1) on MATS Winter 2023-24 Retrospective · 2024-05-14T23:44:07.509Z · LW · GW

I see this is strongly disagree voted - I don't mind, but I'd be curious for people to reply with which parts they disagree with! (Or at least disagree react to specific lines). I make a lot of claims in that comment, though I personally think they're all pretty reasonable. The one about not wanting inexperienced researchers to start orgs, or "alignment teams at scaling labs are good actually" might be spiciest?

Comment by Neel Nanda (neel-nanda-1) on OpenAI releases GPT-4o, natively interfacing with text, voice and vision · 2024-05-13T20:20:27.318Z · LW · GW

Might be mainly driven by an improved tokenizer.

I would be shocked if this is the main driver, they claim that English only has 1.1x fewer tokens, but seem to claim much bigger speed-ups

Comment by Neel Nanda (neel-nanda-1) on MATS Winter 2023-24 Retrospective · 2024-05-12T21:51:02.291Z · LW · GW

(EDIT: I just saw Ryan posted a comment a few minutes before mine, I agree substantially with it)

As a Google DeepMind employee I'm obviously pretty biased, but this seems pretty reasonable to me, assuming it's about alignment/similar teams at those labs? (If it's about capabilities teams, I agree that's bad!)

I think the alignment teams generally do good and useful work, especially those in a position to publish on it. And it seems extremely important that whoever makes AGI has a world-class alignment team! And some kinds of alignment research can only really be done with direct access to frontier models. MATS scholars tend to be pretty early in their alignment research career, and I also expect frontier lab alignment teams are a better place to learn technical skills especially engineering, and generally have a higher talent density there.

UK AISI/US AISI/METR seem like solid options for evals, but basically just work on evals, and Ryan says down thread that only 18% of scholars work on evals/demos. And I think it's valuable both for frontier labs to have good evals teams and for there to be good external evaluators (especially in government), I can see good arguments favouring either option.

44% of scholars did interpretability, where in my opinion the Anthropic team is clearly a fantastic option, and I like to think DeepMind is also a decent option, as is OpenAI. Apollo and various academic labs are the main other places you can do mech interp. So those career preferences seem pretty reasonable to me there for interp scholars.

17% are on oversight/control, and for oversight I think you generally want a lot of compute and access to frontier models? I am less sure for control, and think Redwood is doing good work there, but as far as I'm aware they're not hiring.

This is all assuming that scholars want to keep working in the same field they did MATS for, which in my experience is often but not always true.

I'm personally quite skeptical of inexperienced researchers trying to start new orgs - starting a new org and having it succeed is really, really hard, and much easier with more experience! So people preferring to get jobs seems great by my lights

Comment by Neel Nanda (neel-nanda-1) on MATS Winter 2023-24 Retrospective · 2024-05-11T22:08:24.848Z · LW · GW

Note that number of scholars is a much more important metric than number of mentors when it comes to evaluating MATS resources, as scholar per mentors varies a bunch (eg over winter I had 10 scholars, which is much more than most mentors). Harder to evaluate from the outside though!

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-05-11T19:45:54.818Z · LW · GW

Thanks, I'd be very curious to hear if this meets your bar for being impressed, or what else it would take! Further evidence:

  • Passing the Twitter test (for at least one user)
  • Being used by Simon Lerman, an author on Bad LLama (admittedly with help of Andy Arditi, our first author) to jailbreak LLaMA3 70B to help create data for some red-teaming research, (EDIT: rather than Simon choosing to fine-tune it, which he clearly knows how to do, being a Bad LLaMA author).
Comment by Neel Nanda (neel-nanda-1) on Mechanistic Interpretability Workshop Happening at ICML 2024! · 2024-05-11T10:13:27.420Z · LW · GW

Nnsight, pyvene, inseq, torchlens are other libraries coming to mind that it would be good to discuss in a related work. Also penzai in JAX

Comment by Neel Nanda (neel-nanda-1) on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2024-05-09T15:57:21.522Z · LW · GW

I hadn't seen the latter, thanks for sharing!

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-05-06T11:54:59.557Z · LW · GW

Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.

Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we're both wrong, and the most surgical intervention is deleting the direction from key layers only.

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-05-05T13:19:57.592Z · LW · GW

Thanks! I'm personally skeptical of ablating a separate direction per block, it feels less surgical than a single direction everywhere, and we show that a single direction works fine for LLAMA3 8B and 70B

The transformer lens library does not have a save feature :(

Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.

Comment by Neel Nanda (neel-nanda-1) on Introducing AI-Powered Audiobooks of Rational Fiction Classics · 2024-05-04T21:57:05.056Z · LW · GW

Thanks for making these! How expensive is it?

Comment by Neel Nanda (neel-nanda-1) on Mechanistic Interpretability Workshop Happening at ICML 2024! · 2024-05-04T11:36:30.556Z · LW · GW

Makes sense! Sounds like a fairly good fit

It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.

Another way of framing it: Try to write your paper in such a way that a mech interp researcher reading it says "huh, I want to go and use this library for my research". Eg give examples of things that were previously hard that are now easy.

Comment by Neel Nanda (neel-nanda-1) on Mechanistic Interpretability Workshop Happening at ICML 2024! · 2024-05-03T09:47:23.006Z · LW · GW

Looks relevant to me on a skim! I'd probably want to see some arguments in the submission for why this is useful tooling for mech interp people specifically (though being useful to non mech interp people too is a bonus!)

Comment by Neel Nanda (neel-nanda-1) on Transcoders enable fine-grained interpretable circuit analysis for language models · 2024-05-01T22:23:51.216Z · LW · GW

That's awesome, and insanely fast! Thanks so much, I really appreciate it

Comment by Neel Nanda (neel-nanda-1) on Transcoders enable fine-grained interpretable circuit analysis for language models · 2024-05-01T09:14:04.480Z · LW · GW

Nope to both of those, though I think both could be interesting directions!

Comment by Neel Nanda (neel-nanda-1) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T01:03:47.189Z · LW · GW

Nah I think it's pretty sketchy. I personally prefer mean ablation, especially for residual stream SAEs where zero ablation is super damaging. But even there I agree. Compute efficiency hit would be nice, though it's a pain to get the scaling laws precise enough

For our paper this is irrelevant though IMO because we're comparing gated and normal SAEs, and I think this is just scaling by a constant? It's at least monotonic in CE loss degradation

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-29T18:33:16.185Z · LW · GW

I don't think we really engaged with that question in this post, so the following is fairly speculative. But I think there's some situations where this would be a superior technique, mostly low resource settings where doing a backwards pass is prohibitive for memory reasons, or with a very tight compute budget. But yeah, this isn't a load bearing claim for me, I still count it as a partial victory to find a novel technique that's a bit worse than fine tuning, and think this is significantly better than prior interp work. Seems reasonable to disagree though, and say you need to be better or bust

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-29T13:30:38.501Z · LW · GW

+1 to Rohin. I also think "we found a cheaper way to remove safety guardrails from a model's weights than fine tuning" is a real result (albeit the opposite of useful), though I would want to do more actual benchmarking before we claim that it's cheaper too confidently. I don't think it's a qualitative improvement over what fine tuning can do, thus hedging and saying tentative

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-29T02:07:41.087Z · LW · GW

Thanks! Broadly agreed

For example, I think our understanding of Grokking in late 2022 turned out to be importantly incomplete.

I'd be curious to hear more about what you meant by this

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-28T22:46:05.628Z · LW · GW

It was added recently and just added to a new release, so pip install transformer_lens should work now/soon (you want v1.16.0 I think), otherwise you can install from the Github repo

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-28T11:05:17.725Z · LW · GW

There's been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven't seen much elsewhere, but I could easily be missing references

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-28T11:00:33.758Z · LW · GW

First and foremost, this is interpretability work, not directly safety work. Our goal was to see if insights about model internals could be applied to do anything useful on a real world task, as validation that our techniques and models of interpretability were correct. I would tentatively say that we succeeded here, though less than I would have liked. We are not making a strong statement that addressing refusals is a high importance safety problem.

I do want to push back on the broader point though, I think getting refusals right does matter. I think a lot of the corporate censorship stuff is dumb, and I could not care less about whether GPT4 says naughty words. And IMO it's not very relevant to deceptive alignment threat models, which I care a lot about. But I think it's quite important for minimising misuse of models, which is also important: we will eventually get models capable of eg helping terrorists make better bioweapons (though I don't think we currently have such), and people will want to deploy those behind an API. I would like them to be as jailbreak proof as possible!

Comment by Neel Nanda (neel-nanda-1) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-26T02:48:49.168Z · LW · GW

Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This'll be added to the paper soon, sorry!). I'll leave the other Qs for my co-authors

Comment by Neel Nanda (neel-nanda-1) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-26T00:14:38.219Z · LW · GW

I haven't fully worked through the maths, but I think both IG and attribution patching break down here? The fundamental problem is that the discontinuity is invisible to IG because it only takes derivatives. Eg the ReLU and Jump ReLU below look identical from the perspective of IG, but not from the perspective of activation patching, I think.

Comment by Neel Nanda (neel-nanda-1) on Funny Anecdote of Eliezer From His Sister · 2024-04-25T23:04:53.549Z · LW · GW

From the title I expected this to be embarrassing for Eliezer, but that was actually extremely sweet, and good advice!

Comment by Neel Nanda (neel-nanda-1) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-25T21:03:42.881Z · LW · GW

Great work! Obviously the results here speak for themselves, but I especially wanted to complement the authors on the writing. I thought this paper was a pleasure to read, and easily a top 5% exemplar of clear technical writing. Thanks for putting in the effort on that.

<3 Thanks so much, that's extremely kind. Credit entirely goes to Sen and Arthur, which is even more impressive given that they somehow took this from a blog post to a paper in a two week sprint! (including re-running all the experiments!!)

Comment by Neel Nanda (neel-nanda-1) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-17T10:56:54.703Z · LW · GW

It seems like we have significant need for orgs like METR or the DeepMind dangerous capabilities evals team trying to operationalise these evals, but also regulators with authority building on that work to set them as explicit and objective standards. The latter feels maybe more practical for NIST to do, especially under Paul?

Comment by Neel Nanda (neel-nanda-1) on Ophiology (or, how the Mamba architecture works) · 2024-04-09T23:07:59.532Z · LW · GW

Thanks for the clear explanation, Mamba is more cursed and less Transformer like than I realised! And thanks for creating and open sourcing Mamba Lens, it looks like a very useful tool for anyone wanting to build on this stuff

Comment by Neel Nanda (neel-nanda-1) on Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition · 2024-04-08T16:50:44.623Z · LW · GW

Each element of the  matrix, denoted as , is constrained to the interval . This means that for all , where  indexes the query positions and  indexes the key positions:

Why is this strictly less than 1? Surely if the dot product is 1.1 and you clamp, it gets clamped to exactly 1

Comment by Neel Nanda (neel-nanda-1) on The Best Tacit Knowledge Videos on Every Subject · 2024-04-07T21:49:10.929Z · LW · GW

Oh nice, I didn't know Evan had a YouTube channel. He's one of the most renowned olympiad coaches and seems highly competent

Comment by Neel Nanda (neel-nanda-1) on Fabien's Shortform · 2024-04-06T10:55:47.415Z · LW · GW

Thanks! I read and enjoyed the book based on this recommendation

Comment by Neel Nanda (neel-nanda-1) on LessWrong: After Dark, a new side of LessWrong · 2024-04-03T10:05:38.242Z · LW · GW

I'm in favour of people having hobbies and fun projects to do in their downtime! That seems good and valuable for impact over the longterm, rather than thinking that every last moment needs to be productive

Comment by Neel Nanda (neel-nanda-1) on A Selection of Randomly Selected SAE Features · 2024-04-01T12:13:12.192Z · LW · GW
What's going on? It's annoying or not interesting I'm in this photo and I don't like it I think it shouldn't be on Facebook It's spam
Comment by Neel Nanda (neel-nanda-1) on SAE-VIS: Announcement Post · 2024-03-31T15:38:16.690Z · LW · GW

Thanks for open sourcing this! We've already been finding it really useful on the DeepMind mech interp team, and saved us the effort of writing our own :)

Comment by Neel Nanda (neel-nanda-1) on SAE reconstruction errors are (empirically) pathological · 2024-03-29T17:17:09.169Z · LW · GW

Great post! I'm pretty surprised by this result, and don't have a clear story for what's going on. Though my guess is closer to "adding noise with equal norm to the error is not a fair comparison, for some reason" than "SAEs are fundamentally broken". I'd love to see someone try to figure out WTF is going on.

Comment by Neel Nanda (neel-nanda-1) on Charlie Steiner's Shortform · 2024-03-29T12:03:34.947Z · LW · GW

You may be able to notice data points where the SAE performs unusually badly at reconstruction? (Which is what you'd see if there's a crucial missing feature)

Comment by Neel Nanda (neel-nanda-1) on yanni's Shortform · 2024-03-29T12:02:50.443Z · LW · GW

What banner?

Comment by Neel Nanda (neel-nanda-1) on Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features · 2024-03-15T17:31:41.559Z · LW · GW
  1. Global Threshold - Let's treat all features the same. Set all feature activations less than [0.1] to 0 (this is equivalent to adding a constant to the encoder bias).

 

The bolded part seems false? This maps 0.2 original act -> 0.2 new act while adding 0.1 to the encoder bias maps 0.2 original act -> 0.1 new act. Ie, changing the encoder bias changes the value of all activations, while thresholding only affects small ones

Comment by Neel Nanda (neel-nanda-1) on Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems · 2024-03-14T10:13:16.092Z · LW · GW

+1 that I'm still fairly confused about in context learning, induction heads seem like a big part of the story but we're still confused about those too!

Comment by Neel Nanda (neel-nanda-1) on My Clients, The Liars · 2024-03-11T00:44:36.803Z · LW · GW

This is not a LessWrong dynamic I've particularly noticed and it seems inaccurate to describe it as invisible helicopter blades to me

Comment by Neel Nanda (neel-nanda-1) on Attention SAEs Scale to GPT-2 Small · 2024-03-09T16:39:24.606Z · LW · GW

We've found slightly worse results for MLPs, but nowhere near 40%, I expect you're training your SAEs badly. What exact metric equals 40% here?

Comment by Neel Nanda (neel-nanda-1) on Grief is a fire sale · 2024-03-04T21:31:53.099Z · LW · GW

Thanks for the post, I found it moving. You might want to add a timestamp at the top saying "written in Nov 2023" or something, otherwise the OpenAI board stuff is jarring

Comment by Neel Nanda (neel-nanda-1) on If you weren't such an idiot... · 2024-03-03T21:47:44.724Z · LW · GW

Thanks! This inspired me to buy multiple things that I've been vaguely annoyed to lack

Comment by Neel Nanda (neel-nanda-1) on Some costs of superposition · 2024-03-03T21:41:31.329Z · LW · GW

Thanks for writing this up, I found it useful to have some of the maths spelled out! In particular, I think that the equation constraining l, the number of simultaneously active features, is likely crucial for constraining the number of features in superposition

Comment by Neel Nanda (neel-nanda-1) on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-03-01T12:03:06.105Z · LW · GW

The art is great! How was it made?

Comment by Neel Nanda (neel-nanda-1) on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-02-28T22:54:15.263Z · LW · GW

In my opinion the pun is worth it

Comment by Neel Nanda (neel-nanda-1) on Useful starting code for interpretability · 2024-02-14T00:19:59.458Z · LW · GW

This seems like a useful resource, thanks for making it! I think it would be more useful if you enumerated the different ARENA notebooks, my guess is many readers won't click through to the link, and are more likely to if they see the different names. And IMO the arena tutorials are much higher production quality than the other notebooks on that list

Comment by Neel Nanda (neel-nanda-1) on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2024-02-09T23:08:10.132Z · LW · GW

We dig into this in post 3. The layers compose importantly with each other and don't seem to be doing the same thing in parallel, path patching the internal connections will break things, so I don't think it's like what you're describing

Comment by Neel Nanda (neel-nanda-1) on A Chess-GPT Linear Emergent World Representation · 2024-02-09T04:45:58.783Z · LW · GW

Very cool work! I'm happy to see that the "my vs their colour" result generalises

Comment by Neel Nanda (neel-nanda-1) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-06T06:13:47.367Z · LW · GW

Thanks for doing this, I'm excited about Neuronpedia focusing on SAE features! I expect this to go much better than neuron interpretability

Comment by Neel Nanda (neel-nanda-1) on An Interpretability Illusion for Activation Patching of Arbitrary Subspaces · 2024-01-26T09:43:44.109Z · LW · GW

The illusion is most concerning when learning arbitrary directions in space, not when iterating over individual neurons OR SAE features. I don't have strong takes on whether the illusion is more likely with neurons than SAEs if you're eg iterating over sparse subsets, in some sense it's more likely that you get a dormant and a disconnected feature in your SAE than as neurons since they are more meaningful?