Posts

How to use and interpret activation patching 2024-04-24T08:35:00.857Z
[Full Post] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:59.185Z
[Summary] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:17.755Z
AtP*: An efficient and scalable method for localizing LLM behaviour to components 2024-03-18T17:28:37.513Z
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems 2024-03-13T17:09:17.027Z
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To 2024-03-06T05:03:09.639Z
Attention SAEs Scale to GPT-2 Small 2024-02-03T06:50:22.583Z
Sparse Autoencoders Work on Attention Layer Outputs 2024-01-16T00:26:14.767Z
Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization 2024-01-14T02:06:00.290Z
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) 2023-12-23T02:46:25.892Z
Fact Finding: How to Think About Interpreting Memorisation (Post 4) 2023-12-23T02:46:16.675Z
Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3) 2023-12-23T02:46:05.517Z
Fact Finding: Simplifying the Circuit (Post 2) 2023-12-23T02:45:49.675Z
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) 2023-12-23T02:44:24.270Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper 2023-10-23T22:38:33.951Z
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small 2023-10-13T18:32:02.376Z
Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy 2023-08-29T22:07:04.059Z
An Interpretability Illusion for Activation Patching of Arbitrary Subspaces 2023-08-29T01:04:18.688Z
Mech Interp Puzzle 2: Word2Vec Style Embeddings 2023-07-28T00:50:00.297Z
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla 2023-07-20T10:50:58.611Z
Tiny Mech Interp Projects: Emergent Positional Embeddings of Words 2023-07-18T21:24:41.990Z
Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo 2023-07-16T22:02:15.410Z
How to Think About Activation Patching 2023-06-04T14:17:42.264Z
Finding Neurons in a Haystack: Case Studies with Sparse Probing 2023-05-03T13:30:30.836Z
Identifying semantic neurons, mechanistic circuits & interpretability web apps 2023-04-13T11:59:51.629Z
Othello-GPT: Reflections on the Research Process 2023-03-29T22:13:42.007Z
Othello-GPT: Future Work I Am Excited About 2023-03-29T22:13:26.823Z
Actually, Othello-GPT Has A Linear Emergent World Representation 2023-03-29T22:13:14.878Z
Attribution Patching: Activation Patching At Industrial Scale 2023-03-16T21:44:54.553Z
Paper Replication Walkthrough: Reverse-Engineering Modular Addition 2023-03-12T13:25:46.400Z
Mech Interp Project Advising Call: Memorisation in GPT-2 Small 2023-02-04T14:17:03.929Z
Mechanistic Interpretability Quickstart Guide 2023-01-31T16:35:49.649Z
200 COP in MI: Studying Learned Features in Language Models 2023-01-19T03:48:23.563Z
200 COP in MI: Interpreting Reinforcement Learning 2023-01-10T17:37:44.941Z
200 COP in MI: Image Model Interpretability 2023-01-08T14:53:14.681Z
200 COP in MI: Techniques, Tooling and Automation 2023-01-06T15:08:27.524Z
200 COP in MI: Analysing Training Dynamics 2023-01-04T16:08:58.089Z
200 COP in MI: Exploring Polysemanticity and Superposition 2023-01-03T01:52:46.044Z
200 COP in MI: Interpreting Algorithmic Problems 2022-12-31T19:55:39.085Z
200 COP in MI: Looking for Circuits in the Wild 2022-12-29T20:59:53.267Z
200 COP in MI: The Case for Analysing Toy Language Models 2022-12-28T21:07:03.838Z
200 Concrete Open Problems in Mechanistic Interpretability: Introduction 2022-12-28T21:06:53.853Z
Analogies between Software Reverse Engineering and Mechanistic Interpretability 2022-12-26T12:26:57.880Z
Concrete Steps to Get Started in Transformer Mechanistic Interpretability 2022-12-25T22:21:49.686Z
A Comprehensive Mechanistic Interpretability Explainer & Glossary 2022-12-21T12:35:08.589Z
A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2 2022-11-22T17:12:02.562Z
Results from the interpretability hackathon 2022-11-17T14:51:44.568Z
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien) 2022-11-07T22:39:16.671Z
Real-Time Research Recording: Can a Transformer Re-Derive Positional Info? 2022-11-01T23:56:06.215Z

Comments

Comment by Neel Nanda (neel-nanda-1) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-17T10:56:54.703Z · LW · GW

It seems like we have significant need for orgs like METR or the DeepMind dangerous capabilities evals team trying to operationalise these evals, but also regulators with authority building on that work to set them as explicit and objective standards. The latter feels maybe more practical for NIST to do, especially under Paul?

Comment by Neel Nanda (neel-nanda-1) on Ophiology (or, how the Mamba architecture works) · 2024-04-09T23:07:59.532Z · LW · GW

Thanks for the clear explanation, Mamba is more cursed and less Transformer like than I realised! And thanks for creating and open sourcing Mamba Lens, it looks like a very useful tool for anyone wanting to build on this stuff

Comment by Neel Nanda (neel-nanda-1) on Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition · 2024-04-08T16:50:44.623Z · LW · GW

Each element of the  matrix, denoted as , is constrained to the interval . This means that for all , where  indexes the query positions and  indexes the key positions:

Why is this strictly less than 1? Surely if the dot product is 1.1 and you clamp, it gets clamped to exactly 1

Comment by Neel Nanda (neel-nanda-1) on The Best Tacit Knowledge Videos on Every Subject · 2024-04-07T21:49:10.929Z · LW · GW

Oh nice, I didn't know Evan had a YouTube channel. He's one of the most renowned olympiad coaches and seems highly competent

Comment by Neel Nanda (neel-nanda-1) on Fabien's Shortform · 2024-04-06T10:55:47.415Z · LW · GW

Thanks! I read and enjoyed the book based on this recommendation

Comment by Neel Nanda (neel-nanda-1) on LessWrong: After Dark, a new side of LessWrong · 2024-04-03T10:05:38.242Z · LW · GW

I'm in favour of people having hobbies and fun projects to do in their downtime! That seems good and valuable for impact over the longterm, rather than thinking that every last moment needs to be productive

Comment by Neel Nanda (neel-nanda-1) on A Selection of Randomly Selected SAE Features · 2024-04-01T12:13:12.192Z · LW · GW
What's going on? It's annoying or not interesting I'm in this photo and I don't like it I think it shouldn't be on Facebook It's spam
Comment by Neel Nanda (neel-nanda-1) on SAE-VIS: Announcement Post · 2024-03-31T15:38:16.690Z · LW · GW

Thanks for open sourcing this! We've already been finding it really useful on the DeepMind mech interp team, and saved us the effort of writing our own :)

Comment by Neel Nanda (neel-nanda-1) on SAE reconstruction errors are (empirically) pathological · 2024-03-29T17:17:09.169Z · LW · GW

Great post! I'm pretty surprised by this result, and don't have a clear story for what's going on. Though my guess is closer to "adding noise with equal norm to the error is not a fair comparison, for some reason" than "SAEs are fundamentally broken". I'd love to see someone try to figure out WTF is going on.

Comment by Neel Nanda (neel-nanda-1) on Charlie Steiner's Shortform · 2024-03-29T12:03:34.947Z · LW · GW

You may be able to notice data points where the SAE performs unusually badly at reconstruction? (Which is what you'd see if there's a crucial missing feature)

Comment by Neel Nanda (neel-nanda-1) on yanni's Shortform · 2024-03-29T12:02:50.443Z · LW · GW

What banner?

Comment by Neel Nanda (neel-nanda-1) on Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features · 2024-03-15T17:31:41.559Z · LW · GW
  1. Global Threshold - Let's treat all features the same. Set all feature activations less than [0.1] to 0 (this is equivalent to adding a constant to the encoder bias).

 

The bolded part seems false? This maps 0.2 original act -> 0.2 new act while adding 0.1 to the encoder bias maps 0.2 original act -> 0.1 new act. Ie, changing the encoder bias changes the value of all activations, while thresholding only affects small ones

Comment by Neel Nanda (neel-nanda-1) on Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems · 2024-03-14T10:13:16.092Z · LW · GW

+1 that I'm still fairly confused about in context learning, induction heads seem like a big part of the story but we're still confused about those too!

Comment by Neel Nanda (neel-nanda-1) on My Clients, The Liars · 2024-03-11T00:44:36.803Z · LW · GW

This is not a LessWrong dynamic I've particularly noticed and it seems inaccurate to describe it as invisible helicopter blades to me

Comment by Neel Nanda (neel-nanda-1) on Attention SAEs Scale to GPT-2 Small · 2024-03-09T16:39:24.606Z · LW · GW

We've found slightly worse results for MLPs, but nowhere near 40%, I expect you're training your SAEs badly. What exact metric equals 40% here?

Comment by Neel Nanda (neel-nanda-1) on Grief is a fire sale · 2024-03-04T21:31:53.099Z · LW · GW

Thanks for the post, I found it moving. You might want to add a timestamp at the top saying "written in Nov 2023" or something, otherwise the OpenAI board stuff is jarring

Comment by Neel Nanda (neel-nanda-1) on If you weren't such an idiot... · 2024-03-03T21:47:44.724Z · LW · GW

Thanks! This inspired me to buy multiple things that I've been vaguely annoyed to lack

Comment by Neel Nanda (neel-nanda-1) on Some costs of superposition · 2024-03-03T21:41:31.329Z · LW · GW

Thanks for writing this up, I found it useful to have some of the maths spelled out! In particular, I think that the equation constraining l, the number of simultaneously active features, is likely crucial for constraining the number of features in superposition

Comment by Neel Nanda (neel-nanda-1) on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-03-01T12:03:06.105Z · LW · GW

The art is great! How was it made?

Comment by Neel Nanda (neel-nanda-1) on New LessWrong review winner UI ("The LeastWrong" section and full-art post pages) · 2024-02-28T22:54:15.263Z · LW · GW

In my opinion the pun is worth it

Comment by Neel Nanda (neel-nanda-1) on Useful starting code for interpretability · 2024-02-14T00:19:59.458Z · LW · GW

This seems like a useful resource, thanks for making it! I think it would be more useful if you enumerated the different ARENA notebooks, my guess is many readers won't click through to the link, and are more likely to if they see the different names. And IMO the arena tutorials are much higher production quality than the other notebooks on that list

Comment by Neel Nanda (neel-nanda-1) on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2024-02-09T23:08:10.132Z · LW · GW

We dig into this in post 3. The layers compose importantly with each other and don't seem to be doing the same thing in parallel, path patching the internal connections will break things, so I don't think it's like what you're describing

Comment by Neel Nanda (neel-nanda-1) on A Chess-GPT Linear Emergent World Representation · 2024-02-09T04:45:58.783Z · LW · GW

Very cool work! I'm happy to see that the "my vs their colour" result generalises

Comment by Neel Nanda (neel-nanda-1) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-06T06:13:47.367Z · LW · GW

Thanks for doing this, I'm excited about Neuronpedia focusing on SAE features! I expect this to go much better than neuron interpretability

Comment by Neel Nanda (neel-nanda-1) on An Interpretability Illusion for Activation Patching of Arbitrary Subspaces · 2024-01-26T09:43:44.109Z · LW · GW

The illusion is most concerning when learning arbitrary directions in space, not when iterating over individual neurons OR SAE features. I don't have strong takes on whether the illusion is more likely with neurons than SAEs if you're eg iterating over sparse subsets, in some sense it's more likely that you get a dormant and a disconnected feature in your SAE than as neurons since they are more meaningful?

Comment by Neel Nanda (neel-nanda-1) on Toward A Mathematical Framework for Computation in Superposition · 2024-01-22T12:08:19.353Z · LW · GW

Interesting post, thanks for writing it!

I think that the QK section somewhat under-emphasises the importance of the softmax. My intuition is that models rarely care about as precise a task as counting the number of pairs of matching query-key features at each pair of token positions, and that instead softmax is more of an "argmax-like" function that finds a handful of important token positions (though I have not empirically tested this, and would love to be proven wrong!). This enables much cheaper and more efficient solutions, since you just need the correct answer to be the argmax-ish.

For example, ignoring floating point precision, you can implement a duplicate token head with and arbitrarily high . If there are vocab elements, map the th query and key to the point of the way round the unit circle. The dot product is maximised when they are equal.

If you further want the head to look at a resting position unless the duplicate token is there, you can increase , and have a dedicated BOS dimension with a score of , so you only get a higher score for a perfect match. And then make the softmax temperature super low so it's an argmax.

Comment by Neel Nanda (neel-nanda-1) on Sparse Autoencoders Work on Attention Layer Outputs · 2024-01-17T23:29:48.638Z · LW · GW

These models were not trained with dropout. Nice idea though!

Comment by Neel Nanda (neel-nanda-1) on Fact Finding: Simplifying the Circuit (Post 2) · 2024-01-17T23:29:30.543Z · LW · GW

I'm not sure! My guess is that it's because some athlete names were two tokens and others were three tokens (or longer) and we left padded so all prompts were the same length (and masked the attention so it couldn't attend to the padding tokens). We definitely didn't need to do this, and could have just filtered for two token names, it's not an important detail.

Comment by Neel Nanda (neel-nanda-1) on A Longlist of Theories of Impact for Interpretability · 2024-01-02T22:29:29.827Z · LW · GW

Thanks for the kind words! I'd class "interp supercharging other sciences" under:

Microscope AI: Maybe we can avoid deploying agents at all, by training systems to do complex tasks, then interpreting how they do it and doing it ourselves

This might just be semantics though

Comment by Neel Nanda (neel-nanda-1) on Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small · 2023-12-31T14:47:52.378Z · LW · GW

I really like this paper! This is one of my favourite interpretability papers of 2022, and has substantially influenced my research. I voted at 9 in the annual review. Specific things I like about it:

  • It really started the "narrow distribution" focused interpretability, just examining models on sentences of the form "John and Mary went to the store, John gave a bag to" -> " Mary". IMO this is a promising alternative focus to the "understand what model components mean on the full data distribution" mindset, and worth some real investment in. Model components often do many things in different contexts (are polysemantic), and narrow distributions allow us to ignore their other roles.
    • This is less ambitious and less useful than full distribution interp, but may be much easier, and still sufficient for useful applications of interp like debugging model failures (eg why does BingChat gaslight people) or creating adversarial examples.
  • It really pushed forwards causal intervention based mech interp (ie activation patching), rather than "analysing weights based" mech interp. Causal interventions are inherently distribution dependent and in some sense less satisfying, but much more scalable, and an important tool in our toolkit. (eg they kinda just worked on Chinchilla 70B
    • Patching was not original to IOI, but IOI is the first time I saw someone actually try to use it to uncover a circuit
    • It was the first place I saw edge/path patching, which is a cool and important innovation on the technique. It's a lot easier to interpret a set of key nodes and how they connect up than just heads that matter in isolation.
  • It's really useful to have an example of a concrete circuit when reasoning through mech interp! I often use IOI as a standard example when teaching or thinking through something
  • When you go looking inside a model you see weird phenomena, which is valuable to know about in future - the role of the work is by giving existence proofs of these, so just a single example is sufficient
    • It was the first place I saw the phenomena of backup/self-repair, which I found highly unexpected
    • It was the first place I saw negative heads (which directly led to the copy suppression paper I supervised, one of my favourite interp papers of 2023!)
  • It's led to a lot of follow-up works trying to uncover different circuits. I think this line of research is hitting diminishing returns, but I'd still love to eg have a zoo of at least 10 circuits in small to medium language models!
  • This was the joint first mech interp work published at a top ML conference, which seems like solid field-building, with more than 100 citations in the past 14 months!

I personally didn't find the paper that easy to read, and tend to recommend people read other resources to understand the techniques used, and I'd guess it suffered somewhat from trying to conform to peer review. But idk, the above is just a lot of impressive things for a single paper!

Comment by Neel Nanda (neel-nanda-1) on A Longlist of Theories of Impact for Interpretability · 2023-12-31T14:28:37.621Z · LW · GW

Meta level I wrote this post in 1-3 hours, and am very satisfied with the returns per unit time! I don't think this is the best or most robust post I could have written, and I think some of these theories of impact are much more important than others. But I think that just collecting a ton of these in the same place was a valuable thing to do, and have heard from multiple people who appreciated this post's existence! More importantly, it was easy and fun, and I personally want to take this as inspiration to find more, easy-to-write-yet-valuable things to do.

Object level I think the key point I wanted to make with this post was "there's a bunch of ways that interp can be helpful", which I think basically stands. I go back and forth on how much it's valuable to think about theories of impact day to day, vs just trying to do good science and pluck impactful low-hanging fruit, but I think that either way it's valuable to have a bunch in mind rather than carefully back-chaining from a specific and fragile theory of change.

This post got some extensive criticism in Against Almost Every Theory of Impact of Interpretability, but I largely agree with Richard Ngo and Rohin Shah's responses.

Comment by Neel Nanda (neel-nanda-1) on A Mechanistic Interpretability Analysis of Grokking · 2023-12-31T14:03:31.459Z · LW · GW

Self-Review: After a while of being insecure about it, I'm now pretty fucking proud of this paper, and think it's one of the coolest pieces of research I've personally done. (I'm going to both review this post, and the subsequent paper). Though, as discussed below, I think people often overrate it.

Impact The main impact IMO is proving that mechanistic interpretability is actually possible, that we can take a trained neural network and reverse-engineer non-trivial and unexpected algorithms from it. In particular, I think by focusing on grokking I (semi-accidentally) did a great job of picking a problem that people cared about for non-interp reasons, where mech interp was unusually easy (as it was a small model, on a clean algorithmic task), and that I was able to find real insights about grokking as a direct result of doing the mechanistic interpretability. Real models are fucking complicated (and even this model has some details we didn't fully understand), but I feel great about the field having something that's genuinely detailed, rigorous and successfully reverse-engineered, and this seems an important proof of concept. IMO the other contributions are the specific algorithm I found, and the specific insights about how and why grokking happens. but in my opinion these are much less interesting.

Field-Building Another large amount of impact is that this was a major win for mech interp field-building. This is hard to measure, but some data:

  • There are multiple papers I like that are substantially building on/informed by these results (A toy model of universality, the clock and the pizza, Feature emergence via margin maximization, Explaining grokking through circuit efficiency
    • It's got >100 citations in less than a year (a decent chunk of these are semi-fake citations from this being used as a standard citation for 'mech interp exists as a field', so I care more about the "how many papers would not exist without this" metric)
  • It went pretty viral on Twitter (1,000 likes, 300,000 views)
  • This was the joint first mech interp paper at a top ML conference (ICLR 23 spotlight) and seemed to get a lot of interest at the conference.
  • Anecdotally, I moderately often see people discussing this paper, or who know of me from this paper
  • At least one of my MATS scholars says he got into the field because of this work.

Is field-building good? It's plausible to me that, on the margin, less alignment effort should be going into mech interp, and more should go into other agendas. But I'm still excited about mech interp field building, think this is a solid and high-value thing, and that field-building is often positive sum - people often have strong personal fits to different areas, and many people are drawn to it from non-alignment motivations. And though there's concern over capabilities externalities, my guess is that good interp work is strongly net good.

Is it worth publishing in academic venues Submitting this to ICLR was my first serious experience with peer review. I'm not sure what updates I've made re whether this was worth it. I think some, but probably <50% of the field-building benefit came from this, and that going Twitter viral was much more important for ensuring people became aware of the work. I think it made the work more legitimate-seeming, more citable, more respectable, etc. On an object level, it resulted in the writing, ideas and experiments becoming much better and clearer (though led to some of the more interesting speculation being relegated to appendix E :'( ) though this was largely due to /u/lawrencec 's skills. I definitely find peer review/conforming to academic conventions very tedious and irritating, and am very grateful to Lawrence for doing almost all of the work.

Personal benefit It's hard to measure, but I think this turned out to be substantially good for my career, reputation and influence. It's often been what people know me for, and I think has helped me snowball into other career successes.

Ways it's overrated As noted, I do think there's ways people overrate the results/ideas here, and that there's too much interest in the object level results (the Fourier Multiplication algorithm, and understanding grokking). Some thoughts:

  • The specific techniques I used to understand the model are super specific to modular addition + Fourier stuff, and not very useful on real models (though I hold out hope that it'll be relevant to how language models do addition!)
  • People often think grokking is a key insight about how language models learn, I think this can be misleading. Grokking is a fragile and transitionary state you get when your hyper-parameters are just right (a bit more or less data makes it memorise or generalise immediately) and requires a ton of overtraining on the same data gain and again. I think grokking gives some hints that individual circuits may be learned in sudden phase transitions (the quantization model of neural scaling), but we need much more evidence from real models on these questions. And something like "true reasoning" is plausibly a mix of many circuits, each with their own phase transition, rather than a thing that'll be suddenly grokked.
  • People often underestimate the difficulty jump in interpreting real models (even a 1L language model) compared to the modular addition model, and get too excited about how easy it'll be
    • People also get excited about more algorithmic interp work. I think this is largely played out, and focus on real models in my own work (and the work I supervise). I ultimately care about reverse-engineering AGI, and I think language models (even small ones) are decent proxies for this, while algorithmic problem models are not, unless you set up a really good one, that captures some property of real models that we care about. And I'm unconvinced there's much more marginal value in demonstrating that algorithmic mech interp is possible.
Comment by Neel Nanda (neel-nanda-1) on A Universal Emergent Decomposition of Retrieval Tasks in Language Models · 2023-12-19T23:29:45.917Z · LW · GW

Cool work! I'm excited to see a diversity of approaches to mech interp

Comment by Neel Nanda (neel-nanda-1) on The LessWrong 2022 Review · 2023-12-05T19:42:41.881Z · LW · GW

Is research published elsewhere but cross posted here eligible? Eg I think that Toy Models of Superposition was one of the best papers of last year, and it was [cross posted to LessWrong] (https://www.lesswrong.com/posts/CTh74TaWgvRiXnkS6/toy-models-of-superposition)/came out of the overall alignment space, but isn't exactly a LessWrong post per se.

(notably, my grokking work and casual scrubbing were mech interp research that WAS published on LessWrong first and foremost)

Comment by Neel Nanda (neel-nanda-1) on MATS Summer 2023 Retrospective · 2023-12-02T19:48:36.371Z · LW · GW

For what it's worth, as a MATS mentor, I gave a bunch of 7s and 8s for people I'm excited about, and felt bad giving people 9s or 10s unless it was super obviously justified

Comment by Neel Nanda (neel-nanda-1) on How useful is mechanistic interpretability? · 2023-12-02T10:54:11.415Z · LW · GW

This is a discussion that would need to be its own post, but I think superposition is basically not real and a confused concept.

I'd be curious to hear more about this - IMO we're talking past each other given that we disagree on this point! Like, in my opinion, the reason low rank approximations work at all is because of superposition.

For example, if an SAE gives us 16x as many dimensions as the original activations, and we find that half of those are interpretable, to me this seems like clear evidence of superposition (8x as many interpretable directions!). How would you interpret that phenomena?

Comment by Neel Nanda (neel-nanda-1) on How useful is mechanistic interpretability? · 2023-12-02T10:16:51.047Z · LW · GW

My understanding was that John wanted to only have a few variables mattering on a given input, which SAEs give you. The causal graph is large in general, but IMO that's just an unavoidable property of models and superposition.

I'm confused by why you don't consider "only a few neurons being non-zero" to be a "low dimensional summary of the relevant information in the layer"

Comment by Neel Nanda (neel-nanda-1) on 200 COP in MI: Exploring Polysemanticity and Superposition · 2023-11-21T22:59:05.736Z · LW · GW

Thanks for the catch, I deleted "Note that the hidden dimen". Transformers do blow up the hidden dimension, but that's not very relevant here - they have many more neurons than residual stream dimensions, and they have many more features than neurons (as shown in the recent Anthropic paper)

Comment by Neel Nanda (neel-nanda-1) on Johannes C. Mayer's Shortform · 2023-11-18T13:03:27.247Z · LW · GW

Seems clearly true, the Fourier Multiplication Algorithm for modular addition is not the program easiest for me to understand to perform modular addition!

Comment by Neel Nanda (neel-nanda-1) on Experiences and learnings from both sides of the AI safety job market · 2023-11-15T19:43:07.150Z · LW · GW

Thanks!

Comment by Neel Nanda (neel-nanda-1) on Experiences and learnings from both sides of the AI safety job market · 2023-11-15T18:11:14.755Z · LW · GW

Thanks for writing this, this is a great post and I broadly agree with most of it!

If you get rejected without being invited to an interview, this is unfortunate but still valuable feedback. It basically means “You clearly aren’t there yet”. So you should probably build more skills for 6 months or so before applying again.

This feels false to me. I've done a lot of CV (aka resume) screening, and it is a super noisy process, and it's easy to be overly credentialist and favour people with legible signalling. There's probably also a fair amount of noise in how well you write your CV to make the crucial information prominent. (relevant work experience, relevant publications, degrees, relevant projects, anything else impressive you've done). Further, "6 months of upskilling" may not turn out anything super legible (though it's great if it does, and this is worth aiming for!)

My MATS application has a 10 hour work task, and it's like night and day looking at the difference between how much signal I get from that and from just the CV, and I accept a lot of candidates who look mediocre on paper (and vice versa).

If you're getting desk rejected from jobs, I'd recommend asking a friend (ideally one with some experience in the relevant field/industry or who's done hiring before) to look at your CV/application to some recent jobs and give feedback.

Comment by Neel Nanda (neel-nanda-1) on Picking Mentors For Research Programmes · 2023-11-12T12:08:20.016Z · LW · GW

Strong +1 to asking the mentor being a great way to get information! My guess is many mentors aren't going out of their way to volunteer this kind of info, but will share it if asked. Especially if they've already decided that they want to work with you.

My MATS admission doc has some info on that for me, though I can give more detailed answers if anyone emails me with specific questions.

Comment by Neel Nanda (neel-nanda-1) on Picking Mentors For Research Programmes · 2023-11-12T12:05:55.612Z · LW · GW

I'd guess this varies by field? I think this would be bad advice in mech interp - there's a lot of concepts and existing mech interp theory that you need to understand a bunch of good projects, and people new to the field are often bad at explaining these (and, importantly, I think I have decent judgement about whether a project is any good). But I'd guess this is decent advice in some areas of alignment.

Comment by Neel Nanda (neel-nanda-1) on Linear encoding of character-level information in GPT-J token embeddings · 2023-11-10T23:30:34.790Z · LW · GW

We show that linear probes can retrieve much character-level information from embeddings and we ;.

You have a sentence fragment in the TLDR

Comment by Neel Nanda (neel-nanda-1) on Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper · 2023-11-06T14:32:04.831Z · LW · GW

Thanks for the analysis! This seems pretty persuasive to me, especially the argument that "fire as rarely as possible" could incentivise learning the same feature, and that it doesn't trivially fall out of other dimensionality reduction methods. I think this predicts that if we look at the gradient with respect to the pre-activation value in MLP activation space, that the average of this will correspond to the rare feature direction? Though maybe not, since we want the average weighted by "how often does this cause a feature to flip from on to off", there's no incentive to go from -4 to -5.

An update is that when training on gelu-2l with the same parameters, I get truly dead features but fairly few ultra low features, and in one autoencoder (I think the final layer) the truly dead features are gone. This time I trained on mlp_output rather than mlp_activations, which is another possible difference.

Comment by Neel Nanda (neel-nanda-1) on On the Executive Order · 2023-11-02T10:36:43.755Z · LW · GW

My guess is that the threshold is a precursor to more stringent regulation on people above the bar, and that it's easier to draw a line in the sand now and stick to it. I feel pretty fine with it being so high

Comment by Neel Nanda (neel-nanda-1) on Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper · 2023-10-27T18:09:42.068Z · LW · GW

Interesting! My guess is that the numbers are small enough that there's not much to it? But I share your prior that it should be basically orthogonal. The MLP basis is weird and privileged and I don't feel well equipped to reason about it

Comment by Neel Nanda (neel-nanda-1) on Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper · 2023-10-26T22:21:22.235Z · LW · GW

Search for the text "A sample of texts where the average feature fires highly" and look at the figure below

Comment by Neel Nanda (neel-nanda-1) on Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter · 2023-10-26T09:16:46.721Z · LW · GW

Note that the astra fellowship link in italics at the top goes to the researcher program not the astra fellowship

Comment by Neel Nanda (neel-nanda-1) on Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper · 2023-10-26T09:13:08.345Z · LW · GW

Ah, I did compare it to the mean activations and didn't find much, alas. Good idea though!