Posts

MATS Applications + Research Directions I'm Currently Excited About 2025-02-06T11:03:40.093Z
Learning Multi-Level Features with Matryoshka SAEs 2024-12-19T15:59:00.036Z
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders 2024-12-11T06:30:37.076Z
Evolutionary prompt optimization for SAE feature visualization 2024-11-14T13:06:49.728Z
SAEs are highly dataset dependent: a case study on the refusal direction 2024-11-07T05:22:18.807Z
SAE Probing: What is it good for? Absolutely something! 2024-11-01T19:23:55.418Z
Open Source Replication of Anthropic’s Crosscoder paper for model-diffing 2024-10-27T18:46:21.316Z
SAE features for refusal and sycophancy steering vectors 2024-10-12T14:54:48.022Z
Base LLMs refuse too 2024-09-29T16:04:21.343Z
Showing SAE Latents Are Not Atomic Using Meta-SAEs 2024-08-24T00:56:46.048Z
Calendar feature geometry in GPT-2 layer 8 residual stream SAEs 2024-08-17T01:16:53.764Z
Extracting SAE task features for in-context learning 2024-08-12T20:34:13.747Z
Self-explaining SAE features 2024-08-05T22:20:36.041Z
BatchTopK: A Simple Improvement for TopK-SAEs 2024-07-20T02:20:51.848Z
JumpReLU SAEs + Early Access to Gemma 2 SAEs 2024-07-19T16:10:54.664Z
SAEs (usually) Transfer Between Base and Chat Models 2024-07-18T10:29:46.138Z
Stitching SAEs of different sizes 2024-07-13T17:19:20.506Z
Neel Nanda's Shortform 2024-07-12T07:16:31.097Z
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 2024-07-07T17:39:35.064Z
Attention Output SAEs Improve Circuit Analysis 2024-06-21T12:56:07.969Z
SAEs Discover Meaningful Features in the IOI Task 2024-06-05T23:48:04.808Z
Mechanistic Interpretability Workshop Happening at ICML 2024! 2024-05-03T01:18:26.936Z
Transcoders enable fine-grained interpretable circuit analysis for language models 2024-04-30T17:58:09.982Z
Refusal in LLMs is mediated by a single direction 2024-04-27T11:13:06.235Z
Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z
How to use and interpret activation patching 2024-04-24T08:35:00.857Z
[Full Post] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:59.185Z
[Summary] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:17.755Z
AtP*: An efficient and scalable method for localizing LLM behaviour to components 2024-03-18T17:28:37.513Z
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems 2024-03-13T17:09:17.027Z
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To 2024-03-06T05:03:09.639Z
Attention SAEs Scale to GPT-2 Small 2024-02-03T06:50:22.583Z
Sparse Autoencoders Work on Attention Layer Outputs 2024-01-16T00:26:14.767Z
Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization 2024-01-14T02:06:00.290Z
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) 2023-12-23T02:46:25.892Z
Fact Finding: How to Think About Interpreting Memorisation (Post 4) 2023-12-23T02:46:16.675Z
Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3) 2023-12-23T02:46:05.517Z
Fact Finding: Simplifying the Circuit (Post 2) 2023-12-23T02:45:49.675Z
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) 2023-12-23T02:44:24.270Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper 2023-10-23T22:38:33.951Z
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small 2023-10-13T18:32:02.376Z
Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy 2023-08-29T22:07:04.059Z
An Interpretability Illusion for Activation Patching of Arbitrary Subspaces 2023-08-29T01:04:18.688Z
Mech Interp Puzzle 2: Word2Vec Style Embeddings 2023-07-28T00:50:00.297Z
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla 2023-07-20T10:50:58.611Z
Tiny Mech Interp Projects: Emergent Positional Embeddings of Words 2023-07-18T21:24:41.990Z
Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo 2023-07-16T22:02:15.410Z
How to Think About Activation Patching 2023-06-04T14:17:42.264Z
Finding Neurons in a Haystack: Case Studies with Sparse Probing 2023-05-03T13:30:30.836Z

Comments

Comment by Neel Nanda (neel-nanda-1) on Martin Randall's Shortform · 2025-02-19T00:28:22.920Z · LW · GW

Idk, I personally feel near maxed out on spending money to increase my short term happiness (or at least, any ways coming to mind seem like a bunch of effort, like hiring a great personal assistant), and so the only reason to care about keeping it around is saving it for future use. I would totally be spending more money on myself now if I thought it would actually improve my life

Comment by Neel Nanda (neel-nanda-1) on AGI Safety & Alignment @ Google DeepMind is hiring · 2025-02-18T11:06:58.067Z · LW · GW

In my incredibly biased opinion, the GDM AGI safety team is great and an effective place to work on reducing AI x-risk, and I would love to get applications from people here

Comment by Neel Nanda (neel-nanda-1) on Martin Randall's Shortform · 2025-02-17T19:05:57.777Z · LW · GW

On the other hand, if you have shorter timelines and higher P Doom, the value of saving for retirement becomes much lower, which means that if you earn a income notably higher than your needs, the cost of cryonics is much lower, If you don't otherwise have valuable things to spend money on, they that get you value right now

Comment by Neel Nanda (neel-nanda-1) on William_S's Shortform · 2025-02-17T07:49:02.143Z · LW · GW

I was also thinking recently that I would love this to exist! If I ever had the time I was going to try hacking it together in cursor

Comment by Neel Nanda (neel-nanda-1) on MATS Applications + Research Directions I'm Currently Excited About · 2025-02-17T04:28:00.905Z · LW · GW

Huh, seems to be working for me. What do you see when you click on it?

tinyurl.com/neel-mats-app

Comment by Neel Nanda (neel-nanda-1) on Gary Marcus now saying AI can't do things it can already do · 2025-02-09T18:52:31.701Z · LW · GW

I think it's just not worth engaging with his claims about the limits of AI, he's clearly already decided on his conclusion

Comment by Neel Nanda (neel-nanda-1) on Tips and Code for Empirical Research Workflows · 2025-02-09T09:25:51.550Z · LW · GW

Control space

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2025-02-03T00:01:07.150Z · LW · GW

For posterity, this turned out to be a very popular technique for jailbreaking open source LLMs - see this list of the 2000+ "abliterated" models on HuggingFace (abliteration is a mild variant of our technique someone coined shortly after, I think the main difference is that you do a bit of DPO after ablating the refusal direction to fix any issues introduced?). I don't actually know why people prefer abliteration to just finetuning, but empirically people use it, which is good enough for me to call it beating baselines on some metric.

Comment by Neel Nanda (neel-nanda-1) on Tail SP 500 Call Options · 2025-01-25T14:14:05.405Z · LW · GW

Interesting. Does anyone know what the counterparty risk is like here? Eg, am I gambling on the ETF continuing to be provided, the option market maker not going bust, the relevant exchange continuing to exist, etc. (the first and third generally seem like reasonable bets, but in a short timelines world everything is high variance...)

Comment by Neel Nanda (neel-nanda-1) on SAEBench: A Comprehensive Benchmark for Sparse Autoencoders · 2025-01-25T12:20:14.445Z · LW · GW

Yeah, if you're doing this, you should definitely pre compute and save activations

Comment by Neel Nanda (neel-nanda-1) on Tips and Code for Empirical Research Workflows · 2025-01-24T10:58:57.047Z · LW · GW

I've been really enjoying voice to text + LLMs recently, via a great Mac App called Super Whisper (which can work with local speech to text models, so could also possibly be used for confidential stuff) - combining Super Whisper and Claude and Cursor means I can just vaguely ramble at my laptop about what experiments should happen and they happen, it's magical!

Comment by Neel Nanda (neel-nanda-1) on Some lessons from the OpenAI-FrontierMath debacle · 2025-01-21T10:58:12.653Z · LW · GW

I agree that OpenAI training on Frontier Math seems unlikely, and not in their interests. The thing I find concerning is that having high quality evals is very helpful for finding capabilities improvements - ML research is all about trying a bunch of stuff and seeing what works. As benchmarks saturate, you want new ones to give you more signal. If Epoch have a private benchmark they only apply to new releases, this is fine, but if OpenAI can run it whenever they want, this is plausibly fairly helpful for making better systems faster, since this makes hill climbing a bit easier.

Comment by Neel Nanda (neel-nanda-1) on Tips and Code for Empirical Research Workflows · 2025-01-21T01:00:20.974Z · LW · GW

This looks extremely comprehensive and useful, thanks a lot for writing it! Some of my favourite tips (like clipboard managers and rectangle) were included, which is always a good sign. And I strongly agree with "Cursor/LLM-assisted coding is basically mandatory".

I passed this on to my mentees - not all of this transfers to mech interp, in particular the time between experiments is often much shorter (eg a few minutes, or even seconds) and often almost an entire project is in de-risking mode, but much of it transfers. And the ability to get shit done fast is super important

Comment by Neel Nanda (neel-nanda-1) on Jonathan Claybrough's Shortform · 2025-01-19T13:23:29.310Z · LW · GW

This seems fine to me (you can see some reasons I like Epoch here). My understanding is that most Epoch staff are concerned about AI Risk, though tend to longer timelines and maybe lower p(doom) than many in the community, and they aren't exactly trying to keep this secret.

Your argument rests on an implicit premise that Epoch talking about "AI is risky" in their podcast is important, eg because it'd change the mind of some listeners. This seems fairly unlikely to me - it seems like a very inside baseball podcast, mostly listened to by people already aware of AI risk arguments, and likely that Epoch is somewhat part of the AI risk-concerned community. And, generally, I don't think that all media produced by AI risk concerned people needs to mention that AI risk is a big deal - that just seems annoying and preachy. I see Epoch's impact story as informing people of where AI is likely to go and what's likely to happen, and this works fine even if they don't explicitly discuss AI risk

Comment by Neel Nanda (neel-nanda-1) on AI Timelines · 2025-01-09T15:54:26.375Z · LW · GW

I don't know much about CTF specifically, but based on my maths exam/olympiad experience I predict that there's a lot of tricks to go fast (common question archetypes, saved code snippets, etc) that will be top of mind for people actively practicing, but not for someone with a lot of domain expertise who doesn't explicitly practice CTF. I also don't know how important speed is for being a successful cyber professional. They might be able to get some of this speed up with a bit of practice, but I predict by default there's a lot of room for improvement.

Comment by Neel Nanda (neel-nanda-1) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2025-01-08T17:32:57.773Z · LW · GW

Tagging @philh @bilalchughtai @eventuallyalways @jbeshir in case this is relevant to you (though pooling money to get GWWC interested in helping may make more sense, if it can enable smaller donors and has lower fees)

Comment by Neel Nanda (neel-nanda-1) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2025-01-08T17:30:57.996Z · LW · GW

For anyone considering large-ish donations (in the thousands), there are several ways to do this in general for US non-profits, as a UK taxpayer. (Several of these also work for people who pay tax in both the US and UK)

The one I'd recommend here is using the Anglo-American Charity - you can donate to them tax-deductibly (a UK charity) and they'll send it to a US non-profit. I hear from a friend that they're happy to forward it to every.org so this should be easy here. The main annoying thing is fees - for amounts below £15K it's min(4%, £250) (so 4% above £6250, a flat fee of £250 below).

Fee structure; 4% on donations under £15k, 3% on gifts between £15,001 to £50,000 and 2% on gifts over £50k. Minimum gift of £1,000. Minimum fee £250.

The minimum donation is £1,000.00.

But this is still a major saving, even in the low thousands, especially if you're in a high tax bracket. Though it may make more sense for you to donate to another charity.

Another option is the charitable aid foundation's donor advised gift, which is a similar deal with worse fees: min(4%, £400) on amounts below £150K.

If you're a larger donor (eg £20K+) it may make sense to set up a donor advised fund, which will often let you donate to worldwide non-profits.

Comment by Neel Nanda (neel-nanda-1) on The Plan - 2024 Update · 2025-01-02T12:11:22.624Z · LW · GW

Sure, but I think that human cognition tends to operate at a level of abstract above the configuration of atoms in a 3D environment. Like "that is a chair" is a useful way to reason about an environment. Whilethat "that is a configuration of pixels that corresponds to a chair when projected at a certain angle in certain lighting conditions" must first be converted to "that is a chair" before anything useful can be done. Text just has a lot of useful preprocessing applied already and is far more compressed

Comment by Neel Nanda (neel-nanda-1) on The Plan - 2024 Update · 2024-12-31T20:30:12.814Z · LW · GW

Strong +1, that argument didn't make sense to me. Images are a fucking mess - they're a grid of RGB pixels, of a 3D environment (interpreted through the lens of a camera) from a specific angle. Text is so clean and pretty in comparison, and has much richer meaning, and has a much more natural mapping to concepts we understand

Comment by Neel Nanda (neel-nanda-1) on The Plan - 2024 Update · 2024-12-31T20:25:38.493Z · LW · GW

Fwiw, this is not at all obvious to me, and I would weakly bet that larger models are harder to interpret (even beyond there just being more capabilities to study)

Comment by Neel Nanda (neel-nanda-1) on evhub's Shortform · 2024-12-29T19:46:21.420Z · LW · GW

I would be very surprised if it had changed for early employees. I considered the donation matching part of my compensation package (it 2.5x the amount of equity, since it was a 3:1 match on half my equity), and it would be pretty norm violating to retroactively reduce compensation

Comment by Neel Nanda (neel-nanda-1) on evhub's Shortform · 2024-12-28T21:24:32.712Z · LW · GW

I gather that they changed the donation matching program for future employees, but the 3:1 match still holds for prior employees, including all early employees (this change happened after I left, when Anthropic was maybe 50 people?)

I'm sad about the change, but I think that any goodwill due to believing the founders have pledged much of their equity to charity is reasonable and not invalidated by the change

Comment by Neel Nanda (neel-nanda-1) on How to replicate and extend our alignment faking demo · 2024-12-25T21:08:50.431Z · LW · GW

Thanks a lot for sharing all this code and data, seems super useful for external replication and follow-on work. It might be good to link this post from the Github readme - I initially found the Github via the paper, but not this post, and I found this exposition in this post more helpful than the current readme

Comment by Neel Nanda (neel-nanda-1) on Sam Marks's Shortform · 2024-12-16T22:45:55.717Z · LW · GW

That's technically even more conditional as the intervention (subtract the parallel component) also depends on the residual stream. But yes. I think it's reasonable to lump these together though, orthogonalisation also should be fairly non destructive unless the direction was present, while steering likely always has side effects

Comment by Neel Nanda (neel-nanda-1) on Sam Marks's Shortform · 2024-12-16T06:03:45.238Z · LW · GW

Note that this is conditional SAE steering - if the latent doesn't fire it's a no-op. So it's not that surprising that it's less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though

Comment by Neel Nanda (neel-nanda-1) on Remap your caps lock key · 2024-12-16T06:00:27.740Z · LW · GW

When do you use escape?

Comment by Neel Nanda (neel-nanda-1) on Zach Stein-Perlman's Shortform · 2024-12-12T16:19:00.532Z · LW · GW

It seems unlikely that openai is truly following the test the model plan? They keep eg putting new experimental versions onto lmsys, presumably mostly due to different post training, and it seems pretty expensive to be doing all the DC evals again on each new version (and I think it's pretty reasonable to assume that a bit of further post training hasn't made things much more dangerous)

Comment by Neel Nanda (neel-nanda-1) on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2024-12-07T01:58:06.547Z · LW · GW

I'm not super sure what I think of this project. I endorse the seed of the idea re "let's try to properly reverse engineer what representing facts in superposition looks like" and think this was a good idea ex ante. Ex post, I consider our results fairly negative, and have mostly confused that this kind of thing is cursed and we should pursue alternate approaches to interpretability (eg transcoders). I think this is a fairly useful insight! But also something I made from various other bits of data. Overall I think this was a fairly useful conclusion re updating away from ambitious mech interp and has had a positive impact on my future research, though it's harder to say if this impacted others (beyond the general sphere of people I mentor/manage)

I think the circuit analysis here is great, a decent case study of what high quality circuit analysis looks like, one of studies of factual recall I trust most (though I'm biased), and introduced some new tricks that I think are widely useful, like using probes to understand when information is introduced Vs signal boosted, and using mechanistic probes to interpret activations without needing training data. However, I largely haven't seen much work build on this, beyond a few scattered examples, which suggests it hasn't been too impactful. I also think this project took much longer than it should have, which is a bit sad.

Though, this did get discussed in a 3Blue1Brown video, which is the most important kind of impact!

Comment by Neel Nanda (neel-nanda-1) on Finding Neurons in a Haystack: Case Studies with Sparse Probing · 2024-12-07T01:51:21.604Z · LW · GW

I really like this paper (though, obviously, am extremely biased). I don't think it was groundbreaking, but I think it was an important contribution to mech interp, and one of my favourite papers that I've supervised.

Superposition seems like an important phenomena that affects our ability to understand language models. I think this paper was some of the first evidence that it actually happens in language models, and on what it actually looks like. Thinking about eg why neurons detecting compound words (eg blood pressure) were unusually easy to represent in superposition, while "this text is in French" merited dedicated neurons, helped significantly clarify my understanding of superposition beyond what was covered in Toy Models of Superposition (discussed in Appendix A). I also just like having case studies and examples of phenomena in language models to think about, and have found some of the neuron families in this paper helpful to keep in mind when reasoning about other weirdnesses in LLMs. I largely think the results in this paper have stood the test of time.

Comment by Neel Nanda (neel-nanda-1) on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · 2024-12-07T01:45:24.232Z · LW · GW

Sparse autoencoders have been one of the most important developments in mechanistic interpretability in the past year or so, and significantly shaped the research of the field (including my own work). I think this is in substantial part due to Towards Monosemanticity, between providing some rigorous preliminary evidence that the technique actually worked, a bunch of useful concepts like feature splitting, and practical advice for training these well. I think that understanding what concepts are represented in model activations is one of the most important problems in mech interp right now. Though highly imperfect, SAEs seem the best current bet we have here, and I expect whatever eventually works to look at least vaguely like an SAE.

I have various complaints and caveats about the paper (that I may elaborate on in a longer review in the discussion phase), and pessimisms about SAEs, but I think this work remains extremely impactful and significantly net positive on the field, and SAEs are a step in the right direction.

Comment by Neel Nanda (neel-nanda-1) on A Longlist of Theories of Impact for Interpretability · 2024-12-06T10:57:27.489Z · LW · GW

How would you evade their tools?

Comment by Neel Nanda (neel-nanda-1) on Neel Nanda's Shortform · 2024-12-05T19:14:42.122Z · LW · GW

A tip for anyone on the ML job/PhD market - people will plausibly be quickly skimming your google scholar to get a sense of "how impressive is this person/what is their deal" read (I do this fairly often), so I recommend polishing your Google scholar if you have publications! It can make a big difference.

I have a lot of weird citable artefacts that confuse Google Scholar, so here's some tips I've picked up:

  • First, make a google scholar profile if you don't already have one!
    • Verify the email (otherwise it doesn't show up properly in search)
  • (Important!) If you are co-first author on a paper but not in the first position, indicate this by editing the names of all co-first authors to end in a *
    • You edit by logging in to the google account you made the profile with, going to your profile, clicking on the paper's name, and then editing the author's names
    • Co-first vs second author makes a big difference to how impressive a paper is, so you really want this to be clear!
  • Edit the venue of your work to be the most impressive place it was published, and include any notable awards from the venue (eg spotlight, oral, paper awards, etc).
    • You can edit this by clicking on the paper name and editing the journal field.
    • If it was a workshop, make sure you include the word workshop (otherwise it can appear deceptive).
    • See my profile for examples.
  • Hunt for lost citations: Often papers have weirdly formatted citations and Google scholar gets confused and thinks it was a different paper. You can often find these by clicking on the plus just below your profile picture then add articles, and then clicking through the pages for anything that you wrote. Add all these papers, and then use the merge function to combine them into one paper (with a combined citation count).
    • Merge lets you choose which of the merged artefacts gets displayed
    • Merge = return to the main page, click the tick box next to the paper titles, then clicking merge at the top
    • Similar advice applies if you have eg a blog post that was later turned into a paper, and have citations for both
    • Another merging hack, if you have a weird artefact on your google scholar (eg a blog post or library) and you don't like how Google scholar thinks it should be presented, you can manually add the citation in the format you like, and then merge this with the existing citation, and display your new one
  • If you're putting citations on a CV, semantic scholar is typically better for numbers, as it updates more frequently than Google scholar. Though it's worse at picking up on the existence of non paper artefacts like a cited Github or blog post
  • Make your affiliation/title up to date at the top
Comment by Neel Nanda (neel-nanda-1) on You should consider applying to PhDs (soon!) · 2024-12-02T21:51:40.250Z · LW · GW

Do you know what topics within AI Safety you're interested in? Or are you unsure and so looking for something that lets you keep your options open?

Comment by Neel Nanda (neel-nanda-1) on You should consider applying to PhDs (soon!) · 2024-12-01T19:33:39.256Z · LW · GW

+1 to the other comments, I think this is totally doable, especially if you can take time off work.

The hard part imo is letters of recommendation, especially if you don't have many people who've worked with you on research before. If you feel awkward about asking for letters of recommendation on short notice (which multiple people have asked me for in the past week, if it helps, so this is pretty normal), one thing that makes it lower effort for the letter writer is giving them a bunch of notes on specific things you did while working with them and what traits of your's this demonstrates or, even better, offering to write a rough first draft letter for them to edit (try not to give very similar letters to all your recommenders though!).

Comment by Neel Nanda (neel-nanda-1) on The Big Nonprofits Post · 2024-11-29T22:46:47.682Z · LW · GW

Thanks a lot for the post! It's really useful to have so many charities and a bit of context in the same place when thinking about my own donations. I found it hard to navigate a post with so many charities, so I put this into a spreadsheet that lets me sort and filter the categories - hopefully this is useful to others too! https://docs.google.com/spreadsheets/d/1WN3uaQYJefV4STPvhXautFy_cllqRENFHJ0Voll5RWA/edit?gid=0#gid=0

Comment by Neel Nanda (neel-nanda-1) on Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders · 2024-11-24T14:43:39.985Z · LW · GW

Cool project! Thanks for doing it and sharing, great to see more models with SAEs

interpretability research on proprietary LLMs that was quite popular this year and great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind

I run the Google DeepMind team, and just wanted to clarify that our work was not on proprietary closed weight models, but instead on Gemma 2, as were our open weight SAEs - Gemma 2 is about as open as llama imo. We try to use open models wherever possible for these general reasons of good scientific practice, ease of replicability, etc. Though we couldn't open source the data, and didn't go to the effort of open sourcing the code, so I don't think they can be considered true open source. OpenAI did most of their work on gpt2, and only did their large scale experiment on GPT4 I believe. All Anthropic work I'm aware of is on proprietary models, alas.

Comment by Neel Nanda (neel-nanda-1) on Open Source Replication of Anthropic’s Crosscoder paper for model-diffing · 2024-11-18T22:00:58.518Z · LW · GW

It's essentially training an SAE on the concatenation of the residual stream from the base model and the chat model. So, for each prompt, you run it through the base model to get a residual stream vector v_b, through the chat model to get a residual stream vector v_c, and then concatenate these to get a vector twice as long, and train an SAE on this (with some minor additional details that I'm not getting into)

Comment by Neel Nanda (neel-nanda-1) on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2024-10-21T14:40:54.408Z · LW · GW

This is somewhat similar to the approach of the ROME paper, which has been shown to not actually do fact editing, just inserting louder facts that drown out the old ones and maybe suppressing the old ones.

In general, the problem with optimising model behavior as a localisation technique is that you can't distinguish between something that truly edits the fact, and something which adds a new fact in another layer that cancels out the first fact and adds something new.

Comment by Neel Nanda (neel-nanda-1) on How I got 4.2M YouTube views without making a single video · 2024-10-14T20:07:09.723Z · LW · GW

Agreed, chance of success when cold emailing busy people is low, and spamming them is bad. And there are alternate approaches that may work better, depending on the person and their setup - some Youtubers don't have a manager or employees, some do. I also think being able to begin an email with "Hi, I run the DeepMind mechanistic interpretability team" was quite helpful here.

Comment by Neel Nanda (neel-nanda-1) on Mark Xu's Shortform · 2024-10-11T18:20:40.759Z · LW · GW

The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams

Thanks for writing the post!

Comment by Neel Nanda (neel-nanda-1) on MichaelDickens's Shortform · 2024-10-04T20:19:53.134Z · LW · GW

Huh, are there examples of right leaning stuff they stopped funding? That's new to me

Comment by Neel Nanda (neel-nanda-1) on Nathan Young's Shortform · 2024-09-24T13:39:27.807Z · LW · GW

+1. Concretely this means converting every probability p into p/(1-p), and then multiplying those (you can then convert back to probabilities)

Intuition pump: Person A says 0.1 and Person B says 0.9. This is symmetric, if we instead study the negation, they swap places, so any reasonable aggregation should give 0.5

Geometric mean does not, instead you get 0.3

Arithmetic gets 0.5, but is bad for the other reasons you noted

Geometric mean of odds is sqrt(1/9 * 9) = 1, which maps to a probability of 0.5, while also eg treating low probabilities fairly

Comment by Neel Nanda (neel-nanda-1) on Showing SAE Latents Are Not Atomic Using Meta-SAEs · 2024-09-22T09:47:41.537Z · LW · GW

Interesting thought! I expect there's systematic differences, though it's not quite obvious how. Your example seems pretty plausible to me. Meta SAEs are also more incentived to learn features which tend to split a lot, I think, as then they're useful for more predicting many latents. Though ones that don't split may be useful as they entirely explain a latent that's otherwise hard to explain.

Anyway, we haven't checked yet, but I expect many of the results in this post would look similar for eg sparse linear regression over a smaller SAEs decoder. Re why meta SAEs are interesting at all, they're much cheaper to train than a smaller SAE, and BatchTopK gives you more control over the L0 than you could easily get with sparse linear regression, which are some mild advantages, but you may have a small SAE lying around anyway. I see the interesting point of this post more as "SAE latents are not atomic, as shown by one method, but probably other methods would work well too"

Comment by Neel Nanda (neel-nanda-1) on quetzal_rainbow's Shortform · 2024-09-18T22:02:26.676Z · LW · GW

What's wrong with twitter as an archival source? You can't edit tweets (technically you can edit top level tweets for up to an hour, but this creates a new URL and old links still show the original version). Seems fine to just aesthetically dislike twitter though

Comment by Neel Nanda (neel-nanda-1) on Why I'm bearish on mechanistic interpretability: the shards are not in the network · 2024-09-13T17:38:06.577Z · LW · GW

To me, this model predicts that sparse autoencoders should not find abstract features, because those are shards, and should not be localisable to a direction in activation space on a single token. Do you agree that this is implied?

If so, how do you square that with eg all the abstract features Anthropic found in Sonnet 3?

Comment by Neel Nanda (neel-nanda-1) on Contra papers claiming superhuman AI forecasting · 2024-09-13T11:38:14.772Z · LW · GW

Thanks for making the correction!

Comment by Neel Nanda (neel-nanda-1) on OpenAI o1 · 2024-09-13T10:26:23.388Z · LW · GW

I expect there's lots of new forms of capabilities elicitation for this kind of model, which their standard framework may not have captured, and which requires more time to iterate on

Comment by Neel Nanda (neel-nanda-1) on Contra papers claiming superhuman AI forecasting · 2024-09-12T21:41:40.353Z · LW · GW

Thanks for the post!

sample five random users’ forecasts, score them, and then average

Are you sure this is how their bot works? I read this more as "sample five things from the LLM, and average those predictions". For Metaculus, the crowd is just given to you, right, so it seems crazy to sample users?

Comment by Neel Nanda (neel-nanda-1) on Zach Stein-Perlman's Shortform · 2024-09-08T10:11:02.654Z · LW · GW

Yeah, fair point, disagreement retracted

Comment by Neel Nanda (neel-nanda-1) on Pay Risk Evaluators in Cash, Not Equity · 2024-09-07T23:36:15.060Z · LW · GW

I think this is important to define anyway! (and likely pretty obvious). This would create a lot more friction for someone to take on such a role though, or move out