Posts

Neel Nanda's Shortform 2024-07-12T07:16:31.097Z
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 2024-07-07T17:39:35.064Z
Attention Output SAEs Improve Circuit Analysis 2024-06-21T12:56:07.969Z
SAEs Discover Meaningful Features in the IOI Task 2024-06-05T23:48:04.808Z
Mechanistic Interpretability Workshop Happening at ICML 2024! 2024-05-03T01:18:26.936Z
Transcoders enable fine-grained interpretable circuit analysis for language models 2024-04-30T17:58:09.982Z
Refusal in LLMs is mediated by a single direction 2024-04-27T11:13:06.235Z
Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z
How to use and interpret activation patching 2024-04-24T08:35:00.857Z
[Full Post] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:59.185Z
[Summary] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:17.755Z
AtP*: An efficient and scalable method for localizing LLM behaviour to components 2024-03-18T17:28:37.513Z
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems 2024-03-13T17:09:17.027Z
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To 2024-03-06T05:03:09.639Z
Attention SAEs Scale to GPT-2 Small 2024-02-03T06:50:22.583Z
Sparse Autoencoders Work on Attention Layer Outputs 2024-01-16T00:26:14.767Z
Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization 2024-01-14T02:06:00.290Z
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) 2023-12-23T02:46:25.892Z
Fact Finding: How to Think About Interpreting Memorisation (Post 4) 2023-12-23T02:46:16.675Z
Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3) 2023-12-23T02:46:05.517Z
Fact Finding: Simplifying the Circuit (Post 2) 2023-12-23T02:45:49.675Z
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) 2023-12-23T02:44:24.270Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper 2023-10-23T22:38:33.951Z
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small 2023-10-13T18:32:02.376Z
Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy 2023-08-29T22:07:04.059Z
An Interpretability Illusion for Activation Patching of Arbitrary Subspaces 2023-08-29T01:04:18.688Z
Mech Interp Puzzle 2: Word2Vec Style Embeddings 2023-07-28T00:50:00.297Z
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla 2023-07-20T10:50:58.611Z
Tiny Mech Interp Projects: Emergent Positional Embeddings of Words 2023-07-18T21:24:41.990Z
Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo 2023-07-16T22:02:15.410Z
How to Think About Activation Patching 2023-06-04T14:17:42.264Z
Finding Neurons in a Haystack: Case Studies with Sparse Probing 2023-05-03T13:30:30.836Z
Identifying semantic neurons, mechanistic circuits & interpretability web apps 2023-04-13T11:59:51.629Z
Othello-GPT: Reflections on the Research Process 2023-03-29T22:13:42.007Z
Othello-GPT: Future Work I Am Excited About 2023-03-29T22:13:26.823Z
Actually, Othello-GPT Has A Linear Emergent World Representation 2023-03-29T22:13:14.878Z
Attribution Patching: Activation Patching At Industrial Scale 2023-03-16T21:44:54.553Z
Paper Replication Walkthrough: Reverse-Engineering Modular Addition 2023-03-12T13:25:46.400Z
Mech Interp Project Advising Call: Memorisation in GPT-2 Small 2023-02-04T14:17:03.929Z
Mechanistic Interpretability Quickstart Guide 2023-01-31T16:35:49.649Z
200 COP in MI: Studying Learned Features in Language Models 2023-01-19T03:48:23.563Z
200 COP in MI: Interpreting Reinforcement Learning 2023-01-10T17:37:44.941Z
200 COP in MI: Image Model Interpretability 2023-01-08T14:53:14.681Z
200 COP in MI: Techniques, Tooling and Automation 2023-01-06T15:08:27.524Z
200 COP in MI: Analysing Training Dynamics 2023-01-04T16:08:58.089Z
200 COP in MI: Exploring Polysemanticity and Superposition 2023-01-03T01:52:46.044Z
200 COP in MI: Interpreting Algorithmic Problems 2022-12-31T19:55:39.085Z
200 COP in MI: Looking for Circuits in the Wild 2022-12-29T20:59:53.267Z
200 COP in MI: The Case for Analysing Toy Language Models 2022-12-28T21:07:03.838Z

Comments

Comment by Neel Nanda (neel-nanda-1) on Habryka's Shortform Feed · 2024-07-12T07:24:45.533Z · LW · GW

This is true. I signed a concealed non-disparagement when I left Anthropic in mid 2022. I don't have clear evidence this happened to anyone else (but that's not strong evidence of absence). More details here

Comment by Neel Nanda (neel-nanda-1) on Habryka's Shortform Feed · 2024-07-12T07:22:54.739Z · LW · GW

I can confirm that my concealed non-disparagement was very explicit that I could not discuss the existence or terms of the agreement, I don't see any way I could be misinterpreting this. (but I have now kindly been released from it!)

EDIT: It wouldn't massively surprise me if Sam just wasn't aware of its existence though

Comment by Neel Nanda (neel-nanda-1) on Neel Nanda's Shortform · 2024-07-12T07:16:32.409Z · LW · GW

In response to Habryka's shortform, I can confirm that I signed a concealed non-disparagement as part of my Anthropic separation agreement. I worked there for 6 months and left in mid 2022. I received a cash payment as part of that agreement, with nothing shady going on a la threatening previous compensation (though I had no equity to threaten). In hindsight I undervalued my ability to speak freely, and didn't more seriously consider that I could just decline to sign the separation agreement and walk away, I'm not sure what I would do if doing it again.

I asked Anthropic to release me from this after the comment thread started, and they have now released me from both the non-disparagement clause, and the non-disclosure part, which was very nice of them - I would encourage anyone in a similar situation to reach out to hr[at]anthropic.com and legal[at]anthropic.com, though obviously can't guarantee that they'll release everyone

I'll take advantage of my newfound freedoms to say that...

Idk, I don't really have anything too disparaging to say (though I dislike the use of concealed non-disparagements in general and am glad they say they're stopping!). I'm broadly a fan of Anthropic, think their heart is likely in the right place and they're trying to do what's best for the world (though could easily be making the wrong calls) and would seriously consider returning in the right circumstances. I've recommended that several friends of mine accept offers to do safety and interp work there, and feel good about this (though would feel much more hesitant about recommending someone joins a pure capabilities team there). My biggest critique is that I have concerns about their willingness to push the capabilities frontier and worsen race dynamics and, while I can imagine reasonable justifications, I think they're under valuing the importance of at least having clear public positions and rationales for this kind of thing and their clear shift in policies since Claude 1.0

Comment by Neel Nanda (neel-nanda-1) on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 · 2024-07-11T09:46:52.706Z · LW · GW

Thanks! That was copied from the previous post, and Ithink this is fair pushback, so I've hedged the claim to "one of the most", does that seem reasonable?

I haven't deeply engaged enough with those three papers to know if they meet my bar for recommendation, so I've instead linked to your comment from the post

Comment by Neel Nanda (neel-nanda-1) on Fabien's Shortform · 2024-07-10T10:14:08.869Z · LW · GW

I found this comment very helpful, and also expected probing to be about as good, thank you!

Comment by Neel Nanda (neel-nanda-1) on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 · 2024-07-09T13:17:04.521Z · LW · GW

Glad you liked the post!

I'm also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added.

I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they're more interesting to focus on)

Comment by Neel Nanda (neel-nanda-1) on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 · 2024-07-08T22:27:27.912Z · LW · GW

I'm not aware of any problems with it. I think it's a nice paper, but not really at my bar for important work (which is a really high bar, to be clear - fewer than half the papers in this post probably meet it)

Comment by Neel Nanda (neel-nanda-1) on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 · 2024-07-07T19:32:31.598Z · LW · GW

Fair point, I'll add that in to the post. The main reason I recommend it so highly and prominently is that I think it builds valuable conceptual frameworks for reasoning about the pieces of a transformer, even if it somewhat overclaims on how far it can get on interpreting tiny attention-only models, and I think those broad intuitions still stand even after your critiques. Eg strict induction heads as an example of the kind of algorithm that can be implemented with attention, even if it's not fully faithful to the underlying model. But I agree that these are worthwhile caveats to have in mind when reading, and the paper shouldn't be blindly recommended.

Comment by Neel Nanda (neel-nanda-1) on Interpreting Preference Models w/ Sparse Autoencoders · 2024-07-06T02:47:52.748Z · LW · GW

This is a preference model trained on GPT-J I think, so my guess is that it's just very dumb and learned lots of silly features. I'd be very surprise if a ChatGPT preference model had the same issues when an SAE is trained on it.

Comment by Neel Nanda (neel-nanda-1) on Habryka's Shortform Feed · 2024-07-05T09:10:41.423Z · LW · GW

Note, since this is a new and unverified account, that Jack Clark (Anthropic co-founder) confirmed on Twitter that the parent comment is the official Anthropic position https://x.com/jackclarkSF/status/1808975582832832973

Comment by Neel Nanda (neel-nanda-1) on Habryka's Shortform Feed · 2024-07-04T21:31:35.646Z · LW · GW

EDIT: Anthropic have kindly released me personally from my entire concealed non-disparagement, not just made a specific safety exception. Their position on other employees remains unclear, but I take this as a good sign

If someone signed a non-disparagement agreement in the past and wants to raise concerns about safety at Anthropic, we welcome that feedback and will not enforce the non-disparagement agreement.

Thanks for this update! To clarify, are you saying that you WILL enforce existing non disparagements for everything apart from safety, but you are specifically making an exception for safety?

this routine use of non-disparagement agreements, even in these narrow cases, conflicts with our mission

Given this part, I find this surprising. Surely if you think it's bad to ask future employees to sign non disparagements you should also want to free past employees from them too?

Comment by Neel Nanda (neel-nanda-1) on Leon Lang's Shortform · 2024-07-02T01:17:09.785Z · LW · GW

Strongly agreed, it's a complete game changer to be able to click on references in a PDF and see a popup

Comment by Neel Nanda (neel-nanda-1) on Habryka's Shortform Feed · 2024-06-30T22:34:49.111Z · LW · GW

Agreed, I think it's quite confusing as is

Comment by Neel Nanda (neel-nanda-1) on ryan_greenblatt's Shortform · 2024-06-28T21:06:13.932Z · LW · GW

This seems fantastic! Kudos to Anthropic

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-06-25T18:19:19.364Z · LW · GW

As far as I'm aware, major open source chat tuned models like LLaMA are fine-tuned properly, not via a LoRA

Comment by Neel Nanda (neel-nanda-1) on Fabien's Shortform · 2024-06-24T00:43:25.535Z · LW · GW

This was interesting, thanks! I really enjoy your short book reviews

Comment by Neel Nanda (neel-nanda-1) on Fabien's Shortform · 2024-06-21T23:57:18.839Z · LW · GW

I think it was an interesting paper, but this analysis and predictions all seem extremely on point to me

Comment by Neel Nanda (neel-nanda-1) on TurnTrout's shortform feed · 2024-06-21T13:44:38.341Z · LW · GW

Oh, that's great! Kudos to the authors for setting the record straight. I'm glad your work is now appropriately credited

Comment by Neel Nanda (neel-nanda-1) on Jacob Pfau's Shortform · 2024-06-16T18:44:41.274Z · LW · GW

Ah! That makes way more sense, thanks

Comment by Neel Nanda (neel-nanda-1) on Jacob Pfau's Shortform · 2024-06-16T10:10:47.262Z · LW · GW

Why was the second line of your 43 ASCII full of slashes? At that site I see pipes (and indeed GPT4 generates pipes). I do find it interesting that GPT4 can generate the appropriate spacing on the first line though, autoregressively! And if it does systematically recover the same word as you put into the website, that's pretty surprising and impressive

Comment by Neel Nanda (neel-nanda-1) on [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations · 2024-06-14T23:28:19.509Z · LW · GW

Thanks for the additional context, that seems reasonable

Comment by Neel Nanda (neel-nanda-1) on [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations · 2024-06-14T14:10:26.441Z · LW · GW

This is an important problem, thanks for working on it!

To check I've understood correctly, these results are unsurprising, right? You're basically asking the model to do the task of "detect scary questions and perform worse on those", which is well within what I predict frontier models can do.

The interesting question re sandbagging to me is whether models WILL sandbag, ie whether they can detect if they're being evaluated or not (which you do not seem to test) and whether they will realise unprompted that they SHOULD sandbag to achieve their goals, which seem like the hard and interesting questions here.

Am I missing something here? And were there results that you found particularly surprising? Or was the goal to basically show to skeptics that the capabilities for sandbagging exist, even if the propensity may not yet exist

Comment by Neel Nanda (neel-nanda-1) on yanni's Shortform · 2024-06-07T10:19:09.526Z · LW · GW

I'd be pretty surprised

Comment by Neel Nanda (neel-nanda-1) on Non-Disparagement Canaries for OpenAI · 2024-05-31T22:19:51.280Z · LW · GW

Fair point

Comment by Neel Nanda (neel-nanda-1) on Non-Disparagement Canaries for OpenAI · 2024-05-31T00:01:20.232Z · LW · GW

Geoffrey Irving (Research Director, AI Safety Institute)

Given the tweet thread Geoffrey wrote during the board drama, it seems pretty clear that he's willing to publicly disparage OpenAI. (I used to work with Geoffrey, but have no private info here)

Comment by Neel Nanda (neel-nanda-1) on I am the Golden Gate Bridge · 2024-05-29T13:48:14.802Z · LW · GW

Oh, that's great! Was that recently changed? I swear I looked shortly after release and it just showed me a job ad when I clicked on a feature...

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-05-23T15:01:22.688Z · LW · GW

Thanks! Note that this work uses steering vectors, not SAEs, so the technique is actually really easy and cheap - I actively think this is one of the main selling points (you can jailbreak a 70B model in minutes, without any finetuning or optimisation). I am excited at the idea of seeing if you can improve it with SAEs though - it's not obvious to me that SAEs are better than steering vectors, though it's plausible.

I may take you up on the two hours offer, thanks! I'll ask my co-authors

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-05-23T14:30:05.242Z · LW · GW

But it mostly seems like it would be helpful because it gives you well-tuned baselines to compare your results to. I don't think you have results that can cleanly be compared to well-established baselines?

If we compared our jailbreak technique to other jailbreaks on an existing benchmark like Harm Bench and it does comparably well to SOTA techniques, or does even better than SOTA techniques, would you consider this success at doing something useful on a real task?

Comment by Neel Nanda (neel-nanda-1) on EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024 · 2024-05-22T20:27:32.219Z · LW · GW

+1, I think the correct conclusion is "a16z are making bald faced lies to major governments" not "a16z were misled by Anthropic hype"

Comment by Neel Nanda (neel-nanda-1) on Open Thread Spring 2024 · 2024-05-21T10:30:38.352Z · LW · GW

I only ever notice it on my own posts when I get a notification about it

Comment by Neel Nanda (neel-nanda-1) on MATS Winter 2023-24 Retrospective · 2024-05-14T23:44:07.509Z · LW · GW

I see this is strongly disagree voted - I don't mind, but I'd be curious for people to reply with which parts they disagree with! (Or at least disagree react to specific lines). I make a lot of claims in that comment, though I personally think they're all pretty reasonable. The one about not wanting inexperienced researchers to start orgs, or "alignment teams at scaling labs are good actually" might be spiciest?

Comment by Neel Nanda (neel-nanda-1) on OpenAI releases GPT-4o, natively interfacing with text, voice and vision · 2024-05-13T20:20:27.318Z · LW · GW

Might be mainly driven by an improved tokenizer.

I would be shocked if this is the main driver, they claim that English only has 1.1x fewer tokens, but seem to claim much bigger speed-ups

Comment by Neel Nanda (neel-nanda-1) on MATS Winter 2023-24 Retrospective · 2024-05-12T21:51:02.291Z · LW · GW

(EDIT: I just saw Ryan posted a comment a few minutes before mine, I agree substantially with it)

As a Google DeepMind employee I'm obviously pretty biased, but this seems pretty reasonable to me, assuming it's about alignment/similar teams at those labs? (If it's about capabilities teams, I agree that's bad!)

I think the alignment teams generally do good and useful work, especially those in a position to publish on it. And it seems extremely important that whoever makes AGI has a world-class alignment team! And some kinds of alignment research can only really be done with direct access to frontier models. MATS scholars tend to be pretty early in their alignment research career, and I also expect frontier lab alignment teams are a better place to learn technical skills especially engineering, and generally have a higher talent density there.

UK AISI/US AISI/METR seem like solid options for evals, but basically just work on evals, and Ryan says down thread that only 18% of scholars work on evals/demos. And I think it's valuable both for frontier labs to have good evals teams and for there to be good external evaluators (especially in government), I can see good arguments favouring either option.

44% of scholars did interpretability, where in my opinion the Anthropic team is clearly a fantastic option, and I like to think DeepMind is also a decent option, as is OpenAI. Apollo and various academic labs are the main other places you can do mech interp. So those career preferences seem pretty reasonable to me there for interp scholars.

17% are on oversight/control, and for oversight I think you generally want a lot of compute and access to frontier models? I am less sure for control, and think Redwood is doing good work there, but as far as I'm aware they're not hiring.

This is all assuming that scholars want to keep working in the same field they did MATS for, which in my experience is often but not always true.

I'm personally quite skeptical of inexperienced researchers trying to start new orgs - starting a new org and having it succeed is really, really hard, and much easier with more experience! So people preferring to get jobs seems great by my lights

Comment by Neel Nanda (neel-nanda-1) on MATS Winter 2023-24 Retrospective · 2024-05-11T22:08:24.848Z · LW · GW

Note that number of scholars is a much more important metric than number of mentors when it comes to evaluating MATS resources, as scholar per mentors varies a bunch (eg over winter I had 10 scholars, which is much more than most mentors). Harder to evaluate from the outside though!

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-05-11T19:45:54.818Z · LW · GW

Thanks, I'd be very curious to hear if this meets your bar for being impressed, or what else it would take! Further evidence:

  • Passing the Twitter test (for at least one user)
  • Being used by Simon Lerman, an author on Bad LLama (admittedly with help of Andy Arditi, our first author) to jailbreak LLaMA3 70B to help create data for some red-teaming research, (EDIT: rather than Simon choosing to fine-tune it, which he clearly knows how to do, being a Bad LLaMA author).
Comment by Neel Nanda (neel-nanda-1) on Mechanistic Interpretability Workshop Happening at ICML 2024! · 2024-05-11T10:13:27.420Z · LW · GW

Nnsight, pyvene, inseq, torchlens are other libraries coming to mind that it would be good to discuss in a related work. Also penzai in JAX

Comment by Neel Nanda (neel-nanda-1) on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) · 2024-05-09T15:57:21.522Z · LW · GW

I hadn't seen the latter, thanks for sharing!

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-05-06T11:54:59.557Z · LW · GW

Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.

Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we're both wrong, and the most surgical intervention is deleting the direction from key layers only.

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-05-05T13:19:57.592Z · LW · GW

Thanks! I'm personally skeptical of ablating a separate direction per block, it feels less surgical than a single direction everywhere, and we show that a single direction works fine for LLAMA3 8B and 70B

The transformer lens library does not have a save feature :(

Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.

Comment by Neel Nanda (neel-nanda-1) on Introducing AI-Powered Audiobooks of Rational Fiction Classics · 2024-05-04T21:57:05.056Z · LW · GW

Thanks for making these! How expensive is it?

Comment by Neel Nanda (neel-nanda-1) on Mechanistic Interpretability Workshop Happening at ICML 2024! · 2024-05-04T11:36:30.556Z · LW · GW

Makes sense! Sounds like a fairly good fit

It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.

Another way of framing it: Try to write your paper in such a way that a mech interp researcher reading it says "huh, I want to go and use this library for my research". Eg give examples of things that were previously hard that are now easy.

Comment by Neel Nanda (neel-nanda-1) on Mechanistic Interpretability Workshop Happening at ICML 2024! · 2024-05-03T09:47:23.006Z · LW · GW

Looks relevant to me on a skim! I'd probably want to see some arguments in the submission for why this is useful tooling for mech interp people specifically (though being useful to non mech interp people too is a bonus!)

Comment by Neel Nanda (neel-nanda-1) on Transcoders enable fine-grained interpretable circuit analysis for language models · 2024-05-01T22:23:51.216Z · LW · GW

That's awesome, and insanely fast! Thanks so much, I really appreciate it

Comment by Neel Nanda (neel-nanda-1) on Transcoders enable fine-grained interpretable circuit analysis for language models · 2024-05-01T09:14:04.480Z · LW · GW

Nope to both of those, though I think both could be interesting directions!

Comment by Neel Nanda (neel-nanda-1) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T01:03:47.189Z · LW · GW

Nah I think it's pretty sketchy. I personally prefer mean ablation, especially for residual stream SAEs where zero ablation is super damaging. But even there I agree. Compute efficiency hit would be nice, though it's a pain to get the scaling laws precise enough

For our paper this is irrelevant though IMO because we're comparing gated and normal SAEs, and I think this is just scaling by a constant? It's at least monotonic in CE loss degradation

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-29T18:33:16.185Z · LW · GW

I don't think we really engaged with that question in this post, so the following is fairly speculative. But I think there's some situations where this would be a superior technique, mostly low resource settings where doing a backwards pass is prohibitive for memory reasons, or with a very tight compute budget. But yeah, this isn't a load bearing claim for me, I still count it as a partial victory to find a novel technique that's a bit worse than fine tuning, and think this is significantly better than prior interp work. Seems reasonable to disagree though, and say you need to be better or bust

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-29T13:30:38.501Z · LW · GW

+1 to Rohin. I also think "we found a cheaper way to remove safety guardrails from a model's weights than fine tuning" is a real result (albeit the opposite of useful), though I would want to do more actual benchmarking before we claim that it's cheaper too confidently. I don't think it's a qualitative improvement over what fine tuning can do, thus hedging and saying tentative

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-29T02:07:41.087Z · LW · GW

Thanks! Broadly agreed

For example, I think our understanding of Grokking in late 2022 turned out to be importantly incomplete.

I'd be curious to hear more about what you meant by this

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-28T22:46:05.628Z · LW · GW

It was added recently and just added to a new release, so pip install transformer_lens should work now/soon (you want v1.16.0 I think), otherwise you can install from the Github repo

Comment by Neel Nanda (neel-nanda-1) on Refusal in LLMs is mediated by a single direction · 2024-04-28T11:05:17.725Z · LW · GW

There's been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven't seen much elsewhere, but I could easily be missing references