Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

paulpauls

Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

post by PaulPauls · 2024-11-24T05:45:20.124Z · LW · GW · 3 comments

This is a link post for https://github.com/PaulPauls/llama3_interpretability_sae

3 comments

I recently published a rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary and open-source LLMs that was quite popular this year and produced great research papers by Anthropic^[1]^[2], OpenAI^[3]^[4] and Google Deepmind^[5] with the humble but open source Llama 3.2-3B model.

The project provides a complete end-to-end pipeline for training Sparse Autoencoders to interpret LLM features, from activation capture through training, interpretation, and verification. All code, data, trained models, and detailed documentation are publicly available in my attempt to make this as open research as possible, though calling it an extensively documented personal project wouldn't be wrong either in my opinion.

Since LessWrong has a strong focus on AI interpretability research, I thought some of you might find value in this open research replication. I'm happy to answer any questions about the methodology, results, or future directions.

3 comments

Comments sorted by top scores.

comment by Neel Nanda (neel-nanda-1) · 2024-11-24T14:43:39.985Z · LW(p) · GW(p)

Cool project! Thanks for doing it and sharing, great to see more models with SAEs

interpretability research on proprietary LLMs that was quite popular this year and great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind

I run the Google DeepMind team, and just wanted to clarify that our work was not on proprietary closed weight models, but instead on Gemma 2, as were our open weight SAEs - Gemma 2 is about as open as llama imo. We try to use open models wherever possible for these general reasons of good scientific practice, ease of replicability, etc. Though we couldn't open source the data, and didn't go to the effort of open sourcing the code, so I don't think they can be considered true open source. OpenAI did most of their work on gpt2, and only did their large scale experiment on GPT4 I believe. All Anthropic work I'm aware of is on proprietary models, alas.

Replies from: PaulPauls

↑ comment by PaulPauls · 2024-11-24T19:38:31.824Z · LW(p) · GW(p)

Hi Neel,

you're absolutely right, all research in the gemmascope paper was performed on the open source Gemma 2 model. I wanted to group up all research that my paper was based on in a concise sentence and by doing so erroneously put you in the 'proprietary LLMs' section. I went ahead and corrected the mistake.

My apologies.

I hope you still enjoyed the project and thank you for your great research work at DeepMind. =)

comment by Volodymyr Barannik (volodymyr-barannik) · 2024-12-04T18:21:47.644Z · LW(p) · GW(p)

Thanks for your work, but why is the project currently taken down?

Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

Contents

3 comments