Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

post by PaulPauls · 2024-11-24T05:45:20.124Z · LW · GW · 0 comments

This is a link post for https://github.com/PaulPauls/llama3_interpretability_sae

Contents

No comments

I recently published a rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary LLMs that was quite popular this year and produced great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind[5] with the humble but open source Llama 3.2-3B model.

The project provides a complete end-to-end pipeline for training Sparse Autoencoders to interpret LLM features, from activation capture through training, interpretation, and verification. All code, data, trained models, and detailed documentation are publicly available in my attempt to make this as open research as possible, though calling it an extensively documented personal project wouldn't be wrong either in my opinion.

Since LessWrong has a strong focus on AI interpretability research, I thought some of you might find value in this open research replication. I'm happy to answer any questions about the methodology, results, or future directions.

  1. ^
  2. ^
  3. ^
  4. ^
  5. ^

0 comments

Comments sorted by top scores.