peter-lai

Posts
Comments

Posts

Proof-of-Concept Debugger for a Small LLM 2025-03-17T22:27:52.386Z

SAE regularization produces more interpretable models 2025-01-28T20:02:56.662Z

Peter Lai's Shortform 2025-01-25T19:41:33.057Z

Comments

Comment by Peter Lai (peter-lai) on SAE regularization produces more interpretable models · 2025-02-04T19:29:47.375Z · LW · GW

This adds quite a bit more. Code here if you're interested in taking a look at what I tried: https://github.com/peterlai/gpt-circuits/blob/main/experiments/regularization/train.py. My goal was to show that regularization is possible and to spark more interest in this general approach. Matthew Chen and @JoshEngels just released a paper describing a more practical approach that I hope to try out soon: https://x.com/match_ten/status/1886478581423071478. Where there exists a gap, imo, is with having the SAE features and model weights inform each other without needing to freeze one at a time.

Comment by Peter Lai (peter-lai) on SAE regularization produces more interpretable models · 2025-01-31T17:15:42.651Z · LW · GW

The original SAE is actually quite good, and, in my experiments with Gated SAEs, I'm using those values. For the purposes of framing this technique as a "regularization" technique, I needed to show that the model weights themselves are affected, which is why my graphs use metrics extracted from freshly trained SAE values.

Comment by Peter Lai (peter-lai) on SAE regularization produces more interpretable models · 2025-01-28T22:08:17.756Z · LW · GW

Yep, the graphs in this post reflect the values of features extracted through training new a SAE on the activations of the "regularized" weights.

User info

Mechanistic Interpretability Enthusiast

Posts

Comments