[Linkpost] Play with SAEs on Llama 3

post by Tom McGrath, Eric Ho (eh42), Dan Balsam (dan-balsam) · 2024-09-25T22:35:44.824Z · LW · GW · 2 comments

Contents

2 comments

We (Goodfire) just put our research preview live - you can play with Llama 3 and use sparse autoencoders to read & write from its internal activations. This is a linkpost for:

Taking research and turning it into something you can actually use and play with has been great. It's surprising how much of a difference iterating on something when you expect it to actually be used feels; I think it's definitely pushed the quality of what you can do with SAEs up a notch.

2 comments

Comments sorted by top scores.

comment by Lao Mein (derpherpize) · 2024-09-26T08:01:17.488Z · LW(p) · GW(p)

Extremely impressive! I've been wanting something like this for a while.

comment by RogerDearnaley (roger-d-1) · 2024-10-17T15:54:21.135Z · LW(p) · GW(p)

Having already played with this a little, it's pretty amazing: the range of concepts you can find in the SAE, how clearly the autointerp has labelled them and how easy they are to find, and how effective they are (as long as you don't turn them up too much) are all really impressive. I can't wait to try a production model where you can set up sensors and alarms on features, clip or ablate them or wire them together at various layers, and so forth. It will also be really interesting to see how larger models compare.

I'd also love to start looking at jailbreaks with this and seeing what features the jailbreak is inducing in the residual stream. Finding the emotional/situational manipulation elements I suspect will be pretty easy — I'm curious to see if it will also show the 'confusion' effect of jailbreaks that read like confusing nonsense to a human as some form of confusion or noise, or if those are also emotional/situational manipulation just in a more noise-like adversarial format, comparable to adversarial attacks on image classifiers that just look like noise to a human eye, but actually effectively activate internal features of the vision model