Posts

Open Source Automated Interpretability for Sparse Autoencoder Features 2024-07-30T21:11:36.866Z
Ophiology (or, how the Mamba architecture works) 2024-04-09T19:31:09.975Z

Comments

Comment by SrGonao (srgonao) on Deceptive agents can collude to hide dangerous features in SAEs · 2024-12-13T08:29:46.771Z · LW · GW

Hi! I know that this post is now almost 5 months old, but I feel like I need to ask some clarifying questions and point out things about your methodology that I don't completely understand/agree.

How do you source the sentences used for the scoring method? Are they all from top activations? This is not explicitly mentioned in the methodology section - although in the footnote you do say you have 3 high activations and 3 low activations. Am I to understand correctly that there are no cases with no activations?

Are the sentences shown individually or in batches? 

I'm not sure, I understand the reasoning behind your simulation scoring method and its validity. You reduced it to simulating the activation at the sentence level and not at the token level, but still you simulated the full sentence. Why not use the "standard" simulation scoring? I assume it is much worse than yours, as it normally is, but is there a specific reason? 

I'm afraid that with this scoring method, the model just has to think one of the tokens is active to have a good score, and I'm not entirely convinced about your random chance upper bound. What is the distribution of real scores (after normalization) that you use for "high" and "low" sentences? What is the score of a randomly chosen explanation? I think that should be presented as a baseline, a different scoring method is presented. I expect the random explanation to have a better than 4.9e-5.

The way this method is set up almost reduces to "detection" where you are just asking the model if the explanation matches the activating sentence. Because of that you actually want to show negative examples and not only positive examples, because models tend to just say that all sentences activate even bad explanation. 

I think the results are interesting. Giving good explanations is already complicated and if you are able to do perfect stenography I doubt that the performance will take such a hit, so I think that your results would probably hold even when using more strict scores.

Comment by SrGonao (srgonao) on Evaluating Sparse Autoencoders with Board Game Models · 2024-08-03T09:59:28.169Z · LW · GW

I don't know much about chess. Could it be that feature 172 that you are highlighting is related to some kind of chess opening? The distribution of black pawns could be due to different states of the opening, and the position of the black bishop and white horse could also be related to different parts of that opening?