Posts

Stitching SAEs of different sizes 2024-07-13T17:19:20.506Z
A Selection of Randomly Selected SAE Features 2024-04-01T09:09:49.235Z
SAE-VIS: Announcement Post 2024-03-31T15:30:49.079Z
Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders 2024-03-25T21:17:58.421Z
Understanding SAE Features with the Logit Lens 2024-03-11T00:16:57.429Z
Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders 2024-02-27T02:43:22.446Z
Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small 2024-02-02T06:54:53.392Z
Linear encoding of character-level information in GPT-J token embeddings 2023-11-10T22:19:14.654Z
Features and Adversaries in MemoryDT 2023-10-20T07:32:21.091Z
Joseph Bloom on choosing AI Alignment over bio, what many aspiring researchers get wrong, and more (interview) 2023-09-17T18:45:28.891Z
A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N) 2023-05-16T22:59:20.553Z
Decision Transformer Interpretability 2023-02-06T07:29:01.917Z

Comments

Comment by Joseph Bloom (Jbloom) on Interpreting Preference Models w/ Sparse Autoencoders · 2024-07-02T07:54:34.556Z · LW · GW

7B parameter PM

 

@Logan Riggs this link doesn't work for me. 

Comment by Joseph Bloom (Jbloom) on SAE feature geometry is outside the superposition hypothesis · 2024-06-25T10:38:20.381Z · LW · GW

Thanks for writing this up.  A few points:

- I generally agree with most of the things you're saying and am excited about this kind of work. I like that you endorse empirical investigations here and think there are just far fewer people doing these experiments than anyone thinks. 
- Structure between features seems like the under-dog of research agendas in SAE research (which I feel I can reasonably claim to have been advocating for in many discussions over the preceding months). Mainly I think it presents the most obvious candidate for reducing the description length issue with larger SAEs. 
- I'm working on a project looking into this (and am aware of several others) but I don't think this should deter people who are interested from playing around. It's fairly easy to get going on these projects using my library and neuronpedia.

For example, it seems hard to understand how a tree-like structure could explain circular features. 

Tree structure between features is easy to find, with hierarchical clustering providing a degree of insight into the feature space that is not achieved by other methods like U-MAP.  I would interpret this as a kind of "global structure" whereas day of the week geometry is probably more local. It seems totally plausible that a tree is a reasonable characterisation of structure at a high level without being a perfect characterisation. 

The days of the week/months of the year lie on a circle, in order. Let’s be clear about what the interesting finding is from Engels et al.: it’s not that all the days of the week have high cosine sim with each other, or even really that they live in a subspace, but that they are in order!

I think another part of the result here was that the PCA of the lower dimensional space spanned by the day of the week features was much clearer in showing the geometry than simply doing PCA over the decoder weights (see below). I double checked this just now and you can actually get the correct ordering just on the features but it's much less obvious what's happening (imo). If you look at these features, they also tend to fire on days of  multiple days of the week with different strengths. The lesson here is that co-occurence of feature may matter a lot in particular subspaces.

 Layer 7-GPT2 small. Decoder weight PCA on day of the features. Feature labels come from max activating examples. See dashboards here.

Comment by Joseph Bloom (Jbloom) on Claude 3.5 Sonnet · 2024-06-21T12:06:27.864Z · LW · GW

Maybe we should make fake datasets for this? Neurons often aren't that interpretable and we're still confused about SAE features a lot of the time. It would be nice to distinguish "can do autointerp | interpretable generating function of complexity x" from "can do autointerp". 

Comment by Joseph Bloom (Jbloom) on SAE sparse feature graph using only residual layers · 2024-05-25T16:36:52.091Z · LW · GW

SAEs are model specific. You need Pythia SAEs to investigate Pythia. I don't have a comprehensive list but you can look at the sparse autoencoder tag on LW for relevant papers.

Comment by Joseph Bloom (Jbloom) on Quick Thoughts on Scaling Monosemanticity · 2024-05-24T10:00:35.317Z · LW · GW

Thanks Joel. I appreciated this. Wish I had time to write my own version of this. Alas. 

Previously I’ve seen the rule of thumb “20-100 for most models”. Anthropic says:

We were saying this and I think this might be an area of debate in the community for a few reasons. It could be that the "true L0" is actually very high. It could be that low activating features aren't contributing much to your reconstruction and so aren't actually an issue in practice. It's possible the right L1 or  L0 is affected by model size, context length or other details which aren't being accounted for in these debates. A thorough study examining post-hoc removal of low activating or low norm features could help. FWIW, it's not obvious to me that L0 should be lower / higher and I think we should be careful not to cargo-cult the stat. Probably we're not at too much risk here since we're discussing this out in the open already. 

Having multiple different-sized SAEs for the same model seems useful. The dashboard shows feature splitting clearly. I hadn’t ever thought of comparing features from different SAEs using cosine similarity and plotting them together with UMAP.

Different SAEs, same activations. Makes sense since it's notionally the same vector space. Apollo did this recently when comparing e2e vs vanilla SAEs. I'd love someone to come up with better measures of U-MAP quality as the primary issue with them is the risk of arbitrariness. 

Neither of these plots seems great. They both suggest to me that these SAEs are “leaky” in some sense at lower activation levels, but in opposite ways:

This could be bad. Could also be that the underlying information is messy and there's interference or other weird things going on. Not obvious that it's bad measurement as opposed to messy phenomena imo. Trying to distinguish the two seems valuable. 

 

4. On Scaling

Yup.  Training simultaneously could be good. It's an engineering challenge. I would reimplement good proofs of concept that suggest this is feasible and how to do it. I'd also like to point out that this isn't the first time a science has had this issue. 

On some level I think this challenge directly parallels bioinformatics / gene sequencing. They needed a human genome project because it was expensive and ambitious and individual actors couldn't do it on their own. But collaborating is hard. Maybe EA in particular can get the ball rolling here faster than it might otherwise. The NDIF / Bau Lab might also be a good banner to line up behind. 

I didn’t notice many innovations here -- it was mostly scaling pre-existing techniques to a larger model than I had seen previously. The good news is that this worked well. The bad news is that none of the old challenges have gone away.

Agreed. I think the point was basically scale. Criticisms along the lines of "this is tackling the hard part of the problem or proving interp is actually useful" are unproductive if that wasn't the intention. Anthropic has 3 teams now and counting doing this stuff. They're definitely working on a bunch of harder / other stuff that maybe focuses on the real bottlenecks. 

Comment by Joseph Bloom (Jbloom) on Talent Needs of Technical AI Safety Teams · 2024-05-24T09:37:11.609Z · LW · GW

All young people and other newcomers should be made aware that on-paradigm AI safety/alignment--while being more tractable, feedbacked, well-resourced, and populated compared to theory--is also inevitably streetlighting https://en.wikipedia.org/wiki/Streetlight_effect. 

 

Half-agree. I think there's scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification). I do think there is a lot of diversity in people working in these more legible areas and that means there are now many people who haven't engaged with or understood the alignment problem well enough to realise where we might be suffering from the street light effect. 

Comment by Joseph Bloom (Jbloom) on SAE sparse feature graph using only residual layers · 2024-05-24T09:25:58.602Z · LW · GW

I think so, but expect others to object. I think many people interested in circuits are using attn and MLP SAEs and experimenting with transcoders and SAE variants for attn heads. Depends how much you care about being able to say what an attn head or MLP is doing or you're happy to just talk about features. Sam Marks at the Bau Lab is the person to ask.

Comment by Joseph Bloom (Jbloom) on Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders · 2024-05-08T11:55:41.765Z · LW · GW

Neuronpedia has an API (copying from a recent message Johnny wrote to someone else recently.):

"Docs are coming soon but it's really simple to get JSON output of any feature. just add "/api/feature/" right after "neuronpedia.org".for example, for this feature: https://neuronpedia.org/gpt2-small/0-res-jb/0
the JSON output of it is here: https://www.neuronpedia.org/api/feature/gpt2-small/0-res-jb/0
(both are GET requests so you can do it in your browser)note the additional "/api/feature/"i would prefer you not do this 100,000 times in a loop though - if you'd like a data dump we'd rather give it to you directly."

 

Feel free to join the OSMI slack and post in the Neuronpedia or Sparse Autoencoder channels if you have similar questions in the future :)  https://join.slack.com/t/opensourcemechanistic/shared_invite/zt-1qosyh8g3-9bF3gamhLNJiqCL_QqLFrA 

Comment by Joseph Bloom (Jbloom) on SAE-VIS: Announcement Post · 2024-03-31T18:39:24.214Z · LW · GW

I'm a little confused by this question. What are you proposing? 

Comment by Joseph Bloom (Jbloom) on Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders · 2024-03-27T21:52:40.570Z · LW · GW

Lots of thoughts. This is somewhat stream of consciousness as I happen to be short on time this week, but feel free to follow up again in the future:

  • Anthropic tested their SAEs on a model with random weights here and found that the results look noticeably different in some respects to SAEs trained on real models "The resulting features are here, and contain many single-token features (such as "span", "file", ".", and "nature") and some other features firing on seemingly arbitrary subsets of different broadly recognizable contexts (such as LaTeX or code)." I think further experiments like this which identify classes of features which are highly non-trivial, don't occur in SAEs trained on random models (or random models with a W_E / W_U from a real model) or which can be related to interpretable circuity would help. 
  • I should not that, to the extent that SAEs could be capturing structure in the data, the model might want to capture structure in the data too, so it's not super clear to me what you would observe that would distinguish SAEs capturing structure in the data which the model itself doesn't utilise. Working this out seems important. 
  • Furthermore, the embedding space of LLM's is highly structured already and since we lack good metrics, it's hard to say how precisely SAE's capture "marginal" structure over existing methods. So quantifying what we mean by structure seems important too. 
  • The specific claim that SAEs learn features which are combinations of true underlying features is a reasonable one given the L1 penalty, but I think it's far from obvious how we should think about this in practice. 
  • I'm pretty excited about deliberate attempts to understand where SAEs might be misleading or not capturing information well (eg: here or here). It seems like there are lots of technical questions that are slightly more low level that help us build up to this.  

So in summary: I'm a bit confused about what we mean here and think there are various technical threads to follow up on. Knowing which actually resolve this requires we try to define our terms here more thoroughly. 

Comment by Joseph Bloom (Jbloom) on Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders · 2024-03-26T21:31:17.977Z · LW · GW

Thanks for asking:

  1. Currently we load SAEs into my codebase here. How hard this is will depend on how different your SAE architecture/forward pass is from what I currently support. We're planning to support users / do this ourselves for the first n users and once we can, we'll automate the process. So feel free to link us to huggingface or a public wandb artifact. 
  2.  We run the SAEs over random samples from the same dataset on which the model was trained (with activations drawn from forward passes of the same length). Callum's SAE vis codebase has a demo where you can see how this works. 
  3. Since we're doing this manually, the delay will depend on the complexity on handling the SAEs and things like whether they're trained on a new model (not GPT2 small) and how busy we are with other people's SAEs or other features. We'll try our best and keep you in the loop. Ballpark is 1 -2 weeks not months. Possibly days (especially if the SAEs are very similar to those we are hosting already). We expect this to be much faster in the future. 

We've made the form in part to help us estimate the time / effort required to support SAEs of different kinds (eg: if we get lots of people who all have SAEs for the same model or with the same methodological variation, we can jump on that). 

Comment by Joseph Bloom (Jbloom) on Neuroscience and Alignment · 2024-03-25T09:58:09.008Z · LW · GW

It helps a little but I feel like we're operating at too high a level of abstraction. 

Comment by Joseph Bloom (Jbloom) on Neuroscience and Alignment · 2024-03-24T08:49:04.197Z · LW · GW

with the mech interp people where they think we can identify values or other high-level concepts like deception simply by looking at the model's linear representations bottom-up, where I think that'll be a highly non-trivial problem.

 

I'm not sure anyone I know in mech interp is claiming this is a non-trivial problem. 

Comment by Joseph Bloom (Jbloom) on Neuroscience and Alignment · 2024-03-24T08:47:07.627Z · LW · GW

biological and artificial neural-networks are based upon the same fundamental principles

 

I'm confused by this statement. Do we know this? Do we have enough of an understanding of either to say this? Don't get me wrong, there's some level on which I totally buy this. However, I'm just highly uncertain about what is really being claimed here. 

Comment by Joseph Bloom (Jbloom) on How to train your own "Sleeper Agents" · 2024-03-09T05:38:46.767Z · LW · GW

Depending on model size I'm fairly confident we can train SAEs and see if they can find relevant features (feel free to dm me about this).

Comment by Joseph Bloom (Jbloom) on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-06T16:53:17.700Z · LW · GW

Thanks for posting this! I've had a lot of conversations with people lately about OthelloGPT and I think it's been useful for creating consensus about what we expect sparse autoencoders to recover in language models. 

Maybe I missed it but:

  • What is the performance of the model when the SAE output is used in place of the activations?
  • What is the L0? You say 12% of features active so I assume that means 122 features are active.This seems plausibly like it could be too dense (though it's hard to say, I don't have strong intuitions here). It would be preferable to have a sweep where you have varying L0's, but similar explained variance. The sparsity is important since that's where the interpretability is coming from.  One thing worth plotting might be the feature activation density of your SAE features as compares to the feature activation density of the probes (on a feature density histogram). I predict you will have features that are too sparse to match your probe directions 1:1 (apologies if you address this and I missed this). 
  • In particular, can you point to predictions (maybe in the early game) where your model is effectively perfect and where it is also perfect with the SAE output in place of the activations at some layer? I think this is important to quantify as I don't think we have a good understanding of the relationship between explained variance of the SAE and model performance and so it's not clear what counts as a "good enough" SAE. 

I think a number of people expected SAEs trained on OthelloGPT to recover directions which aligned with the mine/their probe directions, though my personal opinion was that besides "this square is a legal move", it isn't clear that we should expect features to act as classifiers over the board state in the same way that probes do. 

This reflects several intuitions:

  1. At a high level, you don't get to pick the ontology. SAEs are exciting because they are unsupervised and can show us results we didn't expect. On simple toy models, they do recover true features, and with those maybe we know the "true ontology" on some level. I think it's a stretch to extend the same reasoning to OthelloGPT just because information salient to us is linearly probe-able. 
  2. Just because information is linearly probeable, doesn't mean it should be recovered by sparse autoencoders. To expect this, we'd have to have stronger priors over the underlying algorithm used by OthelloGPT. Sure, it must us representations which enable it to make predictions up to the quality it predicts, but there's likely a large space of concepts it could represent. For example, information could be represented by the model in a local or semi-local code or deep in superposition. Since the SAE is trying to detect representations in the model, our beliefs about the underlying algorithm should inform our expectations of what it should recover, and since we don't have a good description of the circuits in OthelloGPT, we should be more uncertain about what the SAE should find. 
  3. Separately, it's clear that sparse autoencoders should be biased toward local codes over semi-local / compositional codes due to the L1 sparsity penalty on activations. This means that even if we were sure that the model represented information in a particular way, it seems likely the SAE would create representations for variables like (A and B) and (A and B') in place of A even if the model represents A. However, the exciting thing about this intuition is it makes a very testable prediction about combinations of features likely combining to be effective classifiers over the board state. I'd be very excited to see an attempt to train neuron-in-a-haystack style sparse probes over SAE features in OthelloGPT for this reason.

Some other feedback:

  • Positive: I think this post was really well written and while I haven't read it in more detail, I'm a huge fan of how much detail you provided and think this is great. 
  • Positive: I think this is a great candidate for study and I'm very interested in getting "gold-standard" results on SAEs for OthelloGPT. When Andy and I trained them, we found they could train in about 10 minutes making them a plausible candidate for regular / consistent methods benchmarking. Fast iteration is valuable. 
  • Negative: I found your bolded claims in the introduction jarring. In particular "This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model". I think this is overclaiming in the sense that OthelloGPT is not toy-enough, nor do we understand it well enough to know that SAEs have failed here, so much as they aren't recovering what you expect. Moreover, I think it would best to hold-off on proposing solutions here (in the sense that trying to map directly from your results to the viability of the technique encourages us to think about arguments for / against SAEs rather than asking, what do SAEs actually recover, how do neural networks actually work and what's the relationship between the two).
  • Negative: I'm quite concerned that tieing the encoder / decoder weights and not having a decoder output bias results in worse SAEs. I've found the decoder bias initialization to have a big effect on performance (sometimes) and so by extension whether or not it's there seems likely to matter. Would be interested to see you follow up on this. 

Oh, and maybe you saw this already but an academic group put out this related work: https://arxiv.org/abs/2402.12201  I don't think they quantify the proportion of probe directions they recover, but they do indicate recovery of all types of features that been previously probed for. Likely worth a read if you haven't seen it. 

Comment by Joseph Bloom (Jbloom) on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-06T16:02:26.865Z · LW · GW

I think we got similar-ish results. @Andy Arditi  was going to comment here to share them shortly. 

Comment by Joseph Bloom (Jbloom) on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-06T00:35:19.783Z · LW · GW

@LawrenceC  Nanda MATS stream played around with this as group project with code here: https://github.com/andyrdt/mats_sae_training/tree/othellogpt 

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-29T15:51:52.537Z · LW · GW

@Evan Anders "For each feature, we find all of the problems where that feature is active, and we take the two measurements of “feature goodness" <- typo? 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-28T23:40:11.283Z · LW · GW

added a link at the top.

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-28T16:58:42.911Z · LW · GW

My mental model is the encoder is working hard to find particular features and distinguish them from others (so it's doing a compressed sensing task) and that out of context it's off distribution and therefore doesn't distinguish noise properly. Positional features are likely a part of that but I'd be surprised if it was most of it. 

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-28T16:56:39.812Z · LW · GW

I've heard this idea floated a few times and am a little worried that "When a measure becomes a target, it ceases to be a good measure" will apply here. OTOH, you can directly check whether the MSE / variance explained diverges significantly so at least you can track the resulting SAE's use for decomposition. I'd be pretty surprised if an SAE trained with this objective became vastly more performant and you could check whether downstream activations of the reconstructed activations were off distribution. So overall, I'm pretty excited to see what you get!

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-28T16:48:13.709Z · LW · GW

problems

 

prompts*

Comment by Joseph Bloom (Jbloom) on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders · 2024-02-27T17:11:45.022Z · LW · GW

This means they're somewhat problematic for OOD use cases like treacherous turn detection or detecting misgeneralization.

 

I kinda want to push back on this since OOD in behavior is not obviously OOD in the activations. Misgeneralization especially might be better thought of as an OOD environment and on-distribution activations? 

I think we should come back to this question when SAEs have tackled something like variable binding with SAEs. Right now it's hard to say how SAEs are going to help us understand more abstract thinking and therefore I think it's hard to say how problematic they're going to be for detecting things like a treacherous turn. I think this will depend on how how representations factor. In the ideal world, they generalize with the model's ability to generalize (Apologies for how high level / vague that idea is). 

Some experiments I'd be excited to look at:

  • If the SAE is trained on a subset of the training distribution, can we distinguish it being used to decompose activations on those data points off the training distribution?
  • How does that compare to an SAE trained on the whole training distribution from the model, but then looking at when the model is being pushed off distribution? 

I think I'm trying to get at - can we distinguish:

  • Anomalous activations. 
  • Anomalous data points. 
  • Anomalous mechanisms. 

Lots of great work to look forward to!

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-20T15:22:38.616Z · LW · GW

Why do you want to refill and shuffle tokens whenever 50% of the tokens are used?

 

Neel was advised by the authors that it was important minimise batches having tokens from the same prompt. This approach leads to a buffer having activations from many different prompts fairly quickly. 

 

Is this just tokens in the training set or also the test set? In Neel's code I didn't see a train/test split, isn't that important?

I never do evaluations on tokens from prompts used in training, rather, I just sample new prompts from the buffer. Some library set aside a set of tokens to do evaluations on which are re-used. I don't currently do anything like this but it might be reasonable. In general, I'm not worried about overfitting. 

Also, can you track the number of epochs of training when using this buffer method (it seems like that makes it more difficult)?

Epochs in training makes sense in a data-limited regime which we aren't in. OpenWebText has way more tokens than we ever train any sparse autoencoder on so we're always on way less than 1 epoch. We never reuse the same activations when training, but may use more than one activation from the same prompt. 

Comment by Joseph Bloom (Jbloom) on Addressing Feature Suppression in SAEs · 2024-02-16T22:17:06.628Z · LW · GW

Awesome work! I'd be quite interested to know whether the benefits from this technique are equivalently significant with a larger SAE and also what the original perplexity was (when looking at the summary statistics table). I'll probably reimplement at some point. 

Also, kudos on the visualizations. Really love the color scales!

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-15T04:25:23.388Z · LW · GW

On wandb, the dashboards were randomly sampled but we've since uploaded all features to Neuronpedia https://www.neuronpedia.org/gpt2-small/res-jb. The log sparsity is stored in the huggingface repo so you can look for the most sparse features and check if their dashboards are empty or not (anecdotally most dashboards seem good, beside the dead neurons in the first 4 layers).

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-14T17:17:05.017Z · LW · GW

24, 576 prompts of length 128 = 3, 145, 728.

With features that fire less frequently this won't be enough, but for these we seemed to find some activations (if not highly activating) for all features. 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-10T01:12:15.670Z · LW · GW

Makes sense. Will set off some runs with longer context sizes and track this in the future.

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-09T18:27:40.057Z · LW · GW

Ahhh I see. Sorry I was way too hasty to jump at this as the explanation. Your code does use the tied decoder bias (and yeah, it was a little harder to read because of how your module is structured). It is strange how assuming that bug seemed to help on some of the SAEs but I ran my evals over all your residual stream SAE's and it only worked for some / not others and certainly didn't seem like a good explanation after I'd run it on more than one. 

I've been talking to Logan Riggs who says he was able to load in my SAEs and saw fairly similar reconstruction performance to to me but that outside of the context length of 128 tokens, performance markedly decreases. He also mentioned your eval code uses very long prompts whereas mine limits to 128 tokens so this may be the main cause of the difference.  Logan mentioned you had discussed this with him so I'm guessing you've got more details on this than I have? I'll build some evals specifically to look at this in the future I think. 

Scientifically, I am fairly surprised about the token length effect and want to try training on activations from much longer context sizes now. I have noticed (anecdotally) that the number of features I get sometimes increases over the prompt so an SAE trained on activations from shorter prompts are plausibly going to have a much easier time balancing reconstruction and sparsity, which might explain the generally lower MSE / higher reconstruction. Though we shouldn't really compare between models and with different levels of sparsity as we're likely to be at different locations on the pareto frontier. 

One final note is that I'm excited to see whether performance on the first 128 tokens actually improves in SAEs trained on activations from > 128 token forward passes (since maybe the SAE becomes better in general). 
 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-07T20:21:19.849Z · LW · GW
  • MSE Losses were in the WandB report (screenshot below).
  • I've loaded in your weights for one SAE and I get very bad performance (high L0, high L1, and bad MSE Loss) at first. 
  • It turns out that this is because my forward pass uses a tied decoder bias which is subtracted from the initial activations and added as part of the decoder forward pass. AFAICT, you don't do this. 
  • To verify this, I added the decoder bias to the activations of your SAE prior to running a forward pass with my code (to effectively remove the decoder bias subtraction from my method) and got reasonable results. 
  • I've screenshotted the Towards Monosemanticity results which describes the tied decoder bias below as well. 

I'd be pretty interested in knowing if my SAEs seem good now based on your evals :) Hopefully this was the only issue. 

 


Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-07T16:26:41.945Z · LW · GW

Agreed, thanks so much! Super excited about what can be done here!

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-06T20:34:44.548Z · LW · GW

I've run some of the SAE's through more thorough eval code this morning (getting variance explained with the centring and calculating mean CE losses with more batches). As far as I can tell the CE loss is not that high at all and the MSE loss is quite low. I'm wondering whether you might be using the wrong hooks? These are resid_pre so layer 0 is just the embeddings and layer 1 is after the first transformer block and so on. One other possibility is that you are using a different dataset? I trained these SAEs on OpenWebText. I don't much padding at all, that might be a big difference too. I'm curious to get to the bottom of this. 

One sanity check I've done is just sampling from the model when using the SAE to reconstruct activations and it seems to be about as good, which I think rules out CE loss in the ranges you quote above. 

For percent alive neurons a batch size of 8192 would be far too few to estimate dead neurons (since many neurons have a feature sparsity < 10**-3. 

You're absolutely right about missing the centreing in percent variance explained. I've estimated variance explained again for the same layers and get very similar results to what I had originally. I'll make some updates to my code to produce CE score metrics that have less variance in the future at the cost of slightly more train time. 

If we don't find a simple answer I'm happy to run some more experiments but I'd guess an 80% probability that there's a simple bug which would explain the difference in what you get. Rank order of most likely: Using the wrong activations, using datapoints with lots of padding, using a different dataset (I tried the pile and it wasn't that bad either). 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-06T16:11:54.391Z · LW · GW

Oh no. I'll look into this and get back to you shortly. One obvious candidate is that I was reporting CE for some batch at the end of training that was very small and so the statistics likely had high variance and the last datapoint may have been fairly low. In retrospect I should have explicitly recalculated this again post training. However, I'll take a deeper dive now to see what's up. 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T17:00:23.452Z · LW · GW

I'd be excited about reading about / or doing these kinds of experiments. My weak prediction is that low activating features are important in specific examples where nuance matters and that what we want is something like an "adversarially robust SAE" which might only be feasible with current SAE methods on a very narrow distribution. 

A mini experiment I did which motivates this: I did an experiment with an SAE at the residual stream where I looked at the attention pattern of an attention head immediately following the head as function of k, where we take the top-k SAE features in the reconstruction. I found that if the head was attending to "Mary" in the original forward pass (and not "John"), then a k of 3 was good enough to have it attend to Mary and not John. But if I replaced John with Martha, the minimum k such that the head attended to Mary increased. 

Comment by Joseph Bloom (Jbloom) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T16:37:22.555Z · LW · GW

Unless my memory is screwing up the scale here, 0.3 CE Loss increase seems quite substantial? A 0.3 CE loss increase on the pile is roughly the difference between Pythia 410M and Pythia 2.8B.

Thanks for raising this! I had wanted to find a comparison in terms of different model performances to help me quantify this so I'm glad to have this as a reference.

And do I see it right that this is the CE increase maximum for adding in one SAE, rather than all of them at the same time? So unless there is some very kind correlation in these errors where every SAE is failing to reconstruct roughly the same variance, and that variance at early layers is not used to compute the variance SAEs at later layers are capturing, the errors would add up? Possibly even worse than linearly? What CE loss do you get then?

Have you tried talking to the patched models a bit and compared to what the original model sounds like? Any discernible systematic differences in where that CE increase is changing the answers?

While I have explored model performance with SAEs at different layers, I haven't done so with more than one SAE or explored sampling from the model with the SAE. I've been curious about systematic errors induced by the SAE but a few brief experiments with earlier SAEs/smaller models didn't reveal any obvious patterns. I have once or twice looked at the divergence in the activations after an SAE has been added and found that errors in earlier layers propagated.

One thought I have on this is that if we take the analogy to DNA sequencing seriously, relatively minor errors in DNA sequencing make the resulting sequences useless. If you get one or two base pairs wrong then try to make bacteria express the printed gene (based on your sequencing) then you'll kill that bacteria. This gives me the intuition that I absolutely expect we could have fairly accurate measurements with some error and that the resulting error is large. 

To bring it back to what I suspect is the main point here:  We should amend the statement to say "Our reconstruction scores were pretty good as compared to our previous results". 

It bothers me quite a bit that SAEs don't recover performance better, but I think this is a fairly well defined and that the community can iterate on both via improvements to SAEs and looking for nearby alternatives. For example, I'm quite excited to experiment with any alternative architectures/training procedures that come out of the theory of computation in superposition line of work.

One productive direction inspired by thinking of this as sequencing is that we should have lots of SAEs trained on the same model and show that they get very similar results (to give us more confidence we have a better estimate of the true underlying features). It's standard in DNA/RNA/Protein sequencing to run methods many times over. I think once we see evidence that we get good results along those lines, we should be more interested in / raise our standards for model performance with reconstructed SAEs. 

Comment by Joseph Bloom (Jbloom) on "Does your paradigm beget new, good, paradigms?" · 2024-01-28T16:41:40.613Z · LW · GW

Thanks for writing this! This is an idea that I think is pretty valuable and one that comes up fairly frequently when discussing different AI safety research agendas.

I think that there's a possibly useful analogue of this which is useful from the perspective of being deep inside a cluster of AI safety research and wondering whether it's good. Specifically, I think we should ask "does the value of my current line of research hinge on us basically being right about a bunch of things or does much of the research value come from discovering all the places we are wrong?".

One reason this feels like an important variant to me is that when I speak to people skeptical about the area of research I've been working in, they often seem surprised that I'm very much in agreement with them about a number of issues. Still, I disagree with them that the solution is to shift focus, so much as to try to work how the one paradigm might need to shift into another.

Comment by Joseph Bloom (Jbloom) on My best guess at the important tricks for training 1L SAEs · 2023-12-21T10:55:16.415Z · LW · GW

Thanks. I've found this incredibly useful. This is something that I feel has been long overdue with SAE's! I think the value from advice + (detailed) results + code is something like 10x more useful than the way these insights tend to be reported!

Comment by Joseph Bloom (Jbloom) on Finding Sparse Linear Connections between Features in LLMs · 2023-12-09T12:32:12.363Z · LW · GW

Interesting! This is very cool work but I'd like to understand your metrics better. 
- "So we take the difference in loss for features (ie for a feature, we take linear loss - MLP loss)". What do you mean here? Is this the difference between the mean MSE loss when the feature is on vs not on?  
- Can you please report the L0's for each of the auto-encoders and the linear model as well as the next token prediction loss when using the autoencoder/linear model. These are important metrics on which my generally excitement hinges. (eg: if those are both great, I'm way more interested in results about specific features). 
- I'd be very interested in you can take a specific input, look at the features present and compare them between autoencoder/the linear model. This would be especially cool if you pick an example where ablating the MLP out causes the incorrect prediction so we know it's representing something important.
- Are you using a holdout dataset of eval tokens when measuring losses? Or how many tokens are you using to measure losses? 
- Have you plotted per token MSE loss vs l0 for each model? Do they look similar? Are there any outliers in that relationship? 

Comment by Joseph Bloom (Jbloom) on Testbed evals: evaluating AI safety even when it can’t be directly measured · 2023-11-16T10:14:42.796Z · LW · GW

Tesbeds

missing "t"

Comment by Joseph Bloom (Jbloom) on Linear encoding of character-level information in GPT-J token embeddings · 2023-11-11T04:35:02.179Z · LW · GW

Fixed, thanks!

Comment by Joseph Bloom (Jbloom) on [Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small · 2023-10-26T19:32:50.328Z · LW · GW

Cool paper. I think the semantic similarity result is particularly interesting.

As I understand it you've got a circuit  that wants to calculate something like Sim(A,B), where A and B might have many "senses" aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features. 

So for example, there are senses in which "Berkeley" and "California" are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn't expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score.  This would only happen for really "mono-semantic tokens" that have only one sense (maybe you could test that). 

Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc). 

I'd be interested if you have use something like SAE's to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here:
- Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn't (or some linear combination) or via a more complicated function of sets of features being used to inform suppression. 
- Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there's some asymmetry here, that could be interesting as it would correspond to "I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).

I'm particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations. 

Comment by Joseph Bloom (Jbloom) on Trying to understand John Wentworth's research agenda · 2023-10-25T21:02:08.885Z · LW · GW

I'd be very interested in seeing these products and hearing about the use-cases / applications. Specifically, my prior experience at a startup leads me to believe that building products while doing science can be quite difficult (although there are ways that the two can synergise). 

I'd be more optimistic about someone claiming they'll do this well if there is an individual involved in the project who is both deeply familiar with the science and has build products before (as opposed to two people each counting on the other to have sufficient expertise they lack). 

A separate question I have is about how building products might be consistent with being careful about what information you make public. If you there are things that you don't want to be public knowledge, will there be proprietary knowledge not shared with users/clients? It seems like a non-trivial problem to maximize trust/interest/buy-in whilst minimizing clues to underlying valuable insights. 

Comment by Joseph Bloom (Jbloom) on Features and Adversaries in MemoryDT · 2023-10-22T20:18:39.056Z · LW · GW

Thanks Jay! (much better answer!) 

Comment by Joseph Bloom (Jbloom) on Features and Adversaries in MemoryDT · 2023-10-22T20:17:51.705Z · LW · GW

The first frame, apologies.  This is a detail of how we number trajectories that I've tried to avoid dealing with in this post. We left pad in a context windows of 10 timesteps so the first observation frame is S5. I've updated the text not to refer to S5. 

Comment by Joseph Bloom (Jbloom) on Features and Adversaries in MemoryDT · 2023-10-22T20:12:33.834Z · LW · GW

I'm not sure if it's interesting to me for alignment, since it's such a toy model.


Cruxes here are things like whether you think toy models are governed by the same rules as larger models, whether studying them helps you understand those general principles and whether understanding those principles is valuable. This model in particular shares many similarities in architecture and training to LLMs and is over a million parameters so it's not nearly as much of a toy model as others and we have particular reasons to expect insights to transfer (both transformers / next token predictors). 

What do you think would change when trying to do similar interpretability on less-toy models?

The recipe stays mostly the same but scale increases and you know less about the training distribution.

  • Features: Feature detection in LLM's via sparse auto-encoders seems highly tractable. There may be more features and you might have less of  a sense for the overall training distribution. Once you collapse latent space into features, this will go a long way to dealing with the curse of dimensionality with these systems. 
  • Training Data: We know much less about the training distribution of larger models (ie: what are the ground truth features, how to they correlate or anti-correlate). 
  • Circuits This investigation treats circuits like a black-box, but larger models will likely solve more complex tasks with more complicated circuitry. The cool thing about knowing the features is that you can get to fairly deep insights even without understanding the circuits (like showing which observations are effectively equivalent to the model). 

What would change about finding adversarial examples?

This is a very complicated/broad question. There's a number of ways you could approach this. I'd probably look at identifying critical features in the language model and see whether we can develop automatic techniques for flipping them. This could be done recursively if you are able to find the features most important for those features (etc.). Understanding why existing adversaries like jail-breaking techniques / initial affirmative responses work (mechanistically) might tell us a lot about how to automate more general search for adversaries. However, my guess is that the task of finding adversaries using a white-box approaches may be fairly tractable. The search space is much smaller once you know features and there are many search strategies that might work to flip features (possibly working recursively through features in each layer and guided by some regularization designed to keep adversaries naturalistic/plausible. 

Directly intervening on features seems like it might stay the same though.

This doesn't seem super obvious if features aren't orthogonal, or may exist in subspaces or manifolds rather than individual directions. The fact that this transition isn't trivial is one reason it would be better to understand some simple models very well (so that when we go to larger models, we're on surer scientific footing). 

Comment by Joseph Bloom (Jbloom) on Don't Dismiss Simple Alignment Approaches · 2023-10-07T07:59:25.516Z · LW · GW

My vibe from this post is something like "we're making on stuff that could be helpful so there's stuff to work on!" and this is a vibe I like. However, I suspect that for people who might not be as excited about these approaches, you're likely not touching on important cruxes (eg: do these approaches really scale? Are some agendas capabilities enhancing? Will these solve deceptive alignment or just corrigible alignment?)

I also think that if the goal is to actually make progress and not to maximize the number of people making progress or who feel like they're making progress, then engaging with those cruxes is important before people invest substantive energy (ie: beyond upskilling). However as a directional update for people who are otherwise pretty cynical, this seems like a good update.

Comment by Joseph Bloom (Jbloom) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T10:01:55.568Z · LW · GW

Strong disagree. Can’t say I’ve worked through the entire article in detail but wanted to chime in as one of the many of junior researchers investing energy in interpretability. Noting that you erred on the side of making arguments too strong. I agree with Richard about this being the wrong kind of reasoning for novel scientific research and with Rohin’s idea that we’re creating new affordances. I think generally MI is grounded and much closer to being a natural science that will progress over time and be useful for alignment, synergising with other approaches. I can't speak for Neel, but I suspect the original list was more about getting something out there than making many nuanced arguments, so I think it's important to steelman those kinds of claims / expand on them before responding. 

A few extra notes: 

The first point I want to address your endorsement of “retargeting the search” and finding the “motivational API” within AI systems which is my strongest motivator for working in interpretability.

This is interesting because this would be a way to not need to fully reverse engineer a complete model. The technique used in Understanding and controlling a maze-solving policy network seems promising to me. Just focusing on “the motivational API” could be sufficient.

I predict that methods like “steering vectors” are more likely to work in worlds where we make much more progress in understanding of neural networks. But steering vectors are relatively recent, so it seems reasonable to think that we might have other ideas soon that could be equally useful but may require progress more generally in the field.

We need only look to biology and medicine to see examples of imperfectly understood systems, which remain mysterious in many ways, and yet science has led us to impressive feats that might have been unimaginable years prior. For example, the ability in recent years to retarget the immune system to fight cancer. Because hindsight devalues science we take such technologies for granted and I think this leads to a general over-skepticism about fields like interpretability.

The second major point I wanted to address was this argument:

Determining the dangerousness of a feature is a mis-specified problem. Searching for dangerous features in the weights/structures of the network is pointless. A feature is not inherently good or bad. The danger of individual atoms is not a strong predictor of the danger of assembly of atoms and molecules. For instance, if you visualize the feature of layer 53, channel 127, and it appears to resemble a gun, does it mean that your system is dangerous? Or is your system simply capable of identifying a dangerous gun? The fact that cognition can be externalized also contributes to this point.

I agree that it makes little sense to think of a feature on it’s own as dangerous but I it sounds to me like you are making a point about emergence. If understanding transistors doesn’t lead to understanding computer software then why work so hard to understand transistors?

I am pretty partial to the argument that the kinds of alignment relevant phenomena in neural networks will not be accessible via the same theories that we’re developing today in mechanistic interpretability. Maybe these phenomena will exist in something analogous to a “nervous system” while we’re still understanding “biochemistry”. Unlike transistors and computers though, biochemistry is hugely relevant to understanding neuroscience.

Comment by Joseph Bloom (Jbloom) on Ten Levels of AI Alignment Difficulty · 2023-07-04T03:38:46.958Z · LW · GW

Thanks for writing this up. I really liked this framing when I first read about it but reading this post has helped me reflect more deeply on it. 

I’d also like to know your thoughts on whether Chris Olah’s original framing, that anything which advances this ‘present margin of safety research’ is net positive, is the correct response to this uncertainty.

I wouldn't call it correct or incorrect only useful in some ways and not others. Whether it's net positive may rely on whether it is used by people in cases where it is appropriate/useful. 

As an educational resource/communication tool, I think this framing is useful. It's often useful to collapse complex topics into few axes and construct idealised patterns, in this case a difficulty-distribution on which we place techniques by the kinds of scenarios where they provide marginal safety. This could be useful for helping people initially orient to existing ideas in the field or in governance or possibly when making funding decisions. 

However, I feel like as a tool to reduce fundamental confusion about AI systems, it's not very useful.  The issue is that many of the current ideas we have in AI alignment are based significantly on pre-formal conjecture that is not grounded in observations of real world systems (see the Alignment Problem from a Deep Learning Perspective).  Before we observe more advanced future systems, we should be highly uncertain about existing ideas. Moreover, it seems like this scale attempts to describe reality via the set of solutions which produce some outcome in it? This seems like an abstraction that is unlikely to be useful.

In other words, I think it's possible that this framing leads to confusion between the map and the territory, where the map is making predictions about tools that are useful in territory which we have yet to observe.

To illustrate how such an axis may be unhelpful if you were trying to think more clearly,  consider the equivalent for medicine. Diseases can be divided up into varying classes on difficulty to cure with corresponding research being useful for curing them. Cuts/Scrapes are self-mending whereas infections require corresponding antibiotics/antivirals, immune disorders and cancers are diverse and therefore span various levels of difficulties amongst their instantiations. It's not clear to me that biologists/doctors would find much use from conjecture on exactly how hard vs likely each disease is to occur, especially in worlds where you lack a fundamental understanding of the related phenomena. Possibly, a closer analogy would be trying to troubleshoot ways evolution can generate highly dangerous species like humans. 

I think my attitude here leads into more takes about good and bad ways to discuss which research we should prioritise but I'm not sure how to convey those concisely. Hopefully this is useful. 

Comment by Joseph Bloom (Jbloom) on [Research Update] Sparse Autoencoder features are bimodal · 2023-06-24T02:12:49.847Z · LW · GW

Hey Robert, great work! My focus isn't currently on this but I thought I'd mention that these trends might relate to some of the observations in the Finding Neurons in a Haystack paper. https://arxiv.org/abs/2305.01610.

If you haven't read the paper, the short version is they used sparse probing to find neurons which linearly encode variables like "is this python code" in a variety of models with varying size. 

The specific observation which I believe may be relevant:

"As models increase in size, representation sparsity increases on average, but different features obey different dynamics: some features with dedicated neurons emerge with scale, others split into finer grained features with scale, and many remain unchanged or appear somewhat randomly" 

I believe this accords with your observation that "Finding more features finds more high-MCS features, but finds even more low-MCS features". 

Maybe finding ways to directly compare approaches could support further use of either approach. 

Also, interesting to hear about using EMD over KL divergence. I hadn't thought about that!