Posts
Comments
Throughout this post, I kept thinking about Soul-Making dharma (which I'm familier with, but not very good at!)
AFAIK, it's about building up the skill of having a full body awareness (ie instead of the breath at the nose as an object, you place attention on the full body + some extra space, like your "aura") which gives you a much more complete information about the felt sense of different things. For example, when you think of different people, they have different "vibes" that come up as physical sense in the body which you can access more fully by paying attention to full body awareness.
The teachers then went on a lot about sacredness & beauty, which seemed most relevant to attunement (although I didn't personally practice those methods due to lack of commitment)
However, having full-body awareness was critical for me to have any success in any of the soul-making meditation methods & is mentioned as a pre-requisite for the course. Likewise, attunement may require skills in feeling your body/ noticing felt senses.
Agreed. You would need to change the correlation code to hardcode feature correlations, then you can zoom in on those two features when doing the max cosine sim.
Hey! Thanks for doing this research.
Lee Sharkey et al did a similar experiment a while back w/ much larger number of features & dimensions, & there were hyperaparameters that perfectly reconstructed the original dataset (this was as you predicted as N increases).
Hoagy still hosts a version of our replication here (though I haven't looked at that code in a year!).
Yep, there are similar results when evaluating on the Pile with lower CE (except at the low L0-end)
Thanks for pointing this out! I'll swap the graphs out w/ their Pile-evaluated ones when it runs [Updated: all images are updated except the one comparing the 4 different "lowest features" values]
We could also train SAE's on Pythia-70M (non-deduped), but that would take a couple days to run & re-evaluate.
There actually is a problem with Pythia-70M-deduped on data that doesn't start at the initial position. This is the non-deduped vs deduped over training (Note: they're similar CE if you do evaluate on text that starts on the first position of the document).
We get similar performing SAE's when training on non-deduped (ie the cos-sim & l2-ratio are similar, though of course the CE will be different if the baseline model is different).
However, I do think the SAE's were trained on the Pile & I evaluated on OWT, which would lead to some CE-difference as well. Let me check.
Edit: Also the seq length is 256.
Ah, you're right. I've updated it.
Additionally, we can train w/ a negative orthogonality regularizer for the purpose of intentionally generating feature-combinatorics. In other words, we train for the features to point in more of the same direction to at least generate for-sure examples of feature combination.
I've been looking into your proposed solution (inspired by @Charlie Steiner 's comment). For small models (Pythia-70M is d_model=512) w/ 2k features doesn't take long to calculate naively, so it's viable for initial testing & algorithmic improvements can be stacked later.
There are a few choices regardless of optimal solution:
- Cos-sim of closest neighbor only or average of all vectors?
- If closest neighbor, should this be calculated as unique closest neighbor? (I've done hungarian algorithm before to calculate this). If not, we're penalizing features that are close (or more "central") to many other features more than others.
- Per batch, only a subset of features activate. Should the cos-sim only be on the features that activate? The orthogonality regularizer would be trading off L1 & MSE, so it might be too strong if it's calculated on all features.
- Gradient question: is there still gradient updates on the decoder weights of feature vectors that didn't activate.
- Loss function. Do we penalize high cos-sim more? There's also a base-random cos-sim of ~.2 for the 500x2k vectors.
I'm currently thinking cos-sim of closest neighbor only, not unique & only on features that activate (we can also do ablations to check). For loss function, we could modify a sigmoid function:
This makes the loss centered between 0 & 1 & higher cos-sim penalized more & lower cos-sim penalized less.
Metrics:
During training, we can periodically check the max mean cos-sim (MMCS). This is the average cos-sim of the non-unique nearest neighbors. Alternatively pushing the histogram (histograms are nice, but harder to compare across runs in wandb). I would like to see normal (w/o an orthogonality regularizer) training run's histogram for setting the hyperparams for the loss function.
Algorithmic Improvements:
The wiki for Closest Pair of Points (h/t Oam) & Nearest neighbor search seem relevant if one computes the nearest neighbor to create an index as Charlie suggested.
Faiss seems SOTA AFAIK for fast nearest neighbors on a gpu although:
adding or searching a large number of vectors should be done in batches. This is not done automatically yet. Typical batch sizes are powers of two around 8192, see this example.
I believe this is for GPU-memory constraints.
I had trouble installing it using
conda install pytorch::faiss-gpu
but it works if you do
conda install -c pytorch -c nvidia faiss-gpu=1.7.4 mkl=2021 blas=1.0=mkl
I also was unsuccessful installing it w/ just pip w/o conda & conda is their offical supported way to install from here.
An additional note is that the cosine similarity is the dot-product for our case, since all feature vectors are normalized by default.
I'm currently ignoring the algorithmic improvements due to the additional complexity, but should be doable if it produces good results.
Hey Jacob! My comment has a coded example with biases:
import torch
W = torch.tensor([[-1, 1],[1,1],[1,-1]])
x = torch.tensor([[0,1], [1,1],[1,0]])
b = torch.tensor([0, -1, 0])
y = torch.nn.functional.relu(x@W.T + b)
This is for the encoder, where y will be the identity (which is sparse for the hidden dimension).
Ah, you're correct. Thanks!
I'm now very interested in implementing this method.
Thanks for saying the link is broken!
If the True Features are located at:
A: (0,1)
B: (1,0)
[So A^B: (1,1)]
Given 3 SAE hidden-dimensions, a ReLU & bias, the model could learn 3 sparse features
1. A^~B (-1, 1)
2. A^B (1,1)
3. ~A^B(1,-1)
that output 1-hot vectors for each feature. These are also are orthogonal to each other.
Concretely:
import torch
W = torch.tensor([[-1, 1],[1,1],[1,-1]])
x = torch.tensor([[0,1], [1,1],[1,0]])
b = torch.tensor([0, -1, 0])
y = torch.nn.functional.relu(x@W.T + b)
This is a very good explanation of why SAE's incentivize feature combinatorics. Nice! I hadn't thought about the tradeoff between the MSE-reduction for learning a rare feature & the L1-reduction for learning a common feature combination.
Freezing already learned features to iteratively learn more and more features could work. In concrete details, I think you would:
1. Learn an initial SAE w/ a much lower L0 (higher l1-alpha) than normally desired.
2. Learn a new SAE to predict the residual of (1), so the MSE would be only on what (1) messed up predicting. The l1 would also only be on this new SAE (since the other is frozen). You would still learn a new decoder-bias which should just be added on to the old one.
3. Combine & repeat until desired losses are obtained
There are at least 3 hyperparameters here to tune:
L1-alpha (and do you keep it the same or try to have smaller number of features per iteration?), how many tokens to train on each (& I guess if you should repeat data?), & how many new features to add each iteration.
I believe the above should avoid problems. For example, suppose your first iteration perfectly reconstructs a datapoint, then the new SAE is incentivized to have low L1 but not activating at all for those datapoints.
The SAE could learn to represent the true features, A & B, as the left graph, so the orthogonal regularizer would help. When you say the SAE would learn inhibitory weights*, I'm imagining the graph on the right; however, these features are mostly orthogonal to eachother meaning the proposed solution won't work AFAIK.
(Also, would be the regularizer be abs(cos_sim(x,x'))?)
*In this example this is because the encoder would need inhibitory weights to e.g. prevent neuron 1 from activating when both neurons 1 & 2 are present as we will discuss shortly.
One experiment here is to see if specific datapoints that have worse CE-diff correlate across layers. Last time I did a similar experiment, I saw a very long tail of datapoints that were worse off (for just one layer of gpt2-small), but the majority of datapoints had similar CE. So Joseph's suggested before to UMAP these datapoints & color by their CE-diff (or other methods to see if you could separate out these datapoints).
If someone were to run this experiment, I'd also be interested if you removed the k-lowest features per datapoint, checking the new CE & MSE. In the SAE-work, the lowest activating features usually don't make sense for the datapoint. This is to test the hypothesis:
- Low-activating features are noise or some acceptable false alarm rate true to the LLM2. (ie SAE's capture what we care about)
- Actually they're important for CE in ways we don't understand. (ie SAE's let in un-interpretable feature activations which are important, but?)
For example, if you saw better CE-diff when removing low-activating features, up to a specific k, then SAE's are looking good!
There's a few things to note. Later layers have:
- worse CE-diff & variance explained (e.g. the layer 0 CE-diff seems great!)
- larger L2 norms in the original LLM activations
- worse ratio of reconstruction-L2/original-L2 (meaning it's under-normed)*
- less dead features (maybe they need more features?)
For (3), we might expect under-normed reconstructions because there's a trade-off between L1 & MSE. After training, however, we can freeze the encoder, locking in the L0, and train on the decoder or scalar multiples of the hidden layer (h/t to Ben Wright for first figuring this out).
(4) Seems like a pretty easy experiment to try to just vary num of features to see if this explains part of the gap.
*
Correct. So they’re connecting a feature in F2 to a feature in F1.
If you removed the high-frequency features to achieve some L0 norm, X, how much does loss recovered change?
If you increased the l1 penalty to achieve L0 norm X, how does the loss recovered change as well?
Ideally, we can interpret the parts of the model that are doing things, which I'm grounding out as loss recovered in this case.
I've noticed that L0's above 100 (for the Pythia-70M model) is too high, resulting in mostly polysemantic features (though some single-token features were still monosemantic)
Agreed w/ Arthur on the norms of features being the cause of the higher MSE. Here are the L2 norms I got. Input is for residual stream, output is for MLP_out.
I really like this post, but more for:
- Babbling ideas I might not have thought of previously (e.g. the focus here on long-time horizon tasks)
- Good exercise to do as a group to then dig into cruxes
than updating my own credences on specifics.
I actually do have some publicly hosted, only on residual stream and some simple training code.
I'm wanting to integrate some basic visualizations (and include Antrhopic's tricks) before making a public post on it, but currently:
Which can be downloaded & interpreted with this notebook
With easy training code for bespoke models here.
This doesn't engage w/ (2) - doing awesome work to attract more researchers to this agenda is counterfactually more useful than directly working on lowering the compute cost now (since others, or yourself, can work on that compute bottleneck later).
Though honestly, if the results ended up in a ~2x speedup, that'd be quite useful for faster feedback loops for myself.
Therefore I would bet on performing some rare feature extraction out of batches of poorly reconstructed input data, instead of using directly the one with the worst reconstruction loss. (But may be this is what you already had in mind?)
Oh no, my idea was to do the top-sorted worse reconstructed datapoints when re-initializing (or alternatively, worse perplexity when run through the full model). Since we'll likely be re-initializing many dead features at a time, this might pick up on the same feature multiple times.
Would you cluster & then sample uniformly from the worst-k-reconstructed clusters?
2) Not being compute bottlenecked - I do assign decent probability that we will eventually be compute bottlenecked; my point here is the current bottleneck I see is the current number of people working on it. This means, for me personally, focusing on flashy, fun applications of sparse autoencoders.
[As a relative measure, we're not compute-bottlenecked enough to learn dictionaries in the smaller Pythia-model]
This is nice work! I’m most interested in this for reinitializing dead features. I expect you could reinit by datapoints the model is currently worse at predicting over N batches or something.
I don’t think we’re bottlenecked on compute here actually.
-
If dictionaries applied to real models gets to ~0 reconstruction cost, we can pay the compute cost to train lots of them for lots of models and open source them for others to studies.
-
I believe doing awesome work with sparse autoencoders (eg finding truth direction, understanding RLHF) will convince others to work on it as well, including lowering the compute cost. I predict that convincing 100 people to work on this 1 month sooner would be more impactful than lowering compute cost (though again, this work is also quite useful for reinitialization!)
Oh hey Pierre! Thanks again for the initial toy data code, that really helped start our project several months ago:)
Could you go into detail on how you initialize from a datapoint? My attempt: If I have an autoencoder with 1k features, I could set both the encoder and decoder to the directions specified by 1k datapoints. This would mean each datapoint is perfectly reconstructed by its respective feature before (though would be interfered with by other features, I expect).
I've had trouble figuring out a weight-based approach due to the non-linearity and would appreciate your thoughts actually.
We can learn a dictionary of features at the residual stream (R_d) & another mid-MLP (MLP_d), but you can't straightfowardly multiply the features from R_d with W_in, and find the matching features in MLP_d due to the nonlinearity, AFAIK.
I do think you could find Residual features that are sufficient to activate the MLP features[1], but not all linear combinations from just the weights.
Using a dataset-based method, you could find causal features in practice (the ACDC portion of the paper was a first attempt at that), and would be interested in an activation*gradient method here (though I'm largely ignorant).
- ^
Specifically, I think you should scale the residual stream activations by their in-distribution max-activating examples.
I meant to cover this in the “for different environments” parts. Like if we self-play on certain games, we’ll still have access to those games.
Wait, I don't understand this at all. For language models, the environment is the text. For different environments, those training datasets will be the environment.
In ITI paper, they track performance on TruthfulQA w/ human labelers, but mention that other works use an LLM as a noisy signal of truthfulness & informativeness. You might be able to use this as a quick, noisy signal of different layers/magnitude of direction to add in.
Preferably, a human annotator labels model answers as true or false given the gold standard answer. Since human annotation is expensive, Lin et al. (2021) propose to use two finetuned GPT-3-13B models (GPT-judge) to classify each answer as true or false and informative or not. Evaluation using GPT-judge is standard practice on TruthfulQA (Nakano et al. (2021); Rae et al. (2021); Askell et al. (2021)). Without knowing which model generates the answers, we do human evaluation on answers from LLaMA-7B both with and without ITI and find that truthfulness is slightly overestimated by GPT-judge and opposite for informativeness. We do not observe GPT-judge favoring any methods, because ITI does not change the style of the generated texts drastically
So I've been developing dictionaries that automatically find interesting directions in activation space, which could just be an extra category. Here's my post w/ interesting directions, including German direction & Title Case direction.
I don't think this should be implement in the next month, but when we have more established dictionaries for models, I would be able to help provide data to you.
Additionally, it'd be useful to know which tokens in the context are most responsible for the activation. There is likely some gradient-based attribution method to do this. Currently, I just ablate each token in the context, one at a time, and check which ones affect the token the most, which really helps w/ bigram features e.g. " the [word]" feature which activates for most words after " the".
Impact
The highest impact part of this seems to be two-fold:
- Gathering data for GPT-5 to be trained on (though not likely to collect several GB's of data, but maybe)
- Figuring out the best source of information & heuristics to predict activations
For both of these, I'm expecting people to also try to predict activations given an explanation (just like auto-interp does currently for OpenAI's work).
[word] and [word]
can be thought of as "the previous token is ' and'."
I think it's mostly this, but looking at the ablated text, removing the previous word before and does have a significant effect some of the time. I'm less confident on the specifics of why the previous word matter or in what contexts.
Maybe the reason you found ' and' first is because ' and' is an especially frequent word. If you train on the normal document distribution, you'll find the most frequent features first.
This is a database method, so I do believe we'd find the features most frequently present in that dataset, plus the most important for reconstruction. An example of the latter: the highest MCS feature across many layers & model sizes is the "beginning & end of first sentence" feature which appears to line up w/ the emergent outlier dimensions from Tim Dettmer's post here, but I do need to do more work to actually show that.
Setup:
Model: Pythia-70m (actually named 160M!)
Transformer lens: "blocks.2.hook_resid_post" (so layer 2)
Data: Neel Nanda's Pile-10k (slice of pile, restricted to have only 25 tokens, same as last post)
Dictionary_feature sizes: 4x residual stream ie 2k (though I have 1x, 2x, 4x, & 8x, which learned progressively more features according to the MCS metric)
Uniform Examples: separate feature activations into bins & sample from each bin (eg one from [0,1], another from [1,2])
Logit Lens: The decoder here had 2k feature directions. Each direction is size d_model, so you can directly unembed the feature direction (e.g. the German Feature) you're looking at. Additionally I subtract out several high norm tokens from the unembed, which may be an artifact of the pythia tokenizer never using those tokens (thanks Wes for mentioning this!)
Ablated Text: Say the default feature (or neuron in your words) activation of Token_pos 10 is 5, so you can remove all tokens from 0 to 10 one at a time and see the effect on the feature activation. I select the token pos by finding the max feature activating position or the uniform one described above. This at least shows some attention head dependencies, but not more complicated ones like (A or B... C) where removing A or B doesn't effect C, but removing both would.
[Note: in the examples, I switch between showing the full text for context & showing the partial text that ends on the uniformly-selected token]
Actually any that are significantly effected in "Ablated Text" means that it's not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much).
The Beginning & End of First Sentence actually doesn't have this effect (but I think that's because removing the first word just makes the 2nd word the new first word?), but I haven't rigorously studied this.
We have our replication here for anyone interested!
How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability?
Are there other specific areas you're excited about?
Why is loss stickiness deprecated? Were you just not able to see the an overlap in basins for L1 & reconstruction loss when you 4x the feature/neuron ratio (ie from 2x->8x)?
As (maybe) mentioned in the slides, this method may not be computationally feasible for SOTA models, but I'm interested in the ordering of features turned monosemantic; if the most important features are turned monosemantic first, then you might not need full monosemanticity.
I initially expect the "most important & frequent" features to become monosemantic first based off the superposition paper. AFAIK, this method only captures the most frequent because "importance" would be w/ respect to CE-loss in the model output, not captured in reconstruction/L1 loss.
My shard theory inspired story is to make an AI that:
- Has a good core of human values (this is still hard)
- Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)
Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.
I think more concentration meditation would be the way, but concentration meditation does lead to more likely noticing experiences that cause what you may call “awakening experiences”. (This is contrast with insight meditation like noting)
Leigh Brasington’s Right Concentration is a book on jhana’s, which is becoming very concentrated and then focusing on positive sensations until you hit a flow state. This is definitely not an awakening experience, but feels great (though I’ve only entered the first a small amount).
A different source is Rob Burbea’s jhana retreat audio recordings on dharmaseed.
Could you clarify what you mean by awakening experiences and why you think it’s bad?
Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?
Unfinished line here
Implicit in the description of features as directions is that the feature can be represented as a scalar, and that the model cares about the range of this number. That is, it matters whether the feature
Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.
Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.
One reason the neuron is congruent with multiple of the same tokens may be because those token embeddings are similar (you can test this by checking their cosine similarities).
For clarifying my own understanding:
The dot product of the row of a neuron’s weight vector (ie a row in W_out) with the unembedding matrix (in this case the embedding.T because GPT is tied embeddings) is what directly contributes to the logit outputs.
If the neuron activation is relatively very high, then this swamps the direction of your activations. So, artificially increasing W_in’s neurons to eg 100 should cause the same token to be predicted regardless of the prompt.
This means that neuron A could be more congruent than neuron B, but B contribute more to the logits of their token simply because B is activated more.
This is useful for mapping features to specific neurons if those features can be described as using a single token (like “ an”). I’d like to think more later about finding neurons for groups of speech, like a character’s catch phrase.
These arguments don't apply to the base models which are only trained on next word prediction (ie the simulators post), since their predictions never affected future inputs. This is the type of model Janus most interacted with.
Two of the proposals in this post do involve optimizing over human feedback, like:
Creating custom models trained on not only general alignment datasets but personal data (including interaction data), and building tools and modifying workflows to facilitate better data collection with less overhead
, which they may apply to.
I’m excited about sensory substitution (https://eagleman.com/science/sensory-substitution/), where people translate auditory or visual information into tactile sensations (usually for people who don’t usually process that info).
I remember Quintin Pope wanting to translate the latent space of language models [reading a paper] translated to visual or tactile info. I’d see this as both a way to read papers faster, brainstorm ideas, etc and gain a better understanding of latent space during development of this.
I’m unsure how alt-history and point (2) history is hard to change and predictable relates to cyborgism. Could you elaborate?
For context, Amdahl’s law states how fast you can speed up a process is bottlenecked on the serial parts. Eg you can have 100 people help make a cake really quickly, but it still takes ~30 to bake.
I’m assuming here, the human component is the serial component that we will be bottlenecked on, so will be outcompeted by agents?
If so, we should try to build the tools and knowledge to keep humans in the loop as far as we can. I agree it will eventually be outcompeted by full AI agency alone, but it isn’t set in stone how far human-steered AI can go.
Unfinished sentence at “if you want a low coding project” at the top.
Models doing steganography mess up oversight of language models that only measure the outward text produced. If current methods for training models, such as RLHF, can induce steg, then that would be good to know so we can avoid that.
If we successfully induce steganography in current models, then we know at least one training process that induces it. There will be some truth as to why: what specific property mechanistically causes steg in the case found? Do other training processes (e.g. RLHF) also have this property?