Posts
Comments
It'd be important to cache the karma of all users > 1000 atm, in order to credibly signal you know which generals were part of the nuking/nuked side. Would anyone be willing to do that in the next 2 & 1/2 hours? (ie the earliest we could be nuked)
We could instead pre-commit to not engage with any nuker's future posts/comments (and at worse comment to encourage others to not engage) until end-of-year.
Or only include nit-picking comments.
Could you dig into why you think it's great inter work?
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.
This paragraph sounded like you're claiming LLMs do have concepts, but they're not in specific activations or weights, but distributed across them instead.
But from your comment, you mean that LLMs themselves don't learn the true simple-compressed features of reality, but a mere shadow of them.
This interpretation also matches the title better!
But are you saying the "true features" in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).
Possibly clustering the data points by their network gradients would be a way to put some order into this mess?
Eric Michaud did cluster datapoints by their gradients here. From the abstract:
...Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta).
The one we checked last year was just Pythia-70M, which I don't expect the LLM itself to have a gender feature that generalizes to both pronouns and anisogamy.
But again, the task is next-token prediction. Do you expect e.g. GPT 4 to have learned a gender concept that affects both knowledge about anisogamy and pronouns while trained on next-token prediction?
Sparse autoencoders finds features that correspond to abstract features of words and text. That's not the same as finding features that correspond to reality.
(Base-model) LLMs are trained to minimize prediction error, and SAEs do seem to find features that sparsely predict error, such as a gender feature that, when removed, affects the probability of pronouns. So pragmatically, for the goal of "finding features that explain next-word-prediction", which LLMs are directly trained for, SAEs find good examples![1]
I'm unsure what goal you have in mind for "features that correspond to reality", or what that'd mean.
- ^
Not claiming that all SAE latents are good in this way though.
Is there code available for this?
I'm mainly interested in the loss fuction. Specifically from footnote 4:
We also need to add a term to capture the interaction effect between the key-features and the query-transcoder bias, but we omit this for simplicity
I'm unsure how this is implemented or the motivation.
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.
@Lucius Bushnaq , why would MLPs compute linear transformations?
Because two linear transformations can be combined into one linear transformation, why wouldn't downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?
What is the activation name for the resid SAEs? hook_resid_post or hook_resid_pre?
I found https://github.com/ApolloResearch/e2e_sae/blob/main/e2e_sae/scripts/train_tlens_saes/run_train_tlens_saes.py#L220
to suggest _post
but downloading the SAETransformer from wandb shows:(saes):
ModuleDict( (blocks-6-hook_resid_pre):
SAE( (encoder): Sequential( (0):...
which suggests _pre.
Here’s a link: https://archive.nytimes.com/thelede.blogs.nytimes.com/2008/07/02/a-window-into-waterboarding/
3. Those who are more able to comprehend and use these models are therefore of a higher agency/utility and higher moral priority than those who cannot. [emphasis mine]
This (along with saying "dignity" implies "moral worth" in Death w/ Dignity post), is confusing to me. Could you give a specific example of how you'd treat differently someone who has more or less moral worth (e.g. give them more money, attention, life-saving help, etc)?
One thing I could understand from your Death w/ Dignity excerpt is he's definitely implying a metric that scores everyone, and some people will score higher on this metric than others. It's also common to want to score high on these metrics or feel emotionally bad if you don't score high on these metrics (see my post for more). This could even have utility, like having more "dignity" gets you a thumbs up from Yudowsky or have your words listened to more in this community. Is this close to what you mean at all?
Rationalism is path-dependent
I was a little confused on this section. Is this saying that human's goals and options (including options that come to mind) change depending on the environment, so rational choice theory doesn't apply?
Games and Game Theory
I believe the thesis here is that game theory doesn't really apply in real life, that there are usually extra constraints or freedoms in real situations that change the payoffs.
I do think this criticism is already handled by trying to "actually win" and "trying to try"; though I've personally benefitted specifically from trying to try and David Chapman's meta-rationality post.
Probability and His Problems
The idea of deference (and when to defer) isn't novel (which is fine! Novelty is another metric I'm bringing up, but not important for everything one writes to be). It's still useful to apply Bayes theorem to deference. Specifically evidence that convinces you to trust someone should imply that there's evidence that convinces you to not trust them.
This is currently all I have time for; however, my current understanding is that there is a common interpretation of Yudowsky's writings/The sequences/LW/etc that leads to an over-reliance on formal systems that will invevitably fail people. I think you had this interpretation (do correct me if I'm wrong!), and this is your "attempt to renegotiate rationalism ".
There is the common response of "if you re-read the sequences, you'll see how it actually handles all the flaws you mentioned"; however, it's still true that it's at least a failure in communication that many people consistently mis-interpret it.
Glad to hear you're synthesizing and doing pretty good now:)
I think copy-pasting the whole thing will make it more likely to be read! I enjoyed it and will hopefully leave a more substantial comment later.
I've really enjoyed these posts; thanks for cross posting!
Kind of confused on why the KL-only e2e SAE have worse CE than e2e+downstream across dictionary size:
This is true for layers 2 & 6. I'm unsure if this means that training for KL directly is harder/unstable, and the intermediate MSE is a useful prior, or if this is a difference in KL vs CE (ie the e2e does in fact do better on KL but worse on CE than e2e+downstream).
I finally checked!
Here is the Jaccard similarity (ie similarity of input-token activations) across seeds
The e2e ones do indeed have a much lower jaccard sim (there normally is a spike at 1.0, but this is removed when you remove features that only activate <10 times).
I also (mostly) replicated the decoder similarity chart:
And calculated the encoder sim:
[I, again, needed to remove dead features (< 10 activations) to get the graphs here.]
So yes, I believe the original paper's claim that e2e features learn quite different features across seeds is substantiated.
And here's the code to convert it to NNsight (Thanks Caden for writing this awhile ago!)
import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer
from nnsight.models.UnifiedTransformer import UnifiedTransformer
model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")
# Undo my hacky LayerNorm removal
for block in model.transformer.h:
block.ln_1.weight.data = block.ln_1.weight.data / 1e6
block.ln_1.eps = 1e-5
block.ln_2.weight.data = block.ln_2.weight.data / 1e6
block.ln_2.eps = 1e-5
model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6
model.transformer.ln_f.eps = 1e-5
# Properly replace LayerNorms by Identities
def removeLN(transformer_lens_model):
for i in range(len(transformer_lens_model.blocks)):
transformer_lens_model.blocks[i].ln1 = torch.nn.Identity()
transformer_lens_model.blocks[i].ln2 = torch.nn.Identity()
transformer_lens_model.ln_final = torch.nn.Identity()
hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu")
removeLN(hooked_model)
model_nnsight = UnifiedTransformer(model="gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu")
removeLN(model_nnsight)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
prompt = torch.tensor([1,2,3,4], device=device)
logits = hooked_model(prompt)
with torch.no_grad(), model_nnsight.trace(prompt) as runner:
logits2 = model_nnsight.unembed.output.save()
logits, cache = hooked_model.run_with_cache(prompt)
torch.allclose(logits, logits2)
Maybe this should be like Anthropic's shared decoder bias? Essentially subtract off the per-token bias at the beginning, let the SAE reconstruct this "residual", then add the per-token bias back to the reconstructed x.
The motivation is that the SAE has a weird job in this case. It sees x, but needs to reconstruct x - per-token-bias, which means it needs to somehow learn what that per-token-bias is during training.
However, if you just subtract it first, then the SAE sees x', and just needs to reconstruct x'.
So I'm just suggesting changing here:
w/ remaining the same:
That's great thanks!
My suggested experiment to really get at this question (which if I were in your shoes, I wouldn't want to run cause you've already done quite a bit of work on this project!, lol):
Compare
1. Baseline 80x expansion (56k features) at k=30
2. Tokenized-learned 8x expansion (50k vocab + 6k features) at k=29 (since the token adds 1 extra feature)
for 300M tokens (I usually don't see improvements past this amount) showing NMSE and CE.
If tokenized-SAEs are still better in this experiment, then that's a pretty solid argument to use these!
If they're equivalent, then tokenized-SAEs are still way faster to train in this lower expansion range, while having 50k "features" already interpreted.
If tokenized-SAEs are worse, then these tokenized features aren't a good prior to use. Although both sets of features are learned, the difference would be the tokenized always has the same feature per token (duh), and baseline SAEs allow whatever combination of features (e.g. features shared across different tokens).
About similar tokenized features, maybe I'm misunderstanding, but this seems like a problem for any decoder-like structure.
I didn't mean to imply it's a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from " 9" token on a specific task, then we should conclude it's reading from a more general number direction than a specific number direction.
I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)
Although, tokenized features are dissimilar to normal features in that they don't vary in activation strength. Tokenized features are either 0 or 1 (or norm of the vector). So it's not exactly an apples-to-apples comparison w/ a similar sized dictionary of normal SAE features, although that plot would be nice!
I do really like this work. This is useful for circuit-style work because the tokenized-features are already interpreted. If a downstream encoded feature reads from the tokenized-feature direction, then we know the only info being transmitted is info on the current token.
However, if multiple tokenized-features are similar directions (e.g. multiple tokenizations of word "the") then a circuit reading from this direction is just using information about this set of tokens.
Do you have a dictionary-size to CE-added plot? (fixing L0)
So, we did not use the lookup biases as additional features (only for decoder reconstruction)
I agree it's not like the other features in that the encoder isn't used, but it is used for reconstruction which affects CE. It'd be good to show the pareto improvement of CE/L0 is not caused by just having an additional vocab_size number of features (although that might mean having to use auxk to have a similar number of alive features).
Did you vary expansion size? The tokenized SAE will have 50k more features in its dictionary (compared to the 16x expansion of ~12k features from the paper version).
Did you ever train a baseline SAE w/ similar number of features as the tokenized?
This is extremely useful for SAE circuit work. Now the connections between features are at most ReLU(Wx + b) which is quite interpretable! (Excluding attn_in->attn_out)
Thanks for doing this!
Did you ever do a max-cos-sim between the vectors in the token-biases? I'm wondering how many biases are exactly the same (e.g. " The" vs "The" vs "the" vs " the", etc) which would allow a shrinkage of the number of features (although your point is good that an extra vocab_size num of features isn't large in the scale of millions of features).
Do you have any tokenized SAEs uploaded (and a simple download script?). I could only find model definitions in the repos.
If you just have it saved locally, here's my usual script for uploading to huggingface (just need your key).
Did y'all do any ablations on your loss terms. For example:
1. JumpReLU() -> ReLU
2. L0 (w/ STE) -> L1
I'd be curious to see if the pareto improvements and high frequency features are due to one, the other, or both
On activation patching:
The most salient difference is that RE doesn’t completely change the activations, but merely adds to them. Thus, RE aims to find how a certain concept is represented, while activation patching serves as an ablation of the function of a specific layer or neuron. Thus activation patching doesn’t directly tell you where to find the representation of a concept.
I'm pretty sure both methods give you some approximate location of the representation. RE is typically done on many layers & then picks the best layer. Activation patching ablates each layer & shows you which one is most important for the counterfactual. I would trust the result of patching more than RE for location of a representation (maybe grounded out as what a SAE would find?) due to being more in-distribution.
Strong upvoted to get from -6 to 0 karma. Would be great if someone who downvotes could explain?
My read of the paper is this is a topic Jan is researching, and they wrote up their own lit review for their own sake and other's if they're interested in the topic which isn't negative karma-worthy.
Regarding urls, I think this is a mix of the HH dataset being non-ideal & the PM not being a great discriminator of chosen vs rejected reward (see nostalgebraist's comment & my response)
I do think SAE's find the relevant features, but inefficiently compressed (see Josh & Isaac's work on days of the week circle features). So an ideal SAE (or alternative architecture) would not separate these features. Relatedly, many of the features that had high url-relevant reward had above-random cos-sim with each other.
[I also think the SAE's could be optimized to trade off some reconstruction loss for reward-difference loss which I expect to show a cleaner effect on the reward]
The PM is pretty bad (it's trained on hh).
It's actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 -> 1.36 if you only calculate over that remaining ~136k subset.
My understanding is there's 3 bad things:
1. the hh dataset is inconsistent
2. The PM doesn't separate chosen vs rejected very well (as shown above)
3. The PM is GPT-J (7B parameter model) which doesn't have the most complex features to choose from.
The in-distribution argument is most likely the case for the "Thank you. My pleasure" case, because the assistant never (AFAIK, I didn't check) said that phrase as a response. Only "My pleasure" after the user said " thank you".
I prefer when they are directly mentioned in the post/paper!
That would be a more honest picture. The simplest change I could think of was adding it to the high-level takeaways.
I do think you could use SAE features to beat that baseline if done in the way specified by General Takeaways. Specifically, if you have a completion that seems to do unjustifiably better, then you can find all feature's effects on the rewards that were different than your baseline completion.
Features help come up with hypotheses, but also isolates the effect. If do have a specific hypothesis as mentioned, then you should be able to find features that capture that hypothesis (if SAEs are doing their job). When you create some alternative completion based on your hypothesis, you might unknowingly add/remove additional negative & positive features e.g. just wanting to remove completion-length, you also remove the end-of-sentence punctuation.
In general, I think it's hard to come up with the perfect counterfactual, but SAE's at least let you know if you're adding or removing specific reward-relevant features in your counterfactual completions.
Thanks!
There were some features that didn't work, specifically ones that activated on movie names & famous people's names, which I couldn't get to work. Currently I think they're actually part of a "items in a list" group of reward-relevant features (like the urls were), but I didn't attempt to change prompts based off items in a list.
For "unsupervised find spurious features over a large dataset" my prior is low given my current implementation (ie I didn't find all the reward-relevant features).
However, this could be improved with more compute, SAEs over layers, data, and better filtering of the resulting feature results (and better versions of SAEs that e.g. fix feature splitting, train directly for reward).
From my (quick) read, it's not obvious how well this approach compares to the baseline of "just look at things the model likes and try to understand the spurious features of the PM"
From this section, you could augment this with SAE features by finding the features relevant for causing one completion to be different than the other. I think this is the most straightforwardly useful application. A couple of gotcha's:
- Some features are outlier dimensions or high-frequency features which will affect both completions (or even most text), so include some baselines which shouldn't be affected (which requires a hypothesis)
- You should look over multiple layers (though if you do multiple residual stream SAEs you'll find near-duplicate features)
Fixed! Thanks:)
Really cool work! Some general thoughts from training SAEs on LLMs that might not carry over:
L0 vs reconstruction
Your variance explained metric only makes sense for a given sparsity (ie L0). For example, you can get near perfect variance explained if you set your sparsity term to 0, but then the SAE is like an identity function & the features aren't meaningful.
In my initial experiments, we swept over various L0s & checked which ones looked most monosemantic (ie had the same meaning across all activations), when sampling 30 features. We found 20-100 being a good L0 for LLMs of d_model=512. I'm curious how this translates to text.
Dead Features
I believe you can do Leo Gao's topk w/ tied-initialization scheme to (1) tightly control your L0 & (2) have less dead features w/o doing ghost grads. Gao et al notice that this tied-init (ie setting encoder to decoder transposed initially) led to little dead features for small models, which your d_model of 1k is kind of small.
Nora has an implementation here (though you'd need to integrate w/ your vision models)
Icon Explanation
I didn't really understand your icon image... Oh I get it. The template is the far left & the other three are three different features that you clamp to a large value generating from that template. Cool idea! (maybe separate the 3 features from the template or do a 1x3 matrix-table for clarity?)
Other Interp Ideas
Feature ablation - Take the top-activating images for a feature, ablate the feature ( by reconstructing w/o that feature & do the same residual add-in thing you found useful), and see the resulting generation.
Relevant Inputs - What pixels or something are causally responsible for activating this feature? There's gotta be some literature on input-causal attribution on the output class in image models, then you're just applying that to your features instead.
For instance, the subject of one photo could be transferred to another. We could adjust the time of day, and the quantity of the subject. We could add entirely new features to images to sculpt and finely control them. We could pick two photos that had a semantic difference, and precisely transfer over the difference by transferring the features. We could also stack hundreds of edits together.
This really sounds amazing. Did you patch the features from one image to another specifically? Details on how you transferred a subject from one to the other would be appreciated.
Thanks so much! All the links and info will save me time:)
Regarding cos-sim, after thinking a bit, I think it's more sinister. For cross-cos-sim comparison, you get different results if you take the max over the 0th or 1st dimension (equivalent to doing cos(local, e2e) vs cos(e2e, local). As an example, you could have 2 features each, 3 point in the same direction and 1 points opposte. Making up numbers:
feature-directions(1D) = [ [1],[1]] & [[1],[-1]]
cos-sim = [[1, 1], [-1, -1]]
For more intuition, suppose 4 local features surround 1 e2e feature (and the other features are pointed elsewhere). Then the 4 local features will all have high max-cos sim but the e2e only has 1. So it's not just double-counting, but quadruple counting. You could see for yourself if you swap your dim=1 to 0 in your code.
But my original comment showed your results are still directionally correct when doing [global max w/ replacement] (if I coded it correctly).
Btw it's not intuitive to me that the encoder directions might be similar even though the decoder directions are not. Curious if you could share your intuitions here.
The decoder directions have degrees of freedom, but the encoder directions...might have similar degrees of freedom and I'm wrong, lol. BUT! they might be functionally equivalent, so they activate on similar datapoints across seeds. That is more laborious to check though, waaaah.
I can check both (encoder directions first) because previous literature is really only on the SVD of gradient (ie the output), but an SAE might be more constrained when separating out inputs into sparse features. Thanks for prompting for my intuition!
You say:
We cannot realistically expect a significant part of population (let's say, 10%) to become advanced meditators to the level that they actually become indifferent to pain. So... for practical purposes, "pain causes aversion" describes the situation correctly, for a vast majority of people.
Which, if this is just a semantic argument, then sure. But OP's conclusion is goal-oriented:
Understanding the distinction between pain and suffering is crucial for developing effective strategies to reduce suffering. By directly addressing the craving, aversion, and clinging which cause suffering, we can create more compassionate and impactful interventions.
When I think of effective strategies here, I think of developing jhana helmets[1] which would imitate the mind state of blissful-joy-flow state. Although this causes joy-concentrated-collectedness, it's argued that this puts your mind in a state where it can better notice that aversion/craving are necessary for suffering (note: I've only partially experienced this).
Although I think you're expressing skepticism of craving/aversion as the only necessary cause of suffering for all people? Or maybe just the 99.9999% reduction in suffering (ie the knife) vs a 99% reduction for all? What do you actually believe?
For me, I read a book that suggested many different experiments to try in a playful way. One was to pay attention to the "distance" between a my current state (e.g. "itching") and a desired state ("not itching"), and it did feel worse the larger the "distance". I could even intentionally make it feel larger or small and thought that was very interesting. In one limiting case, you don't classify the two situations "itching" "not itching" as separate, so no suffering. In the other, it's "the difference between heaven and hell", lol. This book had >100 like these (though it is intended for advanced meditators, brag brag).
- ^
those jhana helmet people have pivoted to improving pedagogy w/ jhana retreats at $1-2k, I think for the purpose of gathering more jhana data, but then it became widely successful
The e2e having different feature directions across seeds was quite the bummer, but then I thought "are the encoder directions different though?"
Intuitively the encoder directions affect which datapoints each feature activates on, and the decoder is the causal downstream effect. For e2e, we would expect widely different decoder directions because there are many free parameters (from some other work that showed SVD of gradients had many zero singular values, meaning moving in most directions don't effect the downstream loss), but not necessarily encoder directions.
If the encoder directions are similar across seeds, I'd trust them to inform relevant features for the model output (in cases where we don't care about connections w/ downstream layers).
However, I was not able to find the SAEs for various seeds.
Trying to replicate Cos-sim Plots
I downloaded the similar CE at layer 6 for all three types of SAEs & took their cos-sim (last column in figure 3).
I think your cos-sim metric gives different results if you take the max over the first or 2nd dimension (or equivalently swapped the order of decoders multiplied by each other). If so, I think this is because you might double-count or something? Regardless, I ended up doing some hungarian algorithm to take the overall max (but don't double-count), but it's on cpu, so I only did the first 10k/40k features. Below is results for both encoder & decoder, which do replicate the directional results.
Nonzero Features
Additionally I thought that some results were from counting nonzero features, which, for the encoder is some high-cos-sim features, and decoder is the low-cos-sim features weirdly enough.
Would appreciate if y'all upload any repeated seeds!
My code is temporarily hosted (for a few weeks maybe?) here.
Thanks for the correction! What I meant was figure 7 is better modeled as “these neurons are not monosemantic”since their co-activation has a consistent effect (upweighting 9) which isn’t captured by any individual component, and (I predict) these neurons would do different things on different prompts.
But I think I see where you’re coming from now, so the above is tangential. You’re just decomposing the logits using previous layers components. So even though intermediate layers logit contribution won’t make any sense (from tuned lens) that’s fine.
It is interesting in your example of the first two layers counteracting each other. Surely this isn’t true in general, but it could be a common theme of later layers counteracting bigrams (what the embedding is doing?) based off context.
To give my speculation (though I upvoted):
I believe this work makes sense overall e.g. let's do logit lens but for individual model components, but does not compare to baselines or mentions SAEs.
Specifically, what would this method be useful for?
With logit prisms, we can closely examine how the input embeddings, attention heads, and MLP neurons each contribute to the final output.
If it's for isolating which model components are causally responsible for a task (e.g. addition & Q&A), then does it improve on patching in different activations for these different model components (or the linear approximation method Attribution Patching)? In what way?
Additionally, this post did assume mlp neurons are monosemantic, which isn't true. This is why we use sparse autoencoders for superposition, which
A final problem is that the logit attribution with the logit lens doesn't always work out, as shown by the cited Tuned Lens paper (e.g. directly unembedding early layers usually produces nonsense in which logits are upweighted).
I did upvote however because I think the standards for a blog post on LW should be lower. Thank you Raemon also for asking for details, because it sucks to get downvoted and not told why.
Strong upvote fellow co-author! lol
Highest-activating Features
I agree we shouldn't interpret features by their max-activation, but I think the activation magnitude really does matter. Removing smaller activations affects downstream CE less than larger activations (but this does mean the small activations do matter). A weighted percentage of feature activation captures this more (ie (sum of all golden gate activations)/(sum of all activations)).
I do believe "lower-activating examples don't fit your hypothesis" is bad because of circuits. If you find out that "Feature 3453 is a linear combination of the Golden Gate (GG) feature and the positive sentiment feature" then you do understand this feature at high GG activations, but not low GG + low positive sentiment activations (since you haven't interpreted low GG activations).
Your "code-error" feature example is good. If it only fits "code-error" at the largest feature activations & does other things, then if we ablate this feature, we'll take a capabilities hit because the lower activations were used in other computations. But, let's focus on the lower activations which we don't understand are being used in other computations bit. We could also have "code-error" or "deception" being represented in the lower activations of other features which, when co-occurring, cause the model to be deceptive or write code errors.
[Although, Anthropic showed evidence against this by ablating the code-error feature & running on errored code which predicted a non-error output]
Finding Features
Anthropic suggested that if you have a feature that occurs 1/Billion tokens, you need 1 Billion features. You also mention finding important features. I think SAE's find features on the dataset you give it. For example, we trained an SAE on only chess data (on a chess-finetuned-Pythia model) & all the features were on chess data. I bet if you trained it on code, it'd find only code features (note: I do think some semantic & token level features that would generalize to other domains).
Pragmatically, if there are features you care about, then it's important to train the SAE on many texts that exhibit that feature. This is also true for the safety relevant features.
In general, I don't think you need these 1000x feature expansions. Even a 1x feature expansion will give you sparse features (because of the L1 penalty). If you want your model to [have positive personality traits] then you only need to disentangle those features.
[Note: I think your "SAE's don't find all Othello board state features" does not make the point that SAE's don't find relevant features, but I'd need to think for 15 min to clearly state it which I don't want to do now, lol. If you think that's a crux though, then I'll try to communicate it]
Correlated Features
They said 82% of features had a max of 0.3 correlation which (wait, does this imply that 18% of their million billion features did correlate even more???), I agree is a lot. I think this is strongest evidence for "neuron basis is not as good as SAE's", which I'm unsure who is still arguing that, but as a sanity check makes sense.
However, some neurons are monosemantic so it makes sense for SAE features to also find those (though again, 18% of a milllion billion have a higher correlation than 0.3?)
> We additionally confirmed that feature activations are not strongly correlated with activations of any residual stream basis direction.
I'm sure they actually found very strongly correlated features specifically for the outlier dimensions in the residual stream which Anthropic has previous work showing is basis aligned (unless Anthropic trains their models in ways that doesn't produce an outlier dimension which there is existing lit on).
[Note: I wrote a lot. Feel free to respond to this comment in parts!]
I think some people use the loss when all features are set to zero, instead of strictly doing
I think this is an unfinished
What a cool paper! Congrats!:)
What's cool:
1. e2e saes learn very different features every seed. I'm glad y'all checked! This seems bad.
2. e2e SAEs have worse intermediate reconstruction loss than local. I would've predicted the opposite actually.
3. e2e+downstream seems to get all the benefits of the e2e one (same perf at lower L0) at the same compute cost, w/o the "intermediate activations aren't similar" problem.
It looks like you've left for future work postraining SAE_local on KL or downstream loss as future work, but that's a very interesting part! Specifically the approximation of SAE_e2e+downstream as you train on number of tokens.
Did y'all try ablations on SAE_e2e+downstream? For example, only training on the next layers Reconstruction loss or next N-layers rec loss?
Great work!
Did you ever run just the L0-approx & sparsity-frequency penalty separately? It's unclear if you're getting better results because the L0 function is better or because there are less dead features.
Also, a feature frequency of 0.2 is very large! 1/5 tokens activating is large even for positional (because your context length is 128). It'd be bad if the improved results are because polysemanticity is sneaking back in through these activations. Sampling datapoints across a range of activations should show where the meaning becomes polysemantic. Is it the bottom 10% (or 10% of max-activating example is my preferred method)
For comparing CE-difference (or the mean reconstruction score), did these have similar L0's? If not, it's an unfair comparison (higher L0 is usually higher reconstruction accuracy).
Seems tangential. I interpreted loss recovered is CE-related (not reconstruction related).
Could you go into more details on how this would work? For example, Sam Altman wants to raise more money, but can't raise as much since Claude-3 is better. So he waits to raise more money after releasing GPT-5 (so no change in behavior except when to raise money).
If you argue releasing GPT-5 sooner, that time has to come from somewhere. For example, suppose GPT-4 was release ready by February, but they wanted to wait until Pi day for fun. Capability researchers are still researching capabilities in the meantime regardless ff they were pressured & instead relased 1 month earlier.
Maybe arguing that earlier access allows more API access so more time finagling w/ scaffolding?
I've only done replications on the mlp_out & attn_out for layers 0 & 1 for gpt2 small & pythia-70M
I chose same cos-sim instead of epsilon perturbations. My KL divergence is log plot, because one KL is ~2.6 for random perturbations.
I'm getting different results for GPT-2 attn_out Layer 0. My random perturbation is very large KL. This was replicated last week when I was checking how robust GPT2 vs Pythia is to perturbations in input (picture below). I think both results are actually correct, but my perturbation is for a low cos-sim (which if you see below shoots up for very small cos-sim diff). This is further substantiated by my SAE KL divergence for that layer being 0.46 which is larger than the SAE you show.
Your main results were on the residual stream, so I can try to replicate there next.
For my perturbation graph:
I add noise to change the cos-sim, but keep the norm at around 0.9 (which is similar to my SAE's). GPT2 layer 0 attn_out really is an outlier in non-robustness compared to other layers. The results here show that different layers have different levels of robustness to noise for downstream CE loss. Combining w/ your results, it would be nice to add points for the SAE's cos-sim/CE.
An alternative hypothesis to yours is that SAE's outperform random perturbation at lower cos-sim, but suck at higher-cos-sim (which we care more about).
Throughout this post, I kept thinking about Soul-Making dharma (which I'm familier with, but not very good at!)
AFAIK, it's about building up the skill of having a full body awareness (ie instead of the breath at the nose as an object, you place attention on the full body + some extra space, like your "aura") which gives you a much more complete information about the felt sense of different things. For example, when you think of different people, they have different "vibes" that come up as physical sense in the body which you can access more fully by paying attention to full body awareness.
The teachers then went on a lot about sacredness & beauty, which seemed most relevant to attunement (although I didn't personally practice those methods due to lack of commitment)
However, having full-body awareness was critical for me to have any success in any of the soul-making meditation methods & is mentioned as a pre-requisite for the course. Likewise, attunement may require skills in feeling your body/ noticing felt senses.
Agreed. You would need to change the correlation code to hardcode feature correlations, then you can zoom in on those two features when doing the max cosine sim.