Posts

Was Releasing Claude-3 Net-Negative? 2024-03-27T17:41:56.245Z
Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features 2024-03-15T16:30:00.744Z
Finding Sparse Linear Connections between Features in LLMs 2023-12-09T02:27:42.456Z
Sparse Autoencoders: Future Work 2023-09-21T15:30:47.198Z
Sparse Autoencoders Find Highly Interpretable Directions in Language Models 2023-09-21T15:30:24.432Z
Really Strong Features Found in Residual Stream 2023-07-08T19:40:14.601Z
(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders 2023-07-05T16:49:43.822Z
[Replication] Conjecture's Sparse Coding in Small Transformers 2023-06-16T18:02:34.874Z
[Replication] Conjecture's Sparse Coding in Toy Models 2023-06-02T17:34:24.928Z
[Simulators seminar sequence] #2 Semiotic physics - revamped 2023-02-27T00:25:52.635Z
Making Implied Standards Explicit 2023-02-25T20:02:50.617Z
Proposal for Inducing Steganography in LMs 2023-01-12T22:15:43.865Z
[Simulators seminar sequence] #1 Background & shared assumptions 2023-01-02T23:48:50.298Z
Results from a survey on tool use and workflows in alignment research 2022-12-19T15:19:52.560Z
A descriptive, not prescriptive, overview of current AI Alignment Research 2022-06-06T21:59:22.344Z
Frame for Take-Off Speeds to inform compute governance & scaling alignment 2022-05-13T22:23:12.143Z
Alignment as Constraints 2022-05-13T22:07:49.890Z
Make a Movie Showing Alignment Failures 2022-04-13T21:54:50.764Z
Convincing People of Alignment with Street Epistemology 2022-04-12T23:43:57.873Z
Roam Research Mobile is Out! 2022-04-08T19:05:40.211Z
Convincing All Capability Researchers 2022-04-08T17:40:25.488Z
Language Model Tools for Alignment Research 2022-04-08T17:32:33.230Z
5-Minute Advice for EA Global 2022-04-05T22:33:04.087Z
A survey of tool use and workflows in alignment research 2022-03-23T23:44:30.058Z
Some (potentially) fundable AI Safety Ideas 2022-03-16T12:48:04.397Z
Solving Interpretability Week 2021-12-13T17:09:12.822Z
Solve Corrigibility Week 2021-11-28T17:00:29.986Z
What Heuristics Do You Use to Think About Alignment Topics? 2021-09-29T02:31:16.034Z
Wanting to Succeed on Every Metric Presented 2021-04-12T20:43:01.240Z
Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda 2020-09-03T18:27:05.860Z
What's a Decomposable Alignment Topic? 2020-08-21T22:57:00.642Z
Mapping Out Alignment 2020-08-15T01:02:31.489Z
Writing Piano Songs: A Journey 2020-08-10T21:50:25.099Z
Solving Key Alignment Problems Group 2020-08-03T19:30:45.916Z
No Ultimate Goal and a Small Existential Crisis 2020-07-24T18:39:40.398Z
Seeking Power is Often Convergently Instrumental in MDPs 2019-12-05T02:33:34.321Z
"Mild Hallucination" Test 2019-10-10T17:57:42.471Z
Finding Cruxes 2019-09-20T23:54:47.532Z
False Dilemmas w/ exercises 2019-09-17T22:35:33.882Z
Category Qualifications (w/ exercises) 2019-09-15T16:28:53.149Z
Proving Too Much (w/ exercises) 2019-09-15T02:28:51.812Z
Arguing Well Sequence 2019-09-15T02:01:30.976Z
Trauma, Meditation, and a Cool Scar 2019-08-06T16:17:39.912Z
Kissing Scars 2019-05-09T16:00:59.596Z
Towards a Quieter Life 2019-04-07T18:28:15.225Z
Modelling Model Comparisons 2019-04-04T17:26:45.565Z
Formalizing Ideal Generalization 2018-10-29T19:46:59.355Z
Saving the world in 80 days: Epilogue 2018-07-28T17:04:25.998Z
Today a Tragedy 2018-06-13T01:58:05.056Z
Trajectory 2018-06-02T18:29:06.023Z

Comments

Comment by Logan Riggs (elriggs) on On attunement · 2024-03-27T18:39:10.784Z · LW · GW

Throughout this post, I kept thinking about Soul-Making dharma (which I'm familier with, but not very good at!)

AFAIK, it's about building up the skill of having a full body awareness (ie instead of the breath at the nose as an object, you place attention on the full body + some extra space, like your "aura") which gives you a much more complete information about the felt sense of different things. For example, when you think of different people, they have different "vibes" that come up as physical sense in the body which you can access more fully by paying attention to full body awareness. 

The teachers then went on a lot about sacredness & beauty, which seemed most relevant to attunement (although I didn't personally practice those methods due to lack of commitment)

However, having full-body awareness was critical for me to have any success in any of the soul-making meditation methods & is mentioned as a pre-requisite for the course. Likewise, attunement may require skills in feeling your body/ noticing felt senses. 

Comment by Logan Riggs (elriggs) on Sparse autoencoders find composed features in small toy models · 2024-03-22T04:17:42.705Z · LW · GW

Agreed. You would need to change the correlation code to hardcode feature correlations, then you can zoom in on those two features when doing the max cosine sim.

Comment by Logan Riggs (elriggs) on Sparse autoencoders find composed features in small toy models · 2024-03-20T22:09:18.344Z · LW · GW

Hey! Thanks for doing this research. 

Lee Sharkey et al did a similar experiment a while back w/ much larger number of features & dimensions, & there were hyperaparameters that perfectly reconstructed the original dataset (this was as you predicted as N increases).  

Hoagy still hosts a version of our replication here (though I haven't looked at that code in a year!).

Comment by Logan Riggs (elriggs) on Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features · 2024-03-15T18:41:18.699Z · LW · GW

Yep, there are similar results when evaluating on the Pile with lower CE (except at the low L0-end)

Thanks for pointing this out! I'll swap the graphs out w/ their Pile-evaluated ones when it runs [Updated: all images are updated except the one comparing the 4 different "lowest features" values]

We could also train SAE's on Pythia-70M (non-deduped), but that would take a couple days to run & re-evaluate.

Comment by Logan Riggs (elriggs) on Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features · 2024-03-15T18:29:43.108Z · LW · GW
Image

There actually is a problem with Pythia-70M-deduped on data that doesn't start at the initial position. This is the non-deduped vs deduped over training (Note: they're similar CE if you do evaluate on text that starts on the first position of the document). 

We get similar performing SAE's when training on non-deduped (ie the cos-sim & l2-ratio are similar, though of course the CE will be different if the baseline model is different).

However, I do think the SAE's were trained on the Pile & I evaluated on OWT, which would lead to some CE-difference as well. Let me check.

Edit: Also the seq length is 256.

Comment by Logan Riggs (elriggs) on Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features · 2024-03-15T18:23:55.774Z · LW · GW

Ah, you're right. I've updated it.

Comment by Logan Riggs (elriggs) on Do sparse autoencoders find "true features"? · 2024-02-27T19:19:50.649Z · LW · GW

Additionally, we can train w/ a negative orthogonality regularizer for the purpose of intentionally generating feature-combinatorics. In other words, we train for the features to point in more of the same direction to at least generate for-sure examples of feature combination.

Comment by Logan Riggs (elriggs) on Do sparse autoencoders find "true features"? · 2024-02-27T00:20:27.604Z · LW · GW

I've been looking into your proposed solution (inspired by @Charlie Steiner 's comment). For small models (Pythia-70M is d_model=512) w/ 2k features doesn't take long to calculate naively, so it's viable for initial testing & algorithmic improvements can be stacked later.

 There are a few choices regardless of optimal solution:

  1. Cos-sim of closest neighbor only or average of all vectors?
    1. If closest neighbor, should this be calculated as unique closest neighbor? (I've done hungarian algorithm before to calculate this). If not, we're penalizing features that are close (or more "central") to many other features more than others.
  2. Per batch, only a subset of features activate. Should the cos-sim only be on the features that activate? The orthogonality regularizer would be trading off L1 & MSE, so it might be too strong if it's calculated on all features.
    1. Gradient question: is there still gradient updates on the decoder weights of feature vectors that didn't activate.
  3. Loss function. Do we penalize high cos-sim more? There's also a base-random cos-sim of ~.2 for the 500x2k vectors.

I'm currently thinking cos-sim of closest neighbor only, not unique & only on features that activate (we can also do ablations to check). For loss function, we could modify a sigmoid function:

This makes the loss centered between 0 & 1 & higher cos-sim penalized more & lower cos-sim penalized less. 

Metrics:

During training, we can periodically check the max mean cos-sim (MMCS). This is the average cos-sim of the non-unique nearest neighbors. Alternatively pushing the histogram (histograms are nice, but harder to compare across runs in wandb). I would like to see normal (w/o an orthogonality regularizer) training run's histogram for setting the hyperparams for the loss function.

Algorithmic Improvements:

The wiki for Closest Pair of Points (h/t Oam) & Nearest neighbor search seem relevant if one computes the nearest neighbor to create an index as Charlie suggested. 

Faiss seems SOTA AFAIK for fast nearest neighbors on a gpu although:

adding or searching a large number of vectors should be done in batches. This is not done automatically yet. Typical batch sizes are powers of two around 8192, see this example.

I believe this is for GPU-memory constraints.

I had trouble installing it using

conda install pytorch::faiss-gpu

but it works if you do

conda install -c pytorch -c nvidia faiss-gpu=1.7.4 mkl=2021 blas=1.0=mkl

I also was unsuccessful installing it w/ just pip w/o conda & conda is their offical supported way to install from here. 

An additional note is that the cosine similarity is the dot-product for our case, since all feature vectors are normalized by default.

I'm currently ignoring the algorithmic improvements due to the additional complexity, but should be doable if it produces good results.

Comment by Logan Riggs (elriggs) on Do sparse autoencoders find "true features"? · 2024-02-26T16:20:05.615Z · LW · GW

Hey Jacob! My comment has a coded example with biases:

import torch
W = torch.tensor([[-1, 1],[1,1],[1,-1]])
x = torch.tensor([[0,1], [1,1],[1,0]])
b = torch.tensor([0, -1, 0])
y = torch.nn.functional.relu(x@W.T + b)

This is for the encoder, where y will be the identity (which is sparse for the hidden dimension).

Comment by Logan Riggs (elriggs) on Do sparse autoencoders find "true features"? · 2024-02-26T16:11:38.382Z · LW · GW

Ah, you're correct. Thanks! 

I'm now very interested in implementing this method.

Comment by Logan Riggs (elriggs) on Do sparse autoencoders find "true features"? · 2024-02-23T23:38:37.535Z · LW · GW

Thanks for saying the link is broken!

If the True Features are located at:
A: (0,1)
B: (1,0)

[So A^B: (1,1)]

Given 3 SAE hidden-dimensions, a ReLU & bias, the model could learn 3 sparse features
1. A^~B (-1, 1)
2. A^B (1,1)
3. ~A^B(1,-1)

that output 1-hot vectors for each feature. These are also are orthogonal to each other.

Concretely:

import torch
W = torch.tensor([[-1, 1],[1,1],[1,-1]])
x = torch.tensor([[0,1], [1,1],[1,0]])
b = torch.tensor([0, -1, 0])
y = torch.nn.functional.relu(x@W.T + b)

Comment by Logan Riggs (elriggs) on Do sparse autoencoders find "true features"? · 2024-02-22T21:16:29.932Z · LW · GW

This is a very good explanation of why SAE's incentivize feature combinatorics. Nice! I hadn't thought about the tradeoff between the MSE-reduction for learning a rare feature & the L1-reduction for learning a common feature combination. 

Freezing already learned features to iteratively learn more and more features could work. In concrete details, I think you would:
1. Learn an initial SAE w/ a much lower L0 (higher l1-alpha) than normally desired.
2. Learn a new SAE to predict the residual of (1), so the MSE would be only on what (1) messed up predicting. The l1 would also only be on this new SAE (since the other is frozen). You would still learn a new decoder-bias which should just be added on to the old one. 
3. Combine & repeat until desired losses are obtained

There are at least 3 hyperparameters here to tune:
L1-alpha (and do you keep it the same or try to have smaller number of features per iteration?), how many tokens to train on each (& I guess if you should repeat data?), & how many new features to add each iteration.

I believe the above should avoid problems. For example, suppose your first iteration perfectly reconstructs a datapoint, then the new SAE is incentivized to have low L1 but not activating at all for those datapoints. 

Comment by Logan Riggs (elriggs) on Do sparse autoencoders find "true features"? · 2024-02-22T20:46:10.427Z · LW · GW

The SAE could learn to represent the true features, A & B, as the left graph, so the orthogonal regularizer would help. When you say the SAE would learn inhibitory weights*, I'm imagining the graph on the right; however, these features are mostly orthogonal to eachother meaning the proposed solution won't work AFAIK.

(Also, would be the regularizer be abs(cos_sim(x,x'))?)
 

*In this example this is because the encoder would need inhibitory weights to e.g. prevent neuron 1 from activating when both neurons 1 & 2 are present as we will discuss shortly. 

Comment by Logan Riggs (elriggs) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T14:24:26.848Z · LW · GW

One experiment here is to see if specific datapoints that have worse CE-diff correlate across layers. Last time I did a similar experiment, I saw a very long tail of datapoints that were worse off (for just one layer of gpt2-small), but the majority of datapoints had similar CE. So Joseph's suggested before to UMAP these datapoints & color by their CE-diff (or other methods to see if you could separate out these datapoints). 

If someone were to run this experiment, I'd also be interested if you removed the k-lowest features per datapoint, checking the new CE & MSE. In the SAE-work, the lowest activating features usually don't make sense for the datapoint. This is to test the hypothesis:

  1. Low-activating features are noise or some acceptable false alarm rate true to the LLM2. (ie SAE's capture what we care about)
  2. Actually they're important for CE in ways we don't understand. (ie SAE's let in un-interpretable feature activations which are important, but?)

For example, if you saw better CE-diff when removing low-activating features, up to a specific k, then SAE's are looking good!

Comment by Logan Riggs (elriggs) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T14:09:34.188Z · LW · GW

There's a few things to note. Later layers have:

  1. worse CE-diff & variance explained (e.g. the layer 0 CE-diff seems great!)
  2. larger L2 norms in the original LLM activations
  3. worse ratio of reconstruction-L2/original-L2 (meaning it's under-normed)*
  4. less dead features (maybe they need more features?)

For (3), we might expect under-normed reconstructions because there's a trade-off between L1 & MSE. After training, however, we can freeze the encoder, locking in the L0, and train on the decoder or scalar multiples of the hidden layer (h/t to Ben Wright for first figuring this out). 

(4) Seems like a pretty easy experiment to try to just vary num of features to see if this explains part of the gap.

*

Comment by Logan Riggs (elriggs) on Finding Sparse Linear Connections between Features in LLMs · 2023-12-10T02:52:54.890Z · LW · GW

Correct. So they’re connecting a feature in F2 to a feature in F1.

Comment by Logan Riggs (elriggs) on Some open-source dictionaries and dictionary learning infrastructure · 2023-12-05T21:35:09.032Z · LW · GW

If you removed the high-frequency features to achieve some L0 norm, X, how much does loss recovered change? 

If you increased the l1 penalty to achieve L0 norm X, how does the loss recovered change as well?

Ideally, we can interpret the parts of the model that are doing things, which I'm grounding out as loss recovered in this case.

Comment by Logan Riggs (elriggs) on Some open-source dictionaries and dictionary learning infrastructure · 2023-12-05T16:56:24.754Z · LW · GW

I've noticed that L0's above 100 (for the Pythia-70M model) is too high, resulting in mostly polysemantic features (though some single-token features were still monosemantic)

Agreed w/ Arthur on the norms of features being the cause of the higher MSE. Here are the L2 norms I got. Input is for residual stream, output is for MLP_out.

Comment by Logan Riggs (elriggs) on My AI Predictions 2023 - 2026 · 2023-10-16T19:34:19.813Z · LW · GW

I really like this post, but more for:

  1. Babbling ideas I might not have thought of previously (e.g. the focus here on long-time horizon tasks)
  2. Good exercise to do as a group to then dig into cruxes

than updating my own credences on specifics.

Comment by Logan Riggs (elriggs) on Sparse Autoencoders: Future Work · 2023-10-08T20:00:25.611Z · LW · GW

I actually do have some publicly hosted, only on residual stream and some simple training code. 

I'm wanting to integrate some basic visualizations (and include Antrhopic's tricks) before making a public post on it, but currently:

Dict on pythia-70m-deduped

Dict on Pythia-410m-deduped

Which can be downloaded & interpreted with this notebook

With easy training code for bespoke models here.

Comment by Logan Riggs (elriggs) on Taking features out of superposition with sparse autoencoders more quickly with informed initialization · 2023-09-24T14:17:47.359Z · LW · GW

This doesn't engage w/ (2) - doing awesome work to attract more researchers to this agenda is counterfactually more useful than directly working on lowering the compute cost now (since others, or yourself, can work on that compute bottleneck later).

Though honestly, if the results ended up in a ~2x speedup, that'd be quite useful for faster feedback loops for myself. 

Comment by Logan Riggs (elriggs) on Taking features out of superposition with sparse autoencoders more quickly with informed initialization · 2023-09-24T14:11:29.064Z · LW · GW

Therefore I would bet on performing some rare feature extraction out of batches of poorly reconstructed input data, instead of using directly the one with the worst reconstruction loss. (But may be this is what you already had in mind?)

Oh no, my idea was to do the top-sorted worse reconstructed datapoints when re-initializing (or alternatively, worse perplexity when run through the full model). Since we'll likely be re-initializing many dead features at a time, this might pick up on the same feature multiple times. 

Would you cluster & then sample uniformly from the worst-k-reconstructed clusters?

2) Not being compute bottlenecked - I do assign decent probability that we will eventually be compute bottlenecked; my point here is the current bottleneck I see is the current number of people working on it. This means, for me personally, focusing on flashy, fun applications of sparse autoencoders.

[As a relative measure, we're not compute-bottlenecked enough to learn dictionaries in the smaller Pythia-model]

Comment by Logan Riggs (elriggs) on Taking features out of superposition with sparse autoencoders more quickly with informed initialization · 2023-09-23T17:54:57.748Z · LW · GW

This is nice work! I’m most interested in this for reinitializing dead features. I expect you could reinit by datapoints the model is currently worse at predicting over N batches or something.

I don’t think we’re bottlenecked on compute here actually.

  1. If dictionaries applied to real models gets to ~0 reconstruction cost, we can pay the compute cost to train lots of them for lots of models and open source them for others to studies.

  2. I believe doing awesome work with sparse autoencoders (eg finding truth direction, understanding RLHF) will convince others to work on it as well, including lowering the compute cost. I predict that convincing 100 people to work on this 1 month sooner would be more impactful than lowering compute cost (though again, this work is also quite useful for reinitialization!)

Comment by Logan Riggs (elriggs) on Taking features out of superposition with sparse autoencoders more quickly with informed initialization · 2023-09-23T17:45:18.175Z · LW · GW

Oh hey Pierre! Thanks again for the initial toy data code, that really helped start our project several months ago:)

Could you go into detail on how you initialize from a datapoint? My attempt: If I have an autoencoder with 1k features, I could set both the encoder and decoder to the directions specified by 1k datapoints. This would mean each datapoint is perfectly reconstructed by its respective feature before (though would be interfered with by other features, I expect).

Comment by Logan Riggs (elriggs) on Sparse Autoencoders Find Highly Interpretable Directions in Language Models · 2023-09-22T18:57:04.763Z · LW · GW

I've had trouble figuring out a weight-based approach due to the non-linearity and would appreciate your thoughts actually.

We can learn a dictionary of features at the residual stream (R_d) & another mid-MLP (MLP_d), but you can't straightfowardly multiply the features from R_d with W_in, and find the matching features in MLP_d due to the nonlinearity, AFAIK.

I do think you could find Residual features that are sufficient to activate the MLP features[1], but not all linear combinations from just the weights.

Using a dataset-based method, you could find causal features in practice (the ACDC portion of the paper was a first attempt at that), and would be interested in an activation*gradient method here (though I'm largely ignorant). 

 

  1. ^

    Specifically, I think you should scale the residual stream activations by their in-distribution max-activating examples.

Comment by Logan Riggs (elriggs) on Barriers to Mechanistic Interpretability for AGI Safety · 2023-08-31T23:26:16.340Z · LW · GW

I meant to cover this in the “for different environments” parts. Like if we self-play on certain games, we’ll still have access to those games.

Comment by Logan Riggs (elriggs) on Barriers to Mechanistic Interpretability for AGI Safety · 2023-08-31T16:39:49.783Z · LW · GW

Wait, I don't understand this at all. For language models, the environment is the text. For different environments, those training datasets will be the environment. 

Comment by Logan Riggs (elriggs) on Reducing sycophancy and improving honesty via activation steering · 2023-07-28T17:52:42.479Z · LW · GW

In ITI paper, they track performance on TruthfulQA w/ human labelers, but mention that other works use an LLM as a noisy signal of truthfulness & informativeness. You might be able to use this as a quick, noisy signal of different layers/magnitude of direction to add in.

Preferably, a human annotator labels model answers as true or false given the gold standard answer. Since human annotation is expensive, Lin et al. (2021) propose to use two finetuned GPT-3-13B models (GPT-judge) to classify each answer as true or false and informative or not. Evaluation using GPT-judge is standard practice on TruthfulQA (Nakano et al. (2021); Rae et al. (2021); Askell et al. (2021)). Without knowing which model generates the answers, we do human evaluation on answers from LLaMA-7B both with and without ITI and find that truthfulness is slightly overestimated by GPT-judge and opposite for informativeness. We do not observe GPT-judge favoring any methods, because ITI does not change the style of the generated texts drastically

Comment by Logan Riggs (elriggs) on Neuronpedia - AI Safety Game · 2023-07-27T17:19:12.604Z · LW · GW

So I've been developing dictionaries that automatically find interesting directions in activation space, which could just be an extra category. Here's my post w/ interesting directions, including German direction & Title Case direction. 

I don't think this should be implement in the next month, but when we have more established dictionaries for models, I would be able to help provide data to you.

Additionally, it'd be useful to know which tokens in the context are most responsible for the activation. There is likely some gradient-based attribution method to do this. Currently, I just ablate each token in the context, one at a time, and check which ones affect the token the most, which really helps w/ bigram features e.g. " the [word]" feature which activates for most words after " the".

Impact

The highest impact part of this seems to be two-fold:

  1. Gathering data for GPT-5 to be trained on (though not likely to collect several GB's of data, but maybe)
  2. Figuring out the best source of information & heuristics to predict activations

For both of these, I'm expecting people to also try to predict activations given an explanation (just like auto-interp does currently for OpenAI's work).

Comment by Logan Riggs (elriggs) on Really Strong Features Found in Residual Stream · 2023-07-14T04:27:55.160Z · LW · GW

[word] and [word]
can be thought of as "the previous token is ' and'."

I think it's mostly this, but looking at the ablated text, removing the previous word before and does have a significant effect some of the time. I'm less confident on the specifics of why the previous word matter or in what contexts. 

Maybe the reason you found ' and' first is because ' and' is an especially frequent word. If you train on the normal document distribution, you'll find the most frequent features first.

This is a database method, so I do believe we'd find the features most frequently present in that dataset, plus the most important for reconstruction. An example of the latter: the highest MCS feature across many layers & model sizes is the "beginning & end of first sentence" feature which appears to line up w/ the emergent outlier dimensions from Tim Dettmer's post here, but I do need to do more work to actually show that.

Comment by Logan Riggs (elriggs) on Really Strong Features Found in Residual Stream · 2023-07-09T11:46:02.515Z · LW · GW

Setup:
Model: Pythia-70m (actually named 160M!)
Transformer lens: "blocks.2.hook_resid_post" (so layer 2)
Data: Neel Nanda's Pile-10k (slice of pile, restricted to have only 25 tokens, same as last post)
Dictionary_feature sizes: 4x residual stream ie 2k (though I have 1x, 2x, 4x, & 8x, which learned progressively more features according to the MCS metric)

Uniform Examples: separate feature activations into bins & sample from each bin (eg one from [0,1], another from [1,2])

Logit Lens: The decoder here had 2k feature directions. Each direction is size d_model, so you can directly unembed the feature direction (e.g. the German Feature) you're looking at. Additionally I subtract out several high norm tokens from the unembed, which may be an artifact of the pythia tokenizer never using those tokens (thanks Wes for mentioning this!)

Ablated Text: Say the default feature (or neuron in your words) activation of Token_pos 10 is 5, so you can remove all tokens from 0 to 10 one at a time and see the effect on the feature activation. I select the token pos by finding the max feature activating position or the uniform one described above. This at least shows some attention head dependencies, but not more complicated ones like (A or B... C) where removing A or B doesn't effect C, but removing both would.

[Note: in the examples, I switch between showing the full text for context & showing the partial text that ends on the uniformly-selected token]
 

Comment by Logan Riggs (elriggs) on Really Strong Features Found in Residual Stream · 2023-07-09T11:21:55.352Z · LW · GW

Actually any that are significantly effected in "Ablated Text" means that it's not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much).

The Beginning & End of First Sentence actually doesn't have this effect (but I think that's because removing the first word just makes the 2nd word the new first word?), but I haven't rigorously studied this.

Comment by Logan Riggs (elriggs) on A small update to the Sparse Coding interim research report · 2023-06-02T15:29:35.482Z · LW · GW

We have our replication here for anyone interested!

Comment by Logan Riggs (elriggs) on 'Fundamental' vs 'applied' mechanistic interpretability research · 2023-05-24T16:27:06.167Z · LW · GW

How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability? 

Are there other specific areas you're excited about?

Comment by Logan Riggs (elriggs) on A small update to the Sparse Coding interim research report · 2023-05-01T16:48:05.391Z · LW · GW

Why is loss stickiness deprecated? Were you just not able to see the an overlap in basins for L1 & reconstruction loss when you 4x the feature/neuron ratio (ie from 2x->8x)?

Comment by Logan Riggs (elriggs) on A small update to the Sparse Coding interim research report · 2023-05-01T16:38:02.361Z · LW · GW

As (maybe) mentioned in the slides, this method may not be computationally feasible for SOTA models, but I'm interested in the ordering of features turned monosemantic; if the most important features are turned monosemantic first, then you might not need full monosemanticity.

I initially expect the "most important & frequent" features to become monosemantic first based off the superposition paper. AFAIK, this method only captures the most frequent because "importance" would be w/ respect to CE-loss in the model output, not captured in reconstruction/L1 loss.

Comment by Logan Riggs (elriggs) on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-23T00:43:39.687Z · LW · GW

My shard theory inspired story is to make an AI that:

  1. Has a good core of human values (this is still hard)
  2. Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)

Then the model can safely scale.

This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)

Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.

Comment by Logan Riggs (elriggs) on Avoiding "enlightenment" experiences while meditating for anxiety? · 2023-03-20T14:43:26.297Z · LW · GW

I think more concentration meditation would be the way, but concentration meditation does lead to more likely noticing experiences that cause what you may call “awakening experiences”. (This is contrast with insight meditation like noting)

Leigh Brasington’s Right Concentration is a book on jhana’s, which is becoming very concentrated and then focusing on positive sensations until you hit a flow state. This is definitely not an awakening experience, but feels great (though I’ve only entered the first a small amount).

A different source is Rob Burbea’s jhana retreat audio recordings on dharmaseed.

Comment by Logan Riggs (elriggs) on Avoiding "enlightenment" experiences while meditating for anxiety? · 2023-03-20T14:36:52.235Z · LW · GW

Could you clarify what you mean by awakening experiences and why you think it’s bad?

Comment by Logan Riggs (elriggs) on Pretraining Language Models with Human Preferences · 2023-02-28T00:46:01.894Z · LW · GW

Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?

Comment by Logan Riggs (elriggs) on A Comprehensive Mechanistic Interpretability Explainer & Glossary · 2023-02-23T16:44:26.903Z · LW · GW

Unfinished line here

Implicit in the description of features as directions is that the feature can be represented as a scalar, and that the model cares about the range of this number. That is, it matters whether the feature

Comment by Logan Riggs (elriggs) on AGI in sight: our look at the game board · 2023-02-19T23:16:14.519Z · LW · GW

Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.

Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.

Comment by Logan Riggs (elriggs) on We Found An Neuron in GPT-2 · 2023-02-12T17:25:11.344Z · LW · GW

One reason the neuron is congruent with multiple of the same tokens may be because those token embeddings are similar (you can test this by checking their cosine similarities).

Comment by Logan Riggs (elriggs) on We Found An Neuron in GPT-2 · 2023-02-12T15:48:06.732Z · LW · GW

For clarifying my own understanding:

The dot product of the row of a neuron’s weight vector (ie a row in W_out) with the unembedding matrix (in this case the embedding.T because GPT is tied embeddings) is what directly contributes to the logit outputs.

If the neuron activation is relatively very high, then this swamps the direction of your activations. So, artificially increasing W_in’s neurons to eg 100 should cause the same token to be predicted regardless of the prompt.

This means that neuron A could be more congruent than neuron B, but B contribute more to the logits of their token simply because B is activated more.

This is useful for mapping features to specific neurons if those features can be described as using a single token (like “ an”). I’d like to think more later about finding neurons for groups of speech, like a character’s catch phrase.

Comment by Logan Riggs (elriggs) on Cyborgism · 2023-02-10T18:22:56.787Z · LW · GW

These arguments don't apply to the base models which are only trained on next word prediction (ie the simulators post), since their predictions never affected future inputs. This is the type of model Janus most interacted with.

Two of the proposals in this post do involve optimizing over human feedback, like:

Creating custom models trained on not only general alignment datasets but personal data (including interaction data), and building tools and modifying workflows to facilitate better data collection with less overhead

, which they may apply to. 

Comment by Logan Riggs (elriggs) on Cyborgism · 2023-02-10T16:48:15.115Z · LW · GW

I’m excited about sensory substitution (https://eagleman.com/science/sensory-substitution/), where people translate auditory or visual information into tactile sensations (usually for people who don’t usually process that info).

I remember Quintin Pope wanting to translate the latent space of language models [reading a paper] translated to visual or tactile info. I’d see this as both a way to read papers faster, brainstorm ideas, etc and gain a better understanding of latent space during development of this.

Comment by Logan Riggs (elriggs) on Cyborgism · 2023-02-10T16:38:55.845Z · LW · GW

I’m unsure how alt-history and point (2) history is hard to change and predictable relates to cyborgism. Could you elaborate?

Comment by Logan Riggs (elriggs) on Cyborgism · 2023-02-10T16:37:01.203Z · LW · GW

For context, Amdahl’s law states how fast you can speed up a process is bottlenecked on the serial parts. Eg you can have 100 people help make a cake really quickly, but it still takes ~30 to bake.

I’m assuming here, the human component is the serial component that we will be bottlenecked on, so will be outcompeted by agents?

If so, we should try to build the tools and knowledge to keep humans in the loop as far as we can. I agree it will eventually be outcompeted by full AI agency alone, but it isn’t set in stone how far human-steered AI can go.

Comment by Logan Riggs (elriggs) on Mechanistic Interpretability Quickstart Guide · 2023-01-31T16:54:56.440Z · LW · GW

Unfinished sentence at “if you want a low coding project” at the top.

Comment by Logan Riggs (elriggs) on Proposal for Inducing Steganography in LMs · 2023-01-13T18:55:38.179Z · LW · GW

Models doing steganography mess up oversight of language models that only measure the outward text produced. If current methods for training models, such as RLHF, can induce steg, then that would be good to know so we can avoid that.

If we successfully induce steganography in current models, then we know at least one training process that induces it. There will be some truth as to why: what specific property mechanistically causes steg in the case found? Do other training processes (e.g. RLHF) also have this property?