StefanHex's Shortform

post by StefanHex (Stefan42) · 2024-07-05T14:31:09.892Z · LW · GW · 16 comments

Contents

16 comments

16 comments

Comments sorted by top scores.

comment by StefanHex (Stefan42) · 2024-11-19T19:45:27.388Z · LW(p) · GW(p)

Collection of some mech interp knowledge about transformers:

Writing up folk wisdom & recent results, mostly for mentees and as a link to send to people. Aimed at people who are already a bit familiar with mech interp. I've just quickly written down what came to my head, and may have missed or misrepresented some things. In particular, the last point is very brief and deserves a much more expanded comment at some point. The opinions expressed here are my own and do not necessarily reflect the views of Apollo Research.

Transformers take in a sequence of tokens, and return logprob predictions for the next token. We think it works like this:

  1. Activations represent a sum of feature directions, each direction representing to some semantic concept. The magnitude of directions corresponds to the strength or importance of the concept.
    1. These features may be 1-dimensional, but maybe multi-dimensional features make sense too. We can either allow for multi-dimensional features (e.g. circle of days of the week), acknowledge that the relative directions of feature embeddings matter (e.g. considering days of the week individual features but span a circle), or both. See also Jake Mendel's post [LW · GW].
    2. The concepts may be "linearly" encoded, in the sense that two concepts A and B being present (say with strengths α and β) are represented as α*vector_A + β*vector_B). This is the key assumption of linear representation hypothesis. See Chris Olah & Adam Jermyn but also Lewis Smith [LW · GW].
  2. The residual stream of a transformer stores information the model needs later. Attention and MLP layers read from and write to this residual stream. Think of it as a kind of "shared memory", with this picture in your head, from Anthropic's famous AMFTC.
    1. This residual stream seems to slowly accumulate information throughout the forward pass, as suggested by LogitLens [LW · GW].
    2. Additionally, we expect there to be internally-relevant information inside the residual stream, such as whether the sequence of nouns in a sentence is ABBA or BABA [LW · GW].
    3. Maybe think of each transformer block / layer as doing a serial step of computation. Though note that layers don't need to be privileged points between computational steps, a computation can be spread out over layers (Lee Sharkey's CLDR arguments, Anthropic's Crosscoder-motivation image)
  3. Superposition. There can be more features than dimensions in the vector space, corresponding to almost-orthogonal directions. Established in Anthropic's TMS. You can have a mix as well. See Chris Olah's post on distributed representations for a nice write-up.
    1. Superposition requires sparsity, i.e. that only few features are active at a time.
  4. The model starts with token (and positional) embeddings.
    1. We think token embeddings mostly store features that might be relevant about a given token (e.g. words in which it occurs and what concepts they represent). The meaning of a token depends a lot on context.
    2. We think positional embeddings are pretty simple (in GPT2-small, but likely also other models). In GPT2-small they appear to encode ~4 dimensions worth of positional information, consisting of "is this the first token", "how late in the sequence is it", plus two sinusoidal directions. The latter three create a helix [LW · GW].
      1. PS: If you try to train an SAE on the full embedding you'll find this helix split up into segments ("buckets") as individual features (e.g. here [LW · GW]). Pay attention to this bucket-ing as a sign of compositional representation.
  5. The overall Transformer computation is said to start with detokenization: accumulating context and converting the pure token representation into a context-aware representation of the meaning of the text. Early layers in models often behave differently from the rest. Lad et al. claim three more distinct stages but that's not consensus.
  6. There's a couple of common motifs we see in LLM internals, such as
    1. LLMs implementing human-interpretable algorithms.
      1. Induction heads (paper, good illustration [LW · GW]): attention heads being used to repeat sequences seen previously in context. This can reach from literally repeating text to maybe being generally responsible for in-context learning.
      2. Indirect object identification, docstring completion [LW · GW]. Importantly don't take these early circuits works to mean "we actually found the circuit in the model" but rather take away "here is a way you could implement this algorithm in a transformer" and maybe the real implementation looks something like it.
        1. In general we don't think this manual analysis scales to big models (see e.g. Tom Lieberum's paper)
        2. Also we want to automate the process, e.g. ACDC and follow-ups (1, 2).
        3. My personal take is that all circuits analysis is currently not promising because circuits are not crisp. With this I mean the observation that a few distinct components don't seem to be sufficient to explain a behaviour, and you need to add more and more components, slowly explaining more and more performance. This clearly points towards us not using the right units to decompose the model. Thus, model decomposition is the major area of mech interp research right now.
    2. Moving information. Information is moved around in the residual stream, from one token position to another. This is what we see in typical residual stream patching experiments, e.g. here [LW · GW].
    3. Information storage. Early work (e.g. Mor Geva) suggests that MLPs can store information as key-value memories; generally folk wisdom is that MLPs store facts. However, those facts seem to be distributed and non-trivial to localise (see ROME & follow-ups, e.g. MEMIT). The DeepMind mech interp team tried [AF · GW] and wasn't super happy with their results.
    4. Logical gates. We think models calculate new features from existing features by computing e.g. AND and OR gates. Here we show a bunch of features that look like that is happening, and the papers by Hoagy Cunningham & Sam Marks show computational graphs for some example features.
    5. Activation size & layer norm. GPT2-style transformers have a layer normalization layer before every Attn and MLP block. Also, the norm of activations grows throughout the forward pass. [LW · GW] Combined this means old features become less important over time, Alex Turner has thoughts on this.
      1. There are hypotheses on what layer norm could be responsible for, but it can't do anything substantial since you can run models without it (e.g. TinyModel, GPT2_noLN [LW · GW])
  7. (Sparse) circuits agenda. The current mainstream agenda in mech interp (see e.g. Chris Olah's recent talk) is to (1) find the right components to decompose model activations, to (2) understand the interactions between these features, and to finally (3) understand the full model.
    1. The first big open problem is how to do this decomposition correctly. There's plenty of evidence that the current Sparse Autoencoders (SAEs) don't give us the correct solution, as well as conceptual issues. I'll not go into the details here to keep this short-ish.
    2. The second big open problem is that the interactions, by default, don't seem sparse. This is expected if there are multiple ways (e.g. SAE sizes) to decompose a layer, and adjacent layers aren't decomposed correspondingly. In practice this means that one SAE feature seems to affect many many SAE features in the next layers, more than we can easily understand. Plus, those interactions seem to be not crisp which leads to the same issue as described above.
Replies from: mr-hire, Particleman, andrei-alexandru-parfeni
comment by Matt Goldenberg (mr-hire) · 2024-11-19T19:50:13.390Z · LW(p) · GW(p)

this is great, thanks for sharing

comment by aribrill (Particleman) · 2024-11-19T20:56:00.071Z · LW(p) · GW(p)

Thanks for the great writeup.

Superposition ("local codes") require sparsity, i.e. that only few features are active at a time.

Typo: I think you meant to write distributed, not local, codes. A local code is the opposite of superposition.

Replies from: Stefan42
comment by StefanHex (Stefan42) · 2024-11-19T21:35:47.577Z · LW(p) · GW(p)

Thanks! You're right, totally mixed up local and dense / distributed. Decided to just leave out that terminology

comment by sunwillrise (andrei-alexandru-parfeni) · 2024-11-19T20:31:08.478Z · LW(p) · GW(p)

We think it works like this

Who is "we"? Is it:

  1. only you and your team?
  2. the entire Apollo Research org?
  3. the majority of mechinterp researchers worldwide?
  4. some other group/category of people?

Also, this definitely deserves to be made into a high-level post, if you end up finding the time/energy/interest in making one.

Replies from: Stefan42
comment by StefanHex (Stefan42) · 2024-11-19T21:41:50.802Z · LW(p) · GW(p)

Thanks for the comment!

I think this is what most mech interp researchers more or less think. Though I definitely expect many researchers would disagree with individual points, nor does it fairly weigh all views and aspects (it's very biased towards "people I talk to"). (Also this is in no way an Apollo / Apollo interp team statement, just my personal view.)

comment by StefanHex (Stefan42) · 2024-07-05T16:57:15.142Z · LW(p) · GW(p)

Are the features learned by the model the same as the features learned by SAEs?

TL;DR: I want true features model-features to be a property of the model weights, and to be recognizable without access to the full dataset. Toy models have that property. My “poor man’s model-features” [LW · GW] have it. I want to know whether SAE-features have this property too, or if SAE-features do not match the true features model-features.

Introduction: Neural networks likely encode features in superposition. That is, features are represented as directions in activation space, and the model likely tracks many more features than dimensions in activation space. Because features are sparse, it should still be possible for the model to recover and use individual feature values.[1]

Problem statement: The prevailing method for finding these features are Sparse Autoencoders (SAEs). SAEs are well-motivated because they do recover superposed features in toy models. However, I am not certain whether SAEs recover the features of LLMs. I am worried (though not confident) that SAEs do not recover the features of the model (but the dataset), and that we are thus overconfident in how much SAEs tell us.

SAE failure mode: SAEs are trained to achieve a certain compression[2] task: Compress activations into a sparse overcomplete basis, and reconstruct the original activations based on this compressed representation. The solution to this problem can be identical to what the neural network does (wanting to store & use information), but it not necessarily is. In TMS, the network’s only objective is to compress features, so it is natural that the SAE-features match the model-features. But LLMs solve a different task (well, we don’t have a good idea what LLMs do), and training an SAE on a model’s activations might yield a basis different from the model-features (see hypothetical Example 1 below).

Operationalisation of model-features (I’m tabooing “true features”): In the Toy Model of Superposition (TMS) the model’s weights are clearly adjusted to the features directions. We can tell a feature from looking at the model weights. I want this to be a property of true SAE-features as well. Then I would be confident that the features are a property of the model, and not (only) of the dataset distribution. Concrete operationalisation:

  • I give you 5 real SAE-features, and 5 made-up features (with similar properties). Can you tell which features are the real ones? Without relying on the dataset (but you may use an individual prompt). Lindsey (2024) is some evidence, but would it distinguish the SAE-features from an arbitrary decomposition of the activations into 5 fake-features?

Why do I care? I expect that the model-features are, in some sense, the computational units of the model. I expect our understanding to be more accurate (and to generalize) if we understand what the model actually does internally (see hypothetical Example 2 below).

Is this possible? Toy models of computation in superposition seem to suggest that models give special treatment to feature directions (compared to arbitrary activation directions), for example the error correction described here. This may privilege the basis of model-features over other decompositions of activations. I discuss experiment proposals at the bottom.

Example 1: Imagine an LLM was trained on The Pile excluding Wikipedia. Now we train an SAE on the model’s activations on a different dataset including Wikipedia. I expect that the SAE will find Wikipedia-related features: For example, a Wikipedia-citation-syntax feature on a low level, or an Wikipedia-style-objectivity feature on a high level. I would claim that this is not a feature of the model: During training the model never encountered these concepts, it has not reserved a direction in its superposition arrangement (think geometric shapes in Toy Model of Superposition) for this feature.

  • It feels like there is a fundamental distinction between a model (SGD) “deciding” whether to learn a feature (as it does in TMS) and an SAE finding a feature that was useful for compressing activations.

Example 2: Maybe an SAE trained on an LLM playing Civilization and Risk finds a feature that corresponds to “strategic deception” on this dataset. But actually the model does not use a “strategic deception” feature (instead strategic deception originates from some, say, the “power dynamics” feature), and it just happens that the instances of strategic deception in those games clustered into a specific direction. If we now take this direction to monitor for strategic deception we will fail to notice other strategic deception originating from the same “power dynamics” features.

  • If we had known that the model-features that were active during the strategic deception instances were the “power dynamics” (+ other) features, we would have been able to choose the right, better generalizing, deception detection feature.

Experiment proposals: I have explored [LW · GW] the abnormal effect that “poor man’s model-features” (sampled as the difference between two independent model activations) have on model outputs, and their relation to theoretically predicted noise suppression in feature activations. Experiments in Gurnee (2024) [LW · GW] and Lindsey (2024) suggest that SAE decoder errors and SAE-features also have an abnormal effect on the model.  With the LASR Labs [LW · GW] team I mentor I want to explore whether SAE-features match the theoretical predictions, and whether the SAE-feature effects match those expected from model-features.

  1. ^

    I’ll ignore the “there’s more to activations than just features” point made e.g. here [LW · GW], that’s a separate discussion.

  2. ^

    I know the SAE basis is larger, but it is enforced to be sparse and thus cannot perfectly store the activations.

Replies from: Stefan42, Stefan42
comment by StefanHex (Stefan42) · 2024-07-05T18:16:24.102Z · LW(p) · GW(p)

My core request is that I want (SAE-)features to be a property of the model, rather than the dataset.

  • This can be misunderstood in the sense of taking issue with “If a concept is missing from the SAE training set, the SAE won’t find the corresponding feature.” -- no, this is fine, the model-feature exists but simply isn't found by the SAE.
  • What I mean to say is I take issue if “SAEs find a feature only because this concept is common in the dataset rather than because the model uses this concept.”[1] -- in my books this is SAEs making up features and that won't help us understand models
  1. ^

    Of course a concept being common in the model-training-data makes it likely (?) to be a concept the model uses, but I don’t think this is a 1:1 correspondence. (So just making the SAE training set equal to the model training set wouldn’t solve the issue.)

comment by StefanHex (Stefan42) · 2024-07-05T18:11:27.232Z · LW(p) · GW(p)

There is a view that SAE features are just a useful tool for describing activations (interpretable features) and manipulating activations (useful for steering and probing). That SAEs are just a particularly good method in a larger class of methods, but not uniquely principled. In that case I wouldn't expect this connection to model behaviour.

But often we make the claim that we often make is that the model sees and understands the world as a set of model-features, and that we can see the same features by looking at SAE-features of the activations. And then I want to see the extra evidence.

comment by StefanHex (Stefan42) · 2024-11-19T20:01:02.221Z · LW(p) · GW(p)

Why I'm not too worried about architecture-dependent mech interp methods:

I've heard people argue that we should develop mechanistic interpretability methods that can be applied to any architecture. While this is certainly a nice-to-have, and maybe a sign that a method is principled, I don't think this criterion itself is important.

I think that the biggest hurdle for interpretability is to understand any AI that produces advanced language (>=GPT2 level). We don't know how to write a non-ML program that speaks English, let alone reason, and we have no idea how GPT2 does it. I expect that doing this the first time is going to be significantly harder, than doing this the 2nd time. Kind of how "understand an Alien mind" is much harder than "understand the 2nd Alien mind".

Edit: Understanding an image model (say Inception V1 CNN) does feel like a significant step down, in the sense that these models feel significantly less "smart" and capable than LLMs.

Replies from: Jozdien, beelal
comment by Jozdien · 2024-11-19T20:38:05.779Z · LW(p) · GW(p)

I've heard people argue that we should develop mechanistic interpretability methods that can be applied to any architecture.

I think the usual reason this claim is made is because the person making the claim thinks it's very plausible LLMs aren't the paradigm that lead to AGI. If that's the case, then interpretability that's indexed heavily on them gets us understanding of something qualitatively weaker than we'd like. I agree that there'll be some transfer, but it seems better and not-very-hard to talk about how well different kinds of work transfer.

comment by bilalchughtai (beelal) · 2024-11-19T23:33:02.487Z · LW(p) · GW(p)

Agreed. A related thought is that we might only need to be able to interpret a single model at a particular capability level to unlock the safety benefits, as long as we can make a sufficient case that we should use that model. We don't care inherently about interpreting GPT-4, we care about there existing a GPT-4 level model that we can interpret.

comment by StefanHex (Stefan42) · 2024-07-16T15:16:13.175Z · LW(p) · GW(p)

Has anyone tested whether feature splitting can be explained by composite (non-atomic) features [LW · GW]?

  • Feature splitting is the observation that SAEs with larger dictionary size find features that are geometrically (cosine similarity) and semantically (activating dataset examples) similar. In particular, a larger SAE might find multiple features that are all similar to each other, and to a single feature found in a smaller SAE.
    • Anthropic gives the example of the feature " 'the' in mathematical prose" which splits into features " 'the' in mathematics, especially topology and abstract algebra" and " 'the' in mathematics, especially complex analysis" (and others).

There’s at least two hypotheses for what is going on.

  1. The “true features” are the maximally split features; the model packs multiple true features into superposition close to each other. Smaller SAEs approximate multiple true features as one due to limited dictionary size.
  2. The “true features” are atomic features, and split features are composite features made up of multiple atomic features. Feature splitting is an artefact of training the model for sparsity, and composite features could be replaced by linear combinations of a small number of other (atomic) features.

Anthropic conjectures hypothesis 1 in Towards Monosemanticity. Demian Till argues for hypothesis 2 in this post [LW · GW]. I find Demian’s arguments compelling. They key idea is that an SAE can achieve lower loss by creating composite features for frequently co-occurring concepts: The composite feature fires instead of two (or more) atomic features, providing a higher sparsity (lower sparsity penalty) at the cost of taking up another dictionary entry (worse reconstruction).

  • I think the composite feature hypothesis is plausible, especially in light of Anthropic’s Feature Completeness results in Scaling Monosemanticity. They find that not all model concepts are represented in SAEs, and that rarer concepts are less likely to be represented (they find an intriguing relation between number of alive features and feature frequency required to be represented in the SAE, likely related to the frequency-rank via Zipf’s law). I find it probably that the optimiser may dedicate extra dictionary entries to composite features of high-frequency concepts at the cost of representing low-frequency concepts.
  • This is bad for interpretability not (only) because low-frequency concepts are omitted, but because the creation of composite features requires the original atomic features to not fire anymore in the composite case.
    • Imagine there is a “deception” feature, and a “exam” feature. How deception in exams is quite common, so the model learns a composite “deception in the context of exams” feature, and the atomic “deception” and “exam” features no longer fire in that case.
    • Then we can no longer use the atomic “deception” SAE direction as a reliable detector of deception, because it doesn’t fire in cases where the composite feature is active!

Do we have good evidence for the one or the other case?

We observe that split features often have high cosine similarity, but this is explained by both hypotheses. (Anthropic says features are clustered together because they’re similar. Demian Till’s hypothesis would claim that multiple composite features contain the same atomic features, again explaining the similarity.)

A naive test may be to test whether features can be explained by a sparse linear combination of other features, though I’m not sure how easy this would be to test.


For reference, cosine similarity of SAE decoder directions in Joseph Bloom's GPT2-small SAEs, blocks.1.hook_resid_pre and blocks.10.hook_resid_pre  compared to random directions and random directions with the same covariance as typical activations.

Replies from: RGRGRG
comment by RGRGRG · 2024-09-12T04:42:03.434Z · LW(p) · GW(p)

I like this recent post about atomic meta-SAE features, I think these are much closer (compared against normal SAEs) to what I expect atomic units to look like:

https://www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes

comment by StefanHex (Stefan42) · 2024-07-05T14:31:10.104Z · LW(p) · GW(p)

I think we should think more about computation in superposition. What does the model do with features? How do we go from “there are features” to “the model outputs sensible things”? How do MLPs retrieve knowledge (as commonly believed) in a way compatible with superposition (knowing more facts than number of neurons)?

This post [LW · GW] (and paper) by @Kaarel [LW · GW], @jake_mendel [LW · GW], @Dmitry Vaintrob [LW · GW] (and @LawrenceC [LW · GW]) is the kind of thing I'm looking for, trying to lay out a model of how computation in superposition could work. It makes somewhat-concrete predictions [LW(p) · GW(p)] about the number and property of model features.

Why? Because (a) these feature properties may help us find the features of a model (b) a model of computation may be necessary if features alone are not insufficient to address AI Safety (on the interpretability side).

comment by StefanHex (Stefan42) · 2024-11-19T19:54:13.991Z · LW(p) · GW(p)

Why I'm not that hopeful about mech interp on TinyStories models:

Some of the TinyStories models are open source, and manage to output sensible language while being tiny (say 64dim embedding, 8 layers). Maybe it'd be great to try and thoroughly understand one of those?

I am worried that those models simply implement a bunch of bigrams and trigrams, and that all their performance can be explained by boring statistics & heuristics. Thus we would not learn much from fully understanding such a model. Evidence for this is that the 1-layer variant, which due to it's size can only implement bigrams & trigram-ish things, achieves a better loss than many of the tall smaller models (Figure 4). Thus it seems not implausible that most if not all of the performance of all the models could be explained by similarly simple mechanisms.

Folk wisdom is that the TinyStories dataset is just very formulaic and simple, and therefore models without any sophisticated methods can appear to produce sensible language. I haven't looked into this enough to understand whether e.g. TinyStories V2 (used by TinyModel) is sufficiently good to dispel this worry.