Posts

The Residual Expansion: A Framework for thinking about Transformer Circuits 2024-08-02T11:04:56.347Z
An Interpretability Illusion from Population Statistics in Causal Analysis 2024-07-29T14:50:19.497Z
Daniel Tan's Shortform 2024-07-17T06:38:07.166Z
Mech Interp Lacks Good Paradigms 2024-07-16T15:47:32.171Z
Activation Pattern SVD: A proposal for SAE Interpretability 2024-06-28T22:12:48.789Z

Comments

Comment by Daniel Tan (dtch1997) on You can remove GPT2’s LayerNorm by fine-tuning for an hour · 2024-08-12T09:03:37.351Z · LW · GW

Interesting stuff! I'm very curious as to whether removing layer norm damages the model in some measurable way. 

One thing that comes to mind is that previous work finds that the final LN is responsible for mediating 'confidence' through 'entropy neurons'; if you've trained sufficiently I would expect all of these neurons to not be present anymore, which then raises the question of whether the model still exhibits this kind of self-confidence-regulation

Comment by Daniel Tan (dtch1997) on The Residual Expansion: A Framework for thinking about Transformer Circuits · 2024-08-08T08:34:12.400Z · LW · GW

That makes sense to me. I guess I'm dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-08-07T08:31:36.961Z · LW · GW

[Repro] Circular Features in GPT-2 Small

This is a paper reproduction in service of achieving my seasonal goals

Recently, it was demonstrated that circular features are used in the computation of modular addition tasks in language models. I've reproduced this for GPT-2 small in this Colab

We've confirmed that days of the week do appear to be represented in a circular fashion in the model. Furthermore, looking at feature dashboards agrees with the discovery; this suggests that simply looking up features that detect tokens in the same conceptual 'category' could be another way of finding clusters of features with interesting geometry.

Next steps:

1. Here, we've selected 9 SAE features, gotten the reconstruction, and then compressed this down via PCA. However, were all 9 features necessary? Could we remove some of them without hurting the visualization?

2. The SAE reconstruction using 9 features is probably a very small component of the model's overall representation of this token. What's in the rest of the representation? Is it mostly orthogonal to the SAE reconstruction, or is there a sizeable component remaining in this 9-dimensional subspace? If the latter, it would indicate that the SAE representation here is not a 'full' representation of the original model.

Thanks to Egg Syntax for pair programming and Josh Engels for help with the reproduction. 

Comment by Daniel Tan (dtch1997) on The Residual Expansion: A Framework for thinking about Transformer Circuits · 2024-08-07T08:14:02.049Z · LW · GW

If I understand correctly, you're saying that my expansion is wrong, because , which I agree with. 

  1. Then isn't it also true that 
  2. Also, if the output is not a sum of all separate paths, then what's the point of the unraveled view? 
Comment by Daniel Tan (dtch1997) on The ‘strong’ feature hypothesis could be wrong · 2024-08-05T11:50:19.094Z · LW · GW

This is a great article! I find the notion of a 'tacit representation' very interesting, and it makes me wonder whether we can construct a toy model where something is only tacitly (but not explicitly) represented. For example, having read the post, I'm updated towards believing that the goals of agents are represented tacitly rather than explicitly, which would make MI for agentic models much more difficult. 

One minor point: There is a conceptual difference, but perhaps not an empirical difference, between 'strong LRH is false' and 'strong LRH is true but the underlying features aren't human-interpretable'. I think our existing techniques can't yet distinguish between these two cases. 

Relatedly, I (with collaborators) recently released a paper on evaluating steering vectors at scale: https://arxiv.org/abs/2407.12404. We found that many concepts (as defined in model-written evals) did not steer well, which has updated me towards believing that these concepts are not linearly represented. This in turn weakly updates me towards believing strong LRH is false, although this is definitely not a rigorous conclusion. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-08-05T10:23:43.065Z · LW · GW

That's a really interesting blogpost, thanks for sharing! I skimmed it but I didn't really grasp the point you were making here. Can you explain what you think specifically causes self-repair? 

Comment by Daniel Tan (dtch1997) on The Residual Expansion: A Framework for thinking about Transformer Circuits · 2024-08-05T10:20:51.229Z · LW · GW

I agree, this seems like exactly the same thing, which is great! In hindsight it's not surprising that you / other people have already thought about this

Do you think the 'tree-ified view' (to use your name for it) is a good abstraction for thinking about how a model works? Are individual terms in the expansion the right unit of analysis? 

Comment by Daniel Tan (dtch1997) on The Residual Expansion: A Framework for thinking about Transformer Circuits · 2024-08-02T13:28:47.794Z · LW · GW

Fair point, and I should amend the post to point out that AMFOTC also does 'path expansion'. However, I think this is still conceptually distinct from AMFOTC because: 

  • In my reading of AMFOTC, the focus seems to be on understanding attention by separating the QK and OV circuits, writing these as linear (or almost linear) terms, and fleshing this out for 1-2 layer attention-only transformers. This is cool, but also very hard to use at the level of a full model
  • Beyond understanding individual attention heads, I am more interested in how the whole model works; IMO this is very unlikely to be simply understood as a sum of linear components. OTOH residual expansion gives a sum of nonlinear components and maybe each of those things is more interpretable. 
  • I think the notion of path 'degrees' hasn't been explicitly stated before and I found this to be a useful abstraction to think about circuit complexity. 

maybe this post is better framed as 'reconciling AMFOTC with SAE circuit analysis'. 

Comment by Daniel Tan (dtch1997) on An Interpretability Illusion from Population Statistics in Causal Analysis · 2024-08-02T11:26:20.999Z · LW · GW

What's a better way to incorporate the mentioned sample-level variance in measuring the effectiveness of an SAE feature or SV?

In the steering vectors work I linked, we looked at how much of the variance in the metric was explained by a spurious factor, and I think that could be a useful technique if you have some a priori intuition about what the variance might be due to. However, this doesn't mean we can just test a bunch of hypotheses, because that looks like p-hacking.  

Generally, I do think that 'population variance' should be a metric that's reported alongside 'population mean' in order to contextualize findings. But again this doesn't tell a very clean picture; variance being high could be due to heteroscedasticity, among other things

I don't have great solutions for this illusion outside of those two recommendations. One naive way we might try to solve this is to remove things from the dataset until the variance is minimal, but it's hard to do this in a right way that doesn't eventually look like p-hacking. 

Do you also conclude that the causal role of the circuit you discovered was spurious?

an example where causal intervention satisfied the above-mentioned (or your own alternative that was not mentioned in this post) criteria

I would guess that the IOI SAE circuit we found is not unduly influenced by spurious factors, and that the analysis using (variance in the metric difference explained by ABBA / BABA) would corroborate this. I haven't rigorously tested this, but I'd be very surprised if this turned out not to be the case

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-30T11:24:13.107Z · LW · GW

For sure! Working in public is going to be a big driver of these habits :) 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-27T23:51:44.756Z · LW · GW

[Note] On illusions in mechanistic interpretability

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-27T23:49:44.767Z · LW · GW

[Proposal] Out-of-context meta learning as a toy model of steganography

Steganography; the idea that models may say one thing but mean another, and that this may enable them to evade supervision. Essentially, models might learn to "speak in code". 

In order to better study steganography, it would be useful to construct model organisms of steganography, which we don't have at the moment. How might we do this? I think out-of-context meta learning is a very convenient path. 

Out-of-context meta learning: The idea that models can internalise knowledge during training, and then use this knowledge at test-time to answer questions. E.g.

  • Train the model on inputs of the form: "A <nonsense word> is <color>". 
  • At test time, ask the model "What color is <nonsense world>?" 

This has been demonstrated to be possible in Krasheninnikov et al, 2024 as well as Berglund et al, 2024

This seems like a pretty effective path to creating model organisms of steganography. E.g.

  • Train the model on re-definitions of existing words, e.g. "A cat is a vehicle for transport" 
  • Test the model on whether it uses "cat" instead of "car" at test time. Or something in this vein. 

I probably won't work on this myself, but I'm pretty interested in someone doing this and reporting their results

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-27T23:27:15.204Z · LW · GW

[Note] Excessive back-chaining from theories of impact is misguided

Rough summary of a conversation I had with Aengus Lynch 

As a mech interp researcher, one thing I've been trying to do recently is to figure out my big cruxes for mech interp, and then filter projects by whether they are related to these cruxes. 

Aengus made the counterpoint that this can be dangerous, because even the best researchers' mental model of what will be impactful in the future is likely wrong, and errors will compound through time. Also, time spent refining a mental model is time not spent doing real work. Instead, he advocated for working on projects that seem likely to yield near-term value 

I still think I got a lot of value out of thinking about my cruxes, but I agree with the sentiment that this shouldn't consume excessive amounts of my time

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-27T23:20:58.044Z · LW · GW

[Note] On self-repair in LLMs

A collection of empirical evidence

Do language models exhibit self-repair? 

One notion of self-repair is redundancy; having "backup" components which do the same thing, should the original component fail for some reason. Some examples: 

  • In the IOI circuit in gpt-2 small, there are primary "name mover heads" but also "backup name mover heads" which fire if the primary name movers are ablated. this is partially explained via copy suppression
  • More generally, The Hydra effect: Ablating one attention head leads to other attention heads compensating for the ablated head. 
  • Some other mechanisms for self-repair include "layernorm scaling" and "anti-erasure", as described in Rushing and Nanda, 2024

Another notion of self-repair is "regulation"; suppressing an overstimulated component. 

A third notion of self-repair is "error correction". 

Self-repair is annoying from the interpretability perspective. 

  • It creates an interpretability illusion; maybe the ablated component is actually playing a role in a task, but due to self-repair, activation patching shows an abnormally low effect. 

A related thought: Grokked models probably do not exhibit self-repair. 

  • In the "circuit cleanup" phase of grokking, redundant circuits are removed due to the L2 weight penalty incentivizing the model to shed these unused parameters. 
  • I expect regulation to not occur as well, because there is always a single correct answer; hence a model that predicts this answer will be incentivized to be as confident as possible. 
  • Error correction still probably does occur, because this is largely a consequence of superposition 

Taken together, I guess this means that self-repair is a coping mechanism for the "noisiness" / "messiness" of real data like language. 

It would be interesting to study whether introducing noise into synthetic data (that is normally grokkable by models) also breaks grokking (and thereby induces self-repair). 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-27T23:07:00.203Z · LW · GW

[Note] Is adversarial robustness best achieved through grokking? 

A rough summary of an insightful discussion with Adam Gleave, FAR AI

We want our models to be adversarially robust. 

  • According to Adam, the scaling laws don't indicate that models will "naturally" become robust just through standard training. 

One technique which FAR AI has investigated extensively (in Go models) is adversarial training. 

  • If we measure "weakness" in terms of how much compute is required to train an adversarial opponent that reliably beats the target model at Go, then starting out it's like 10m FLOPS, and this can be increased to 200m FLOPS through iterated adversarial training. 
  • However, this is both pretty expensive (~10-15% of pre-training compute), and doesn't work perfectly (even after extensive iterated adversarial training, models still remain vulnerable to new adversaries.) 
  • A useful intuition: Adversarial examples are like "holes" in the model, and adversarial training helps patch the holes, but there are just a lot of holes. 

One thing I pitched to Adam was the notion of "adversarial robustness through grokking". 

  • Conceptually, if the model generalises perfectly on some domain, then there can't exist any adversarial examples (by definition). 
  • Empirically, "delayed robustness" through grokking has been demonstrated on relatively advanced datasets like CIFAR-10 and Imagenette; in both cases, models that underwent grokking became naturally robust to adversarial examples.  

Adam seemed thoughtful, but had some key concerns. 

  • One of Adam's cruxes seemed to relate to how quickly we can get language models to grok; here, I think work like grokfast is promising in that it potentially tells us how to train models that grok much more quickly. 
  • I also pointed out that in the above paper, Shakespeare text was grokked, indicating that this is feasible for natural language 
  • Adam pointed out, correctly, that we have to clearly define what it means to "grok" natural language. Making an analogy to chess; one level of "grokking" could just be playing legal moves. Whereas a more advanced level of grokking is to play the optimal move. In the language domain, the former would be equivalent to outputting plausible next tokens, and the latter would be equivalent to  being able to solve arbitrarily complex intellectual tasks like reasoning. 
  • We had some discussion about characterizing "the best strategy that can be found with the compute available in a single forward pass of a model" and using that as the criterion for grokking. 

His overall take was that it's mainly an "empirical question" whether grokking leads to adversarial robustness. He hadn't heard this idea before, but thought experiments / proofs of concept would be useful. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-27T22:48:45.387Z · LW · GW

[Note] On the feature geometry of hierarchical concepts

A rough summary of insightful discussions with Jake Mendel and Victor Veitch

Recent work on hierarchical feature geometry has made two specific predictions: 

  • Proposition 1: activation space can be decomposed hierarchically into a direct sum of many subspaces, each of which reflects a layer of the hierarchy. 
  • Proposition 2: within these subspaces, different concepts are represented as simplices. 

Example of hierarchical decomposition: A dalmation is a dog, which is a mammal, which is an animal. Writing this hierarchically, Dalmation < Dog < Mammal < Animal. In this context, the two propositions imply that: 

  • P1: $x_{dog} = x_{animal} + x_{mammal | animal} + x_{dog | mammal } + x_{dalmation | dog}$, and the four terms on the RHS are pairwise orthogonal. 
  • P2: If we had a few different kinds of animal, like birds, mammals, and fish, the three vectors $x_{mammal | animal}, x_{fish | animal}, x_{bird | animal}$ would form a simplex.   

According to Victor Veitch, the load-bearing assumption here is that different levels of the hierarchy are disentangled, and hence models want to represent them orthogonally. I.e. $x_{animal}$ is perpendicular to $x_{mammal | animal}$. I don't have a super rigorous explanation for why, but it's likely because this facilitates representing / sensing each thing independently. 

  • E.g. sometimes all that matters about a dog is that it's an animal; it makes sense to have an abstraction of "animal" that is independent of any sub-hierarchy. 

Jake Mendel made the interesting point that, as long as the number of features is less than the number of dimensions, an orthogonal set of vectors will satisfy P1 and P2 for any hierarchy. 

Example of P2 being satisfied. Let's say we have vectors $x_{animal} = (0,1)$ and $x_{plant} = (1,0)$, which are orthogonal. Then we could write $x_{living_thing} = (1/sqrt(2), 1/ sqrt(2))$. Then $x_{animal | living_thing}, x_{plant | living_thing}$ would form a 1-dimensional simplex. 

Example of P1 being satisfied. Let's say we have four things A, B, C, D arranged in a binary tree such that AB, CD are pairs. Then we could write $x_A = x_{AB} + x_{A | AB}$, satisfying both P1 and P2. However, if we had an alternate hierarchy where AC and BD were pairs, we could still write $x_A = x_{AC} + x_{A | AC}$. Therefore hierarchy is in some sense an "illusion", as any hierarchy satisfies the propositions. 

Taking these two points together, the interesting scenario is when we have more features than dimensions, i.e. the setting of superposition. Then we have the two conflicting incentives:

  • On one hand, models want to represent the different levels of the hierarchy orthogonally. 
  • On the other hand, there isn't enough "room" in the residual stream to do this; hence the model has to "trade off" what it chooses to represent orthogonally. 

This points to super interesting questions: 

  • what geometry does the model adopt for features that respect a binary tree hierarchy? 
  • what if different nodes in the hierarchy have differing importances / sparsities?
  • what if the tree is "uneven", i.e. some branches are deeper than others. 
  • what if the hierarchy isn't a tree, but only a partial order? 

Experiments on toy models will probably be very informative here. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-23T15:20:54.794Z · LW · GW

My Seasonal Goals, Jul - Sep 2024

This post is an exercise in public accountability and harnessing positive peer pressure for self-motivation.   

By 1 October 2024, I am committing to have produced:

  • 1 complete project
  • 2 mini-projects
  • 3 project proposals
  • 4 long-form write-ups

Habits I am committing to that will support this:

  • Code for >=3h every day
  • Chat with a peer every day
  • Have a 30-minute meeting with a mentor figure every week
  • Reproduce a paper every week
  • Give a 5-minute lightning talk every week
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-22T08:15:37.097Z · LW · GW

This is really interesting, thanks! As I understand, "affine steering" applies an affine map to the activations, and this is expressive enough to perform a "rotation" on the circle. David Chanin has told me before that LRC doesn't really work for steering vectors. Didn't grok kernelized concept erasure yet but will have another read.  

Generally, I am quite excited to implement existing work on more general steering interventions and then check whether they can automatically learn to steer modular addition 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-22T08:14:18.691Z · LW · GW

[Note] On SAE Feature Geometry

 

SAE feature directions are likely "special" rather than "random". 


Re: the last point above, this points to singular learning theory being an effective tool for analysis. 

  • Reminder: The LLC measures "local flatness" of the loss basin. A higher LLC = flatter loss, i.e. changing the model's parameters by a small amount does not increase the loss by much. 
  • In preliminary work on LLC analysis of SAE features, the "feature-targeted LLC" turns out to be something which can be measured empirically and distinguishes SAE features from random directions
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T14:45:47.651Z · LW · GW

[Proposal] Attention Transcoders: can we take attention heads out of superposition? 

Note: This thinking is cached from before the bilinear sparse autoencoders paper. I need to read that and revisit my thoughts here. 

Primer: Attention-Head Superposition

Attention-head superposition (AHS) was introduced in this Anthropic post from 2023. Briefly, AHS is the idea that models may use a small number of attention heads to approximate the effect of having many more attention heads.  

Definition 1: OV-incoherence. An attention circuit is OV-incoherent if it attends from multiple different tokens back to a single token, and the output depends on the token attended from. 

Example 2: Skip-trigram circuits. A skip trigram consists of a sequence [A]...[B] -> [C], where A, B, C are distinct tokens. 

Claim 3: A single head cannot implement multiple OV-incoherent circuits. Recall from A Mathematical Framework that an attention head can be decomposed into the OV circuit and the QK circuit, which operate independently. Within each head, the OV circuit is solely responsible for mapping linear directions in the input to linear directions in the output.  only the query token. Since it does not see the key token, it must compute a fixed function of the query. 

Claim 4: Models compute many OV-incoherent circuits simultaneously in superposition. If the ground-truth data is best explained by a large number of OV-incoherent circuits, then models will approximate having these circuits by placing them in superposition across their limited number of attention heads. 

Attention Transcoders 

An attention transcoder (ATC) is described as follows:

  • An ATC attempts to reconstruct the input and output of a specific attention block
  • An ATC is simply a standard multi-head attention module, except that it has many more attention heads. 
  • An ATC is regularised during training such that the number of active heads is sparse. 
    • I've left this intentionally vague at the moment as I'm uncertain how exactly to do this. 

Remark 5: The ATC architecture is the generalization of other successful SAE-like architectures to attention blocks. 

  • Residual-stream SAEs simulate a model that has many more residual neurons. 
  • MLP transcoders simulate a model that has many more hidden neurons in its MLP. 
  • ATCs simulate a model that has many more attention heads. 

Remark 6: Intervening on ATC heads. Since the ATC reconstructs the output of an attention block, ablations can be done by simply splicing the ATC into the model's computational graph and intervening directly on individual head outputs. 

Remark 7: Attributing ATC heads to ground-truth heads. In standard attention-out SAEs, it's possible to directly compute the attribution of each head to an SAE feature. That seems impossible here because the ATC head outputs are not direct functions of the ground-truth heads. Nonetheless, if ATC heads seem highly interpretable and accurately reconstruct the real attention outputs, and specific predictions can be verified via interventions, it seems reasonable to conclude that they are a good explanation of how attention blocks are working. 

Key uncertainties

Does AHS actually occur in language models? I think we do not have crisp examples at the moment. 

Concrete experiments

The first and most obvious experiment is to try training an ATC and see if it works. 

  • Scaling milestones: toy models, TinyStories, open web text
  • Do we achieve better Pareto curves of reconstruction loss vs L0 vs standard attention-out SAEs? 

Conditional on that succeeding, the next step would be to attempt to interpret individual heads in an ATC and determine whether they are interpretable. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T08:59:54.967Z · LW · GW

[Proposal] Can we develop a general steering technique for nonlinear representations? A case study on modular addition

Steering vectors are a recent and increasingly popular alignment technique. They are based on the observation that many features are encoded as linear directions in activation space; hence, intervening within this 1-dimensional subspace is an effective method for controlling that feature. 

Can we extend this to nonlinear features? A simple example of a nonlinear feature is circular representations in modular arithmetic. Here, it's clear that a simple "steering vector" will not work. Nonetheless, as the authors show, it's possible to construct a nonlinear steering intervention that demonstrably influences the model to predict a different result. 

Problem: The construction of a steering intervention in the modular addition paper relies heavily on the a-priori knowledge that the underlying feature geometry is a circle. Ideally, we wouldn't need to fully elucidate this geometry in order for steering to be effective. 

Therefore, we want a procedure which learns a nonlinear steering intervention given only the model's activations and labels (e.g. the correct next-token). 

Such a procedure might look something like this:

  • Assume we have paired data $(x, y)$ for a given concept. $x$ is the model's activations and $y$ is the label, e.g. the day of the week. 
  • Define a function $x' = f_\theta(x, y, y')$ that predicts the $x'$ for steering the model towards $y'$. 
  • Optimize $f_\theta(x, y, y')$ using a dataset of steering examples.
  • Evaluate the model under this steering intervention, and check if we've actually steered the model towards $y'$. Compare this to the ground-truth steering intervention. 

If this works, it might be applicable to other examples of nonlinear feature geometries as well. 

Thanks to David Chanin for useful discussions. 

Comment by Daniel Tan (dtch1997) on Arrakis - A toolkit to conduct, track and visualize mechanistic interpretability experiments. · 2024-07-17T08:48:48.107Z · LW · GW

Really interesting! I'm a big proponent of improving the standards of infrastructure in the mech interp community. 

Some questions: 

  • Have you used other things like TransformerLens and NNsight and found those to be insufficient in some way? Your library seems to diverge fundamentally from both of those implementations (pytorch hooks in the former case and "proxy variables" in the latter case). I'm curious about the motivating use case here. 
  • Do you have examples of reproducing specific mech interp analyses using your library? E.g. Neel Nanda's Indirect Object Identification tutorial, or other simple things like doing activation patching / logit lens. 
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T08:22:00.284Z · LW · GW

[Draft][Note] On Singular Learning Theory

 

Relevant links

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T08:01:46.477Z · LW · GW

[Proposal] Do SAEs capture simplicial structure? Investigating SAE representations of known case studies

It's an open question whether SAEs capture underlying properties of feature geometry. Fortunately, careful research has elucidated a few examples of nonlinear geometry already. It would be useful to think about whether SAEs recover these geometries. 

Simplices in models. Work studying hierarchical structure in feature geometry finds that sets of things are often represented as simplices, which are a specific kind of regular polytope. Simplices are also the structure of belief state geometry

The proposal here is: look at the SAE activations for the tetrahedron, identify a relevant cluster, and then evaluate whether this matches the ground-truth.  

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T07:39:43.396Z · LW · GW

[Note] The Polytope Representation Hypothesis

This is an empirical observation about recent works on feature geometry, that (regular) polytopes are a recurring theme in feature geometry. 

Simplices in models. Work studying hierarchical structure in feature geometry finds that sets of things are often represented as simplices, which are a specific kind of regular polytope. Simplices are also the structure of belief state geometry

Regular polygons in models. Recent work studying natural language modular arithmetic has found that language models represent things in a circular fashion. I will contend that "circle" is a bit imprecise; these are actually regular polygons, which are the 2-dimensional versions of polytopes. 

A reason why polytopes could be a natural unit of feature geometry is that they characterize linear regions of the activation space in ReLU networks. However, I will note that it's not clear that this motivation for polytopes coincides very well with the empirical observations above.  

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T07:20:21.140Z · LW · GW

Oh that's really interesting! Can you clarify what "MCS" means? And can you elaborate a bit on how I'm supposed to interpret these graphs? 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T07:14:46.733Z · LW · GW

[Note] Is Superposition the reason for Polysemanticity? Lessons from "The Local Interaction Basis" 

Superposition is currently the dominant hypothesis to explain polysemanticity in neural networks. However, how much better does it explain the data than alternative hypotheses?  

Non-neuron aligned basis. The leading alternative, as asserted by Lawrence Chan here, is that there are not a very large number of underlying features; just that these features are not represented in a neuron-aligned way, so individual neurons appear to fire on multiple distinct features. 

The Local Interaction Basis explores this idea in more depth. Starting from the premise that there is a linear and interpretable basis that is not overcomplete, they propose a method to recover such a basis, which works in toy models. However, empirical results in language models fail to demonstrate that the recovered basis is indeed more interpretable.

My conclusion from this is a big downwards update on the likelihood of the "non-neuron aligned basis" in realistic domains like natural language. The real world probably just is complex enough that there are tons of distinct features which represent reality. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T07:04:06.983Z · LW · GW

[Proposal] Is reasoning in natural language grokkable? Training models on language formulations of toy tasks. 

Previous work on grokking finds that models can grok modular addition and tree search. However, these are not tasks formulated in natural language. Instead, the tokens correspond directly to true underlying abstract entities, such as numerical values or nodes in a graph. I question whether this representational simplicity is a key ingredient of grokking reasoning. 

I have a prior that expressing concepts in natural language (as opposed to directly representing concepts as tokens) introduces an additional layer of complexity which makes grokking much more difficult. 

The proposal here is to repeat the experiments with tasks that test equivalent reasoning skills, but which are formulated in natural language. 

  • Modular addition can be formulated as "day of the week" math, as has been done previously
  • Tree search is more difficult to formulate, but might be phrasable as some kind of navigation instruction. 

I'd expect that we could observe grokking, but that it might take a lot longer (and require larger models) when compared to the "direct concept tokenization". Conditioned on this being true, it would be interesting to observe whether we recover the same kinds of circuits as demonstrated in prior work. 

Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T06:52:05.787Z · LW · GW

[Proposal] Do SAEs learn universal features? Measuring Equivalence between SAE checkpoints

If we train several SAEs from scratch on the same set of model activations, are they “equivalent”? 

Here are two notions of "equivalence: 

  • Direct equivalence. Features in one SAE are the same (in terms of decoder weight) as features in another SAE. 
  • Linear equivalence. Features in one SAE directly correspond one-to-one with features in another SAE after some global transformation like rotation. 
  • Functional equivalence. The SAEs define the same input-output mapping. 

A priori, I would expect that we get rough functional equivalence, but not feature equivalence. I think this experiment would help elucidate the underlying invariant geometrical structure that SAE features are suspected to be in

 

Changelog: 

  • 18/07/2024 - Added discussion on "linear equivalence
Comment by Daniel Tan (dtch1997) on Daniel Tan's Shortform · 2024-07-17T06:38:07.280Z · LW · GW

[Proposal] Are circuits universal? Investigating IOI across many GPT-2 small checkpoints

Universal features. Work such as the Platonic Representation Hypothesis suggest that sufficiently capable models converge to the same representations of the data. To me, this indicates that the underlying "entities" which make up reality are universally agreed upon by models.

Non-universal circuits. There are many different algorithms which could correctly solve the same problem. Prior work such as the clock and the pizza indicate that, even for very simple algorithms, models can learn very different algorithms depending on the "attention rate". 

Circuit universality is a crux. If circuits are mostly model-specific rather than being universal, it makes the near-term impact of MI a lot lower, since finding a circuit in one model tells us very little about what a slightly different model is doing. 

Concrete experiment: Evaluating the universality of IOI. Gurnee et al train several GPT-2 small checkpoints from scratch. We know from prior work that GPT-2 small has an IOI circuit. What, if any, components of this turn to be universal? Maybe we always observe induction heads. But do we always observe name-mover and S-inhibition heads? If so, are they always at the same layer? Etc. I think this experiment would inform us a lot about circuit universality. 

Comment by Daniel Tan (dtch1997) on LLM Generality is a Timeline Crux · 2024-07-16T08:29:36.381Z · LW · GW

Yup, that's basically what I think! IMO, grokking = having memorised the "underlying rules" that define the DGP, and these rules are general by definition."Reasoning" is a loaded term that's difficult to unpack, but I think a good working definition is "applying a set of rules to arrive at an answer". In other words, reasoning is learning a "correct algorithm" to solve the problem. Therefore being able to reason correctly 100% of the time is equivalent to models having grokked their problem domain.  

See this work, which finds that reasoning only happens through grokking. Separate work has trained models to do tree search, and found that backwards chaining circuits (a correct algorithm) emerge only through grokking. And also the seminal work on modular addition which found that correct algorithms emerge through grokking.

Note that the question of "is reasoning in natural language grokkable?" is a totally separate crux and one which I'm highly uncertain about. 

Comment by Daniel Tan (dtch1997) on Transcoders enable fine-grained interpretable circuit analysis for language models · 2024-05-15T21:12:54.217Z · LW · GW

Hey Jacob + Philippe, 

I took the liberty of making a clean installable version of your original codebase. Hope you don't mind, and happy to make any changes that you request! https://github.com/dtch1997/transcoders-slim

Comment by Daniel Tan (dtch1997) on Toward A Mathematical Framework for Computation in Superposition · 2024-04-27T15:00:53.826Z · LW · GW

This work is very exciting to me, and I'm curious to hear the authors' thoughts on whether we could verify specific predictions made by this model in real models. 

  • For example, the proposed U-AND operator - do we expect this to occur in real LLMs, and could we try to find evidence of this by applying mech interp to carefully-chosen toy models? 

I have a more detailed write-up on model organisms of superposition here: https://docs.google.com/document/d/1hwI30HNNB2MkOrtEzo7hppG9X7Cn7Xm9a-1LBqcttWc/edit?usp=sharing

Would love to discuss this more!