## Posts

## Comments

**Daniel Tan (dtch1997)**on The Residual Expansion: A Framework for thinking about Transformer Circuits · 2024-08-02T13:28:47.794Z · LW · GW

Fair point, and I should amend the post to point out that AMFOTC also does 'path expansion'. However, I think this is still conceptually distinct from AMFOTC because:

- In my reading of AMFOTC, the focus seems to be on understanding attention by separating the QK and OV circuits, writing these as linear (or almost linear) terms, and fleshing this out for 1-2 layer attention-only transformers. This is cool, but also very hard to use at the level of a full model
- Beyond understanding individual attention heads, I am more interested in how the whole model works; IMO this is very unlikely to be simply understood as a sum of linear components. OTOH residual expansion gives a sum of nonlinear components and maybe each of those things is more interpretable.
- I think the notion of path 'degrees' hasn't been explicitly stated before and I found this to be a useful abstraction to think about circuit complexity.

maybe this post is better framed as 'reconciling AMFOTC with SAE circuit analysis'.

**Daniel Tan (dtch1997)**on An Interpretability Illusion from Population Statistics in Causal Analysis · 2024-08-02T11:26:20.999Z · LW · GW

What's a better way to incorporate the mentioned sample-level variance in measuring the effectiveness of an SAE feature or SV?

In the steering vectors work I linked, we looked at how much of the variance in the metric was explained by a spurious factor, and I think that could be a useful technique if you have some a priori intuition about what the variance might be due to. However, this doesn't mean we can just test a bunch of hypotheses, because that looks like p-hacking.

Generally, I do think that 'population variance' should be a metric that's reported alongside 'population mean' in order to contextualize findings. But again this doesn't tell a very clean picture; variance being high could be due to heteroscedasticity, among other things

I don't have great solutions for this illusion outside of those two recommendations. One naive way we might try to solve this is to remove things from the dataset until the variance is minimal, but it's hard to do this in a right way that doesn't eventually look like p-hacking.

Do you also conclude that the causal role of the circuit you discovered was spurious?

an example where causal intervention satisfied the above-mentioned (or your own alternative that was not mentioned in this post) criteria

I would guess that the IOI SAE circuit we found is not unduly influenced by spurious factors, and that the analysis using (variance in the metric difference explained by ABBA / BABA) would corroborate this. I haven't rigorously tested this, but I'd be very surprised if this turned out not to be the case

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-30T11:24:13.107Z · LW · GW

For sure! Working in public is going to be a big driver of these habits :)

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-27T23:51:44.756Z · LW · GW

# [Note] On illusions in mechanistic interpretability

- We thought SoLU solved superposition, but not really.
- ROME seemd like a very cool approach but turned out to have a lot of flaws. Firstly, localization does not necessarily inform editing. Secondly, editing can induce side effects (thanks Arthur!).
- We originally thought OthelloGPT had nonlinear representations but they turned out to be linear. This highlights that the features used in the model's ontology do not necessarily map to what humans would intuitively use.
- Max activating examples have been shown to give misleading interpretations of neurons / directions in BERT.

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-27T23:49:44.767Z · LW · GW

# [Proposal] Out-of-context meta learning as a toy model of steganography

Steganography; the idea that models may say one thing but mean another, and that this may enable them to evade supervision. Essentially, models might learn to "speak in code".

In order to better study steganography, it would be useful to construct model organisms of steganography, which we don't have at the moment. How might we do this? I think out-of-context meta learning is a very convenient path.

Out-of-context meta learning: The idea that models can internalise knowledge during training, and then use this knowledge at test-time to answer questions. E.g.

- Train the model on inputs of the form: "A <nonsense word> is <color>".
- At test time, ask the model "What color is <nonsense world>?"

This has been demonstrated to be possible in Krasheninnikov et al, 2024 as well as Berglund et al, 2024.

This seems like a pretty effective path to creating model organisms of steganography. E.g.

- Train the model on re-definitions of existing words, e.g. "A cat is a vehicle for transport"
- Test the model on whether it uses "cat" instead of "car" at test time. Or something in this vein.

I probably won't work on this myself, but I'm pretty interested in someone doing this and reporting their results

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-27T23:27:15.204Z · LW · GW

# [Note] Excessive back-chaining from theories of impact is misguided

*Rough summary of a conversation I had with Aengus Lynch *

As a mech interp researcher, one thing I've been trying to do recently is to figure out my big cruxes for mech interp, and then filter projects by whether they are related to these cruxes.

Aengus made the counterpoint that this can be dangerous, because even the best researchers' mental model of what will be impactful in the future is likely wrong, and errors will compound through time. Also, time spent refining a mental model is time not spent doing real work. Instead, he advocated for working on projects that seem likely to yield near-term value

I still think I got a lot of value out of thinking about my cruxes, but I agree with the sentiment that this shouldn't consume excessive amounts of my time

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-27T23:20:58.044Z · LW · GW

# [Note] On self-repair in LLMs

*A collection of empirical evidence*

Do language models exhibit self-repair?

One notion of self-repair is redundancy; having "backup" components which do the same thing, should the original component fail for some reason. Some examples:

- In the IOI circuit in gpt-2 small, there are primary "name mover heads" but also "backup name mover heads" which fire if the primary name movers are ablated. this is partially explained via copy suppression.
- More generally, The Hydra effect: Ablating one attention head leads to other attention heads compensating for the ablated head.
- Some other mechanisms for self-repair include "layernorm scaling" and "anti-erasure", as described in Rushing and Nanda, 2024

Another notion of self-repair is "regulation"; suppressing an overstimulated component.

- "Entropy neurons" reduce the models' confidence by squeezing the logit distribution.
- "Token prediction neurons" also function similarly

A third notion of self-repair is "error correction".

- Toy models of superposition suggests that NNs use ReLU to suppress small errors in computation
- Error correction is predicted by Computation in Superposition
- Empirically, it's been found that models tolerate errors well along certain directions in the activation space

Self-repair is annoying from the interpretability perspective.

- It creates an interpretability illusion; maybe the ablated component is actually playing a role in a task, but due to self-repair, activation patching shows an abnormally low effect.

A related thought: Grokked models probably do not exhibit self-repair.

- In the "circuit cleanup" phase of grokking, redundant circuits are removed due to the L2 weight penalty incentivizing the model to shed these unused parameters.
- I expect regulation to not occur as well, because there is always a single correct answer; hence a model that predicts this answer will be incentivized to be as confident as possible.
- Error correction still probably does occur, because this is largely a consequence of superposition

Taken together, I guess this means that self-repair is a coping mechanism for the "noisiness" / "messiness" of real data like language.

It would be interesting to study whether introducing noise into synthetic data (that is normally grokkable by models) also breaks grokking (and thereby induces self-repair).

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-27T23:07:00.203Z · LW · GW

# [Note] Is adversarial robustness best achieved through grokking?

*A rough summary of an insightful discussion with Adam Gleave, FAR AI*

We want our models to be adversarially robust.

- According to Adam, the scaling laws don't indicate that models will "naturally" become robust just through standard training.

One technique which FAR AI has investigated extensively (in Go models) is adversarial training.

- If we measure "weakness" in terms of how much compute is required to train an adversarial opponent that reliably beats the target model at Go, then starting out it's like 10m FLOPS, and this can be increased to 200m FLOPS through iterated adversarial training.
- However, this is both pretty expensive (~10-15% of pre-training compute), and doesn't work perfectly (even after extensive iterated adversarial training, models still remain vulnerable to new adversaries.)
- A useful intuition: Adversarial examples are like "holes" in the model, and adversarial training helps patch the holes, but there are just a lot of holes.

One thing I pitched to Adam was the notion of "adversarial robustness through grokking".

- Conceptually, if the model generalises perfectly on some domain, then there can't exist any adversarial examples (by definition).
- Empirically, "delayed robustness" through grokking has been demonstrated on relatively advanced datasets like CIFAR-10 and Imagenette; in both cases, models that underwent grokking became naturally robust to adversarial examples.

Adam seemed thoughtful, but had some key concerns.

- One of Adam's cruxes seemed to relate to how quickly we can get language models to grok; here, I think work like grokfast is promising in that it potentially tells us how to train models that grok much more quickly.
- I also pointed out that in the above paper, Shakespeare text was grokked, indicating that this is feasible for natural language
- Adam pointed out, correctly, that we have to clearly define what it means to "grok" natural language. Making an analogy to chess; one level of "grokking" could just be playing legal moves. Whereas a more advanced level of grokking is to play the optimal move. In the language domain, the former would be equivalent to outputting plausible next tokens, and the latter would be equivalent to being able to solve arbitrarily complex intellectual tasks like reasoning.
- We had some discussion about characterizing "the best strategy that can be found with the compute available in a single forward pass of a model" and using that as the criterion for grokking.

His overall take was that it's mainly an "empirical question" whether grokking leads to adversarial robustness. He hadn't heard this idea before, but thought experiments / proofs of concept would be useful.

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-27T22:48:45.387Z · LW · GW

# [Note] On the feature geometry of hierarchical concepts

*A rough summary of insightful discussions with Jake Mendel and Victor Veitch*

Recent work on hierarchical feature geometry has made two specific predictions:

- Proposition 1: activation space can be decomposed hierarchically into a direct sum of many subspaces, each of which reflects a layer of the hierarchy.
- Proposition 2: within these subspaces, different concepts are represented as simplices.

Example of hierarchical decomposition: A dalmation is a dog, which is a mammal, which is an animal. Writing this hierarchically, Dalmation < Dog < Mammal < Animal. In this context, the two propositions imply that:

- P1: $x_{dog} = x_{animal} + x_{mammal | animal} + x_{dog | mammal } + x_{dalmation | dog}$, and the four terms on the RHS are pairwise orthogonal.
- P2: If we had a few different kinds of animal, like birds, mammals, and fish, the three vectors $x_{mammal | animal}, x_{fish | animal}, x_{bird | animal}$ would form a simplex.

According to Victor Veitch, the load-bearing assumption here is that different levels of the hierarchy are disentangled, and hence models want to represent them orthogonally. I.e. $x_{animal}$ is perpendicular to $x_{mammal | animal}$. I don't have a super rigorous explanation for why, but it's likely because this facilitates representing / sensing each thing independently.

- E.g. sometimes all that matters about a dog is that it's an animal; it makes sense to have an abstraction of "animal" that is independent of any sub-hierarchy.

Jake Mendel made the interesting point that, as long as the number of features is less than the number of dimensions, an orthogonal set of vectors will satisfy P1 and P2 for any hierarchy.

Example of P2 being satisfied. Let's say we have vectors $x_{animal} = (0,1)$ and $x_{plant} = (1,0)$, which are orthogonal. Then we could write $x_{living_thing} = (1/sqrt(2), 1/ sqrt(2))$. Then $x_{animal | living_thing}, x_{plant | living_thing}$ would form a 1-dimensional simplex.

Example of P1 being satisfied. Let's say we have four things A, B, C, D arranged in a binary tree such that AB, CD are pairs. Then we could write $x_A = x_{AB} + x_{A | AB}$, satisfying both P1 and P2. However, if we had an alternate hierarchy where AC and BD were pairs, we could still write $x_A = x_{AC} + x_{A | AC}$. Therefore hierarchy is in some sense an "illusion", as any hierarchy satisfies the propositions.

Taking these two points together, the interesting scenario is when we have *more *features than dimensions, i.e. the setting of superposition. Then we have the two conflicting incentives:

- On one hand, models want to represent the different levels of the hierarchy orthogonally.
- On the other hand, there isn't enough "room" in the residual stream to do this; hence the model has to "trade off" what it chooses to represent orthogonally.

This points to super interesting questions:

- what geometry does the model adopt for features that respect a binary tree hierarchy?
- what if different nodes in the hierarchy have differing importances / sparsities?
- what if the tree is "uneven", i.e. some branches are deeper than others.
- what if the hierarchy isn't a tree, but only a partial order?

Experiments on toy models will probably be very informative here.

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-23T15:20:54.794Z · LW · GW

# My Seasonal Goals, Jul - Sep 2024

*This post is an exercise in public accountability and harnessing positive peer pressure for self-motivation. *

By 1 October 2024, I am committing to have produced:

- 1 complete project
- 2 mini-projects
- 3 project proposals
- 4 long-form write-ups

Habits I am committing to that will support this:

- Code for >=3h every day
- Chat with a peer every day
- Have a 30-minute meeting with a mentor figure every week
- Reproduce a paper every week
- Give a 5-minute lightning talk every week

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-22T08:15:37.097Z · LW · GW

This is really interesting, thanks! As I understand, "affine steering" applies an affine map to the activations, and this is expressive enough to perform a "rotation" on the circle. David Chanin has told me before that LRC doesn't really work for steering vectors. Didn't grok kernelized concept erasure yet but will have another read.

Generally, I am quite excited to implement existing work on more general steering interventions and then check whether they can automatically learn to steer modular addition

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-22T08:14:18.691Z · LW · GW

# [Note] On SAE Feature Geometry

SAE feature directions are likely "special" rather than "random".

- Different SAEs seem to converge to learning the same features
- SAE error directions increase model loss by a lot compared to random directions, indicating that the error directions are "special", which points to the feature directions also being "special"
- Conversely, SAE feature directions increase model loss by much less than random directions

Re: the last point above, this points to singular learning theory being an effective tool for analysis.

- Reminder: The LLC measures "local flatness" of the loss basin. A higher LLC = flatter loss, i.e. changing the model's parameters by a small amount does not increase the loss by much.
- In preliminary work on LLC analysis of SAE features, the "feature-targeted LLC" turns out to be something which can be measured empirically and distinguishes SAE features from random directions

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T14:45:47.651Z · LW · GW

# [Proposal] Attention Transcoders: can we take attention heads out of superposition?

*Note: This thinking is cached from before the bilinear sparse autoencoders paper. I need to read that and revisit my thoughts here. *

## Primer: Attention-Head Superposition

Attention-head superposition (AHS) was introduced in this Anthropic post from 2023. Briefly, AHS is the idea that models may use a small number of attention heads to approximate the effect of having many more attention heads.

**Definition 1: OV-incoherence. **An attention circuit is OV-incoherent if it attends from multiple different tokens back to a single token, and the output depends on the token attended from.

**Example 2: Skip-trigram circuits. **A skip trigram consists of a sequence [A]...[B] -> [C], where A, B, C are distinct tokens.

**Claim 3: A single head cannot implement multiple OV-incoherent circuits.** Recall from A Mathematical Framework that an attention head can be decomposed into the OV circuit and the QK circuit, which operate independently. Within each head, the OV circuit is solely responsible for mapping linear directions in the input to linear directions in the output. only the query token. Since it does not see the key token, it must compute a fixed function of the query.

**Claim 4: Models compute many OV-incoherent circuits simultaneously in superposition**. If the ground-truth data is best explained by a large number of OV-incoherent circuits, then models will approximate having these circuits by placing them in superposition across their limited number of attention heads.

## Attention Transcoders

An attention transcoder (ATC) is described as follows:

- An ATC attempts to reconstruct the input and output of a specific attention block
- An ATC is simply a standard multi-head attention module, except that it has many more attention heads.
- An ATC is regularised during training such that the number of active heads is sparse.
- I've left this intentionally vague at the moment as I'm uncertain how exactly to do this.

**Remark 5: The ATC architecture is the generalization of other successful SAE-like architectures to attention blocks. **

- Residual-stream SAEs simulate a model that has many more residual neurons.
- MLP transcoders simulate a model that has many more hidden neurons in its MLP.
- ATCs simulate a model that has many more attention heads.

**Remark 6: Intervening on ATC heads. **Since the ATC reconstructs the output of an attention block, ablations can be done by simply splicing the ATC into the model's computational graph and intervening directly on individual head outputs.

**Remark 7: Attributing ATC heads to ground-truth heads**. In standard attention-out SAEs, it's possible to directly compute the attribution of each head to an SAE feature. That seems impossible here because the ATC head outputs are not direct functions of the ground-truth heads. Nonetheless, if ATC heads seem highly interpretable and accurately reconstruct the real attention outputs, and specific predictions can be verified via interventions, it seems reasonable to conclude that they are a good explanation of how attention blocks are working.

## Key uncertainties

Does AHS actually occur in language models? I think we do not have crisp examples at the moment.

## Concrete experiments

The first and most obvious experiment is to try training an ATC and see if it works.

- Scaling milestones: toy models, TinyStories, open web text
- Do we achieve better Pareto curves of reconstruction loss vs L0 vs standard attention-out SAEs?

Conditional on that succeeding, the next step would be to attempt to interpret individual heads in an ATC and determine whether they are interpretable.

- It may be useful to compare to known examples of suspected AHS; however, direct comparison is difficult due to Remark 7 above.

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T08:59:54.967Z · LW · GW

# [Proposal] Can we develop a general steering technique for nonlinear representations? A case study on modular addition

Steering vectors are a recent and increasingly popular alignment technique. They are based on the observation that many features are encoded as linear directions in activation space; hence, intervening within this 1-dimensional subspace is an effective method for controlling that feature.

Can we extend this to nonlinear features? A simple example of a nonlinear feature is circular representations in modular arithmetic. Here, it's clear that a simple "steering vector" will not work. Nonetheless, as the authors show, it's possible to construct a nonlinear *steering intervention *that demonstrably influences the model to predict a different result.

Problem: The construction of a steering intervention in the modular addition paper relies heavily on the a-priori knowledge that the underlying feature geometry is a circle. Ideally, we wouldn't need to fully elucidate this geometry in order for steering to be effective.

Therefore, we want a procedure which *learns *a nonlinear steering intervention given only the model's activations and labels (e.g. the correct next-token).

Such a procedure might look something like this:

- Assume we have paired data $(x, y)$ for a given concept. $x$ is the model's activations and $y$ is the label, e.g. the day of the week.
- Define a function $x' = f_\theta(x, y, y')$ that predicts the $x'$ for steering the model towards $y'$.
- Optimize $f_\theta(x, y, y')$ using a dataset of steering examples.
- Evaluate the model under this steering intervention, and check if we've actually steered the model towards $y'$. Compare this to the ground-truth steering intervention.

If this works, it might be applicable to other examples of nonlinear feature geometries as well.

Thanks to David Chanin for useful discussions.

**Daniel Tan (dtch1997)**on Arrakis - A toolkit to conduct, track and visualize mechanistic interpretability experiments. · 2024-07-17T08:48:48.107Z · LW · GW

Really interesting! I'm a big proponent of improving the standards of infrastructure in the mech interp community.

Some questions:

- Have you used other things like TransformerLens and NNsight and found those to be insufficient in some way? Your library seems to diverge fundamentally from both of those implementations (pytorch hooks in the former case and "proxy variables" in the latter case). I'm curious about the motivating use case here.
- Do you have examples of reproducing specific mech interp analyses using your library? E.g. Neel Nanda's Indirect Object Identification tutorial, or other simple things like doing activation patching / logit lens.

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T08:22:00.284Z · LW · GW

[Draft][Note] On Singular Learning Theory

Relevant links

- AXRP with Daniel Murfet on an SLT primer
- Manifund grant proposal on DevInterp research agenda
- Daniel Murfet's post on "simple != short"
- Timaeus blogpost on actionable research projects
- DevInterp repository for estimating LLC

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T08:01:46.477Z · LW · GW

# [Proposal] Do SAEs capture simplicial structure? Investigating SAE representations of known case studies

It's an open question whether SAEs capture underlying properties of feature geometry. Fortunately, careful research has elucidated a few examples of nonlinear geometry already. It would be useful to think about whether SAEs recover these geometries.

**Simplices in models**. Work studying hierarchical structure in feature geometry finds that sets of things are often represented as simplices, which are a specific kind of regular polytope. Simplices are also the structure of belief state geometry.

The proposal here is: look at the SAE activations for the tetrahedron, identify a relevant cluster, and then evaluate whether this matches the ground-truth.

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T07:39:43.396Z · LW · GW

# [Note] The Polytope Representation Hypothesis

This is an empirical observation about recent works on feature geometry, that (regular) polytopes are a recurring theme in feature geometry.

**Simplices in models**. Work studying hierarchical structure in feature geometry finds that sets of things are often represented as simplices, which are a specific kind of regular polytope. Simplices are also the structure of belief state geometry.

**Regular polygons in models**. Recent work studying natural language modular arithmetic has found that language models represent things in a circular fashion. I will contend that "circle" is a bit imprecise; these are actually *regular polygons*, which are the 2-dimensional versions of polytopes.

A reason why polytopes could be a natural unit of feature geometry is that they characterize linear regions of the activation space in ReLU networks. However, I will note that it's not clear that this motivation for polytopes coincides very well with the empirical observations above.

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T07:20:21.140Z · LW · GW

Oh that's really interesting! Can you clarify what "MCS" means? And can you elaborate a bit on how I'm supposed to interpret these graphs?

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T07:14:46.733Z · LW · GW

# [Note] Is Superposition the reason for Polysemanticity? Lessons from "The Local Interaction Basis"

Superposition is currently the dominant hypothesis to explain polysemanticity in neural networks. However, how much better does it explain the data than alternative hypotheses?

**Non-neuron aligned basis**. The leading alternative, as asserted by Lawrence Chan here, is that there are not a very large number of underlying features; just that these features are not represented in a neuron-aligned way, so individual neurons appear to fire on multiple distinct features.

The Local Interaction Basis explores this idea in more depth. Starting from the premise that there is a linear and interpretable basis that is not overcomplete, they propose a method to recover such a basis, which works in toy models. However, empirical results in language models fail to demonstrate that the recovered basis is indeed more interpretable.

My conclusion from this is a big downwards update on the likelihood of the "non-neuron aligned basis" in realistic domains like natural language. The real world probably just is complex enough that there are tons of distinct features which represent reality.

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T07:04:06.983Z · LW · GW

# [Proposal] Is reasoning in natural language grokkable? Training models on language formulations of toy tasks.

Previous work on grokking finds that models can grok modular addition and tree search. However, these are not tasks formulated in natural language. Instead, the tokens correspond directly to true underlying abstract entities, such as numerical values or nodes in a graph. I question whether this representational simplicity is a key ingredient of grokking reasoning.

I have a prior that expressing concepts in natural language (as opposed to directly representing concepts as tokens) introduces an additional layer of complexity which makes grokking much more difficult.

The proposal here is to repeat the experiments with tasks that test equivalent reasoning skills, but which are formulated in natural language.

- Modular addition can be formulated as "day of the week" math, as has been done previously
- Tree search is more difficult to formulate, but might be phrasable as some kind of navigation instruction.

I'd expect that we could observe grokking, but that it might take a lot longer (and require larger models) when compared to the "direct concept tokenization". Conditioned on this being true, it would be interesting to observe whether we recover the same kinds of circuits as demonstrated in prior work.

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T06:52:05.787Z · LW · GW

# [Proposal] Do SAEs learn universal features? Measuring Equivalence between SAE checkpoints

If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?

Here are two notions of "equivalence:

- Direct equivalence. Features in one SAE are the same (in terms of decoder weight) as features in another SAE.
- Linear equivalence. Features in one SAE directly correspond one-to-one with features in another SAE after some global transformation like rotation.
- Functional equivalence. The SAEs define the same input-output mapping.

A priori, I would expect that we get rough functional equivalence, but not feature equivalence. I think this experiment would help elucidate the underlying invariant geometrical structure that SAE features are suspected to be in.

Changelog:

- 18/07/2024 - Added discussion on "linear equivalence

**Daniel Tan (dtch1997)**on Daniel Tan's Shortform · 2024-07-17T06:38:07.280Z · LW · GW

# [Proposal] Are circuits universal? Investigating IOI across many GPT-2 small checkpoints

**Universal features.** Work such as the Platonic Representation Hypothesis suggest that sufficiently capable models converge to the same representations of the data. To me, this indicates that the underlying "entities" which make up reality are universally agreed upon by models.**Non-universal circuits.** There are many different algorithms which could correctly solve the same problem. Prior work such as the clock and the pizza indicate that, even for very simple algorithms, models can learn very different algorithms depending on the "attention rate". **Circuit universality is a crux.** If circuits are mostly model-specific rather than being universal, it makes the near-term impact of MI a lot lower, since finding a circuit in one model tells us very little about what a slightly different model is doing.

**Concrete experiment: Evaluating the universality of IOI. **Gurnee et al train several GPT-2 small checkpoints from scratch. We know from prior work that GPT-2 small has an IOI circuit. What, if any, components of this turn to be universal? Maybe we always observe induction heads. But do we always observe name-mover and S-inhibition heads? If so, are they always at the same layer? Etc. I think this experiment would inform us a lot about circuit universality.

**Daniel Tan (dtch1997)**on LLM Generality is a Timeline Crux · 2024-07-16T08:29:36.381Z · LW · GW

Yup, that's basically what I think! IMO, grokking = having memorised the "underlying rules" that define the DGP, and these rules are general by definition."Reasoning" is a loaded term that's difficult to unpack, but I think a good working definition is "applying a set of rules to arrive at an answer". In other words, reasoning is learning a "correct algorithm" to solve the problem. Therefore being able to reason correctly 100% of the time is equivalent to models having grokked their problem domain.

See this work, which finds that reasoning only happens through grokking. Separate work has trained models to do tree search, and found that backwards chaining circuits (a correct algorithm) emerge only through grokking. And also the seminal work on modular addition which found that correct algorithms emerge through grokking.

Note that the question of "is reasoning in natural language grokkable?" is a totally separate crux and one which I'm highly uncertain about.

**Daniel Tan (dtch1997)**on Transcoders enable fine-grained interpretable circuit analysis for language models · 2024-05-15T21:12:54.217Z · LW · GW

Hey Jacob + Philippe,

I took the liberty of making a clean installable version of your original codebase. Hope you don't mind, and happy to make any changes that you request! https://github.com/dtch1997/transcoders-slim

**Daniel Tan (dtch1997)**on Toward A Mathematical Framework for Computation in Superposition · 2024-04-27T15:00:53.826Z · LW · GW

This work is very exciting to me, and I'm curious to hear the authors' thoughts on whether we could verify specific predictions made by this model in real models.

- For example, the proposed U-AND operator - do we expect this to occur in real LLMs, and could we try to find evidence of this by applying mech interp to carefully-chosen toy models?

I have a more detailed write-up on model organisms of superposition here: https://docs.google.com/document/d/1hwI30HNNB2MkOrtEzo7hppG9X7Cn7Xm9a-1LBqcttWc/edit?usp=sharing

Would love to discuss this more!