What is a circuit? [in interpretability]

randomwalks

What is a circuit? [in interpretability]

post by Yudhister Kumar (randomwalks) · 2025-02-14T04:40:42.978Z · LW · GW · 1 comments

1 comment

I'm aware of the understanding that "a circuit is a subgraph of a neural network that implements a specific computation."

In practice (to my understanding) the way you identify "circuits" is by identifying components of the neural network that have high correlation with certain tasks, and doing some ablations to see if it's "causally responsible" for performance on that task.

It feels like there's also a different way of understanding circuits where morally, circuits themselves are sequences of operations done to features, where the features are the primitives and the operations are mostly considered to be the linearities/nonlinearities represented in the model architecture (although I can understand different perspectives).

A few questions (forgive my ignorance):

If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other? Does this depend on training? (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)
What is a feature? Are there different circuits that appear in a network based on your definition of what a relevant feature is? How crisp are these circuits that appear, both in toy examples and in the wild?
What are the best examples of "circuits in the wild" that are actually robust?

1 comments

Comments sorted by top scores.

comment by Joseph Bloom (Jbloom) · 2025-02-14T16:25:33.272Z · LW(p) · GW(p)

Good resource: https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J <- Neel Nanda's glossary.

> What is a feature?

Often gets confused because early literature doesn't distinguish well between property of the input represented by a model and the internal representation. We tend to refer to the former as a feature and the latter as a latent these days. Eg: "Not all Language Model Features are Linear" => not all the representations are linear (and is not a statement about what gets represented).

> Are there different circuits that appear in a network based on your definition of what a relevant feature is?

This question seems potentially confusing. If you use different methods (eg: supervised vs unsupervised) you are likely to find different results. Eg: In a paper I supervised here https://arxiv.org/html/2409.14507v2 we looked at how SAEs compared to Linear probes. This was a comparison of methods for finding representations. I don't know of any work doing circuit finding with multiple feature finding methods though (but I'd be excited about it).

> How crisp are these circuits that appear, both in toy examples and in the wild?

Read ACDC. https://arxiv.org/abs/2304.14997 . Generally, not crisp.

> What are the best examples of "circuits in the wild" that are actually robust?

The ARENA curriculum probably covers a few. there might be some papers comparing circuit finding methods that use a standard set of circuits you could find.

> If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other?

Interesting question. See Neel's thoughts here: https://www.neelnanda.io/mechanistic-interpretability/othello#finding-modular-circuits

> Does this depend on training?

Probably yes. Probably also on how the different tasks relate to each other (whether they have shareable intermediate results).

> (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)

I don't know if circuits are a good enough description of reality for this to be feasible. But you might find this interesting https://arxiv.org/abs/2501.14926

What is a circuit? [in interpretability]

Contents

1 comments