(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

post by Sodium · 2024-10-03T19:11:58.032Z · LW · GW · 1 comments

Contents

    The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need
    Why would you want to use the heuristics-based framework when thinking about neural networks?
  How can interpretability win if the hypothesis is true?
    Corollary: Understanding neural network computation do not require us to learn “true features” as long as we have some set of faithful, complete, minimal, and comprehensible heuristics
    Weak to strong winning
    Miscellaneous thoughts on interpretability with heuristics hypothesis
  What does it mean for alignment theory if the heuristics hypothesis is true?
  Empirical studies related to the heuristics hypothesis (both in support and against)
  Weaknesses in the Heuristics Hypothesis
    Some versions of the hypothesis are unfalsifiable
    The current features-focused research agendas might be the best way to uncover heuristics, and we don’t actually need to do anything different regardless how true the heuristics hypothesis is.
    Getting heuristics that are causally related to a specific output does not necessarily help monitor a model’s internal thoughts.
  Inspirations and related work that I haven’t already mentioned
  Potential next steps
    Deconfusion: What exactly is a heuristic, and what does a heuristics-based explanation look like?
    Creating new interpretability methods that are centered around heuristics as the fundamental unit
    Using existing interpretability tools to discover heuristics
    Applying the heuristics-framework to study theoretical questions in alignment.
None
1 comment

Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there’s a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”

The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need

Why would you want to use the heuristics-based framework when thinking about neural networks?

Feel free to jump around this post and check out the sections that interests you. Each section is mostly independent of the others. 

How can interpretability win if the hypothesis is true?

I want to first clarify something I do not think we need to win: a one-to-one mapping between neural network computation and heuristics. I believe that we can have multiple acceptable heuristics-based explanations for a given forward pass (i.e., a one-to-many map). Any explanation that fits the following criteria—mostly copied from the original IOI paper—would be sufficient for “understanding why the model did what it did.” 

I believe a bag of heuristics is the easiest way to fulfill these four criteria on arbitrary inputs. 

Corollary: Understanding neural network computation do not require us to learn “true features” as long as we have some set of faithful, complete, minimal, and comprehensible heuristics

A central way people evaluate sparse autoencoders (SAEs) is whether they find a set of “true” features. Researchers have varying intuitions on what true features should be, but a common theme is that they should be atomic (i.e., not composed of linear combinations of other features). This has led to people worrying that the sparsity term in the SAE loss leads to models combining commonly occurring atomic features into a single one (e.g., a red triangle feature instead of a red feature and a triangle feature, see also the recent work on feature absorption [LW · GW]).

While learning intermediate variables in neural networks is a useful subgoal, I’m worried that the pursuit of atomic features—especially given that we can already get some sort of feature decompositions—is not the most productive task we could work on right now.

We should only care about features insofar as they are the inputs and outputs of heuristics/circuits, and we should only care about monosemanticity insofar as it helps us understand the network. If our heuristic decomposition is faithful, complete, and minimal, it doesn’t matter if individual heuristics take non-atomic concepts as inputs as long as we humans can understand the composed concept (likely given AI aid). 

Weak to strong winning

Here are various degrees of winning if the heuristics hypothesis is true. 

Weak victory: We can decompose every forward pass into heuristics composed with each other. That is, we can throw away the rest of the activations and use only the heuristics to reconstruct the input-output relationship to a high fidelity.

Medium victory: In addition to individual forward passes, we understand sets of heuristics that a model uses to solve what humans can think of as “tasks.” That is, we understand all heuristics that handle a certain class of inputs (e.g., the IOI circuit).

Strong victory: We know every heuristic in the model and how they compose, which is analogous to the “reverse engineer a neural network” end goal.

Miscellaneous thoughts on interpretability with heuristics hypothesis

Interpretability with heuristics is not very different from existing circuits analysis. The main ideas I came up with is the focus on heuristics as the key unit of analysis and being explicitly OK with many different potential explanations/levels of abstraction. As a result, it’s not clear if there’s anything major that we need to do differently. Sparse feature circuitstranscoders [LW · GW], and automated circuit discovery techniques already popular in the literature seem to be reasonable ways to proceed even if our end goal is a set of heuristics. 

However, given that a weak victory does not require an enumeration of all features/heuristics, it might be worth the time to try to discover more compute efficient ways to understand a single forward pass.

I also haven’t defined what a heuristic is because I’m genuinely not sure what the best level of abstraction would be. Here are some types of simple functions that I would consider as a “heuristic”

Let me know if you have any other ideas.

A few more thoughts on verifying how correct the heuristics-based explanations are. I think there are two levels, the heuristics level, and the model level. At the heuristics level, we want to make sure that each individual heuristic is faithful to the underlying neural network computation. Ideally this could be done at the weight level, but we can also apply our bag of existing interpretability techniques. 

At the model level, my hope is that we can use our interpretability techniques to discover new algorithms in the form of compositing heuristics that we don’t know how to write. One of my first memorable interactions with ChatGPT was when I asked it to help me rephrase some survey questions I was working on, and it was actually really helpful. We currently have no idea how to write down a program to do that! Learning all the heuristics involved for various tasks could be a path towards some form of Microscope AI [AF · GW]. And, as is the case with circuits analysis, these algorithms fall out naturally once we construct the heuristics.

What does it mean for alignment theory if the heuristics hypothesis is true?

(I’ve spent orders of magnitude less time and effort on this section compared to the interp section, but I figured I’d mention a few ideas and collect some feedback. If people actually like this hypothesis I’ll spend some more time thinking through this)

I’m not super sure if the heuristics framework alone could make concrete predictions on key aspects of alignment theory. You can approximate any function arbitrarily closely with heuristics. In other words, as systems advance, any sort of high-level behavior could emerge even if it’s all heuristics operating below (see, e.g., Interpretability/Tool-ness/Alignment/Corrigibility are not Composable [LW · GW], which is also a problem when we aggregate heuristics from each layer together). 

However, the more powerful future model you’re worried about won’t just fall out of a coconut tree.[5] We need to understand its learning process and how it became powerful. If learning more heuristics is all we need to get to more and more powerful systems, we should understand what types of heuristics and heuristics composition are learned first. Two relevant papers that come to mind are the quantization model of scaling, and work on which concepts are learned first in toy models by Park et al.. Work done here could help us understand if we’ll see, for example, capabilities generalization without alignment generalization. Generally though, it would be cool to see how alignment related concepts are learned and used compared to non-alignment related concepts.

On the surface, it seems like shard theory [? · GW] is more likely to be correct in the world where the heuristics hypothesis is true, although shards are higher-level abstractions compared to heuristics. I’d want to see some more concrete interpretability findings before making a strong claim though.

Empirical studies related to the heuristics hypothesis (both in support and against)

We need to keep in mind that the streetlight effect is certainly contaminating our evidence. That is, simple heuristics are easier for interpretability researchers to recover than complex data structures, and we should expect more evidence for them. 

It’s also cool to think through some other, general neural network phenomenon with the heuristics hypothesis in mind. It makes sense for the network to have some sort of redundancies (e.g., backup name mover heads) if there are similar heuristics learned at the same time. It makes sense that you could get the network to output whatever arbitrary text you want with an optimized string, since you can activate a weird set of heuristics and compose them. As heuristics compose from one layer to the next, they would need intermediate variables to communicate their results. Thus it makes sense that activation engineering works well across a wide range of concepts.

(There are some other results that come to mind which makes less sense, e.g., the 800 orthogonal code steering vectors [LW · GW], although I think Nina Panickssery [LW(p) · GW(p)]’s explanation, if true, would be consistent with the heuristics hypothesis)

Weaknesses in the Heuristics Hypothesis

Some versions of the hypothesis are unfalsifiable

This theory’s biggest weakness is that we can decompose basically anything into a bag of composing heuristics given an infinite bag size. In other words, the heuristics hypothesis is technically consistent with every single hypothesis of how future systems would behave.

I do feel like this theory “explains too much.” However, the key interpretability-related claim is that heuristics based decompositions will be human-understandable, which is a more falsifiable claim.

The current features-focused research agendas might be the best way to uncover heuristics, and we don’t actually need to do anything different regardless how true the heuristics hypothesis is.

The best way to locate heuristics might start with trying to find the most monosematic/atomic features, understand their functional implications [LW · GW], train transcoders, or following something like Lee Sharkley’s sparsify agenda [AF · GW]. In other words, the new framing doesn’t add much. It’s also possible that we wouldn’t be able to achieve weak victory on a given task without understanding the whole task family, in which case the idea of a weak victory doesn’t really matter.

Perhaps this is true, but I think it’s worth thinking through this some more. I’m worried that the field has focused on features mostly due to path dependence from the original circuits thread that posited features as “the fundamental unit of neural networks” (although certainly not all researchers are focused on features, see e.g. Geiger’s causal abstractions agenda). Also, training SAEs that catalog all the features of a model is expensive and unnecessary for the weak victory condition I mentioned above. Trying to find cheaper interpretability techniques that are just meant to understand individual forward passes seems like a worthy thing to try.

I’m not super sure if that’s true? I think it’s reasonable to assume that two different sets of heuristics would be active in the case where the model is deceptive versus not deceptive, even conditional on the final token logits looking the same.

Inspirations and related work that I haven’t already mentioned

Potential next steps

(Yet another section that I wrote rather quickly in the interests of getting some more feedback. I’m also ~60% sure that my specific research interests will shift in the next six months)

I can see four major directions for further exploration of the heuristics hypothesis

Deconfusion: What exactly is a heuristic, and what does a heuristics-based explanation look like?

Although this is a fairly fundamental question, I’m not super worried about needing to get this completely right before trying to look for heuristics in novel settings. I think we can make a lot of progress even with imperfect definitions. Still, applying the heuristics perspective to circuits we already understand (e.g., IOI) and trying to formalize what exactly heuristics are and aren’t seems useful.

Creating new interpretability methods that are centered around heuristics as the fundamental unit

This is speculative, but it might be worth spending some time to figure out if there are ways to directly study heuristics as their own unit. Distributed alignment search (DAS) is the closest idea that comes to mind, but (to my understanding, I could be wrong, sorry!) DAS is a supervised method that requires the researcher to have some causal model in mind before trying to find it in the neural network. Transcoders [LW · GW] represent another attempt, but those require cataloging all features in the training data.

The worry is that the field got locked into looking for features and feature circuits for mostly path-dependence reasons, and there could be some low hanging fruit if we just thought harder about heuristics, especially given the recent evidence that they might play a big role.

Using existing interpretability tools to discover heuristics

This is a much more tractable option to better understand heuristics, especially given the similarities between heuristics and circuit building. 

We can try to catalog individual heuristics manually by coming up with natural language tasks where we believe that the model would need to execute some heuristic at one point. By studying various individual heuristics, it could also help inspire specialized interpretability techniques to uncover them en masse. For example,

We could also leverage the existing SAE and treat features as inputs/outputs of heuristics. In this case, I’m hoping to advance beyond the gradient based attributions used in studies such as the Spare Feature Circuits paper. We can perhaps use gradient attribution to narrow down on the nodes and edges that we care about, but then focus on how, operationally, each edge is formed. The gradient attribution gives us only if-then relationships. Is that what’s locally happening with the model? 

Applying the heuristics-framework to study theoretical questions in alignment.

If we decide that the heuristic model of computation is true/useful, I’d be most excited to use it to study more theoretical topics and perhaps use it to forecast where future capabilities gains could come from. For example, Alex Turner  [AF(p) · GW(p)]said (two years ago):

I think that interpretability should adjudicate between competing theories of generalization and value formation in AIs (e.g. figure out whether and in what conditions a network learns a universal mesa objective, versus contextually activated objectives)

For example, we could study the dynamics of heuristics learning and composition in real world models, especially heuristics related to turning base models into assistants. One guess is that RLHF is sample efficient because it mostly changed how heuristics are composed with each other (and maybe boost existing heuristics to be more active [LW · GW]), which might be a lot easier than learning new heuristics.[6] This would build on top of work done on toy models by Hidenori Tanaka’s group, and also maybe the quanta scaling hypothesis.

I’m currently trying to get into the AI safety field and will also be applying to MATS. Let me know if you’re interested in chatting more about any of these topics. Have a low bar for reaching out.

This post benefited from the feedback from Jack Zhang, Joe Campbell, Mat Allen, Tim Kostolansky, Veniamin Veselovsky, and woog. All errors are my own.

  1. ^

    This definition of generalization comes from Okawa et al. (2023)

  2. ^

    I really struggled to understand this paper :( Would be down to go through it with someone.

  3. ^

    [Citation needed]

  4. ^

    Funny story but I almost wrote down Rome. The real rank one model editing is the one they did to my brain.

  5. ^
  6. ^

    Counterpoint: maybe learning new heuristics is easy and frontier models just have a good ability to learn by the time they’re done with pretraining.

1 comments

Comments sorted by top scores.

comment by Ben Pace (Benito) · 2024-10-03T20:06:43.443Z · LW(p) · GW(p)

Sorry, not on topic, but your post title reminds me of the game Milk Inside a Bag of Milk Inside a Bag of Milk.