A bird's eye view of ARC's research

post by Jacob_Hilton · 2024-10-23T15:50:06.123Z · LW · GW · 1 comments

Contents

  A bird's eye view
    Zoom level 1
    Zoom level 2
    Zoom level 3
    Zoom level 4
  How ARC's research fits into this picture
  Further subproblems
  Conclusion
None
1 comment

This post includes a "flattened version" of an interactive diagram that cannot be displayed on this site. I recommend reading the original version of the post with the interactive diagram, which can be found here.

Over the last few months, ARC has released a number of pieces of research. While some of these can be independently motivated, there is also a more unified research vision behind them. The purpose of this post is to try to convey some of that vision and how our individual pieces of research fit into it.

Thanks to Ryan Greenblatt, Victor Lecomte, Eric Neyman, Jeff Wu and Mark Xu for helpful comments.

A bird's eye view

To begin, we will take a "bird's eye" view of ARC's research.[1] As we "zoom in", more nodes will become visible and we will explain the new nodes.

An interactive version of the diagrams below can be found here.

Zoom level 1

birds_eye_lvl1.svg

At the most zoomed-out level, ARC is working on the problem of "intent alignment": how to design AI systems that are trying to do what their operators want. While many practitioners are taking an iterative approach to this problem, there are foreseeable ways in which today's leading approaches could fail to scale to more intelligent AI systems, which could have undesirable consequences. ARC is attempting to develop algorithms that have a better chance of scaling gracefully to future AI systems, hence the term "scalable alignment".

ARC's particular approach to scalable alignment is a "builder-breaker" methodology (described in more detail here, and exemplified in the ELK report). Roughly speaking, if the scalability of an algorithm depends on unknown empirical contingencies (such as how advanced AI systems generalize), then we try to make worst-case assumptions instead of attempting to extrapolate from today's systems. This is intended to create a feasible iteration loop for theoretical research. We are also conducting empirical research, but mostly to help generate and probe theoretical ideas rather than to test different empirical assumptions.

Zoom level 2

birds_eye_lvl2.svg

Most of ARC's research attempts to solve one of two central subproblems in alignment: alignment robustness and eliciting latent knowledge (ELK).

Alignment robustness refers to AI systems remaining intent aligned even when faced with out-of-distribution inputs.[2] There are a few reasons to focus on failures of alignment robustness, as discussed here (where they are called "malign" failures). A quintessential example of an alignment robustness failure is deceptive alignment, also known as "scheming": the possibility that an AI system will internally reason about the objective that it is being trained on, and stop being intent aligned when it detects clues that it has been taken out of its training environment.

Eliciting latent knowledge (ELK) is defined in this report, and asks: how can we train an AI system to honestly report its internal beliefs, rather than what it predicts a human would think? If we could do this, then we could potentially avoid misalignment by checking whether the model's beliefs are consistent with its actions being helpful. ELK could help with scalable alignment via alignment robustness, but it could also help via outer alignment [AF · GW], by giving the reward function access to relevant information known by the model.

Zoom level 3

birds_eye_lvl3.svg

ARC hopes to make progress on both alignment robustness and ELK using heuristic explanations for neural network behaviors. A heuristic explanation is similar to the kind of explanation found in mechanistic interpretability, except that ARC is attempting to find a mathematical notion of an "explanation", so that they can be found and used automatically. This is similar to how formal verification for ordinary programs can be performed automatically, except that we believe proof is too strict of a standard to be feasible. These similarities are discussed in more detail in the post Formal verification, heuristic explanations and surprise accounting (especially the first couple of sections, up until "Surprise accounting").

A heuristic explanation for a rare but high-stakes kind of failure could help with alignment robustness, while a heuristic explanation for a specific behavior of interest could help with ELK. These two applications of heuristic explanations are fleshed out in more detail at the next zoom level.

Zoom level 4

birds_eye_lvl4.svg

ARC has identified two broad ways in heuristic explanations could help with alignment robustness and/or ELK.

Low probability estimation (LPE) is the task of estimating the probability of a rare kind of model output. The most obvious approach to LPE is to try to find model inputs that give rise to such an output, but this can be infeasible (e.g. if the model were to implement something like a cryptographic hash function). Instead, we can "relax [AF · GW]" this goal and search for a heuristic explanation of why the model could hypothetically produce such an output (e.g. by treating the output of the cryptographic hash function as random). LPE would help with alignment robustness by allowing us to select models for which we cannot explain why they would ever behave catastrophically, even hypothetically. This motivation for LPE is discussed in much greater depth in the post Estimating Tail Risk in Neural Networks.

Mechanism distinction describes our broad hope for how heuristic explanations could help with ELK. A central challenge for ELK is "sensor tampering": detecting when the model reports what it predicts a human would think, but the human has been fooled in some way. Our hope is to detect this by noticing that the model's report has been produced by an "abnormal mechanism". There are a few potential ways in which heuristic explanations could be used to perform mechanism distinction, but the one we currently consider the most promising is mechanistic anomaly detection (MAD), as explained in the post Mechanistic anomaly detection and ELK (for a gentler introduction to MAD, see this post [? · GW]). A variant of MAD is safe distillation, which is an alternative way to perform mechanism distinction if we also have access to a formal specification of what we are trying to elicit latent knowledge of.

A semi-formal account of how heuristic explanations could enable all of LPE, MAD and safe distillation is given in Towards a Law of Iterated Expectations for Heuristic Estimators. An explanation of how MAD could also be used to help with alignment robustness is given in Mechanistic anomaly detection and ELK (in the section "Deceptive alignment").

How ARC's research fits into this picture

We will now explain how some of ARC's research fits into the above diagram at the most zoomed in level. For completeness, we will cover all of ARC's most significant pieces of published research to date, in chronological order. Each piece of work has been labeled with the most closely related node from the diagram, but often also covers nearby nodes and the relationships between them.

Eliciting latent knowledge: How to tell if your eyes deceive you defines ELK, explains its importance for scalable alignment, and covers a large number of possible approaches to ELK. Some of these approaches are somewhat related to heuristic explanations, but most are alternatives that we are no longer pursuing.
Formalizing the presumption of independence lays out the problem of devising a formal notion of heuristic explanations, and makes some early inroads into this problem. It also includes a brief discussion of the motivation for heuristic explanations and the application to alignment robustness and ELK.
Mechanistic anomaly detection and ELK and our other late 2022 blog posts (1, 2, 3) explain the approach to mechanism distinction that we currently find the most promising, mechanistic anomaly detection (MAD). They also cover how mechanism distinction could be used to address alignment robustness and ELK, how heuristic explanations could be used for mechanism distinction, and the feasibility of finding heuristic explanations.
Formal verification, heuristic explanations and surprise accounting discusses the high-level motivation for heuristic explanations by comparing and contrasting them to formal verification for neural networks (as explored in this paper) and mechanistic interpretability. It also introduces surprise accounting, a framework for quantifying the quality of a heuristic explanation, and presents a draft of empirical work on heuristic explanations.
Backdoors as an analogy for deceptive alignment and the associated paper Backdoor defense, learnability and obfuscation discuss a formal notion of backdoors in ML models and some theoretical results about it. This serves as an analogy for the subdiagram Heuristic explanations → Mechanism distinction → Alignment robustness. In this analogy, alignment robustness corresponds to a model being backdoor-free, mechanism distinction corresponds to the backdoor defense, and heuristic explanations correspond to so-called "mechanistic" defenses. The blog post covers this analogy in more depth.
Estimating Tail Risk in Neural Networks lays out the problem of low probability estimation, how it would help with alignment robustness, and possible approaches to LPE based on heuristic explanations. It also presents a draft describing an approach to heuristic explanations based on analytically learning variational autoencoders.
Towards a Law of Iterated Expectations for Heuristic Estimators and the associated paper discuss a possible coherence property for heuristic explanations as part of the search for a formal notion of heuristic explanations. It also provides a semi-formal account of how heuristic explanations could be applied to low probability estimation and mechanism distinction.
Low Probability Estimation in Language Models and the associated paper Estimating the Probabilities of Rare Outputs in Language Models describe an empirical study of LPE in the context of small transformer language models. The method inspired by heuristic explanations outperforms naive sampling in this setting, but does not outperform methods based on red-teaming (searching for inputs giving rise to the rare behavior), although there remain theoretical cases where red-teaming fails.

Further subproblems

ARC's research can be subdivided further, and we have been putting significant effort into a number of subproblems not explicitly mentioned above. For instance, our work on heuristic explanations includes both work on formalizing heuristic explanations (devising a formal framework for heuristic explanations) and work on finding heuristic explanations (designing efficient search algorithms for them). Some subproblems of these include:

Conclusion

We have painted a high-level picture of ARC's research, explained how our published research fits into it, and briefly discussed some additional subproblems that we are working on. We hope this provides people with a clearer sense of what we are up to.


  1. An arrow in the diagram expresses that solving one problem should help solve another, but it varies from case to case whether subproblems combine "conjunctively" (all subproblems need to be solved to solve the main problem) or "disjunctively" (a solution to any subproblem can be used to solve the main problem). ↩︎

  2. The term "alignment robustness" comes from this summary [AF · GW] of this post [AF · GW], and is synonymous with "objective robustness" in the terminology of this post [AF · GW]. A slightly more formal variant is "high-stakes alignment", as defined in this post. ↩︎

1 comments

Comments sorted by top scores.

comment by Dmitry Vaintrob (dmitry-vaintrob) · 2024-10-23T20:04:47.845Z · LW(p) · GW(p)

I think this is a really good and well-thought-out explanation of the agenda.

I do still think that it's missing a big piece: namely in your diagram, the lowest-tier dot (heuristic explanations) is carrying a lot of weight, and needs more support and better messaging. Specifically, my understanding having read this and interacted with ARC's agenda is that "heuristic arguments" as a direction is highly useful. But to the best of my understanding, other than going through vague associative reasoning, its placement at the root of this ambitious diagram is both core to the agenda and is unsupported.

As an extreme example of this, Stephen Wolfram believes he has a collection of ideas building on some thinking about cellular automata that will describe all of physics. He can write down all kinds of causal diagrams with this node in the root, leading to great strides in our understanding of science and the cosmos and so on. But ultimately, such a diagram would be making the statement that "there exists a productive way to build a theory of everything which is based on cellular automata in a particular way similar to how he thinks about this theory". Note that this is different from saying that cellular automata are interesting, or even that a better theory of cellular automata would be useful for physics, and requires a lot more motivation and scientific falsification to motivate.

The idea of heuristic arguments is, at its core, a way of generalizing the notion of independence in statistical systems and models of statistical systems. It's discussing a way to point at a part of the system and say "we are treating this as noise" or "we are treating these two parts as statistically independent", or "we are treating these components of the system as independently as we can, given the following set of observations about our system" (with a lot of the theory of HA asking how to make the last of these statements explicit/computable). I think this is a productive class of questions to think about, both theoretically and empirically. It's related to a lot of other research in the field (on causality, independence and so on). I conceptually vibe with ARC's approach from what I've seen of the org. (Modulo the corrigible fact that I think there should be a lot more empirical work on what kinds of heuristic arguments work in practice. For example what's the right independence assumption on components of an image classifier/ generator NN that notices/generates the kind of textural randomness seen in a cat's fur? So far there is no HA guess about this question, and I think there should be at least some ideas on this level for the field to have a healthy amount of empiricism.)

I think that what ARC is doing is useful and productive. However, I don't see strong evidence that this particular kind of analysis is a principled thing to put at the root of a diagram of this shape. The statement that we should think about and understand independence is a priori not the same as the idea that we should have a more principled way of deciding when one interpretation of a neural net is more correct than another, which is also separate from (though plausibly related to) the (I think also good) idea in MAD/ELK that it might be useful to flag NN's that are behaving "unusually" without having a complete story of the unusual behavior.

I think there's an issue with building such a big structure on top of an undefended assumption, which is that it is creates some immissibility (i.e., difficulty of mixing) with other ideas in interpretability, which are "story-centric". The phenomena that happen in neural nets (same as phenomena in brains, same as phenomena in realistic physical systems) are probably special: they depend on some particular aspects of the world/ of reasoning/ of learning that has some sophisticated moving parts that aren't yet understood (some standard guesses are shallow and hierarchical dependence graphs, abundance of rough symmetries, separation of scale-specific behaviors, and so on). Our understanding will grow by capturing these ideas in terms of suitably natural language and sophistication for each phenomenon.

[added in edit] In particular (to point at a particular formalization of the general critique), I don't think that there currently exists a defendable link between Heuristic Arguments and the proof verification as in Jason Gross's excellent paper. The specific weakening of the notion of proof verification is more general interpretability. Your post on surprise accounting [AF · GW], is also excellent, but it doesn't explain how heuristic arguments would lead to understanding systems better -- rather, it shows that if we had ways of making better independence assumptions about systems with an existing interpretation, we would get a useful way of measuring surprise and explanatory robustness (with proof a maximally robust limit). But I think that drawing the line from seeking explanations with some nice properties/ measurements to the statement that a formal theory of such properties would lead to an immediate generalization of proof/interpretability which is strictly better than the existing "story-centric" methods is currently undefended (similar to the story that some early work on causality in interp had that a good attempt to formalize and validate causal interpretations would lead to better foundations of interp. -- the techniques are currently used productively e.g. here, but as an ingredient of an interpretation analysis rather than the core of the story). I think similar critiques hold for other sufficiently strong interpretations of the other arrows in this post. Note that while I would support a weaker meaning of arrows here (as you suggest in a footnote), there is nevertheless a core implicit assumption that the diagram exists as a part of a coherent agenda that deduces ambitious conclusions from a quite specific approach to interpretability. I could see any of the nodes here as being a part of a reasonable agenda that integrates with mechanistic interpretability more generally, but this is not the approach that ARC has followed.

I think that the issue of the approach sketched here is that it overindexes on a particular shape of explanation -- namely, that the most natural way to describe the relevant details inherent in principled interpretability work will most naturally factorize through a language that grows out of better-understanding independence assumptions in statistical modeling. I don't see much evidence for this being the case, any more than I see evidence that the best theory of physics should grow out of a particular way of seeing cellular automata (and I'd in fact bet with some confidence that this is not true in both of these cases). At the same time I think that ARC ideas are good, and that trying to relate them to other work in interp is productive (I'm excited about the VAE draft in particular). I just would like to see a less ambitious, more collaboratively motivated version of this, which is working on improving and better validating the assumptions one could make as part of mechanistic/statistical analysis of a model (with new interpretability/MAD ideas as a plausible side-effect) rather than orienting towards a world where this particular direction is in some sense foundational for a "universal theory of interpretability".