Concrete empirical research projects in mechanistic anomaly detection

post by Erik Jenner (ejenner), Viktor Rehnberg (viktor.rehnberg), Oliver Daniels-Koch (oliver-daniels-koch) · 2024-04-03T23:07:21.502Z · LW · GW · 0 comments

Contents

  Background: what is mechanistic anomaly detection, and why care?
  Concrete research directions
  Why work on empirical mechanistic anomaly detection?
    (Relatively) straightforward benchmarks
    Connections to mainstream machine learning research
    Neglectedness
    Synergies with interpretability and heuristic arguments
  Why not to work on empirical MAD
None
No comments

Thanks to Jordan Taylor, Mark Xu, Alex Mallen, and Lawrence Chan for feedback on a draft! This post was mostly written by Erik, but we're all currently collaborating on this research direction.

Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. The Alignment Research Center (ARC) is trying to formalize “reasons” for an AI’s output using heuristic arguments, aiming for an indefinitely scalable solution to MAD.

As a complement to ARC’s theoretical approach, we are excited about empirical research on MAD. Rather than looking for a principled definition of “reasons,” this means creating incrementally harder MAD benchmarks and better MAD methods.

We have been thinking about and working on empirical MAD research for the past months. We believe there are many tractable and useful experiments, only a fraction of which we can run ourselves. This post describes several directions we’re excited about and high-level reasons to work on empirical MAD.

Background: what is mechanistic anomaly detection, and why care?

This post [LW · GW] provides a longer introduction to mechanistic anomaly detection. This section recaps that previous post.

In mechanistic anomaly detection, we want to flag when an AI produces an output “for unusual reasons” or “using anomalous mechanisms” relative to what happens on some reference set of inputs. Concretely, a mechanistic anomaly detection task consists of the following components:

To train an anomaly detector for this task, we have access to  and a dataset of trusted inputs. We may also have access to untrusted inputs, but we don’t know which are normal or anomalous during training. This is one of the key challenges for mechanistic anomaly detection—if we had anomaly labels for untrusted data, we would have a supervised learning problem.

There’s currently no generally accepted formal definition of what “anomalous mechanisms” mean (and indeed, finding such a definition is arguably the biggest part of the problem). But there are several problems in AI safety where anomalous mechanisms intuitively seem to occur and where we’d like to be able to detect them. Two central examples:

Concrete research directions

In no particular order, here are projects we are excited about (and, in some cases, working on or planning to work on in the future):

If you’re working on or planning to work on things like this and would like to chat, feel free to email erik@ejenner.com.

Why work on empirical mechanistic anomaly detection?

In summary:

The following subsections go into more detail on these points.

(Relatively) straightforward benchmarks

Several sources of benchmarks for mechanistic anomaly detection methods either already exist or don’t seem to have fundamental obstacles to implementation. Existing examples include backdoor detection (including structurally similar cases like sleeper agents or “quirky language models”), adversarial example detection, or proxies for measurement tampering detection like Redwood’s benchmark [AF · GW]. Having concrete benchmarks like this has historically been extremely useful for machine learning research.

That said, it might be difficult to make benchmarks that are good proxies for real problems, and there’s a significant risk of Goodharting imperfect benchmarks.

Connections to mainstream machine learning research

Given some amount of groundwork, it seems plausible that machine learning researchers might become interested in MAD for reasons unrelated to existential safety. MAD might also be a good fit for some researchers looking to do existential safety-related research while staying close to their area of expertise. Empirical MAD research could look a lot like most other machine learning research, e.g. making progress on concrete benchmarks. It also has close object-level connections to several existing research fields, such as backdoor detection, adversarial example detection, OOD/anomaly detection, and positive and unlabeled learning.

All of this is good for two types of reasons:

Two important caveats:

Neglectedness

As far as we know, very few people are explicitly working on empirical mechanistic anomaly detection (compared to, say, mechanistic interpretability). Concretely:

If we’re missing anyone working on MAD, we’d be grateful to hear about that!

There’s certainly additional work that could be useful for MAD research (such as Sleeper Agents). There’s also work on specific instances, such as backdoor detection, but it’s important to study how applicable that is to other MAD problems.

We think that on the current margin, there’s a lot of room for more researchers to work directly on MAD or adopt a MAD framing within related fields like interpretability.

Synergies with interpretability and heuristic arguments

MAD benchmarks may be good targets for mechanistic interpretability:

As a concrete example, evaluating sparse autoencoders from an interpretability perspective is difficult. In comparison, it seems straightforward to test how useful a given sparse autoencoder is for mechanistic anomaly detection (i.e. how much does performance on a MAD benchmark improve if we apply simple anomaly detection methods in the sparse basis).

Of course, MAD benchmarks don’t measure everything you’d want to know for interpretability purposes (for example, sparsity in SAEs is intrinsically important for interpretability but not necessarily for MAD). A big appeal of interpretability is that understanding models just seems obviously useful, and MAD benchmarks won’t measure understanding at all. See Reflections on Qualitative Research for some great thoughts related to how benchmarks might not be the right frame for making progress on understanding networks.

Naturally, we’re also hoping for connections between empirical MAD research and more theoretical approaches like heuristic arguments. Perhaps most importantly, a solid field of empirical MAD research would provide good baselines and test beds if and when theoretical approaches yield practical algorithms.

Conversely, both mechanistic interpretability and theoretical/conceptual MAD research can inspire MAD techniques that can be tested empirically. For example, methods inspired by causal scrubbing or causal abstractions seem like natural candidates for MAD.

Why not to work on empirical MAD

0 comments

Comments sorted by top scores.