Interpretability

post by abergal, Nick_Beckstead · 2021-10-29T07:28:02.650Z · LW · GW · 13 comments

Contents

  Aspirational Goal: Fully Understand a Neural Network
  Research Direction: Discovering Features and Circuits
  Research Direction: Scaling Circuits to Larger Models
  Research Direction: Resolving Polysemanticity
  Other Directions
  Resources
None
13 comments

Chris Olah wrote the following topic prompt for the Open Phil 2021 request for proposals on the alignment of AI systems. We (Asya Bergal and Nick Beckstead) are running the Open Phil RFP and are posting each section as a sequence on the Alignment Forum. Although Chris wrote this document, we didn’t want to commit him to being responsible for responding to comments on it by posting it.

Summary: We would like to see research building towards the ability to "reverse engineer" trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models.

--

Potential safety failures of neural networks might be thought of as falling into two broad categories: known safety problems and unknown safety problems. A known safety problem is one which can be easily anticipated in advance of deploying a model or easily observed in the model's behavior. Such safety failures can be easily caught with testing, and it seems reasonable to hope that they will be fixed with human feedback. But it seems like there’s much less of a clear story for how we’ll resolve unknown safety problems -- the things we didn’t think to test for and wouldn’t obviously give feedback to fix.

Even if we’re able to anticipate certain safety problems, we may not know if we’re sufficiently disincentivizing them. A model might behave well on the training distribution, then unexpectedly exhibit some safety failure in a context it wasn’t trained in. In the extreme, a model might make a “treacherous turn”[1] -- it may use its understanding of the training setup to deliberately behave well only during training, then pursue different goals once it knows it’s outside of the training distribution.

In traditional software engineering, our ability to mitigate unanticipated safety problems largely flows from our ability to understand and carefully reason about code. While testing may catch various kinds of easy to observe or anticipated problems, we rely on code reviews, careful engineering, and even systematic verification to avoid other problems. These approaches are only possible because we can understand code for normal computer programs, something we don’t, by default, get with neural networks.

Neural network parameters can be seen as the assembly instructions of a complex computer program. If it was possible to reverse engineer the parameters of trained neural networks into human-understandable algorithms, it may enable us to catch safety problems the same way we are able to in code.

Recent research has shown that it is possible to reverse engineer modern neural networks into human understandable computer programs, at least on a small scale. The Circuits Thread on Distill contains many examples of reading human understandable algorithms off the weights of neural networks.

There’s also some evidence that this kind of analysis can reveal unanticipated problems and concerns. An analysis of CLIP found that the model had neurons related to race, gender, age, religions, LGBT status, mental health, physical disability, pregnancy, and parental status. We can also mechanistically observe concerning uses of these protected attributes, such as an Asian culture neuron increasing the probability of “I feel pressured” or an LGBT neuron increasing the probability of “I feel accepted.” Although there is significant attention to bias in machine learning, it tends to be focused on a couple categories such as race and gender. Surfacing that a model represents other protected attributes is a proof of concept that mechanistic interpretability can surface unanticipated concerns in state of the art models.

We would like to see more research aimed at mechanistically understanding neural networks, at seriously reverse engineering trained models into understandable programs. While we think it’s most likely this will take the form of work in a similar vein to Circuits, we’re also open to other ideas. Research projects should meet the following desiderata:

At this stage, we’re interested in work that generically makes progress towards mechanistically understanding neural networks, but it’s worth noting that there are specific questions which are of particular interest from a safety perspective. For example:

Work that addresses one of these would be of particular interest.

The following sections will describe some more specific research directions we think are promising.

Aspirational Goal: Fully Understand a Neural Network

A useful aspirational goal in this work is to fully reverse engineer any modern neural network, such as an ImageNet classifier or modern language model. There are a number of ways you could potentially operationalize “fully reverse engineer”:

(It’s worth noting that all these can also be applied to parts of a model in addition to the full thing, and in fact have all roughly been achieved for curve circuits.)

At the moment, no one fully understands any non-trivial neural network. Demonstrating that it’s possible to do so is both a natural milestone, and might make society much more willing to invest in this kind of reverse engineering. This goal is aspirational, but a successful project should somehow advance us towards this goal.

We’re open to a range of possibilities as to how it could happen, as long as it’s a genuine milestone towards understanding powerful models. For example, this could be achieved by simply reverse engineering existing models, but a model engineered to be interpretable in some way also seems like fair game, provided it’s equally performant and easy to train. (At the moment, InceptionV1 is by far the closest thing we have to a fully reverse engineered model, about 30% reverse engineered as measured by neuron count. But reverse engineering another model would be equally useful as a milestone.)

Note that even if we are able to fully understand a modern neural network, we are far from our ultimate, even more ambitious goal: a general method for understanding neural networks of arbitrary size and sophistication.

Research Direction: Discovering Features and Circuits

In the Circuits approach, neural networks are composed of features and circuits. Characterizing features and circuits is the most direct way to move us towards fully understanding neural networks. Examples of doing this kind of work can be found in the Circuits thread (see especially Curve Detectors, Curve Circuits, and High-Low Frequency Detectors; see also Visualizing Weights for more detailed discussion of methods for studying circuits).

There are several reasons why it might be useful to study specific features and circuits, especially as a starter project:

Results on new features and circuits are especially interesting if:

Research Direction: Scaling Circuits to Larger Models

The main weakness of the circuits approach is that, by focusing on small scale structure, it may not be able to scale to understanding large scale models. Curve Circuits is the largest example of reverse engineering a circuit to date, at ~50K parameters. This is many orders of magnitude smaller than modern language models. How might we bridge the gap?

One promising direction is to find additional structure which greatly simplifies mechanistically understanding neural networks, or at least simplifies understanding the safety relevant features. The circuits thread describes two such types of structure: “motifs” (recurring patterns in circuits, such as equivariance) and “structural phenomena” (large scale patterns in how neural networks are organized, such as branch specialization).

Equivariance can simplify circuits in early vision by as much as a factor of 50x, so there’s precedence for discovering structure that can give order of magnitude simplifications of neural networks. Similarly, one could imagine a world where something similar to work on branch specialization or research on modularity, can break neural networks into components, where only some of the components are necessary to audit for safety, massively reducing the work needed to ensure models are safe.

Research projects in this category would ideally:

Concrete examples:

In addition to these approaches, we’re excited about other approaches to scaling interpretability to much larger models. These approaches should fit the general desiderata listed in the introduction, and have a clear story for why their success would enable significantly better scaling. They should also consider how feasible the approach is to study right now; for example, using large numbers of humans, or integrating with some alignment scheme that provides automation, might ultimately be crucial for scaling interpretability but challenging to study now.

Research Direction: Resolving Polysemanticity

Another major challenge to circuits is polysemanticity, where some neurons respond to multiple unrelated features. One theory is that polysemanticity occurs because models would ideally represent more features than they have neurons, and exploit the fact that high dimensional spaces can have many “almost orthogonal” directions to store these features in “superposition” across neurons. Polysemanticity seems related to questions about disentangling representations.

Polysemanticity makes circuits much harder to study and audit, and increases the risk that analysis will miss important structures. For example, imagine one is trying to audit a model for potential bias issues (we see bias as a proxy for a much broader class of safety concerns, but can be studied in models that exist today). If one was to find a circuit where a “female” neuron excites a “nurse” neuron but inhibits a “doctor” neuron, auditors would likely conclude that was an example of bias. But if a “wheel/female/tree” neuron excites a “car/nurse/waterfall” neuron, and there are a number of other neurons which seem to have overlapping features, it’s less clear what to make of it.

Other Directions

This list is far from comprehensive. There are many other questions we’d be interested to see proposals related to. To give a few examples:

We expect there are many other promising ideas we haven’t listed or considered. We would be excited to receive proposals for other interpretability projects that could help us catch unanticipated safety problems or guarantee good behavior in unusual contexts, if they meet the desiderata in the introduction. Proposals related to scaling mechanistic interpretability to larger models are of particular interest.

Resources

Several of the Circuits articles provide colab notebooks reproducing the results in the article, which may be helpful references if one wants to do Circuits research on vision models. In particular, Visualizing Weights focuses on demonstrating and explaining some of the fundamental techniques. If one wants to study multimodal neurons, OpenAI has some code that may be helpful.


  1. Nick Bostrom describes this failure mode in Superintelligence, p. 117: “…one idea for how to ensure superintelligence safety… is that we validate the safety of a superintelligent AI empirically by observing its behavior while it is in a controlled, limited environment (a “sandbox”) and that we only let the AI out of the box if we see it behaving in a friendly, cooperative, responsible manner. The flaw in this idea is that behaving nicely while in the box is a convergent instrumental goal for friendly and unfriendly AIs alike. An unfriendly AI of sufficient intelligence realizes that its unfriendly final goals will be best realized if it behaves in a friendly manner initially, so that it will be let out of the box. It will only start behaving in a way that reveals its unfriendly nature when it no longer matters whether we find out; that is, when the AI is strong enough that human opposition is ineffectual.” Additional discussions of the possibility of such failure modes can be found in Hubinger et al.’s Risks from Learned Optimization in in Advanced Machine Learning Systems (section 4, “Deceptive Alignment”) and Luke Muelhauser’s post, “Treacherous turns in the wild”. ↩︎

13 comments

Comments sorted by top scores.

comment by Evan R. Murphy · 2021-12-23T21:34:41.557Z · LW(p) · GW(p)

Several of the Circuits articles provide colab notebooks reproducing the results in the article, which may be helpful references if one wants to do Circuits research on vision models.

 

I'm starting to reproduce some results from the Circuits thread. It took me longer than expected just to find these colab notebooks so I wanted to share more specifically in case it saves anyone else some time.

The text "colab" isn't really turned up in a targeted Google search on Distill and the Circuits thread.  Also if you open a post like Visualizing Weights and do a Ctrl+F or Cmd+F search for "colab", you won't turn up any results either.

But if you open that same post and scroll down, you'll see button links to open the colab notebooks that look like this. These will also turn up in in a Ctrl+F or Cmd+F search for "notebook" in case you want to jump around to different colab examples in the Circuits thread.

comment by paulfchristiano · 2021-11-05T22:40:28.437Z · LW(p) · GW(p)

I left some comments here [AF · GW]. My overall takeaway (as someone who hasn't worked in the area but cares about alignment) is that I'm very excited about this kind of interpretability research and especially work that focuses on small parts of models without worrying too much about scalability. It seems like interpretability could provide compelling warnings about future risks, is a big part of many of the existing concrete stories for how we might get aligned AI, and is reasonably likely to be helpful in unpredictable ways.

comment by ESRogs · 2021-10-29T23:25:28.770Z · LW(p) · GW(p)

The by-line seems a little confusing on this one. It was written by Chris Olah, and posted by abergal and Nick_Beckstead acting together?

Replies from: abergal
comment by abergal · 2021-11-10T01:58:35.045Z · LW(p) · GW(p)

Chris Olah wrote this topic prompt (with some feedback from me (Asya) and Nick Beckstead). We didn’t want to commit him to being responsible for this post or responding to comments on it, so we submitted this on his behalf. (I've changed the by-line to be more explicit about this.)

Replies from: ESRogs
comment by ESRogs · 2021-11-11T03:31:57.754Z · LW(p) · GW(p)

Ah, got it. Thanks!

comment by delton137 · 2021-11-07T14:59:14.199Z · LW(p) · GW(p)

"If one looks at the performance of particular tasks, such as arithmetic on numbers of a certain size, across model sizes, one often observes points where larger models discontinuously become better at a task."


Is it accurate to say that one "often observes" this?  The only examples I know of are in GPT-3 with the addition, multiplication, and symbolic substitution tasks. I'm not sure how concerned to be about this being a general phenomena. Does anyone have further examples? Does anyone have insights into whether the GPT-3 examples are special cases or not? 

Replies from: gwern, arthur-conmy
comment by gwern · 2021-11-07T15:45:04.784Z · LW(p) · GW(p)

In addition to the original Brown et al 2020 examples, text style transfer, meta-learning instructability*, RL-finetuning of summarization, self-critique of math word problems, and maybe the improving zero-shot translation & program writing/dialogue (I'd have to double-check those), have been shown with GPT-3 and LamDA to 'kick in' at certain sizes going from the O(1b) models to 10-1000b. Nobody seems very surprised these days to see something work on GPT-3-173b but then not on ~1b.

* Should we count all of the examples of meta-learning / generalization which require diverse environments to get abruptly better performance out of sample, like XLand or the MuZero meta-learning paper I mention [LW(p) · GW(p)] over in EfficientZero? That's definitely a stark jump in performance: the single-environment agents, no matter how good in the primary environment, typically perform extremely poorly or even near floor in the new environment.

Replies from: delton137
comment by delton137 · 2021-11-09T01:59:58.259Z · LW(p) · GW(p)

Thanks..  I was looking for more graphs with discontinuous jumps and "# of parameters" on the x-axis... but I think "totally new and unexpected capabilities after going from GPT-2 to GPT-3" is a reasonable thing to point at, also. The scaling laws bibliography is super, super useful. I am just embarking on making my way through it now..

Replies from: gwern
comment by gwern · 2021-11-09T02:48:59.133Z · LW(p) · GW(p)

You can dig those 'money shot' capability jump graphs out of the papers, usually, I think. I try to add them to annotations when I make them because that's a very critical stylized fact about DL's blessings of scale. I'm not going to look now, but Brown has the graphs, and I'm pretty sure the text style transfer & RL finetuning do have the money shot graphs, and probably the others. XLand and MuZero might have them if you squint (not necessarily in parameter # - parameters aren't the only thing that scales, remember!).

Replies from: delton137
comment by delton137 · 2021-11-09T03:51:52.924Z · LW(p) · GW(p)

Great..

Also I just realized that the "grokking" phenomena is relevant here. The "grokking" paper shows jumps during training, but it's similar. From the lens of the lottery ticket hypothesis, it's not surprising that grokking may be easier / more likely in larger models. 

I wonder how much "grokking" is new to transformers. I happened to stumble across an example in the literature where a CNN model "fails to grok" the Game of Life: https://arxiv.org/abs/2009.01398 .. I wonder what would happen if you used a transformer model instead..

Also, please check out my comment on your Scaling Laws bibliography page when you get a chance. 

Replies from: gwern
comment by gwern · 2021-11-09T14:10:51.151Z · LW(p) · GW(p)

I hesitate to call grokking an example of blessings of scale because it's still not clear what is going on there with grokking or patient teacher. They are, after all, tiny models, and patient teacher is all about distilling to small models. And the need for regularization is strange if it's a scaling thing where larger=better: what, the regularization by tininess isn't enough, it needs more regularization from weight decay?

I doubt grokking is unique to Transformers. The research I see as most related to grokking, the finding shallow minima paradigm with the wide basins & cyclic learning rates, are well-established for CNNs. Not finding it for some CNN is pretty weak evidence, given the grokking paper showing that you can go anywhere from like 0% to what was it 90%? depending on the details of the setup and how long you run.

comment by Arthur Conmy (arthur-conmy) · 2022-06-16T15:30:38.784Z · LW(p) · GW(p)

In the AlphaZero interpretability paper [1], CTRL+F "Ruy Lopez" for an example where the model's progress was much faster than human progress in quality.

 

[1] https://arxiv.org/pdf/2111.09259.pdf

Replies from: gwern
comment by gwern · 2022-06-16T16:17:15.816Z · LW(p) · GW(p)

That's within-training by epoch/iteration, not across trained models by total size/compute. It's not clear that they are at all the same sort of thing, because you can get spikes trivially by things like the learning rate dropping. Investigating whether there is any connection would be interesting.