The limited upside of interpretability

peter-s-park

The limited upside of interpretability

post by Peter S. Park · 2022-11-15T18:46:15.547Z · LW · GW · 11 comments

11 comments

11 comments

Comments sorted by top scores.

comment by Erik Jenner (ejenner) · 2022-11-15T20:33:42.286Z · LW(p) · GW(p)

My model for why interpretability research might be useful, translated into how I understand this post's ontology, is mainly that it might let us make coarse-grained predictions using fine-grained insights into the model.

I think it's obviously true that we won't be able to make detailed predictions about what an AGI will do without running it (this is especially clear for a superintelligent AI: since it's smarter than us, we can't predict exactly what actions it will take). I'm not sure if you are claiming something stronger about what we won't be able to predict?

In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as "Is it going to kill us all?"). For example, we might figure out that the AI is running some computations that look like they're checking whether the AI is still in a training sandbox. The output of those computations seems to influence a bunch of other stuff going on in the AI. If we intervene on this output, the AI behaves very differently (e.g. trying to scam people we're simulating for money). I think this is an unrealistically optimistic picture, but I don't see how it's ruled out specifically by the arguments in this post.

As an analogy: while we can't predict which moves AlphaZero is going to make without running it, we can still make very important coarse-grained predictions, such as "it's going to win", if we roughly know how AlphaZero works internally. You could imagine an analogous chess playing AI that's just one big neural net with learned search. If interpretability can tell us "this thing is basically running MCTS, its value function assigns very high value to board states where it's clearly winning, ...", we could make an educated guess that it's a good chess player without ever running it.

One thing that might be productive would be to apply your arguments to specific examples of how people might want to use interpretability (something like the deception case I outlined above). I currently don't know how to do that, so for now the argument doesn't seem that forceful to me (it sounds more like one of these impossibility results that sometimes don't matter in practice, like no free lunch theorems).

Replies from: Peter S. Park, remmelt-ellen

↑ comment by Peter S. Park · 2022-11-17T08:32:32.118Z · LW(p) · GW(p)

Thank you so much, Erik, for your detailed and honest feedback! I really appreciate it.

I agree with you that it is obviously true that we won't be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.

I am not completely pessimistic about interpretability of coarse-grained information, although still somewhat pessimistic. Even in systems neuroscience, interpretability of coarse-grained information has seen some successes (in contrast to interpretability of fine-grained information, which has seen very little success).

I agree that if the interpretability researcher is extremely lucky, they can extract facts about the AI that lets them make important coarse-grained predictions with only a short amount of time and computational resources.

But as you said, this is an unrealistically optimistic picture. More realistically, the interpretability researcher will not be magically lucky, which means we should expect the rate at which prediction-enhancing information is obtained to be inefficient.

And given that information channels are dual-use [LW · GW] (in that the AGI can also use them for sandbox escape), we should prioritize efficient information channels like empiricism, rather than inefficient ones like fine-grained interpretability. Inefficient information channels can be net-negative, because they may be more useful for the AGI's sandbox escape compared to their usefulness to alignment researchers.

Perhaps to demonstrate that this is a practical concern rather than just a theoretical concern, let me ask the following. In your model, why did the Human Brain Project crash and burn? Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?

Replies from: ejenner

↑ comment by Erik Jenner (ejenner) · 2022-11-17T21:04:07.955Z · LW(p) · GW(p)

I agree with you that it is obviously true that we won't be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.

I only agree with the first sentence here, and I don't think the rest of the paragraph follows from it. I agree being able to safely experiment on AGIs would be useful, but it's not a replacement for what interpretability is trying to do. Deception is a good example here: how do you empirically tell whether a model is deceptive without giving it a chance to actually execute a treacherous turn? You'd have to fool the model, and there are big obstacles to that. Maybe relaxed adversarial training could help, but that's also more of a research direction than a concrete method for now---I think for any specific alignment approach, it's easy to find challenges. If there is a specific problem that people are currently planning to solve with interpretability, and that you think could be better solved using some other method based on safely experimenting with the model, I'd be interested to hear that example, that seems more fruitful than abstract arguments. (Alternatively, you'd have to argue that interpretability is just entirely doomed and we should stop pursuing it even lacking better alternatives for now---I don't think your arguments are strong enough for that.)

But as you said, this is an unrealistically optimistic picture.

I want to clarify that any story for solving deception (or similarly big obstacles) that's as detailed as what I described seems unrealistically optimistic to me. Out of all stories this concrete that I can tell, the interpretability one actually looks like one of the more plausible ones to me.

In your model, why did the Human Brain Project crash and burn? Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?

This is actually something I'd be interested to read more about (e.g. I think a post looking at what lessons we can learn for interpretability from neuroscience and attempts to understand the brain could be great). I don't know much about this myself, but some off-the-cuff thoughts:

I think mechanistic interpretability might turn out to be intractably hard in the near future, and I agree that understanding the brain being hard is some evidence for that
OTOH, there are some advantages for NN interpretability that feel pretty big to me: we can read of arbitrary weights and activations extremely cheaply at any time, we can get gradients of lots of different things, we can design networks/training procedures to make interpretability somewhat easier, we can watch how the network changes during its entire training, we can do stuff like train networks on toy tasks to create easier versions to study, and probably more I'm forgetting right now.

Your post briefly mentions these advantages but then dismisses them because they do "not seem to address the core issue of computational irreducibility"---as I said in my first comment, I don't think computational irreducibility rules out the things people realistically want to get out of interpretability methods, which is why for now I'm not convinced we can draw extremely strong conclusions from neuroscience about the difficulty of interpretability.

ETA: so to answer you actual question about what I think happened with the HBP: in part they didn't have those advantages (and without those, I do think mechanistic interpretability would be insanely difficult). Based on the Guardian post you linked, it also seems they may have been more ambitious than interpretability researchers? (i.e. actually making very fine-grained predictions)

↑ comment by Remmelt (remmelt-ellen) · 2022-12-20T08:26:09.477Z · LW(p) · GW(p)

In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as "Is it going to kill us all?"... trying to scam people we're simulating for money). I think this is an unrealistically optimistic picture, but I don't see how it's ruled out specifically by the arguments in this post.

This conclusion has the appearance of being reasonable, while skipping over crucial reasoning steps. I'm going to be honest here.

The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of the kinds you are able to imagine right now does not mean that the method can be extended to detecting/simulating most or all human-lethal dynamics manifested in/by AGI over the long term.

If AGI behaviour converges on outcomes that result in our deaths through less direct routes, it really does not matter much whether the AI researcher humans did an okay job at detecting "intentional direct lethality" and "explicitly rendered deception".

One thing that might be productive would be to apply your arguments to specific examples of how people might want to use interpretability (something like the deception case I outlined above). I currently don't know how to do that, so for now the argument doesn't seem that forceful to me (it sounds more like one of these impossibility results that sometimes don't matter in practice, like no free lunch theorems).

There is an equivocation here. The conclusion presumes that applying Peter's arguments to interpretability of misalignment cases that people like you currently have in mind is a sound and complete test of whether Peter's arguments matter in practice – for understanding the detection possibility limits of interpretability over all human-lethal misalignments that would be manifested in/by self-learning/modifying AGI over the long term.

Worse, this test is biased toward best-case misalignment detection scenarios.

Particularly, it presumes that misalignments can be read out from just the hardware internals of the AGI, rather than requiring the simulation of the larger "complex system of an AGI’s agent-environment interaction dynamics" (quoting the TD;LR).

That larger complex system is beyond the memory capacity of the AGI's hardware, and uncomputable.
Uncomputable by:

the practical compute limits of the hardware (internal input-to-output computations are a tiny subset of all physical signal interactions with AGI components that propagate across the outside world and/or feed back over time).
the sheer unpredictability of non-linearly amplifying feedback cycles (ie. chaotic dynamics) of locally distributed microscopic changes (under constant signal noise interference, at various levels of scale) across the global environment.

My understanding is that chaotic dynamics often give rise to emergent order. For biological systems, the copied/reproduced components inside can get naturally selected for causing dynamics that move between chaotic and orderly effects (chaotic enough to be adaptively creative across varying environmental contexts encountered over the system's operational lifecycle; orderly enough that effects are reproduced when similar contexts reappear). But I'm not a biology researcher – would be curious in Peter's thoughts!

See further comments here [LW · GW].

comment by beren · 2022-11-18T13:51:40.421Z · LW(p) · GW(p)

Thanks for writing this! It’s always good to get critical feedback about a potential alignment direction to make sure we aren’t doing anything obviously stupid. I agree with you that finegrained prediction of what an AGI is going to do in any situation is likely computationally irreducible even with ideal interpretability tools.

I think there are three main arguments for interpretability which might well be cruxes.

As Erik says, interpretability tools potentially let us make coarse-grained predictions about the model utilizing fine-grained information. While predicting everything the model will do is probably not feasible in advance, it might be very possible to get pretty detailed predictions of coarse-grained information such as ‘is this model deceptive’, ‘does it have potentially misaligned mesaoptimizers’, ‘does its value function look reasonably aligned with what we want given the model’s ontology’, ‘is it undergoing / trying to undergo FOOM’? The model’s architecture might also be highly modular so that we could potentially understand/bound a lot of the behaviour of the model that is alignment-relevant while only understanding a small part. This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth. We can then potentially get a lot of mileage out of just interpreting the planner and value function while the exact details of how the model represents, say, chairs in the world model, are less important for alignment
What we ultimately likely want is a statistical-mechanics-like theory of how do neural nets learn representations which includes what circuits/specific computations they tend to do, how they evolve during training, what behaviours these give rise to, and how they behave off distribution etc. Having such a theory would be super important for alignment (although would not solve it directly). Interpretability work provides key bits of evidence that can be generalized to build this theory.
Interpretability tools could let us perform highly targeted interventions on the system without needing to understand the full system. This could potentially involve directly editing out mesaoptimizers or deceptive behaviours, and adjusting goal misgeneralization by tweaking the internal ontology of the model. There is a lot of times in science where we can produce reliable and useful interventions in systems with only partial information and understanding of their most fine-grained workings. Nevertheless, we need some understanding and this is what interpretability appears to give a reliable path to realizing.

Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?

There are quite a lot of reasons why we should expect interpretability to be much easier than neuroscience

We know exactly the underlying computational graph of our models – this would be akin to in neuroscience starting out knowing exactly how neurons, synapses etc work as well as knowing the full connectome and the large scale architecture of the brain.
We know the exact learning algorithm our models use – in neuroscience this would be starting out knowing, say, the cortical update rule as well as the brain’s training objective/loss function.
We know the exact training data our models are trained on
We can experiment on copies of the same model as opposed to different animals with different brains / training data / life histories etc
We can instantly read all activations, weights, essentially any quantity of interest simultaneously and with perfect accuracy – simply being able to read neuron firing rates is very difficult in neuroscience and we have basically no ability to read large numbers of synaptic weights
We can perform arbitrary interventions at arbitrarily high fidelity on our NNs
These points mean experimental results are orders of magnitude faster and easier to get. A typical interpretability experiment looks like: load model into memory, perform precise intervention on model, look at a huge number of possible outputs, iterate. Neuroscience experiments often look like train mice to do some task for months, insert probes or do some broad based intervention where you are not sure exactly what you are measuring or what your intervention actually affected, get a bunch of noisy data from a small sample with potential systematic errors/artifacts from your measurement process where you can only read a tiny fraction of what you would like to read, try to understand what is going on. It is much harder and slower!

Secondly, the blue brain project is just one example of a high profile failure in neuroscience and we shouldn’t generalize too much based on it. I have had no experience with the blue brain project, but it seems plausible to me that this could just be a standard case of garden-variety mismanagement and overoptimistic goals and hype. From my perspective, as someone who has worked in neuroscience is that the field just keeps chugging along accumulating knowledge and is advancing at a reasonable pace – i.e. our understanding of the brain has improved vastly since 20 years ago. Not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.

Given this, it is not clear to me that interpretability is doomed to failure and it seems a reasonably high EV bet to me.

In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.

If by ‘empiricism’ as a channel you mean ‘just run the model and see what it does’, then while some amount of empiricism is necessary, it is also a highly dangerous channel to observe and so we should be trying to get as many bits as possible without requiring empiricism. Pre-deployment interpretability is one way to potentially get a lot of bits of information (although I agree not all the bits) without having to expose yourself to this danger. Hopefully interpretability, even if using only fairly coarse grained information can already eliminate a lot of ‘obviously dangerous AGIs’ before you have to run them.

Some empiricism will likely be necessary though and I agree that designing highly secure test-harnesses for the AGI akin to BSL-4 labs will be another necessary component to safety. However, high-powered interpretability will be a vital safety measure to be applied inside such a safe test harness. Ideally, while the AGI is running in some simulation, with interpretability tools we can be constantly scanning it for deception, for realization that it is in a simulation, for any kind of other malicious intent, and then abort or intervene to fix it if this is the case. I think just relying on airgapping and other physical measures is a bad idea vs a superintelligence and in practice we should be trying to build up as many uncorrelated lines of defense as possible in the test-harness.

And given that information channels are dual-use [LW · GW] (in that the AGI can also use them for sandbox escape), we should prioritize efficient information channels like empiricism, rather than inefficient ones like fine-grained interpretability. Inefficient information channels can be net-negative, because they may be more useful for the AGI's sandbox escape compared to their usefulness to alignment researchers.

This is an interesting objection. While it is possible that the AGI could hack interpretability channels, it seems much more difficult than just communicating in the normal way by running in the world – i.e. the ‘empirical channel’. This is because it probably doesn’t know (and we should be careful to make sure that it definitely does not know) what interpretability tools we are using on it, and that interpretability tools can be applied before we actually run the AGI (i.e. directly to weights) or to individual components in isolation.

Replies from: Peter S. Park

↑ comment by Peter S. Park · 2022-11-19T08:47:46.575Z · LW(p) · GW(p)

Thank you so much for your insightful and detailed response, Beren! I really appreciate your time.

The cruxes seem very important to investigate.

This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth.

It probably helps to have the AGI's architecture hand-designed to be more human-interpretable. My model is that on the spectrum of high-complexity paradigms (e.g., deep learning) to low-complexity paradigms (e.g., software design by a human software engineer), having the AGI's architecture be hand-designed moves away from the former and towards the latter, which helps reduce computational irreducibility and thereby increase out-of-distribution predictability (e.g., on questions like "Is the model deceptive?").

However, my guess is that in order for out-of-distribution predictability of the system to be nontrivial, one would need to go substantially towards the low-complexity end of the spectrum. This would make it unlikely for the model to achieve AGI-like capabilities.

What we ultimately likely want is a statistical-mechanics-like theory of how do neural nets learn representations which includes what circuits/specific computations they tend to do, how they evolve during training, what behaviours these give rise to, and how they behave off distribution etc.

It would be great if we can get a statistical-mechanics-like theory. Realistically, such a theory would probably be a combination of "small-to-middle-number systems because local interactions are quite relevant in affecting system behavior. In this case, there would be a general lack of the homogeneous mixing necessary for a purely statistical mechanics view to be applicable."

Interpretability tools could let us perform highly targeted interventions on the system without needing to understand the full system.

I can't rule it out (gene editing is the optimistic outside view here), but I expect such interventions to be imperfect and vulnerable to unforseeable side effects. These interventions would likely require immense trial-and-error to achieve nontrivial out-of-distribution predictability.

From my perspective, as someone who has worked in neuroscience is that the field just keeps chugging along accumulating knowledge and is advancing at a reasonable pace – i.e. our understanding of the brain has improved vastly since 20 years ago. Not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.

I agree with the characterization "not at the pace probably required for alignment on short timelines but at a decent level for a scientific field." Indeed, the main point I wanted to make was my skepticism that interpretability research on AGI-scale models will progress fast enough to solve alignment in a short amount of time.

I think just relying on airgapping and other physical measures is a bad idea vs a superintelligence and in practice we should be trying to build up as many uncorrelated lines of defense as possible in the test-harness.

I completely agree, as long as the lines of defense are net-positive rather than net-negative [LW · GW]. An information channel should only be opened if it helps the alignment researchers (to align the AGI) more than it helps the likely by-default misaligned AGI (to prematurely escape its sandbox).

This is an interesting objection. While it is possible that the AGI could hack interpretability channels, it seems much more difficult than just communicating in the normal way by running in the world – i.e. the ‘empirical channel’. This is because it probably doesn’t know (and we should be careful to make sure that it definitely does not know) what interpretability tools we are using on it, and that interpretability tools can be applied before we actually run the AGI (i.e. directly to weights) or to individual components in isolation.

My guess is that the reward-risk tradeoff of empirical channels—empiricism in the deployment environment or a precise replica of the deployment environment—is more likely to be favorable than that of interpretability channels. This is because the reward is high (empiricism efficiently yields predictive information) while ways to mitigate the risk are human-understandable (as opposed to "complexity is the enemy of security").

If the AGI doesn't have prior knowledge of our interpretability plans and how to exploit them (e.g., by deceiving us into a false sense of security), then I would be much more optimistic about the reward-risk tradeoff of interpretability channels! This is why we should keep our interpretability plans inacessible to the AGI [LW · GW] as much as possible. Since there is plausibly a significant amount of secrecy value in interpretability plans, we should:

preemptively keep our interpretability plans off of the Internet,
filter these plans from the AGI's training set, and
ensure that the AGI cannot inspect its own internal data.

I have been campaigning for these measures [LW · GW], but many AI safety researchers have generally opposed adopting Measure #1. I hope this will change going forward!

comment by Beth Barnes (beth-barnes) · 2022-11-26T20:44:15.776Z · LW(p) · GW(p)

It seems to me that this argument proves much too much. If I understand correctly, you're saying that various systems including advanced ML-based AI are 'computationally irreducible', by which you mean there's no simplified model of the system that makes useful predictions. I don't think humans are computationally irreducible in this way. For example, I think you can make many very useful predictions about humans by modeling them as (for example) having some set of goals, habits, tendencies, and knowledge. In particular, knowing what the human's intentions or goals are is very useful for the sort of predictions that we need to make in order to check if our AI is aligned. Of course, it's difficult to identify what a human's intentions are just by having access to their brain, but as I understand it that's not the argument you're making.

Replies from: Peter S. Park

↑ comment by Peter S. Park · 2022-11-28T22:41:25.385Z · LW(p) · GW(p)

Thank you so much, Beth, for your extremely insightful comment! I really appreciate your time.

I completely agree with everything you said. I agree that "you can make many very useful predictions about humans by modeling them as (for example) having some set of goals, habits, tendencies, and knowledge," and that these insights will be very useful for alignment research.

I also agree that "it's difficult to identify what a human's intentions are just by having access to their brain." This was actually the main point I wanted to get across; I guess it wasn't clearly communicated. Sorry about the confusion!

My assertion was that in order to predict the interaction dynamics of a computationally irreducible agent with a complex deployment environment, there are two realistic options:

Run the agent in an exact copy of the environment and see what happens.
If the deployment environment is unknown, use the available empirical data to develop a simplified model of the system based on parsimonious first principles that are likely to be valid even in the unknown deployment environment. The predictions yielded by such models have a chance of generalizing out-of-distribution, although they will necessarily be limited in scope.

When researchers try to predict intent from internal data, their assumptions/first principles (based on the limited empirical data they have) will probably not be guaranteed to be "valid even in the unknown deployment enviroment." Hence, there is little robust reason to believe that the predictions based on these model assumptions will be generalizable out-of-distribution.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2022-12-02T22:52:51.107Z · LW(p) · GW(p)

At some points in your comment you use the criterion "likely to be valid", at other points you use the criterion "guaranteed to be valid". These are very different! I think almost everyone agrees that we're unlikely to get predictions which are guaranteed to be valid out-of-distribution. But that's true of every science apart from fundamental physics: they all apply coarse-grained models, whose predictive power out-of-distribution varies very widely. There are indeed some domains in which it's very weak (like ecology), but also some domains in which it's pretty strong (like chemistry). There are some reasons to think interpretability will be more like the former (networks are very complicated!) and some reasons to think it'll be more like the latter (experiments with networks are very reproducible). I don't think this is the type of thing which can be predicted very well in advance, because it's very hard to know what types of fundamental breakthroughs may arise.

More generally, the notion of "computational irreducibility" doesn't seem very useful to me, because it takes a continuous property (some systems are easier or harder to make predictions about) and turns it into a binary property (is it computationally reducible or not), which I think obscures more than it clarifies.

comment by Esben Kran (esben-kran) · 2022-11-26T18:31:35.309Z · LW(p) · GW(p)

Thank you for this critique! They are always helpful to hone in on the truth.

So as far as I understand your text, you argue that fine-grained interpretability loses out against "empiricism" (running the model) because of computational intractability.

I generally disagree with this. beren [LW(p) · GW(p)] points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!

You emphasize the Human Brain Project (HBP) quite a lot, even in the comments, as an example of a failed large-scale attempt to model a complex system. I think this characterization is correct but it does not seem to generalize beyond the project itself. It seems just as much like a project management and strategy problem as so much else. Benes' comment [LW(p) · GW(p)] is great for more reasoning into this and why ANNs seem significantly more tractable to study than the brain.

Additionally, you argue that interpretability and ELK won't succeed simply because of the intractability of fine-grained interpretability. I have two points against this view:

1. Mechanistic interpretability have clearly already garnered quite a lot of interesting and novel insights into neural networks and causal understanding since the field's inception 7 years ago.

It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it's a matter of speed seems completely fine but this is another argument and isn't emphasized in the text.

2. Mechanistic interpretability does not seem to be working on fine-grained interpretability (?).

Maybe it's just my misunderstanding of what you mean by fine-grained interpretability, but we don't need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.

For example work in this paradigm that seems promising, see interpretability in the wild, ROME, the superpositions exposition, the mathematical understanding of transformers and the results from the interpretability hackathon [LW · GW].

For an introduction to features as the basic building blocks as compared to neurons, see Olah et al.'s work (2020).

When it comes to your characterization of the "empirical" method, this seems fine but doesn't conflict with interpretability. It seems you wish to make game theory-like understanding of the models or have them play in settings to investigate their faults? Do you want to do model distillation using circuits analyses or do you want AI to play within larger environments?

I falter to understand the specific agenda from this that isn't done by a lot of other projects already, e.g. AI psychology [LW · GW] and building test environments for AI [LW · GW]. I do see potential in expanding the work here but I see that for interpretability as well.

Again, thank you for the post and I always like when people cite McElreath, though I don't see his arguments apply as well to interpretability since we don't model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan's work.

Replies from: Peter S. Park

↑ comment by Peter S. Park · 2022-11-28T22:27:19.183Z · LW(p) · GW(p)

Thank you so much for your detailed and insightful response, Esben! It is extremely informative and helpful.

So as far as I understand your text, you argue that fine-grained interpretability loses out against "empiricism" (running the model) because of computational intractability.
I generally disagree with this. beren [LW(p) · GW(p)] points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!

The main benefit of interpretability, if it can succeed, is that one can predict harmful future behavior (that would have occurred when deployed out-of-distribution) by probing internal data. This allows the researchers to preemptively prevent the harmful behavior: for example, by retraining after detecting deceptive intent. If this is scientifically possible, it would be a substantial benefit, especially since it is generally difficult to obtain out-of-distribution predictions from atheoretical empiricism.

However, I am skeptical that interpretability can achieve nontrivial success in out-of-distribution predictions, especially in the amount of time alignment researchers will realistically have. The reason is that deceptive intent is likely a fine-grained trait at the internal-data level (rather than at the behavioral level). Consequently, computational irreducibility is likely to impose a hard bound on predicting deceptive intent out-of-distribution, at least when assuming realistic amounts of time and resources.

My guess is that detecting deceptive intent solely from a neural net's internal data is probably at least as fine-grained as behavioral genetics or neuroscience. These fields have made some progress, but preemptively predicting behavioral traits from internal data remains mostly unsolved.

For example, consider a question analogous to that of deceptive misalignment: 'Is the given genome optimized for inclusive fitness, or is it optimized for a proxy goal that deviates from inclusive fitness in certain historically unprecedented environments?' We know that evolutionary pressures select for maximizing inclusive fitness. However, the genome is optimized not for inclusive fitness, but for a proxy goal (survive and engage in sexual intercourse) that deviates from inclusive fitness in environments that are sufficiently distinct from ancestral environments.

How did scientists find out that the genome is optimized for a proxy goal? Almost entirely from behavior. We have a coarse-grained behavioral model that is quite good and generalizable. Evolution shaped animals' behavior towards a drive for sexual intercourse, but historically unprecedented environmental changes (e.g., widespread availability of birth control) has made this proxy goal distinct from inclusive fitness. Parsimonious models based on first principles that are likely to be correct, like the above one, have a realistic chance of achieving situation-specific predictability that generalizes out-of-distribution.

In contrast, there is still very little understanding of which genes interact to cause animals' sex drive. Which genes affect sex drive? Probably a substantial proportion of them, and they probably interact in interconnected and nonlinear ways (including with the extremely complex, multidimensionally varying environment) to produce behavioral traits in an unpredictable manner. Moreover, a lot of the information needed to predict behavioral traits like sex drive will lie in the specific environment and how it interacts with the genome. Only the most coarse-grained of these interaction dynamics will be predictable via bypassing empiricism with a statistical-mechanics-like model, due to computational irreducibility. And such a coarse-grained model will likely be rooted in behavior-based abstractions.

Deep-learning neural nets do come with an advantage lacked by behavioral genetics and neuroscience: a potentially complete knowledge of the internal data, the environmental data, and the data of their interaction throughout the whole training process.

But there is a missing piece: complete knowledge of the deployment environment. Any internals-based model of deceptive intent that alignment researchers can come up with is only guaranteed to hold in the subset of environments that the researchers have empirically tested. In the subset of environments that the polycausal model has not been tested in, there is no a priori reason that the model will generalize correctly. A barrier to generalizability is posed by the nonlinear and interconnected interactions between the neural net's internals and the unprecedented environment, which can and likely will manifest differently depending on the environment [LW · GW]. Relaxed adversarial training can help test a wider variety of environments, but this is still hampered by the blind spot of being unable to test the subset of environments that cannot be instantiated at human-level capabilities (e.g., the environment in which RSA encryption is broken). Thus, my guess is that the intrinsic out-of-distribution predictability of the AGI neural net's behavior would be low, just like that of behavioral genetics or neuroscience.

For a conceptual example, consider the fact that the dynamics of cellular automata can change drastically with just one cell’s change in the initial conditions. See Figure 1 of Beckage et al. (Code 1599 in Wolfram's A New Kind of Science), reproduced below:

In general, the only way to accurately ascertain how a computationally irreducible agent will behave in a complex environment is to run it in that environment. Even with complete knowledge of the agent's internals, incomplete knowledge of the environment is sufficient to constrain a priori predictability. I expect that many predictions yielded by interpretability tools in the pre-deployment environment will fail to generalize to the post-deployment environment, unless the two are equal.

It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it's a matter of speed seems completely fine but this is another argument and isn't emphasized in the text.

Sorry for the miscommunication! I meant to say that the rate at which mechanistic interpretability will yield useful, generalizable information is slow, not zero.

But this is sufficient for concern because informational channels are dual-use [LW · GW]; the AGI can use it for sandbox escape. We should only open an interpretability channel if the rate of scientific benefit exceeds the rate of cost (risk of premature sandbox escape by a misaligned AGI).

My opinion is that while mechanistic interpretability has made some progress, the rate at which this progress is happening is not fast enough to solve alignment in a short amount of time and computational resources. So far, the rate of progress in interpretability research has been substantially outpaced by that in AI capabilities research. I think this was predictable, due to what we know about computational irreducibility.

Maybe it's just my misunderstanding of what you mean by fine-grained interpretability, but we don't need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.

Roughly speaking, there is a spectrum between high-complexity paradigms of design (e.g., deep learning) and low-complexity, modular paradigms of design (e.g., software design by a human software engineer). My guess is that for many complex tasks, the optimal equilibrium strategy can be achieved only by the former, and attempting to meaningfully move towards the latter end of the spectum will result in sacrificing performance. For example, I expect that we won't be able to build AGI via modular software design by a human software engineer, but that we will be able to build it by deep learning.

Again, thank you for the post and I always like when people cite McElreath, though I don't see his arguments apply as well to interpretability since we don't model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan's work.

In Ethan's scaling law, extrapolatory generalization is only guaranteed to be valid locally ("perfectly extrapolate until the next break"), and not globally. This is completely consistent with my prior. My assertion was that in order to globally extrapolate empirical findings to an unknown deployment environment, only simple models have a nontrivial chance of working (assuming realistic amounts of time and computational resources). These simple models will likely be based on parsimonious first principles that we have strong reason to be valid even in the unknown environment. And consequently, they will likely be largely based on behavioral data rather than the internal data of the agent-environment interaction dynamics.

The limited upside of interpretability

Contents

11 comments