Mechanistic Interpretability as Reverse Engineering (follow-up to "cars and elephants")

post by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-11-03T23:19:20.458Z · LW · GW · 3 comments

Contents

3 comments

I think (perhaps) the distinction that I was trying to make in my previous post "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability [LW · GW] is basically the distinction between engineering and reverse engineering.  

Reverse engineering is analogous to mechanistic interpretability; engineering is analogous to "well-founded AI" (to borrow Stuart Russell's term).

So it seems worth exploring the pros and cons of these two approaches to understanding x-safety-relevant properties of advanced AI systems.  

As a gross simplification,[1] we could view the situation this way:

Under this view, these two approaches are working towards the same end from different starting points.  

A few more thoughts:

  1. ^

     I know people will say that we don't actually understand how "Well founded AI" approaches work any better.  I don't feel equipped to evaluate that claim beyond extremely simple cases, and don't expect most readers are either.

  2. ^

    At least if your goal is to get something like an AGI system, the safety of which we have justified confidence in.  This is perhaps too ambitious of a goal.

3 comments

Comments sorted by top scores.

comment by jacob_cannell · 2022-11-04T02:37:23.549Z · LW(p) · GW(p)

In reality you often use both?

For example - many DL ideas have a neurosci inspiration, which is essentially reverse engineering the brain. Then you combine that with various other knowledge to engineer some new system. But then you want to debug it - and interpretability tools are essentially debugging tools - so debugging is a form of targeted reverse engineering (figuring out how a system actually works in practice to improve it).

comment by Raemon · 2022-11-03T23:27:41.934Z · LW(p) · GW(p)

An angle I think is relevant here is that a sufficiently complex, "well founded" AI system is still going to be fairly difficult to understand. i.e. a large codebase, where everything is properly commented and labeled, might still have lots of unforeseen bugs and interactions the engineers didn't intend. 

So I think before you deploy a powerful "Well Founded" AI system, you'll probably still need a kind of generalized reverse-engineering/interpretability skill to explain how the entire process works in various test cases.

Replies from: capybaralet
comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-11-03T23:51:06.489Z · LW(p) · GW(p)

I don't really buy this argument.  

  • I think the following is a vague and slippery concept: "a kind of generalized reverse-engineering/interpretability skill".  But I agree that you would want to do testing, etc. of any system before you deploy it.
  • It seems like the ambitious goal of mechanistic interpretability, which would get you the kind of safety properties we are after, would indeed require explaining how the entire process works. But when we are talking about such a complex system, it seems the main obstacle to understanding for either approach is our ability to comprehend such an explanation.  I don't see a reason to say that we can surmount that obstacle more easily via reverse engineering than via engineering.  It often seems to me that people are assuming that mechanistic interpretability addresses this obstacle (I'm skeptical), or that (effectively) the obstacle doesn't actually exist (in which case why can't we just do it via engineering?)