Analogies between Software Reverse Engineering and Mechanistic Interpretability

post by Neel Nanda (neel-nanda-1), Itay Yona (itay-yona) · 2022-12-26T12:26:57.880Z · LW · GW · 6 comments

This is a link post for https://www.neelnanda.io/mechanistic-interpretability/reverse-engineering

Contents

  My Takeaways
  Call Notes
None
6 comments

These are notes taken during a call with Itay Yona, an expert in software/hardware reverse engineering (SRE). Itay gave me an excellent distillation of key ideas and mindsets in the field, and we discussed analogies/disanalogies to mechanistic interpretability of neural networks. I’m generally very excited to learn about other fields of study that reverse engineer complex systems, and what relevant insights they may have (SRE, neuroscience, systems biology, etc). All mistakes are mine, and all insights are his! 

My Takeaways

  1. Identify a behaviour we care about/want to find a circuit for
  2. Train a toy model to do this, that tries to be as simple as possible while being a good simulation of the real model
  3. Study how the behaviour is implemented
  4. Distill out the core patterns and fingerprints of this behaviour
  5. Scan the real network for these, and use this as (non-rigorous) evidence for what’s going on.

For more discussion on similarities, see Chris Olah’s essay Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases

Call Notes

6 comments

Comments sorted by top scores.

comment by LawrenceC (LawChan) · 2022-12-26T15:57:13.499Z · LW(p) · GW(p)

The distinction between "newbies get caught up trying to understand every detail, experts think in higher-level abstractions, make educated guesses, and only zoom in on the details that matter" felt super interesting and surprising to me.

I claim that this is 1) an instance of a common pattern that 2) is currently missing a step (the pre-newbie stage).

The general pattern is the following (terminology borrowed from Terry Tao):

  1. The pre-rigorous stage: Really new people don't know how ~anything works in a field, and so use high-level abstractions that aren't necessarily grounded in reality. 
  2. The rigorous stage: Intermediate people learn the concrete grounding behind a field, but get bogged down in minutia.
  3. The post rigorous stage: Experts form correct high level abstractions informed by their understanding of the grounding, but still use the grounding when the high level abstractions break down. 

I think that many experts mainly notice the 2->3 transition, but not the 1->2 one, and so often dissuade newbies by encouraging them to not work in the rigorous stage. I claim this is really, really bad, and that a solid understanding of the rigorous stage is a good idea for ~everyone doing technical work. 

Here's a few examples:

  • Terry Tao talks about this in math: Early students start out not knowing what a proof is and have to manipulate high-level, handwavy concepts. More advanced students learn the rigorous foundations behind various fields of math but are encouraged to focus on the formalism as opposed to 'focusing too much on what such objects actually “mean”'. Eventually, mathematicians are able to revisit their early intuitions and focus on the big picture, converting their intuitions to rigorous arguments when needed. 
  • The exact same thing is true in almost discipline with mathematical proofs, e.g. physics or theoretical CS.
  • This happens in a very similar with programming as well.
  • In many strategy games (e.g. Chess), you see crazy high level strategizing at the super low and high levels, while the middle levels are focused on improving technique.
  • I'd also claim that something similar happens in psychology: freshmen undergrads come up with grand theories of human cognition that are pretty detached from reality, many intermediate researchers get bogged down in experimental results, while the good psychology researchers form high level representations and theories based on their knowledge of experiments (and are able to trivially translate intuitions into empirical claims). 
Replies from: itay-yona, neel-nanda-1
comment by Itay Yona (itay-yona) · 2022-12-27T22:19:45.257Z · LW(p) · GW(p)

I strongly agree! When you study towards RE it is critical to understand lots of details about how the machine works, and most people I knew were already familiar with those. They were lacking the skills of using their low-level understanding to actually conduct useful research effectively.

It is natural to pay much less attention to 1->2 phase since there are much more intermediate researchers than complete newbies or experts. It is interesting because when discussing with the intermediate researchers they might think they are discussing with person 1 instead of person 3.

 

Thanks you gave me something to think about :)

comment by Neel Nanda (neel-nanda-1) · 2022-12-26T16:12:21.223Z · LW(p) · GW(p)

I like the analogy! I hadn't explicitly made the connection, but strongly agree (both that this is an important general phenomena, and that it specifically applies here). Though I'm pretty unsure how much I/other MI researchers are in 1 vs 3 when we try to reason about systems!

To be clear, I definitely do not want to suggest that people don't try to rigorously reverse engineer systems a bunch, and be super details oriented. Linked to your comment in the post.

comment by Florian Magin (fmagin) · 2022-12-27T12:46:05.192Z · LW(p) · GW(p)

Context: I have been working in IT Security with reverse engineering as my main interest for nearly 7 years, been interested in it for ~8.5 years, and have been working on automated reverse engineering at a research institute for 3 years. I have been vaguely interested in mechanistic interpretability for the last months and am strongly considering switching to it as a field of research, depending on how much of my RE skillset/experience transfers.

 

I think one topic that the article/viewpoint/Itay Yona are missing is how much of modern RE is only feasible because we have a few decades worth of tooling taking care of the aspects that are already possible to automate. The main industry tools IDA/Ghidra/Binary Ninja all have a roughly similar processing pipeline that consists of various steps. This is typically just called "auto analysis", and this can easily take a few hours for larger binaries, sometimes even more. I think the lessons and approaches from automated RE are much more important than the ones from manual RE, but to some extent this is personal bias, because I like automated RE a lot. But in any case manual, RE would not be feasible without some underlying automated RE that takes care of as much work as possible, to allow the human to focus on what is relevant, in a form that is as close to their level of abstraction as possible, while also being transparent, i.e. make it easily possible to switch to the underling levels of abstraction again, or even show them side by side.

Much of the work is generating various graph representations of the program, e.g. the Control Flow Graph, Data Flow Graphs, analysis based on Abstract Interpretation over some domain (e.g. types) and generating Abstract Syntax Trees that are more similar to C code in terms of the level of abstraction over the actual assembly, which makes it far easier for a human to quickly understand them. Those are then as the basis for further analysis to recover more high level constructs, and to aggregate and propagate information from various parts of the program.

I have the impression that some intermediate step/layer in this process might be a similar level of abstraction to "NNs as computational graphs", and that the steps that follow in (automated RE) might be fairly applicable to mechanistic interpretability, while the steps before this layer are far less directly useful for NNs.

Those next steps are then roughly to find patterns inside this graph that can be replaced with larger conceptual nodes, e.g. a complicated looking piece of instructions might simply be in-lined and highly optimized strcpy , i.e. copies a string from one memory location to another. The nice thing about software RE is that compilers will typically not inline functions to save space, but a NN will most likely not give us that luxury. "Induction Heads" might be the first step in exactly that direction, but I have to admit that I haven't had the time/energy to really look into this, so I can't confidently say this, and I might just be interpreting them in the mental framework I am already used to.

 

Overall, focusing on tools and automated RE is already a niche inside the somewhat niche field of RE, but I think it's crucial for mechanistic interpretability because it has to be scalable. A human looking small toy models is only interesting because we hope that larger models exhibit the same patterns/structure, that we can then automatically recover, to find higher level patterns that only emerge in a larger model, which can only be spotted as a pattern in the intermediate abstraction.

Replies from: itay-yona
comment by Itay Yona (itay-yona) · 2022-12-27T22:33:03.180Z · LW(p) · GW(p)

Thanks, that's a good insight. The graph representation of code is very different than automated decompiling like hex-rays in my opinion. I agree that graph representation is probably the most critical step towards a more high-level analysis and understanding. I am not sure why you claim it required decades of tools because since the dawn of computer-science turing-machines were described with graphs. 

In any case this is an interesting point as it suggest we might want to focus on finding graph-like concepts which will be useful for describing the different states of a neural network computation, and later developing IDA-like tool :)

since we share similar backgrounds and aspiration feel free to reach out:

https://www.linkedin.com/in/itay-yona-b40a7756/

Replies from: fmagin
comment by Florian Magin (fmagin) · 2022-12-28T10:13:39.545Z · LW(p) · GW(p)

The graph representation of code is very different than automated decompiling like hex-rays in my opinion.

There are many different graph representations of code, some of them are crucial for automated decompiling, others probably aren't. So I'm not sure which one you are referring to here. And in the end, the result of the decompilation process is a graph (either a tree like the AST of the C-Code, but I admit that it is kinda nitpicky to call that a "Graph representation"[0]), or more of a true graph like Binary Ninjas High Level Intermediate Language (if displayed as a CFG with blocks consisting of High Level Intermediate Language instructions)

I am not sure why you claim it required decades of tools

I meant this is the sense of "modern RE is implicitly built on a few decades of research and engineering". Of course you could always use some kinds of graphs, but having good tools to do this for you massively accelerates the RE process. There is a reason IDA could just charge whatever they wanted before Ghidra and Binary Ninja came out: A good auto analysis and decompiler is a massive benefit that allows someone without much experience to do basic RE, and for an expert it still saves them a lot of time, because they can focus on higher level abstractions. 

 

[0] Though you really start appreciating the difference in how those trees are internally represented, even if they look the same as Pseudo Code, when you want to work with them.  E.g. the IDA C AST is a true AST that is useful for further transformations and analysis while the Ghidra Syntax Tree isn't even an Abstract Syntax Tree because it still contains information like white space tokens. So you would have to reparse this to a C AST before you can do anything useful with it, and this is a whole other rant arising from a side project of mine...