Automating Mechanistic Interpretability via Program Synthesis

edy-nastase

Automating Mechanistic Interpretability via Program Synthesis

post by Edy Nastase (edy-nastase) · 2025-04-17T10:58:46.748Z · LW · GW · 1 comments

1 comment

I have been researching for a while, and it seems to me that there isn't that much progress on "automating" MI using Program Synthesis. The only source I could find is a paper from Max Tegmark's lab. However, this paper has been about for quiet a while, and not that much progress has been happening since then (or maybe I am unaware of it).

The inspiration of using Program Synthesis comes from Stephen Casper [AF · GW]suggesting that program synthesis could lead to the process of automating MI. In the post itself, he mentions some papers that transform RL policies into programs, while the rest are some obscure methods. However, none of them are actually trying to reverse parts of neural networks into "programs"...

Now, I was thinking about doing some research on this myself, and see if I could end up with a simple prototype. Something like this could represent some of the initial steps:

Select a domain area/data that could be used for experimenting. I think previous MI problems train the networks on arithmetic, operations with integers/Booleans etc. and then try to reverse engineering from there.
A method of identifying circuits aka networks within the network is probably necessary to understand what different parts of the network.
Given a circuit, I need to find a way of identifying whether this could be reversed or not into a program. Maybe if I define a DSL (Domain Specific Language) for the given problem domain, I could try different combinations of sub-graphs inputs/outputs until time runs out? Or it can be a type-search where only specific operations between types are allowed.
From that point, if the prototype is ok, I could start thinking about integrating the system with something like Dream Coder , which essentially "discovers the DSL" given training data and a base language (which could be just the defined DSL, but simplified)

What do you guys think? Why isn't there much research in this direction? What is missing from the above proposed plan?

1 comments

Comments sorted by top scores.

comment by Sergii (sergey-kharagorgiev) · 2025-04-17T12:28:36.638Z · LW(p) · GW(p)

Apperently it's more efficient to do it other way around, to compile programs into transformers, which are then useful as refecene and ground truth when analyzing "real" transformers.

See usage of TRACR in "Towards Automated Circuit Discovery for Mechanistic Interpretability" https://arxiv.org/pdf/2304.14997, for example.

Automating Mechanistic Interpretability via Program Synthesis

Contents

1 comments