A proposal for iterated interpretability with known-interpretable narrow AIs

peter-berggren

A proposal for iterated interpretability with known-interpretable narrow AIs

post by Peter Berggren (peter-berggren) · 2025-01-11T14:43:05.423Z · LW · GW · 0 comments

  Background 
  Key features of my proposal
    Concentrating on narrow-domain systems
    Applying a penalty to lack of interpretability during training
    Developing known-interpretable systems that can interpret more advanced systems
  An outline of steps for this proposal
  Conclusion
None
No comments

I decided, as a challenge to myself, to spend 5 minutes, by the clock, solving the alignment problem. This is the result, plus 25 minutes of writing it up. As such, it might be a bit unpolished, but I hope that it can still be instructive.

Background

This proposal is loosely based on iterated amplification, a proposal for training AIs on difficult problems that are hard for humans to judge by using human judgments on simpler problems. However, this concept goes past that into constructing a broader model of AI development and a series of research avenues based on the idea of using human knowledge about simpler AIs to develop knowledge about more complex AIs.

Key features of my proposal

Concentrating on narrow-domain systems

It seems at least intuitively that interpreting narrower systems will be simpler than interpreting more advanced systems, given reduced neuron polysemanticity for the simple reason that fewer concepts would be encoded in such a system. However, narrow systems could still be very useful, particularly for narrow tasks including:

Nanomachines (useful in a wide range of domains)
Connectomics (can be applied to whole-brain emulation to implement any proposal involving that as an element)
Interpretability of neural networks (can extend any interpretability finding to broader AI)

The remainder of this proposal will focus on narrow-domain AIs which can interpret other AIs, as this is a particularly instructive initial task for this proposal.

Applying a penalty to lack of interpretability during training

Existing methods of "pruning" neural networks could be applied during training of narrow systems as described above to improve their interpretability. In addition, the model could be penalized during training for lacking interpretability. While it is unclear exactly how to code that objective, a variant of RLHF could be used to do this. However, the details of how this would work are still unclear, and I will be writing that up in the future.

Developing known-interpretable systems that can interpret more advanced systems

In the context of existing work on GPT-4 being used to interpret GPT-2 (showing that automating interpretability work is at least possible), it seems possible for a smaller system to interpret a larger one if the smaller system were specifically designed for this task. There seems to me to be nothing about LLMs in particular that makes them especially good at interpretability when compared to other types of AI models, and so experimentation on a broader range of AI systems (GANs, RNNs, CNNs, FFNNs, or potentially even logical expert systems) for interpretability seems like it would allow for the construction of smaller systems that can interpret larger ones.

An outline of steps for this proposal

In small-scale experiments, test out a range of AI architectures to determine which ones are best at doing automated interpretability work on a range of different AI systems.
Take the most promising ones and train them in a way which improves their own interpretability (rewarding sparsity, rewarding interpretability by RLHF)
Have researchers do interpretability work on them with the best available present tools
Train larger models based on the same techniques
Have researchers do interpretability work on them with the best available present tools, including the interpreted smaller models and any insights gained from interpreting them
Repeat steps 4-5 until these tools are good enough to reliably be applied to state-of-the-art AI systems

Conclusion

I understand that this proposal is quite unpolished, and falls into the general concept of "using AI to align AI" which is often unpopular, but I do see some meaningful differences from present work in this domain (particularly in the focus on narrow systems and experimentation on which systems aid interpretability the best). If I were to pick the most important part of this, it would probably be the idea of using RLHF to reward a system for being interpretable. If you want a more polished version of this, feel free to let me know. Otherwise, I will be focusing my alignment proposal work on formalizing my "interpretability by RLHF" plan.

0 comments

Comments sorted by top scores.

A proposal for iterated interpretability with known-interpretable narrow AIs

Contents

Background

Key features of my proposal

Concentrating on narrow-domain systems

Applying a penalty to lack of interpretability during training

Developing known-interpretable systems that can interpret more advanced systems

An outline of steps for this proposal

Conclusion

0 comments