Posts

Comments

Comment by gabeorosan on What and Why: Developmental Interpretability of Reinforcement Learning · 2024-10-07T18:37:49.112Z · LW · GW

I recently learned about developmental interpretability and got very excited about the prospect of being able to understand the development of structure that encodes things like piece evaluations, tactics, and search that chess engines must be doing under the hood, so I'm glad to find out I'm not the only one interested in this! 

> Why study reinforcement learning/developmental interpretability?

The main reason I find this angle so compelling is that the learning process for games compresses what I see as "natural" reasoning to a form which we have some intuition for and can make reasonable hypotheses about. I think a lot about the limits of our cognitive senses, which I've heard Chris Olah talk about; naturally, a primary obstacle in interpretability research is that the actual computation done inside a model greatly exceeds our working memory capacity. Games provide one of the most information-dense interfaces between the internals of a model and our own minds; observing and steering the process of a network learning to play a game is the best way I can think to encode the thought process of a model into my own neural net (to the extent that it's possible).

To that end, I think it would be amazing if there was something like Neuronpedia for looking inside a chess engine as it learns. Something that let you view replays, had an analysis board where you could choose a weight snapshot and board state and play against it or have it play against itself, and that made pretty visualizations of various metrics. There's a large community of people interested in chess engines and I bet they would discover some cool things. If this sounds interesting to anyone else, please let me know!