Posts

Comments

Comment by Liv Gorton (liv-gorton) on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team · 2024-07-18T17:02:10.550Z · LW · GW

This is a great post! Thank you for writing this up :)

On training SAEs on ConvNets - I recently trained SAEs for all layers of InceptionV1. I've written up a paper on some of the findings of early vision, with a specific focus on curve detectors (twitter thread on the paper and another on some branch specialisation related findings). The features look really good across the entire model, including finding interpretable, monosemantic features in the final layer which, to the best of my knowledge, hasn't been done before, which is really exciting! I'm hoping to put out a blog post focusing on on the final layer in the next couple of weeks (including circuit analysis between the last few layers).

To be able to say we fully understand any real neural network is such a huge step forward for the field and it seems like with SAEs we are well-positioned to actually achieve this goal now.