Posts

Comments

Comment by gradStudent52 on [Interim research report] Taking features out of superposition with sparse autoencoders · 2024-07-15T16:27:03.490Z · LW · GW

Hello! After reading work by Anthropic and other similar work, I am trying to fundamentally understand the "big picture". That is, it is not clear to me how "features" are extracted from the activations of the hidden layer in the SAE. There are two things that are contributing to the lack of clarity in this matter:

  1. Any given neuron in the hidden layer of the SAE depends on all neurons of the input layer of the SAE. So, how then can any individual neuron (or activation thereof) in the hidden layer of the SAE be related to a single superposition (feature) of a neuron in the input layer of the SAE?
  2. Based on the reading I have done, each neuron in the hidden layer of the SAE corresponds to some "feature". How is the actual feature identified/specified as something concrete/interpretable? For example, by observing the activation of hidden neuron i of the SAE hidden layer, how does one find that the specific activation corresponds to the presence of numbers (or any other phenomenon) in the input, which in this case is some text?

I greatly appreciate any insight/feedback (please point out any apparent faulty understanding on my part)!