Posts

Comments

Comment by george robinson (george-robinson) on SAE feature geometry is outside the superposition hypothesis · 2024-07-08T16:40:01.640Z · LW · GW

This is probably just the way I've seen features/interpretability explained - the features on one layer are thought of as relevant combinations of simpler features from the previous layer (this explanation in particular seems to be the standard one for features of image classifiers). This is certainly simplistic since the higher level features are probably much more advanced functions of the previous layer rather than just 'these n features are all present'. However for understanding some geometry I think it could be interesting. 

For example, you can certainly build a simplicial complex in the following way: let the features for the first layer be the 0-simplices be the first layer. For a feature F on the n-th layer, compute the n most likely features from the previous layer to fire on a sample highly related to F, and produce an (n-1)-simplex on these (by most likely, I mean either by sampling or there may be a purely mathematical way of doing this from the feature vectors). This simplicial complex is a pretty basic object recording the relationship between features (on the same layer, or between layers). I can't really say whether it would be particularly easy to actually compute with, but it might have some interesting topological features (e.g. how easy is it to disconnect the simplex by removing simplices, equivalently clamping features to zero).

Comment by george robinson (george-robinson) on SAE feature geometry is outside the superposition hypothesis · 2024-06-25T18:19:13.046Z · LW · GW

Interesting post thanks! 

I wonder whether there are some properties of the feature geometry that could be more easily understood by looking at the relation between features in adjacent layers. My mental image of the feature space is something like a simplicial complex with features at each layer n corresponding roughly to n-dimensional faces, connecting up lower dimensional features from previous layers.