sid-black

Posts
Comments

Posts

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable 2022-11-28T12:54:52.399Z

Conjecture Second Hiring Round 2022-11-23T17:11:42.524Z

Conjecture: a retrospective after 8 months of work 2022-11-23T17:10:23.510Z

Current themes in mechanistic interpretability research 2022-11-16T14:14:02.030Z

Interpreting Neural Networks through the Polytope Lens 2022-09-23T17:58:30.639Z

Conjecture: Internal Infohazard Policy 2022-07-29T19:07:08.491Z

Comments

Comment by Sid Black (sid-black) on The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable · 2022-11-29T13:02:46.571Z · LW · GW

Applying SVD to neural nets in general is not a new idea. It’s been used a bunch in the field (Saxe, Olah) but mostly with relation to some input data - either you run SVD on the activations, or some input-output correlation matrix or something.

You generally need to have some data to compare against in order to understand what each vector of your factorization represents exactly. What’s interesting with this technique (imo - and this is mostly Beren’s work so not trying to toot my own horn here) is twofold:

You don’t have to run your model over a whole evaluation set - which can be very expensive - to do this sort of analysis. Actually - you don’t have to do a forward pass on your model at all. Instead you can project the weight matrix you want to analyse into the embedding space (as first noted in logit lens and https://arxiv.org/pdf/2209.02535.pdf) and factorize the resulting matrix. Now you can analyse each SVD vector with regards to the model’s vocabulary, and get an idea at a glance of what kinds of processing each layer is doing. This could prove to be useful in future scenarios where e.g we want computationally efficient methods of interpretability analysis to be run during training to check for deception, or to otherwise debug a model’s behaviour.
The degree of interpretability of these simple factorizations suggests that the matrices we’re analysing operate on largely* linear representations - which could be good news for the MI field in general, as we haven’t made much headway analysing non-linear features.

*As Peter mentions below - we should avoid overupdating on this. Linear features are almost certainly low hanging fruit. Even if they represent “the majority” of the computation going on inside the network in whatever sense, it’s likely that understanding all of the linear features in a network will not give us the full story about the network’s behaviours.

User info

Posts

Comments