Posts

Comments

Comment by Aryaman Arora (aryaman-arora) on The ‘strong’ feature hypothesis could be wrong · 2024-08-05T19:24:55.992Z · LW · GW

cf. https://arxiv.org/abs/2407.14662

Comment by Aryaman Arora (aryaman-arora) on Some common confusion about induction heads · 2023-03-29T20:19:12.585Z · LW · GW

Really nice summarisation of the confusion. Re: your point 3, this point makes "induction heads" as a class of things feel a lot less coherent :( I had also not considered that the behaviour on random sequences to show induction as a fallback--do you think there may be induction-y heads that simply don't activate on random sequences due to the out-of-distribution nature of them?

Comment by Aryaman Arora (aryaman-arora) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-06T00:25:41.338Z · LW · GW

I'll just preregister that I bet these weird tokens have very large norms in the embedding space.

Comment by Aryaman Arora (aryaman-arora) on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien) · 2022-11-09T00:32:48.314Z · LW · GW

Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo's later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.

Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?

Comment by Aryaman Arora (aryaman-arora) on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien) · 2022-11-08T20:20:21.787Z · LW · GW

I'm pretty sure! I don't think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. ) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product the cosine similarity is a useless term.

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

Comment by Aryaman Arora (aryaman-arora) on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien) · 2022-11-08T19:58:54.376Z · LW · GW

Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that--some ideas to consider:

  • the backup heads could have other main functions but incidentally are useful for the specific task we're looking at, so they end up taking the place of the main heads
  • thinking of virtual attention heads, the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren't interpretable in big models due to superposition

Re: GPT-Neo being weird, one of the colabs in the original logit lens post shows that logit lens is pretty decent for standard GPT-2 of varying sizes but basically useless for GPT-Neo, i.e. outputs some extremely unlikely tokens for every layer before the last one. The bigger GPT-Neos are a bit better (some layers are kinda intepretable with logit lens) but still bad. Basically, the residual stream is just in a totally wacky basis until the last layer's computations, unlike GPT-2 which shows more stability (the whole reason logit lens works).

One weird thing I noticed with GPT-Neo 125M's embedding matrix is that the input static embeddings are super concentrated in vector space, avg. pairwise cosine similarity is 0.960 compared to GPT-2 small's 0.225.

On the later layers not doing much, I saw some discussion on the EleutherAI discord that probes can recover really good logit distributions from the middle layers of the big GPT-Neo models. I haven't looked into this more myself so I don't know how it compares to GPT-2. Just seems to be an overall profoundly strange model.

Comment by Aryaman Arora (aryaman-arora) on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien) · 2022-11-08T14:54:30.338Z · LW · GW

Understand IOI in GPT-Neo: it's a same size model but does IOI via composition of MLPs

GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.