Is the output of the softmax in a single transformer attention head usually winner-takes-all?
post by Linda Linsefors · 2025-01-27T15:33:28.992Z · LW · GW · No commentsThis is a question post.
Contents
Answers 4 Buck None No comments
Using the notation from here: A Mathematical Framework for Transformer Circuits
The attention pattern for a single attention head is determined by , where softmax is computed for each row of .
Each row of gives the attention pattern for the current token. Are these rows (post softmax) typically close to one-hot? I.e. are they mainly dominated by a single attention (per current token).
I'm interested in knowing this for various types of transformers, but mainly for LLM and/or frontier models.
I'm asking because I think this has implication for computations in super-position.
Answers
IIRC, for most attention heads the max attention is way less than 90%, so my answer is "no". It should be very easy to get someone to make a basic graph of this for you.
No comments
Comments sorted by top scores.