Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions
post by Rareș Baron · 2025-02-26T11:35:56.610Z · LW · GW · 2 commentsContents
Introduction TL;DR results Methods Results Operations Activation functions Discussion Conclusion Acknowledgements None 2 comments
Epistemic status: small exploration without previous predictions, results low-stakes and likely correct.
Introduction
As a personal exercise for building research taste and experience in the domain of AI safety and specifically interpretability, I have done four minor projects, all building upon code previously written. They were done without previously formulated hypotheses or expectations, but merely to check for anything interesting in low-hanging fruit. In the end, they have not given major insights, but I hope they will be of small use and interest for people working in these domains.
This is the first project: extending Neel Nanda’s modular addition network [LW · GW], made for studying grokking, to subtraction and multiplication, as well as to all 6 activation functions of TransformerLens (ReLU, 3 variants of GELU, SiLU, and SoLU plus LayerNorm).
The modular addition grokking results have been redone using the original code, while changing the operation (subtraction, multiplication), and the activation function.
TL;DR results
- Subtraction does implement the same Fourier transform-based "clock" algorithm. Multiplication, however, does not.
- GELU greatly hastens grokking, SiLU only slightly.
- LN messes up the activations and delays grokking, otherwise leaving the algorithm unchanged.
Methods
The basis for these findings is Neel Nanda’s grokking notebook. All modifications are straight-forward.
All resulted notebooks, extracted graphs, and word files with clean, tabular comparisons can be found here.
Results
Operations
General observations for the three operations[1]:
Subtraction uses the same Fourier algorithm. Multiplication has a random-looking Fourier basis and does not use a Fourier algorithm of any kind. For multiplication, ablating any frequency (in this case, the same 5 for addition) hurts performance across the board, and activations are mostly for the constant terms.
Discussion: The fundamentally multiplicative aspect of neural networks might be an explanation, but this seems confusing and requires further research.
Subtraction and multiplication grok later than addition. Subtraction's clean-up phase is faster, however.
Discussion: Negative results for addition and the increased influence of the mod respectively are likely to blame (cosines are still used, after all).
Attention is what we would expect.
Subtraction makes some attention heads completely useless and others uniformly very high. Neuron activations, however, are more clustered.
Discussion: Smaller values given by the subtraction operations might be the cause.Subtraction has tighter singular values and uses higher frequencies (one is out of graph).
Discussion: Likely for similar reasons as the above.
Multiplication has smaller values for the principal components.
Activation functions
Specific observations for the activation functions (ReLU, the three GELUs, SiLU, SoLU with LayerNorm) - they apply for all operations unless otherwise specified:
- GELU variants (gelu_fast, gelu_new) grok very slightly later than GELU. Besides that, no other relevant differences (all discussions will be about the standard GELU from this point on).
LN messes up the activations and delays grokking severely, but otherwise the same algorithm is implemented.
SiLU and especially GELU hasten grokking.
Discussion: Could the greater freedom afforded by the structure of the functions, smoothness and quadratic aspect help the Fourier circuit formation (or the non-Fourier of multiplication)?
- For addition, GELU has a large spike in testing loss before the cleanup phase. For subtraction, this appears during memorisation.
Discussion: This is likely related to the negative values – they need more sorting out during cleanup than for ReLU, which does not deal with them, but are necessary for subtraction and thus appear during memorisation. SiLU makes memorization take longer. GELU, despite being similar, does not.
Discussion: Perhaps limiting the negative values that give outputs different from 0 is relevant here (GELU is softly capped at -1, SiLU at -2).
For addition, SiLU gives jitters and increasing test loss during the circuit formation phase. For subtraction and multiplication, this appears during memorisation.
Discussion: Similar reasons to the above might apply.SoLU gives large spikes after cleanup is done (slingshot effect?).
SoLU subtraction has interesting diagonal patterns for neuron activations. Some are negative.
SoLU weakly encourages constant frequencies, perhaps needed to smooth the transition beyond LN.
GELU has tighter singular values and makes more neurons explained by the frequencies.
Discussion: Its smoother, quadratic nature likely helps the Fourier algorithm.
Neurons for SoLU and especially of multiplication are poorly explained by the frequencies.
Discussion: Deviations from the Fourier algorithm are to blame.
GELU reduces the residual loss. It seems to help pick out better frequencies than ReLU.
Discussion
Checking if the modular addition Fourier-based algorithm appears for other simple arithmetic operations is a small test of universality. Extending the work to other activation functions used in transformers is similar.
GELU hastens grokking, though it makes the process messier. The quadratic, smooth, negative aspect of the function is a large help in forming circuits.
That multiplication does not use the Fourier algorithm is a problem for universality, though the fact that NNs can do multiplication innately means this is likely not an issue or relevant bits of evidence against it.
Other interpretability hypotheses are untouched by this. We also have yet another confirmation that LN hinders interpretability.
Conclusion
Multiplication does not use a Fourier algorithm, and GELU helps grokking. More research might be needed for multiplication and its algorithm.
Acknowledgements
I would like to thank the wonderful Neel Nanda et al. for starting this research direction, establishing its methods, and writing the relevant code.
- ^
The graphs are for ReLU, though these observations apply to all functions. Graphs are, in order: addition, subtraction, multiplication.
2 comments
Comments sorted by top scores.
comment by Gurkenglas · 2025-02-26T13:59:54.325Z · LW(p) · GW(p)
Some of these plots look like they ought to be higher resolution, especially when Epoch is on the x axis. Consider drawing dots instead of lines to make this clearer.
Replies from: Rareș Baron↑ comment by Rareș Baron · 2025-02-26T14:12:41.086Z · LW(p) · GW(p)
I will keep that in mind for the future. Thank you!
I have put all high-quality .pngs of the plots in the linked Drive folder.