Minor interpretability exploration #2: Extending superposition to different activation functions

post by Rareș Baron · 2025-03-06T11:22:53.528Z · LW · GW · 0 comments

Contents

  Introduction
    TL;DR results
  Previous work
  Methods
  Results
    General
    Specific
  Discussion
  Conclusion
  Acknowledgements
None
No comments

Epistemic status: small exploration without previous predictions, results low-stakes and likely correct.

Introduction

As a personal exercise for building research taste and experience in the domain of AI safety and specifically interpretability, I have done four minor projects, all building upon code previously written. They were done without previously formulated hypotheses or expectations, but merely to check for anything interesting in low-hanging fruit. In the end, they have not given major insights, but I hope they will be of small use and interest for people working in these domains.

This is the second project: extending Anthropic's interpretability team's toy model of superposition notebook to 26 more activation functions.

The toy model of superposition results have been redone using the original code, while changing the activation function where appropriate.

TL;DR results

Broadly, there were three main, general observations, all somewhat confusing and suggesting that additional research is needed:

Previous work

Lad and Kostolansky (2023) extended superposition results to GeLU, SiLU, Sigmoid, Tanh, and SoLU. This project extends this to 20 more activation functions, while confirming previous findings (GELU/SiLU reduce superposition and polysemanticity while smoothing out ReLU, Sigmoid extending superposition to many more neurons, Tanh eliminating it, and SoLU enforcing sparsity).

Methods

The basis for these findings is Anthropic’s toy model of superposition notebook, as well as Zephaniah Roe's reproduction and extension ('demonstrating superposition' and 'introduction'). All modifications are straight-forward replacements of ReLU, as well as extensions of the sparsity regimes.

All resulted notebooks, extracted graphs, tables, and word files with clean, tabular comparisons can be found here.

Results

Besides the original ReLU, the 26 activation functions surveyed are the following: LeakyReLU, SELU, CELU, GELU, Sigmoid, Hardsigmoid, SiLU/Swish, Hardswish, Mish, Softplus, Tanh, Hardtanh, ELU, Hardshrink, Softmax, SoLU/Softmax Linear, Logsoftmax, Logsigmoid, Exp, Exp linear, Tanhshrink, ReLU6, ReLU^2, Softshrink, Softsign, and Softmin. Implementations were PyTorch or PyTorch-derived.

General

General observations have been given above.

Specific

Successive graphs indicate increased sparsification (from 0 to 0.999).

Specific observations for the functions:

Discussion

While ReLU is the most often used activation functions in neural networks, it is not the only one, and not the only one of interest either. I have gone through almost all PyTorch standard activation functions in order to be exhaustive, but I have also included functions that have been discussed in the interpretability literature: GeLU, SoLU, and the exponential activation function. No LayerNorm or other modifications have been applied to the initial networks, so this exercise was done strictly to see the impact each function had upon the development and degree of superposition.

ReLU is unlikely to be replaced as the standard activation functions. It is, however, important to know what properties of these functions lead to superposition, and therefore to lessened interpretability potential. This experiment has shown preliminary results in what characteristics of these functions help or hinder superposition (non-negative, asymmetrical, smooth functions help superposition): developing one to combat superposition while keeping performance high (such as SoLU) might benefit from them.

More research is needed, especially for the weird behaviour of the Sigmoid, Exps, and especially SoLU, and for the powerful superposition of the Softplus.

Conclusion

The broad conclusion is that more superposition leads to marginally smaller losses; non-negative, asymmetrical activation functions gave the best superposition and lowest losses. The fact that loss and superposition are not so harshly tied together is encouraging, however, showing that it is possible to remove a degree of superposition from the network without sacrificing performance to a large degree.

Acknowledgements

I would like to thank the original Anthropic interpretability team for starting this research direction, establishing its methods, and writing the relevant code, as well as for the excellent reproduction and extension by Zephaniah Roe.

0 comments

Comments sorted by top scores.