Minor interpretability exploration #3: Extending superposition to different activation functions (loss landscape)

post by Rareș Baron · 2025-03-14T15:45:14.365Z · LW · GW · 0 comments

Contents

  Introduction
    TL;DR results
  Methods
  Results
    General
    Specific
      GELU (bottom) is mostly just a softer ReLU (top), with slightly reduced and delayed superposition.
  Discussion
  Conclusion
  Acknowledgements
None
No comments

Epistemic status: small exploration without previous predictions, results low-stakes and likely correct.

Introduction

As a personal exercise for building research taste and experience in the domain of AI safety and specifically interpretability, I have done four minor projects, all building upon code previously written. They were done without previously formulated hypotheses or expectations, but merely to check for anything interesting in low-hanging fruit. In the end, they have not given major insights, but I hope they will be of small use and interest for people working in these domains.

This is the third project: extending Timaeus' toy models of superposition devinterp results, made for studying the formation of superposition during the training process and using the LLC. It is partially this project.

The TMS results have been redone using the original code, while changing the activation function.

TL;DR results

Broadly, there were four main observations, all somewhat confusing and suggesting that additional research is needed:

The last three correspond to previous findings (see project #2 [LW · GW]).

Methods

The basis for these findings is the TMS notebook of the devinterp library, plus the example code given.

All resulted notebooks, extracted graphs, tables, and word files with clean, tabular comparisons can be found here.

Results

General

General observations have been given above.

Specific

Specific observations for the functions:

Discussion

The development of superposition happens after the initial large drop in losses, and does correspond to phase changes to a degree. Does superposition correspond to flat basins in the loss landscape, without it being necessary for their development? Do the basins correspond to symmetrical/deeply entrenched basins?

Previous findings have been validated, with the addition that low degrees of superposition usually correspond to slower formation as well. Similar considerations and areas for further research apply here as well. The only large surprise is the vastly different superposition of limiting loss functions (sigmoid especially), which is confusing, and potentially SoLU.

More research is needed, potentially with finer-grained (applications of LLC or other) SLT tools. Otherwise, LLC behaved reasonably.

Conclusion

The broad conclusion is that more superposition leads to marginally smaller losses; the development of superposition happens after the initial large drop in losses during phase changes; sharp, non-negative, asymmetrical activation functions gave the best superposition and lowest losses.

Acknowledgements

I would like to thank the original Timaeus team for starting this research direction, establishing its methods, and writing the relevant code.

0 comments

Comments sorted by top scores.