Posts

Intro to Multi-Agent Safety 2025-04-13T17:40:41.128Z
Conditional Importance in Toy Models of Superposition 2025-02-02T20:35:38.655Z
Thoughts on Toy Models of Superposition 2025-02-02T13:52:54.505Z
Reflections on ML4Good 2024-11-25T02:40:32.586Z

Comments

Comment by james__p on Conditional Importance in Toy Models of Superposition · 2025-03-22T13:22:58.499Z · LW · GW

Thanks for the thoughts --

  • I used the term "importance" since this was the term used in Anthropic's original paper. I agree that (unlike in a real model) my toy scenario doesn't contain sufficient information to deduce the context from the input data.
  • I like your phrasing of the task - it does a great job of concisely highlighting the 'Mathematical Intuition for why Conditional Importance "doesn't matter"'
  • Interesting that the experiment was helpful for you!
Comment by james__p on Thoughts on Toy Models of Superposition · 2025-03-10T22:08:09.237Z · LW · GW

Just to check, in the toy scenario, we assume the features in R^n are the coordinates in the default basis. So we have n features X_1, ..., X_n

Yes, that's correct.

Separately, do you have intuition for why they allow network to learn b too? Why not set b to zero too?

My understanding is that the bias is thought to be useful for two reasons:

  • It is preferable to be able to output a non-zero value for features the model chooses not to represent (namely their expected values)
  • Negative bias allows the model to zero-out small interferences, by shifting the values negative such that the ReLU outputs zero. I think empirically when these toy models are exhibiting lots of superposition, the bias vector typically has many negative entries.
Comment by james__p on Conditional Importance in Toy Models of Superposition · 2025-02-13T13:51:22.817Z · LW · GW

Yeah I agree that with hindsight, the conclusion could be better explained and motivated from first principles, rather than by running an experiment. I wrote this post in the order in which I actually tried things as I wanted to give an honest walkthrough of the process that lead me to the conclusion, but I can appreciate that it doesn't optimise for ease to follow.