Posts
Intro to Multi-Agent Safety
2025-04-13T17:40:41.128Z
Conditional Importance in Toy Models of Superposition
2025-02-02T20:35:38.655Z
Thoughts on Toy Models of Superposition
2025-02-02T13:52:54.505Z
Reflections on ML4Good
2024-11-25T02:40:32.586Z
Comments
Comment by
james__p on
Conditional Importance in Toy Models of Superposition ·
2025-03-22T13:22:58.499Z ·
LW ·
GW
Thanks for the thoughts --
- I used the term "importance" since this was the term used in Anthropic's original paper. I agree that (unlike in a real model) my toy scenario doesn't contain sufficient information to deduce the context from the input data.
- I like your phrasing of the task - it does a great job of concisely highlighting the 'Mathematical Intuition for why Conditional Importance "doesn't matter"'
- Interesting that the experiment was helpful for you!
Comment by
james__p on
Thoughts on Toy Models of Superposition ·
2025-03-10T22:08:09.237Z ·
LW ·
GW
Just to check, in the toy scenario, we assume the features in R^n are the coordinates in the default basis. So we have n features X_1, ..., X_n
Yes, that's correct.
Separately, do you have intuition for why they allow network to learn b too? Why not set b to zero too?
My understanding is that the bias is thought to be useful for two reasons:
- It is preferable to be able to output a non-zero value for features the model chooses not to represent (namely their expected values)
- Negative bias allows the model to zero-out small interferences, by shifting the values negative such that the ReLU outputs zero. I think empirically when these toy models are exhibiting lots of superposition, the bias vector typically has many negative entries.
Comment by
james__p on
Conditional Importance in Toy Models of Superposition ·
2025-02-13T13:51:22.817Z ·
LW ·
GW
Yeah I agree that with hindsight, the conclusion could be better explained and motivated from first principles, rather than by running an experiment. I wrote this post in the order in which I actually tried things as I wanted to give an honest walkthrough of the process that lead me to the conclusion, but I can appreciate that it doesn't optimise for ease to follow.