Posts

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders 2024-12-11T06:30:37.076Z
Evaluating Sparse Autoencoders with Board Game Models 2024-08-02T19:50:21.525Z
OthelloGPT learned a bag of heuristics 2024-07-02T09:12:56.377Z
Past Tense Features 2024-04-20T14:34:23.760Z
An adversarial example for Direct Logit Attribution: memory management in gelu-4l 2023-08-30T17:36:59.034Z
Understanding mesa-optimization using toy models 2023-05-07T17:00:52.620Z
Safety of Self-Assembled Neuromorphic Hardware 2022-12-26T18:51:26.163Z

Comments

Comment by Can (Can Rager) on Understanding mesa-optimization using toy models · 2023-05-13T12:39:23.902Z · LW · GW

Thanks for pointing this out – indeed, our phrasing is quite unclear. The original paragraph was trying to say that our "system" (a transformer trained to find shortest paths via SGD) may learn "alternative objectives" which don't generalize (aren't "desirable" from our perspective), but which achieve the same loss (are "rewarding").

To be clear, the point we want to make here is that models capable of perfoming search are relevant for understanding mesa-optimization as search requires iterative reasoning with subgoal evaluation.

In the context of solving mazes, we may hope to understand how mesa-optimization arises and can become "misaligned"; either through the formation of non-general reasoning steps (reliance on heuristics or overfitted goals) or failure to retarget.

Concretely, we can imagine the network learning to reach the <END_TOKEN> at train time, but failing to generalise at test time as it has instead learnt a goal that was an artefact of our training process. For example, it may have learnt to go to the top right corner (where the <END_TOKEN> happened to be during training).