Posts
Alignment Faking in Large Language Models
2024-12-18T17:19:06.665Z
Evaluating Sparse Autoencoders with Board Game Models
2024-08-02T19:50:21.525Z
Addressing Feature Suppression in SAEs
2024-02-16T18:32:51.927Z
Comments
Comment by
Benjamin Wright (Benw8888) on
SAE reconstruction errors are (empirically) pathological ·
2024-03-29T17:20:16.084Z ·
LW ·
GW
One explanation for pathological errors is feature suppression/feature shrinkage (link). I'd be interested to see if errors are still pathological even if you use the methodology I proposed for finetuning to fix shrinkage. Your method of fixing the norm of the input is close but not quite the same.
Comment by
Benjamin Wright (Benw8888) on
Addressing Feature Suppression in SAEs ·
2024-02-16T22:30:23.942Z ·
LW ·
GW
The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!