Posts

Alignment Faking in Large Language Models 2024-12-18T17:19:06.665Z
Evaluating Sparse Autoencoders with Board Game Models 2024-08-02T19:50:21.525Z
Addressing Feature Suppression in SAEs 2024-02-16T18:32:51.927Z

Comments

Comment by Benjamin Wright (Benw8888) on SAE reconstruction errors are (empirically) pathological · 2024-03-29T17:20:16.084Z · LW · GW

One explanation for pathological errors is feature suppression/feature shrinkage (link). I'd be interested to see if errors are still pathological even if you use the methodology I proposed for finetuning to fix shrinkage. Your method of fixing the norm of the input is close but not quite the same.

Comment by Benjamin Wright (Benw8888) on Addressing Feature Suppression in SAEs · 2024-02-16T22:30:23.942Z · LW · GW

The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!