benjamin-wright

Posts
Comments

Posts

Alignment Faking in Large Language Models 2024-12-18T17:19:06.665Z

Evaluating Sparse Autoencoders with Board Game Models 2024-08-02T19:50:21.525Z

Addressing Feature Suppression in SAEs 2024-02-16T18:32:51.927Z

Comments

Comment by Benjamin Wright (Benw8888) on SAE reconstruction errors are (empirically) pathological · 2024-03-29T17:20:16.084Z · LW · GW

One explanation for pathological errors is feature suppression/feature shrinkage (link). I'd be interested to see if errors are still pathological even if you use the methodology I proposed for finetuning to fix shrinkage. Your method of fixing the norm of the input is close but not quite the same.

Comment by Benjamin Wright (Benw8888) on Addressing Feature Suppression in SAEs · 2024-02-16T22:30:23.942Z · LW · GW

The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!

User info

Posts

Comments