Posts

Evolutionary prompt optimization for SAE feature visualization 2024-11-14T13:06:49.728Z
SAE features for refusal and sycophancy steering vectors 2024-10-12T14:54:48.022Z
Extracting SAE task features for in-context learning 2024-08-12T20:34:13.747Z
Self-explaining SAE features 2024-08-05T22:20:36.041Z

Comments

Comment by Dmitrii Kharlapenko (dmitrii-kharlapenko) on Self-explaining SAE features · 2024-08-06T13:56:43.849Z · LW · GW

Do you mean SAE encoder weights by input features? We did not look into them.

Comment by Dmitrii Kharlapenko (dmitrii-kharlapenko) on Self-explaining SAE features · 2024-08-06T13:55:07.530Z · LW · GW

Thanks! We did try to use it in the repeat setting to make the model produce more than a single token, but it did not work well.

And as far as I remember it also did not improve the meaning prompt much.