Posts

Base LLMs refuse too 2024-09-29T16:04:21.343Z
SAEs (usually) Transfer Between Base and Chat Models 2024-07-18T10:29:46.138Z
Attention Output SAEs Improve Circuit Analysis 2024-06-21T12:56:07.969Z
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To 2024-03-06T05:03:09.639Z
Attention SAEs Scale to GPT-2 Small 2024-02-03T06:50:22.583Z
Sparse Autoencoders Work on Attention Layer Outputs 2024-01-16T00:26:14.767Z

Comments

Comment by Connor Kissane (ckkissane) on Base LLMs refuse too · 2024-09-30T01:38:37.614Z · LW · GW

LLaMA 1 7B definitely seems to be a "pure base model". I agree that we have less transparency into the pre-training of Gemma 2 and Qwen 1.5, and I'll add this as a limitation, thanks!

I've checked that Pythia 12b deduped (pre-trained on the pile) also refuses harmful requests, although at a lower rate (13%). Here's an example, using the following prompt template:

"""User: {instruction}

Assistant:"""

It's pretty dumb though, and often just outputs nonsense. When I give it the Vicuna system prompt, it refuses 100% of harmful requests, though it has a bunch of "incompetent refusals", similar to LLaMA 1 7B:

"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {instruction}

ASSISTANT:"""

Comment by Connor Kissane (ckkissane) on Sparse Autoencoders Work on Attention Layer Outputs · 2024-05-17T08:53:00.639Z · LW · GW

Thanks for the comment! We always use the pre-ReLU feature activation, which is equal to the post-ReLU activation (given that the feature is activate), and is purely linear function of z. Edited the post for clarity. 

Comment by Connor Kissane (ckkissane) on SAE-VIS: Announcement Post · 2024-03-31T16:35:06.004Z · LW · GW

Amazing! We found your original library super useful for our Attention SAEs research, so thanks for making this!

Comment by Connor Kissane (ckkissane) on Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo · 2023-08-14T14:20:07.795Z · LW · GW

These puzzles are great, thanks for making them!

Comment by Connor Kissane (ckkissane) on Causal scrubbing: results on induction heads · 2023-07-19T19:57:54.463Z · LW · GW

Code for this token filtering can be found in the appendix and the exact token list is linked.

Maybe I just missed it, but I'm not seeing this. Is the code still available?