Posts

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders 2024-12-11T06:30:37.076Z
Evolutionary prompt optimization for SAE feature visualization 2024-11-14T13:06:49.728Z
SAEs are highly dataset dependent: a case study on the refusal direction 2024-11-07T05:22:18.807Z
Open Source Replication of Anthropic’s Crosscoder paper for model-diffing 2024-10-27T18:46:21.316Z
SAE features for refusal and sycophancy steering vectors 2024-10-12T14:54:48.022Z
Base LLMs refuse too 2024-09-29T16:04:21.343Z
Extracting SAE task features for in-context learning 2024-08-12T20:34:13.747Z
Self-explaining SAE features 2024-08-05T22:20:36.041Z
JumpReLU SAEs + Early Access to Gemma 2 SAEs 2024-07-19T16:10:54.664Z
SAEs (usually) Transfer Between Base and Chat Models 2024-07-18T10:29:46.138Z
Attention Output SAEs Improve Circuit Analysis 2024-06-21T12:56:07.969Z
Improving Dictionary Learning with Gated Sparse Autoencoders 2024-04-25T18:43:47.003Z
[Full Post] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:59.185Z
[Summary] Progress Update #1 from the GDM Mech Interp Team 2024-04-19T19:06:17.755Z
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To 2024-03-06T05:03:09.639Z
Attention SAEs Scale to GPT-2 Small 2024-02-03T06:50:22.583Z
Sparse Autoencoders Work on Attention Layer Outputs 2024-01-16T00:26:14.767Z
My best guess at the important tricks for training 1L SAEs 2023-12-21T01:59:06.208Z
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small 2023-10-13T18:32:02.376Z
Three ways interpretability could be impactful 2023-09-18T01:02:30.529Z
Mechanistically interpreting time in GPT-2 small 2023-04-16T17:57:52.637Z
RLHF does not appear to differentially cause mode-collapse 2023-03-20T15:39:45.353Z
OpenAI introduce ChatGPT API at 1/10th the previous $/token 2023-03-01T20:48:51.636Z
Arthur Conmy's Shortform 2022-11-01T21:35:29.449Z
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small 2022-10-28T23:55:44.755Z

Comments

Comment by Arthur Conmy (arthur-conmy) on Sam Marks's Shortform · 2024-12-18T02:25:57.597Z · LW · GW

I suggested something similar, and this was the discussion (bolding is the important author pushback):
 

Arthur Conmy

11:33 1 Dec

Why can't the YC company not use system prompts and instead:

1) Detect whether regex has been used in the last ~100 tokens (and run this check every ~100 tokens of model output)

2) If yes, rewind back ~100 tokens, insert a comment like # Don't use regex here (in a valid way given what code has been written so far), and continue the generation

Dhruv Pai

10:50 2 Dec

This seems like a reasonable baseline with the caveat that it requires expensive resampling and inserting such a comment in a useful way is difficult. 

When we ran baselines simply repeating the number of times we told the model not to use regex right before generation in the system prompt, we didn't see the instruction following improve (very circumstantial evidence). I don't see a principled reason why this would be much worse than the above, however, since we do one-shot generation with such a comment right before the actual generation.

Comment by Arthur Conmy (arthur-conmy) on IAPS: Mapping Technical Safety Research at AI Companies · 2024-11-02T11:32:23.163Z · LW · GW
Comment by Arthur Conmy (arthur-conmy) on Bridging the VLM and mech interp communities for multimodal interpretability · 2024-10-29T13:48:36.673Z · LW · GW

The [Sparse Feature Circuits] approach can be seen as analogous to LoRA (Hu et al., 2021), in that you are constraining your model's behavior

 

FWIW I consider SFC and LoRA pretty different, because in practice LoRA is practical, but it can be reversed very easily and has poor worst-case performance. Whereas Sparse Feature Circuits is very expensive, requires far more nodes in bigger models (forthcoming, I think), or requires only studying a subset of layers, but if it worked would likely have far better worst-case performance. 

This makes LoRA a good baseline for some SFC-style tasks, but the research experience using both is pretty different. 

Comment by Arthur Conmy (arthur-conmy) on IAPS: Mapping Technical Safety Research at AI Companies · 2024-10-26T20:54:16.399Z · LW · GW

I assume all the data is fairly noisy, since scanning for the domain I know in https://raw.githubusercontent.com/Oscar-Delaney/safe_AI_papers/refs/heads/main/Automated%20categorization/final_output.csv, it misses ~half of the GDM Mech Interp output from the specified window and also mislabels https://arxiv.org/abs/2208.08345 and https://arxiv.org/abs/2407.13692 as Mech Interp (though two labels are applied to these papers and I didn't dig to see which was used)

Comment by Arthur Conmy (arthur-conmy) on Mark Xu's Shortform · 2024-10-10T16:49:54.523Z · LW · GW

> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”

Does this mean something like:

1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or
 

2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon

or something else entirely?

Comment by Arthur Conmy (arthur-conmy) on The Geometry of Feelings and Nonsense in Large Language Models · 2024-10-06T20:57:27.300Z · LW · GW

This is a great write up, thanks! Has their been any follow up from the paper's authors?

This seems a pretty compelling takedown to me which is not addressed by the existing paper (my understanding of the two WordNet experiments not discussed in post is: Figure 4 concerns whether under whitening a concept can be linearly separated (yes) and so the random baseline used here does not address the concerns in this post; Figure 5 shows that the whitening transformation preserves some of the word net cluster cosine sim, but moreover on the right basically everything is orthogonal, as found in this post).

This seems important to me since the point of mech interp is to not be another interpretability field dominated by pretty pictures (e.g. saliency maps) that fail basic sanity checks (e.g. this paper for saliency maps). (Workshops aren't too important, but I'm still surprised about this)

Comment by Arthur Conmy (arthur-conmy) on Base LLMs refuse too · 2024-10-01T16:38:42.085Z · LW · GW

My current best guess for why base models refuse so much is that "Sorry, I can't help with that. I don't know how to" is actually extremely common on the internet, based on discussion with Achyuta Rajaram on twitter: https://x.com/ArthurConmy/status/1840514842098106527

This fits with our observations about how frequently LLaMA-1 performs incompetent refusal

Comment by Arthur Conmy (arthur-conmy) on Base LLMs refuse too · 2024-09-30T16:01:43.670Z · LW · GW

> Qwen2 was explicitly trained on synthetic data from Qwen1.5

~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~
EDITED TO ADD: "these [Qwen] models are utilized to synthesize high-quality pre-training data" is clear evidence, I am being silly.


All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models "trained to predict the next word on the internet" (I don't think the training samples being IID early and late in training is an important detail)

Comment by Arthur Conmy (arthur-conmy) on peterbarnett's Shortform · 2024-09-26T22:54:51.190Z · LW · GW

The Improbability Principle sounds close. The summary seems to suggest law of large numbers is one part of the pop science book, but admittedly some of the other parts ("probability lever") seem less relevant

Comment by Arthur Conmy (arthur-conmy) on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work · 2024-08-24T10:11:17.611Z · LW · GW

Is DM exploring this sort of stuff?

 

Yes. On the AGI safety and alignment team we are working on activation steering - e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don't have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see "safety practices we can test right now" as a research goal.

Comment by Arthur Conmy (arthur-conmy) on Daniel Tan's Shortform · 2024-07-28T10:22:54.791Z · LW · GW

I would say a better reference for the limitations of ROME is this paper: https://aclanthology.org/2023.findings-acl.733

Short explanation: Neel's short summary, i.e. editing in the Rome fact will also make slightly related questions e.g. "The Louvre is cool. Obama was born in" ... be completed with " Rome" too.

Comment by Arthur Conmy (arthur-conmy) on You should go to ML conferences · 2024-07-25T13:54:49.419Z · LW · GW

I agree that twitter is a worse use of time.

Going to posters for works you already know to talk to authors seems a great idea and I do it. Re-reading your OP, you suggest things like checking papers are fake or not in poster sessions. Maybe you just meant papers that you already knew about? It sounded as if you were suggesting doing this for random papers, which I'm more skeptical about.

Comment by Arthur Conmy (arthur-conmy) on You should go to ML conferences · 2024-07-25T12:03:27.810Z · LW · GW

My opinion is that going to poster sessions, orals, pre-researching papers etc. at ICML/ICLR/NeurIPS is pretty valuable for new researchers and I wish I had done this before having any papers (you don't need to have any papers to go to a conference). See also Thomas Kwa's comment about random intuitions learnt from going to a conference.

After this, I agree with Leo that I think it would be a waste of my time to go to papers/orals/preresearch papers. Maybe there's some value in this for conceptual research but for most empirical work I'm very skeptical (most papers are not good, but it takes my time to figure out whether a paper is good or not, etc.)

Comment by Arthur Conmy (arthur-conmy) on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-07-23T11:06:21.132Z · LW · GW

If there are some very common features in particular layers (e.g. an 'attend to BOS' feature), then restricting one expert to be active at a time will potentially force SAEs to learn common features in every expert. 

 

+1 to similar concerns -- I would have probably left one expert always on. This should both remove some redundant features.

Comment by Arthur Conmy (arthur-conmy) on Breaking Circuit Breakers · 2024-07-18T08:00:30.283Z · LW · GW

Relevant further context: Gray Swan's Cygnet-8B Llama finetune (which uses circuit breakers and probably other safety training too, and had impressive seeming 0.0 scores in some red teaming evals in the paper) was jailbroken in 3 hours: https://x.com/elder_plinius/status/1813670661270786137

Comment by Arthur Conmy (arthur-conmy) on Breaking Circuit Breakers · 2024-07-15T17:46:29.472Z · LW · GW

My takeaway from the blog post was that circuit breakers have fairly simple vulnerabilities. Since circuit breakers are an adversarial robustness method (not a capabilities method) I think you can update on the results of single case studies (i.e. worst case evaluations rather than average case evaluations).

Comment by Arthur Conmy (arthur-conmy) on shortplav · 2024-07-10T19:23:29.267Z · LW · GW

Geminis generally search the internet, which is why they show up on LmSys without a knowledge cutoff date. Even when there's no source attached, the model still knows information from 4 days ago via the internet (image attached). But I think in your response the [1] shows the model did find a internet source for the GUID anyway??

Unless you're using the API here and the model is being weird? Without internet access, I expect it's possible to coax the string out but the model refuses requests a lot so I think it would require a bit of elbow grease.

Comment by Arthur Conmy (arthur-conmy) on GPT-2's positional embedding matrix is a helix · 2024-07-10T00:58:21.040Z · LW · GW

Mistral and Pythia use rotary embeddings and don't have a positional embedding matrix. Which matrix are you looking at for those two models?

Comment by Arthur Conmy (arthur-conmy) on Language Models Model Us · 2024-05-19T01:05:38.367Z · LW · GW

They emailed some people about this: https://x.com/brianryhuang/status/1763438814515843119

The reason is that it may allow unembedding matrix weight stealing: https://arxiv.org/abs/2403.06634

Comment by Arthur Conmy (arthur-conmy) on Transcoders enable fine-grained interpretable circuit analysis for language models · 2024-05-17T23:33:30.827Z · LW · GW

they [transcoders] take as input the pre-MLP activations, and then aim to represent the post-MLP activations of that MLP sublayer

I assumed this meant activations just before GELU and just after GELU, but looking at code I think I was wrong. Could you rephrase to e.g.

they take as input MLP block inputs (just after LayerNorm) and they output MLP block outputs (what is added to the residual stream)

Comment by Arthur Conmy (arthur-conmy) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T01:06:04.838Z · LW · GW

Ah yeah, Neel's comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic's work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.

Comment by Arthur Conmy (arthur-conmy) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T01:02:38.982Z · LW · GW

I don't think zero ablation is that great a baseline. We're mostly using it for continuity's sake with Anthropic's prior work (and also it's a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.

I don't think compute efficiency hit[1] is ideal. It's really expensive to compute, since you can't just calculate it from an SAE alone as you need to know facts about smaller LLMs. It also doesn't transfer as well between sites (splicing in an attention layer SAE doesn't impact loss much, splicing in an MLP SAE impacts loss more, and residual stream SAEs impact loss the most). Overall I expect it's a useful expensive alternative to loss recovered, not a replacement.

EDIT: on consideration of Leo's reply, I think my point about transfer is wrong; a metric like "compute efficiency recovered" could always be created by rescaling the compute efficiency number.

  1. ^

    What I understand "compute efficiency hit" to mean is: for a given (SAE, ) pair, how much less compute you'd need (as a multiplier) to train a different LM,  such that  gets the same loss as -with-the-SAE-spliced-in.

Comment by Arthur Conmy (arthur-conmy) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-05-01T00:49:14.596Z · LW · GW

I'm not sure what you mean by "the reinitialization approach" but feature death doesn't seem to be a major issue at the moment. At all sites besides L27, our Gemma-7B SAEs didn't have much feature death at all (stats at https://arxiv.org/pdf/2404.16014v2 up in a few hours), and also the Anthropic update suggests even in small models the problem can be addressed.

Comment by Arthur Conmy (arthur-conmy) on Refusal in LLMs is mediated by a single direction · 2024-04-30T22:58:00.611Z · LW · GW

The "This should be cited" part of Dan H's comment was edited in after the author's reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.

On the other hand the post's authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).

(The authors made me aware of this fact)

Comment by Arthur Conmy (arthur-conmy) on Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers · 2024-04-29T21:25:51.456Z · LW · GW

Awesome work! I notice I am surprised that this just worked given just 1M datapoints (we use 1000x this with LMs, even small ones), and not needing any new techniques, and producing subjectively extremely abstract features (IMO). 

It would be nice if the "guess the image" game was presented as a result rather than a fun side thing in this post. AFAICT that's the only interpretability result that can't be critiqued as cherry-picked. You should state front and center that the top features for arbitrary images are basically interpretable, it's a great result!

Comment by Arthur Conmy (arthur-conmy) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-29T18:17:07.159Z · LW · GW

Thanks for the feedback, we will put up an update to the paper with all these numbers in tables, tomorrow night. For now I have sent you them (and can send anyone else them who wants them in the next 24H)

Comment by Arthur Conmy (arthur-conmy) on Refusal in LLMs is mediated by a single direction · 2024-04-28T23:57:27.906Z · LW · GW

+1 to Neel. We just fixed a release bug and now pip install transformer-lens should install 1.16.0 (worked in a colab for me)

Comment by Arthur Conmy (arthur-conmy) on Refusal in LLMs is mediated by a single direction · 2024-04-28T16:59:53.586Z · LW · GW

I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. "Citing others" from this Chris Olah blog post), but the authors resolved this when pushed.

But on the other hand, "Section 6.2 of the RepE paper shows exactly this" and accusations of plagiarism seem wrong @Dan H. Changing experimental setups and scaling them to larger models is valuable original work.

(Disclosure: I know all authors of the post, but wasn't involved in this project)

(ETA: I added the word "bypassing". Typo.)

Comment by Arthur Conmy (arthur-conmy) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-27T00:03:26.936Z · LW · GW

We use learning rate 0.0003 for all Gated SAE experiments, and also the GELU-1L baseline experiment. We swept for optimal baseline learning rates on GELU-1L for the baseline SAE to generate this value. 

For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by , motivated by wanting better hyperparameter transfer, and so changed learning rate to 0.001 or 0.00075 for all the runs (currently in Figure 1, only attention output pre-linear uses 0.00075. In the rerelease we'll state all the values used). We didn't see noticable difference in the Pareto frontier changing between 0.001 and 0.00075 so did not sweep the baseline hyperparameter further than this.

Comment by Arthur Conmy (arthur-conmy) on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-25T21:29:38.307Z · LW · GW

Oh oops, thanks so much. We'll update the paper accordingly. Nit: it's actually 



(it's just minimizing a quadratic)

ETA: the reason we have complicated equations is that we didn't compute  during training (this quantity is kinda weird). However, you can compute  from quantities that are usually tracked in SAE training. Specifically,  and all terms here are clearly helpful to track in SAE training.

Comment by Arthur Conmy (arthur-conmy) on [Full Post] Progress Update #1 from the GDM Mech Interp Team · 2024-04-21T16:24:37.199Z · LW · GW

We haven't tried this yet. Thanks, that's a good hypothesis.

I suspect that the mean centering paper https://arxiv.org/abs/2312.03813 is just cancelling the high frequency features, and if so this is a good explanation for why taking differences is important in activation steering. 

(Though it doesn't explain why the SAEs learn several high frequency features when trained on the residual stream)

Comment by Arthur Conmy (arthur-conmy) on [Full Post] Progress Update #1 from the GDM Mech Interp Team · 2024-04-21T16:22:39.457Z · LW · GW

Yes, pretty much.

There's some work on transferring steering vecs, e.g. the Llama-2 steering paper (https://arxiv.org/abs/2312.06681) shows that you can transfer steering vecs from base to chat model, and I saw results at a Hackathon once that suggested you can train resid stream SAEs on early layers and transfer them to some later layers, too. But retraining is likely what our follow up work will do (this post only used two different SAEs)

Comment by Arthur Conmy (arthur-conmy) on Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features · 2024-03-15T17:36:46.806Z · LW · GW

Why is CE loss >= 5.0 everywhere? Looking briefly at GELU-1L over 128 positions (a short sequence length!) I see our models get 4.3 CE loss. 5.0 seems really high?

Ah, I see your section on this, but I doubt that bad data explains all of this. Are you using a very small sequence length, or an odd dataset?

Comment by Arthur Conmy (arthur-conmy) on When and why did 'training' become 'pretraining'? · 2024-03-08T20:36:42.328Z · LW · GW

From my perspective this term appeared around 2021 and became basically ubiquitous by 2022


I don't think this is correct. To add to Steven's answer, in the "GPT-1" paper from 2018 the abstract discusses 

...generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task

and the assumption at the time was that the finetuning step was necessary for the models to be good at a given task. This assumption persisted for a long time with academics finetuning BERT on tasks that GPT-3 would eventually significantly outperformed them on. You can tell this from how cautious the GPT-1 authors are about claiming the base model could do anything, and they sound very quaint:

> We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-11T00:25:15.918Z · LW · GW

The fact that Pythia generalizes to longer sequences but GPT-2 doesn't isn't very surprising to me -- getting long context generalization to work is a key motivation for rotary, e.g. the original paper https://arxiv.org/abs/2104.09864

Comment by Arthur Conmy (arthur-conmy) on Some open-source dictionaries and dictionary learning infrastructure · 2024-02-07T22:26:38.443Z · LW · GW

Do you apply LR warmup immediately after doing resampling (i.e. immediately reducing the LR, and then slowly increasing it back to the normal value)? In my GELU-1L blog post I found this pretty helpful (in addition to doing LR warmup at the start of training)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T16:48:00.127Z · LW · GW

(This reply is less important than my other)

> The network itself doesn't have a million different algorithms to perform a million different narrow subtasks

For what it's worth, this sort of thinking is really not obvious to me at all. It seems very plausible that frontier models only have their amazing capabilities through the aggregation of a huge number of dumb heuristics (as an aside, I think if true this is net positive for alignment). This is consistent with findings that e.g. grokking and phase changes are much less common in LLMs than toy models. 

(Two objections to these claims are that plausibly current frontier models are importantly limited, and also that it's really hard to prove either me or you correct on this point since it's all hand-wavy)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T16:35:22.794Z · LW · GW

Thanks for the first sentence -- I appreciate clearly stating a position.

measured over a single token the network layers will have representation rank 1

I don't follow this. Are you saying that the residual stream at position 0 in a transformer is a function of the first token only, or something like this? 

If so, I agree -- but I don't see how this applies to much SAE[1] or mech interp[2] work. Where do we disagree?

 

  1. ^

    E.g. in this post here we show in detail how an "inside a question beginning with which" SAE feature is computed from which and predicts question marks (I helped with this project but didn't personally find this feature)

  2. ^

    More generally, in narrow distribution mech interp work such as the IOI paper, I don't think it makes sense to reduce the explanation to single-token perfect accuracy probes since our explanation generalises fairly well (e.g. the "Adversarial examples" in Section 4.4 Alexandre found, for example)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T14:52:26.768Z · LW · GW

Neel and I recently tried to interpret a language model circuit by attaching SAEs to the model. We found that using an L0=50 SAE while only keeping the top 10 features by activation value per prompt (and zero ablating the others) was better than an L0=10 SAE by our task-specific metric, and subjective interpretability. I can check how far this generalizes.

Comment by Arthur Conmy (arthur-conmy) on How useful is mechanistic interpretability? · 2024-02-02T14:30:18.839Z · LW · GW

If you have <98% perf explained (on webtext relative to unigram or bigram baseline), then you degrade from GPT4 perf to GPT3.5 perf

 

Two quick thoughts on why this isn't as concerning to me as this dialogue emphasized.

1. If we evaluate SAEs by the quality of their explanations on specific narrow tasks, full distribution performance doesn't matter

2. Plausibly the safety relevant capabilities of GPT (N+1) are a phase change from GPT N, meaning much larger loss increases in GPT (N+1) when attaching SAEs are actually competitive with GPT N (ht Tom for this one)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T14:18:03.893Z · LW · GW

Is the drop of eval loss when attaching SAEs a crux for the SAE research direction to you? I agree it's not ideal, but to me the comparison of eval loss to smaller models only makes sense if the goal of the SAE direction is making a full-distribution competitive model. Explaining narrow tasks, or just using SAEs for monitoring/steering/lie detection/etc. doesn't require competitive eval loss. (Note that I have varying excitement about all these goals, e.g. some pessimism about steering)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T14:13:58.550Z · LW · GW

> 0.3 CE Loss increase seems quite substantial? A 0.3 CE loss increase on the pile is roughly the difference between Pythia 410M and Pythia 2.8B

My personal guess is that something like this is probably true. However since we're comparing OpenWebText and the Pile and different tokenizers, we can't really compare the two loss numbers, and further there is not GPT-2 extra small model so currently we can't compare these SAEs to smaller models. But yeah in future we will probably compare GPT-2 Medium and GPT-2 Large with SAEs attached to the smaller models in the same family, and there will probably be similar degradation at least until we have more SAE advances.

Comment by Arthur Conmy (arthur-conmy) on Steering Llama-2 with contrastive activation additions · 2024-01-08T13:03:45.128Z · LW · GW

It's very impressive that this technique could be used alongside existing finetuning tools.

> According to our data, this technique stacks additively with both finetuning

To check my understanding, the evidence for this claim in the paper is Figure 13, where your method stacks with finetuning to increase sycophancy. But there are not currently results on decreasing sycophancy (or any other bad capability), where you show your method stacks with finetuning, right?

(AFAICT currently Figure 13 shows some evidence that activation addition to reduce sycophancy outcompetes finetuning, but you're unsure about the statistical significance due to the low percentages involved)

Comment by Arthur Conmy (arthur-conmy) on [Interim research report] Taking features out of superposition with sparse autoencoders · 2024-01-03T16:43:45.411Z · LW · GW

I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction. 

Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2

Comment by Arthur Conmy (arthur-conmy) on Mechanistically interpreting time in GPT-2 small · 2023-12-21T11:37:49.364Z · LW · GW

Note that this behavior generalizes far beyond GPT-2 Small head 9.1. We wrote a paper and a easier-to-digest tweet thread

Comment by Arthur Conmy (arthur-conmy) on My best guess at the important tricks for training 1L SAEs · 2023-12-21T11:34:37.742Z · LW · GW

Thanks, fixed

Comment by Arthur Conmy (arthur-conmy) on How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme · 2023-12-20T16:26:15.173Z · LW · GW

The number of "features" scales at most linearly with the parameter count (for information theoretic reasons)

 

Why is this true? Do you have a resource on this?

Comment by Arthur Conmy (arthur-conmy) on Takeaways from a Mechanistic Interpretability project on “Forbidden Facts” · 2023-12-20T13:22:07.317Z · LW · GW

Thanks! 

In general after the Copy Suppression paper (https://arxiv.org/pdf/2310.04625.pdf) I'm hesitant to call this a Waluigi component -- in that work we found that "Negative IOI Heads" and "Anti-Induction Heads" are not specifically about IOI or Induction at all, they're just doing meta-processing to calibrate outputs. 

Similarly, it seems possible the Waluigi components are just making the forbidden tokens appear with prob 10^{-3} rather than 10^{-5} or something like that, and would be incapable of actually making the harmful completion likely 

Comment by Arthur Conmy (arthur-conmy) on Takeaways from a Mechanistic Interpretability project on “Forbidden Facts” · 2023-12-19T16:26:38.350Z · LW · GW

What is a Waluigi component? A component that always says a fact, even when there is training to refuse facts in certain cases...?

Comment by Arthur Conmy (arthur-conmy) on [Interim research report] Taking features out of superposition with sparse autoencoders · 2023-12-07T22:24:45.577Z · LW · GW

I really appreciated this retrospective, this changed my mind about the sparsity penalty, thanks!