Posts

Progress Update #1 from the GDM Mech Interp Team: Full Update 2024-04-19T19:06:59.185Z
Progress Update #1 from the GDM Mech Interp Team: Summary 2024-04-19T19:06:17.755Z
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To 2024-03-06T05:03:09.639Z
Attention SAEs Scale to GPT-2 Small 2024-02-03T06:50:22.583Z
Sparse Autoencoders Work on Attention Layer Outputs 2024-01-16T00:26:14.767Z
My best guess at the important tricks for training 1L SAEs 2023-12-21T01:59:06.208Z
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small 2023-10-13T18:32:02.376Z
Three ways interpretability could be impactful 2023-09-18T01:02:30.529Z
Mechanistically interpreting time in GPT-2 small 2023-04-16T17:57:52.637Z
RLHF does not appear to differentially cause mode-collapse 2023-03-20T15:39:45.353Z
OpenAI introduce ChatGPT API at 1/10th the previous $/token 2023-03-01T20:48:51.636Z
Arthur Conmy's Shortform 2022-11-01T21:35:29.449Z
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small 2022-10-28T23:55:44.755Z

Comments

Comment by Arthur Conmy (arthur-conmy) on Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features · 2024-03-15T17:36:46.806Z · LW · GW

Why is CE loss >= 5.0 everywhere? Looking briefly at GELU-1L over 128 positions (a short sequence length!) I see our models get 4.3 CE loss. 5.0 seems really high?

Ah, I see your section on this, but I doubt that bad data explains all of this. Are you using a very small sequence length, or an odd dataset?

Comment by Arthur Conmy (arthur-conmy) on When and why did 'training' become 'pretraining'? · 2024-03-08T20:36:42.328Z · LW · GW

From my perspective this term appeared around 2021 and became basically ubiquitous by 2022


I don't think this is correct. To add to Steven's answer, in the "GPT-1" paper from 2018 the abstract discusses 

...generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task

and the assumption at the time was that the finetuning step was necessary for the models to be good at a given task. This assumption persisted for a long time with academics finetuning BERT on tasks that GPT-3 would eventually significantly outperformed them on. You can tell this from how cautious the GPT-1 authors are about claiming the base model could do anything, and they sound very quaint:

> We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-11T00:25:15.918Z · LW · GW

The fact that Pythia generalizes to longer sequences but GPT-2 doesn't isn't very surprising to me -- getting long context generalization to work is a key motivation for rotary, e.g. the original paper https://arxiv.org/abs/2104.09864

Comment by Arthur Conmy (arthur-conmy) on Some open-source dictionaries and dictionary learning infrastructure · 2024-02-07T22:26:38.443Z · LW · GW

Do you apply LR warmup immediately after doing resampling (i.e. immediately reducing the LR, and then slowly increasing it back to the normal value)? In my GELU-1L blog post I found this pretty helpful (in addition to doing LR warmup at the start of training)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T16:48:00.127Z · LW · GW

(This reply is less important than my other)

> The network itself doesn't have a million different algorithms to perform a million different narrow subtasks

For what it's worth, this sort of thinking is really not obvious to me at all. It seems very plausible that frontier models only have their amazing capabilities through the aggregation of a huge number of dumb heuristics (as an aside, I think if true this is net positive for alignment). This is consistent with findings that e.g. grokking and phase changes are much less common in LLMs than toy models. 

(Two objections to these claims are that plausibly current frontier models are importantly limited, and also that it's really hard to prove either me or you correct on this point since it's all hand-wavy)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T16:35:22.794Z · LW · GW

Thanks for the first sentence -- I appreciate clearly stating a position.

measured over a single token the network layers will have representation rank 1

I don't follow this. Are you saying that the residual stream at position 0 in a transformer is a function of the first token only, or something like this? 

If so, I agree -- but I don't see how this applies to much SAE[1] or mech interp[2] work. Where do we disagree?

 

  1. ^

    E.g. in this post here we show in detail how an "inside a question beginning with which" SAE feature is computed from which and predicts question marks (I helped with this project but didn't personally find this feature)

  2. ^

    More generally, in narrow distribution mech interp work such as the IOI paper, I don't think it makes sense to reduce the explanation to single-token perfect accuracy probes since our explanation generalises fairly well (e.g. the "Adversarial examples" in Section 4.4 Alexandre found, for example)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T14:52:26.768Z · LW · GW

Neel and I recently tried to interpret a language model circuit by attaching SAEs to the model. We found that using an L0=50 SAE while only keeping the top 10 features by activation value per prompt (and zero ablating the others) was better than an L0=10 SAE by our task-specific metric, and subjective interpretability. I can check how far this generalizes.

Comment by Arthur Conmy (arthur-conmy) on How useful is mechanistic interpretability? · 2024-02-02T14:30:18.839Z · LW · GW

If you have <98% perf explained (on webtext relative to unigram or bigram baseline), then you degrade from GPT4 perf to GPT3.5 perf

 

Two quick thoughts on why this isn't as concerning to me as this dialogue emphasized.

1. If we evaluate SAEs by the quality of their explanations on specific narrow tasks, full distribution performance doesn't matter

2. Plausibly the safety relevant capabilities of GPT (N+1) are a phase change from GPT N, meaning much larger loss increases in GPT (N+1) when attaching SAEs are actually competitive with GPT N (ht Tom for this one)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T14:18:03.893Z · LW · GW

Is the drop of eval loss when attaching SAEs a crux for the SAE research direction to you? I agree it's not ideal, but to me the comparison of eval loss to smaller models only makes sense if the goal of the SAE direction is making a full-distribution competitive model. Explaining narrow tasks, or just using SAEs for monitoring/steering/lie detection/etc. doesn't require competitive eval loss. (Note that I have varying excitement about all these goals, e.g. some pessimism about steering)

Comment by Arthur Conmy (arthur-conmy) on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-02T14:13:58.550Z · LW · GW

> 0.3 CE Loss increase seems quite substantial? A 0.3 CE loss increase on the pile is roughly the difference between Pythia 410M and Pythia 2.8B

My personal guess is that something like this is probably true. However since we're comparing OpenWebText and the Pile and different tokenizers, we can't really compare the two loss numbers, and further there is not GPT-2 extra small model so currently we can't compare these SAEs to smaller models. But yeah in future we will probably compare GPT-2 Medium and GPT-2 Large with SAEs attached to the smaller models in the same family, and there will probably be similar degradation at least until we have more SAE advances.

Comment by Arthur Conmy (arthur-conmy) on Steering Llama-2 with contrastive activation additions · 2024-01-08T13:03:45.128Z · LW · GW

It's very impressive that this technique could be used alongside existing finetuning tools.

> According to our data, this technique stacks additively with both finetuning

To check my understanding, the evidence for this claim in the paper is Figure 13, where your method stacks with finetuning to increase sycophancy. But there are not currently results on decreasing sycophancy (or any other bad capability), where you show your method stacks with finetuning, right?

(AFAICT currently Figure 13 shows some evidence that activation addition to reduce sycophancy outcompetes finetuning, but you're unsure about the statistical significance due to the low percentages involved)

Comment by Arthur Conmy (arthur-conmy) on [Interim research report] Taking features out of superposition with sparse autoencoders · 2024-01-03T16:43:45.411Z · LW · GW

I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction. 

Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2

Comment by Arthur Conmy (arthur-conmy) on Mechanistically interpreting time in GPT-2 small · 2023-12-21T11:37:49.364Z · LW · GW

Note that this behavior generalizes far beyond GPT-2 Small head 9.1. We wrote a paper and a easier-to-digest tweet thread

Comment by Arthur Conmy (arthur-conmy) on My best guess at the important tricks for training 1L SAEs · 2023-12-21T11:34:37.742Z · LW · GW

Thanks, fixed

Comment by Arthur Conmy (arthur-conmy) on How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme · 2023-12-20T16:26:15.173Z · LW · GW

The number of "features" scales at most linearly with the parameter count (for information theoretic reasons)

 

Why is this true? Do you have a resource on this?

Comment by Arthur Conmy (arthur-conmy) on Takeaways from a Mechanistic Interpretability project on “Forbidden Facts” · 2023-12-20T13:22:07.317Z · LW · GW

Thanks! 

In general after the Copy Suppression paper (https://arxiv.org/pdf/2310.04625.pdf) I'm hesitant to call this a Waluigi component -- in that work we found that "Negative IOI Heads" and "Anti-Induction Heads" are not specifically about IOI or Induction at all, they're just doing meta-processing to calibrate outputs. 

Similarly, it seems possible the Waluigi components are just making the forbidden tokens appear with prob 10^{-3} rather than 10^{-5} or something like that, and would be incapable of actually making the harmful completion likely 

Comment by Arthur Conmy (arthur-conmy) on Takeaways from a Mechanistic Interpretability project on “Forbidden Facts” · 2023-12-19T16:26:38.350Z · LW · GW

What is a Waluigi component? A component that always says a fact, even when there is training to refuse facts in certain cases...?

Comment by Arthur Conmy (arthur-conmy) on [Interim research report] Taking features out of superposition with sparse autoencoders · 2023-12-07T22:24:45.577Z · LW · GW

I really appreciated this retrospective, this changed my mind about the sparsity penalty, thanks!

Comment by Arthur Conmy (arthur-conmy) on TurnTrout's shortform feed · 2023-11-23T14:54:16.177Z · LW · GW

Related: "To what extent does Veit et al still hold on transformer LMs?" feels to me a super tractable and pretty helpful paper someone could work on (Anthropic do some experiments but not many). People discuss this paper a lot with regard to NNs having a simplicity bias, as well as how the paper implies NNs are really stacks of many narrow heuristics rather than deep paths. Obviously empirical work won't provide crystal clear answers to these questions but this doesn't mean working on this sort of thing isn't valuable.

Comment by Arthur Conmy (arthur-conmy) on TurnTrout's shortform feed · 2023-11-23T14:53:33.627Z · LW · GW

Yeah I agree with the intuition and hadn't made the explicit connection to the shallow paths paper, thanks!

I would say that Edge Attribution Patching is the extreme form of this https://arxiv.org/abs/2310.10348 , where we just ignored almost all subgraphs H except for G \ {e_1} (removing one edge only), and still got reasonable results, and agrees with some more upcoming results.

Comment by Arthur Conmy (arthur-conmy) on Classifying representations of sparse autoencoders (SAEs) · 2023-11-17T14:31:39.868Z · LW · GW

Why do you think that the sentiment will not be linearly separable?

I would guess that something like multiplying residual stream states by  (ie the logit difference under the Logit Lens) would be reasonable (possibly with hacks like the tuned lens)

Comment by Arthur Conmy (arthur-conmy) on Thomas Kwa's Shortform · 2023-11-14T16:58:13.730Z · LW · GW

mechanistic anomaly detection hopes to do better by looking at the agent's internals as a black box

 

Do you mean "black box" in the sense that MAD does not assume interpretability of the agent? If so this is kinda confusing as "black box" is often used in contrast to "white box", ie "black box" means you have no access to model internals, just inputs+outputs (which wouldn't make sense in your context)

Comment by Arthur Conmy (arthur-conmy) on [Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small · 2023-11-01T01:30:02.934Z · LW · GW

Thanks, fixed

Comment by Arthur Conmy (arthur-conmy) on Lying is Cowardice, not Strategy · 2023-10-25T01:03:00.592Z · LW · GW

Thanks for checking this! I mostly agree with all your original comment now (except the first part suggesting it was point blank, but we're quibbling over definitions at this point), this does seem like a case of intentionally not discussing risk

Comment by Arthur Conmy (arthur-conmy) on Lying is Cowardice, not Strategy · 2023-10-24T22:31:27.821Z · LW · GW

I think your interpretation is fairly uncharitable. If you have further examples of this deceptive pattern from those sympathetic to AI risk I would change my perspective but the speculation in the post plus this example weren't compelling:  

I watched the video and firstly Senator Peters seems to trail off after the quoted part and ends his question by saying "What's your assessment of how fast this is going and when do you think we may be faced with those more challenging issues?". So straightforwardly his question is about timelines not about risk as you frame it. Indeed Matheny (after two minutes) literally responds "it's a really difficult question. I think whether AGI is nearer or farther than thought ..." (emphasis different to yours) so makes it likely to me Matheny is expressing uncertainty about timelines, not risk.

Overall I agree that this was an opportunity for Matheny to discuss AI x-risk and plausibly it wasn't the best use of time to discuss the uncertainty of the situation. But saying this is dishonesty doesn't seem well supported

Comment by Arthur Conmy (arthur-conmy) on Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.) · 2023-10-18T23:25:21.753Z · LW · GW

I just saw this post and cannot parse it at all. You first say that you have removed the 9s of confidence. Then the next paragraph talks about a 99.9… figure. Then there are edit and quote paragraphs and I do not know whether these are your views or other or whether you endorse them.

Comment by Arthur Conmy (arthur-conmy) on AISN #23: New OpenAI Models, News from Anthropic, and Representation Engineering · 2023-10-18T01:15:07.207Z · LW · GW

The RepEng paper claims SOTA on TruthfulQA by 18%. Is this MC1 from here https://paperswithcode.com/sota/question-answering-on-truthfulqa ? Where is this number coming from? And why is the only maintext evaluation against any other method a single table evaluation against ActAdd (what about ITI? And surely there are other methods outside the LW sphere?)?

I’m glad there’s work trying to use model internals for useful things but the evidence didn’t seem that strong besides single prompts that don’t provide me with that much signal

Comment by Arthur Conmy (arthur-conmy) on Three ways interpretability could be impactful · 2023-09-19T12:42:02.409Z · LW · GW

By "crisp internal representations" I mean that 

  1. The latent variables the model uses to perform tasks are latent variables that humans use. Contra ideas that language models reasons in completely alien ways. I agree that the two cited works are not particularly strong evidence : ( 
  2. The model uses these latent variables with a bias towards shorter algorithms (e.g shallower paths in transformers). This is important as it's possible that even when performing really narrow tasks, models could use a very large number of (individually understandable) latent variables in long algorithms such that no human could feasibly understand what's going on.


    I'm not sure what the end product of automated interpretability is, I think it would be pure speculation to make claims here.
Comment by Arthur Conmy (arthur-conmy) on Three ways interpretability could be impactful · 2023-09-19T12:32:11.935Z · LW · GW

I think that your comment on (1) is too pessimistic about the possibility of stopping deployment of a misaligned model. That would be a pretty massive result! I think this would have cultural or policy benefits that are pretty diverse, so I don't think I agree that this is always a losing strategy -- after revealing misalignment to relevant actors and preventing deployment, the space of options grows a lot.

I'm not sure anything I wrote in (2) is close to understanding all the important safety properties of an AI. For example, the grokking work doesn't explain all the inductive biases of transformers/Adam, but it has helped better reasoning about transformers/Adam. Is there something I'm missing?

On (3) I think that rewarding the correct mechanisms in models is basically an extension of process-based feedback. This may be infeasible or only be possible while applying lots of optimization pressure on a model's cognition (which would be worrying for the various list of lethalities reasons). Are these the reasons you're pessimistic about this, or something else?

I like your writing, but AFAIK you haven't written your thoughts on interp prerequisites for useful alignment - do you have any writing on this?

Comment by Arthur Conmy (arthur-conmy) on Introducing the Center for AI Policy (& we're hiring!) · 2023-09-02T13:40:59.924Z · LW · GW

Kudos for providing concrete metrics for frontier systems, receiving pretty negative feedback on one of those metrics (dataset size), and then updating the metrics. 

It would be nice if both the edit about the dataset size restriction was highlighted more clearly (in both your posts and critic comments).

Comment by Arthur Conmy (arthur-conmy) on Optimisation Measures: Desiderata, Impossibility, Proposals · 2023-08-07T18:07:06.181Z · LW · GW

1. I would have thought that VNM utility has invariance with alpha>0 not alpha>=0, is this correct? 

2. Is there any alternative to dropping convex-linearity (perhaps other than changing to convexity, as you mention)? Would the space of possible optimisation functions be too large in this case, or is this an exciting direction?

Comment by Arthur Conmy (arthur-conmy) on Visible loss landscape basins don't correspond to distinct algorithms · 2023-07-29T10:40:15.702Z · LW · GW

Doesn't Figure 7, top left from the arXiv paper provide evidence against the "network is moving through dozens of different basins or more" picture?

Comment by Arthur Conmy (arthur-conmy) on Steering GPT-2-XL by adding an activation vector · 2023-07-22T02:35:06.346Z · LW · GW

Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!

Comment by Arthur Conmy (arthur-conmy) on Steering GPT-2-XL by adding an activation vector · 2023-07-19T01:11:24.492Z · LW · GW

No this isn’t about center_unembed, it’s about center_writing_weights as explained here: https://github.com/neelnanda-io/TransformerLens/blob/main/further_comments.md#centering-writing-weights-center_writing_weight

This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing

Comment by Arthur Conmy (arthur-conmy) on Steering GPT-2-XL by adding an activation vector · 2023-07-16T19:39:25.568Z · LW · GW

> Can we just add in  times the activations for "Love" to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions.

Do you have evidence for this? 

It's totally unsurprising to me that you need to do this on HuggingFace models as the residual stream is very likely to have a constant bias term which you will not want to add to. I saw you used TransformerLens for some part of the project and TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here. EDIT: see reply.

I even tested this:
 

Empirically in TransformerLens the 5*Love and 5*(Love-Hate) additions were basically identical from a blind trial on myself (I found 5*Love more loving 15 times compared to 5*(Love-Hate) more loving 12 times, and I independently rated which generations were more coherent, and both additions were more coherent 13 times. There were several trials where performance on either loving-ness or coherence seemed identical to me).

Comment by Arthur Conmy (arthur-conmy) on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2023-07-14T02:02:57.527Z · LW · GW

(Personal opinion as an ACDC author) I would agree that causal scrubbing aims mostly to test hypotheses and ACDC does a general pass over model components to hopefully develop a hypothesis (see also the causal scrubbing post's related work and ACDC paper section 2). These generally feel like different steps in a project to me (though some posts in the causal scrubbing sequence show that the method can generate hypotheses, and also ACDC was inspired by causal scrubbing). On the practical side, the considerations in How to Think About Activation Patching may be a helpful field guide for projects where one of several tools might be best for a given use case. I think both causal scrubbing are currently somewhat inefficient for different reasons; causal scrubbing is slow because causal diagrams have lots of paths in them, and ACDC is slow as computational graphs have lots of edges in them (follow up work to ACDC will do gradient descent over the edges). 

Comment by Arthur Conmy (arthur-conmy) on MetaAI: less is less for alignment. · 2023-07-05T17:54:13.320Z · LW · GW

I think that this critique is a bit overstated.

i) I would guess that human eval is in general better than most benchmarks. This is because it's a mystery how much benchmark performance is explained by prompt leakage and benchmarks being poorly designed (e.g crowd-sourcing benchmarks has issues with incorrect or not useful tests, and adversarially filtered benchmarks like TruthfulQA have selection effects on their content which make interpreting their results harder, in my opinion)

ii) GPT-4 is the best model we have access to. Any competition with GPT-4 is competition with the SOTA available model! This is a much harder reference class to compare to than models trained with the same compute, models trained without fine-tuning etc.

Comment by Arthur Conmy (arthur-conmy) on ChatGPT understands, but largely does not generate Spanglish (and other code-mixed) text · 2023-07-04T09:47:54.509Z · LW · GW

Note that this seems fairly easy for GPT-4 + a bit of nudging

Comment by Arthur Conmy (arthur-conmy) on Steering GPT-2-XL by adding an activation vector · 2023-07-04T09:46:35.522Z · LW · GW

This is particularly impressive since ChatGPT isn't capable of code-switching (though GPT-4 seems to be from a quick try)

Comment by Arthur Conmy (arthur-conmy) on LLMs Sometimes Generate Purely Negatively-Reinforced Text · 2023-06-17T12:22:08.696Z · LW · GW

Nice toy circuit - interesting that it can describe both attention-head and MLP key-value stores. Do you have any intuition about what mechanistically the model is doing on the negatively reinforced text? E.g is the toy model intended to suggest that there is some feature represented as +1 or -1 in the residual stream? If so do you expect this to be a direction or non-linear?

Comment by Arthur Conmy (arthur-conmy) on [Linkpost] Faith and Fate: Limits of Transformers on Compositionality · 2023-06-16T18:18:27.151Z · LW · GW

Am I missing something or is GPT-4 able to do Length 20 Dynamic Programming using a solution it described itself very easily?

https://chat.openai.com/share/8d0d38c0-e8de-49f3-8326-6ab06884df90

We have 100k context models and several OOMs more FLOPs to throw at models, I couldn't see a reason why autoregressive models were limited in a substantial way given the evidence in the paper

Comment by Arthur Conmy (arthur-conmy) on 200 COP in MI: Image Model Interpretability · 2023-06-07T12:34:29.187Z · LW · GW

Oh huh - those eyes, webs and scales in Slide 43 of your work are really impressive, especially given the difficulty extending these methods to transformers. Is there any write-up of this work?

Comment by Arthur Conmy (arthur-conmy) on Why and When Interpretability Work is Dangerous · 2023-05-28T15:52:17.001Z · LW · GW

I am a bit confused by your operationalization of "Dangerous". On one hand

I posit that interpretability work is "dangerous" when it enhances the overall capabilities of an AI system, without making that system more aligned with human goals

is a definition I broadly agree with, especially since you want it to track the alignment-capabilities trade-off (see also this post). However, your examples suggest a more deontological approach:

This suggests a few concrete rules-of-thumb, which a researcher can apply to their interpretability project P: ...

If P makes it easier/more efficient to train powerful AI models, then P is dangerous.

Do you buy the alignment-capabilities trade-off model, or are you trying to establish principles for interpretability research? (or if both, please clarify what definition we're using here)

Comment by Arthur Conmy (arthur-conmy) on 'Fundamental' vs 'applied' mechanistic interpretability research · 2023-05-28T15:35:19.553Z · LW · GW

This was a nice description, thanks!

 However, regarding

comprehensively interpreting networks [... aims to] identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean)

I think this is incredibly optimistic hope that I think need be challenged more.

On my model GPT-N has a mixture of a) crisp representation, b) fuzzy heuristics made are made crisp in GPT-(N+1) and c) noise and misgeneralizations. Unless we're discussing models that perfectly fit their training distribution, I expect comprehensively interpreting networks involves untangling many competing fuzzy heuristics which are all imperfectly implemented. Perhaps you expect this to be possible? However, I'm pretty skeptical this is tractable and expect the best good interpretability work to not confront these completeness guarentees.

Related (I consider "mechanistic interpretability essentially solved" to be similar to your "comprehensive interpreting" goal)

Comment by Arthur Conmy (arthur-conmy) on Hands-On Experience Is Not Magic · 2023-05-28T11:56:03.923Z · LW · GW

I liked that you found a common thread in several different arguments.

However, I don't think that the views are all believed or all disagreed with in practice. I do think Yann LeCun would agree with all the points and Eliezer Yudkowsky would disagree with all the points (except perhaps the last point).

For example, I agree with 1 and 5, agree with the first half but not the second half of 2 disagree with 3 and have mixed feelings about 4.

Why? At a high level, I think the extent to which individual researchers, large organizations and LLMs/AIs need empirical feedback to improve are all quite different.

Comment by Arthur Conmy (arthur-conmy) on My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI · 2023-05-24T21:05:48.573Z · LW · GW

I agree that some level public awareness would not have been reached without accessible demos of SOTA models.

However, I don’t agree with the argument that AI capabilities should be released to increase our ability to ‘rein it in’ (I assume you are making an argument against a capabilities ‘overhang’ which has been made on LW before). This is because text-davinci-002 (and then 3) were publicly available but not accessible to the average citizen. Safety researchers knew these models existed and were doing good work on them before ChatGPT’s release. Releasing ChatGPT results in shorter timelines and hence less time for safety researchers to do good work.

To caveat this: I agree ChatGPT does help alignment research, but it doesn’t seem like researchers are doing things THAT differently based on its existence. And secondly I am aware that OAI did not realise how large the hype and investment would be from ChatGPT, but nevertheless this hype and investment is downstream of a liberal publishing culture which is something that can be blamed.

Comment by Arthur Conmy (arthur-conmy) on My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI · 2023-05-24T13:37:21.670Z · LW · GW

I agree that ChatGPT was positive for AI-risk awareness. However from my perspective being very happy about OpenAI's impact on x-risk does not follow from this. Releasing powerful AI models does have a counterfactual effect on the awareness of risks, but also a lot of counterfactual hype and funding (such as the vast current VC investment in AI) which is mostly pointed at general capabilities rather than safety, which from my perspective is net negative.

Comment by Arthur Conmy (arthur-conmy) on My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI · 2023-05-24T01:32:36.907Z · LW · GW

Given past statements I expect all lab leaders to speak on AI risk soon. However, I bring up the FLI letter not because it is an AI risk letter, but because it is explicitly about slowing AI progress, which OAI and Anthropic have not shown that much support for

Comment by Arthur Conmy (arthur-conmy) on My May 2023 priorities for AI x-safety: more empathy, more unification of concerns, and less vilification of OpenAI · 2023-05-24T00:43:31.940Z · LW · GW

Thanks for writing this. As far as I can tell most anger about OpenAI is because i) being a top lab and pushing SOTA in a world with imperfect coordination shortens timelines and ii) a large number of safety-focused employees left (mostly for Anthropic) and had likely signed NDAs. I want to highlight i) and ii) in a point about evaluating the sign of the impact of OpenAI and Anthropic.

Since Anthropic's competition seems to me to be exacerbating race dynamics currently (and I will note that very few OpenAI and zero Anthropic employees signed the FLI letter) it seems to me that Anthropic is making i) worse due to coordination being more difficult and race dynamics. At this point, believing Anthropic is better on net than OpenAI has to go through believing *something* about the reasons individuals had for leaving OpenAI (ii)), and that these reasons outweigh the coordination and race dynamic considerations. This is possible, but there's little public evidence for the strength of these reasons from my perspective. I'd be curious if I've missed something from my point.

Comment by Arthur Conmy (arthur-conmy) on Real-Time Research Recording: Can a Transformer Re-Derive Positional Info? · 2023-05-07T15:18:12.038Z · LW · GW

This occurs across different architectures and datasets (https://arxiv.org/abs/2203.16634)

[from a quick skim this video+blog post doesn't mention this]