Posts

EIS XIV: Is mechanistic interpretability about to be practically useful? 2024-10-11T22:13:51.033Z
Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals? 2024-07-30T14:57:06.807Z
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024 2024-05-21T20:15:36.502Z
Analogies between scaling labs and misaligned superintelligent AI 2024-02-21T19:29:39.033Z
Deep Forgetting & Unlearning for Safely-Scoped LLMs 2023-12-05T16:48:18.177Z
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation 2023-11-07T17:59:36.857Z
The 6D effect: When companies take risks, one email can be very powerful. 2023-11-04T20:08:39.775Z
Announcing the CNN Interpretability Competition 2023-09-26T16:21:50.276Z
Open Problems and Fundamental Limitations of RLHF 2023-07-31T15:31:28.916Z
A Short Memo on AI Interpretability Rainbows 2023-07-27T23:05:50.196Z
Examples of Prompts that Make GPT-4 Output Falsehoods 2023-07-22T20:21:39.730Z
Eight Strategies for Tackling the Hard Part of the Alignment Problem 2023-07-08T18:55:22.170Z
Takeaways from the Mechanistic Interpretability Challenges 2023-06-08T18:56:47.369Z
Advice for Entering AI Safety Research 2023-06-02T20:46:13.392Z
GPT-4 is easily controlled/exploited with tricky decision theoretic dilemmas. 2023-04-14T19:39:23.559Z
EIS XII: Summary 2023-02-23T17:45:55.973Z
EIS XI: Moving Forward 2023-02-22T19:05:52.723Z
EIS X: Continual Learning, Modularity, Compression, and Biological Brains 2023-02-21T16:59:42.438Z
EIS IX: Interpretability and Adversaries 2023-02-20T18:25:43.641Z
EIS VIII: An Engineer’s Understanding of Deceptive Alignment 2023-02-19T15:25:46.115Z
EIS VII: A Challenge for Mechanists 2023-02-18T18:27:24.932Z
EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety 2023-02-17T20:48:26.371Z
EIS V: Blind Spots In AI Safety Interpretability Research 2023-02-16T19:09:11.248Z
EIS IV: A Spotlight on Feature Attribution/Saliency 2023-02-15T18:46:23.165Z
EIS III: Broad Critiques of Interpretability Research 2023-02-14T18:24:43.655Z
EIS II: What is “Interpretability”? 2023-02-09T16:48:35.327Z
The Engineer’s Interpretability Sequence (EIS) I: Intro 2023-02-09T16:28:06.330Z
Avoiding perpetual risk from TAI 2022-12-26T22:34:48.565Z
Existential AI Safety is NOT separate from near-term applications 2022-12-13T14:47:07.120Z
Where to be an AI Safety Professor 2022-12-07T07:09:56.010Z
The Slippery Slope from DALLE-2 to Deepfake Anarchy 2022-11-05T14:53:54.556Z
[Linkpost] A survey on over 300 works about interpretability in deep networks 2022-09-12T19:07:09.156Z
Pitfalls with Proofs 2022-07-19T22:21:08.002Z
A daily routine I do for my AI safety research work 2022-07-19T21:58:24.511Z
Deep Dives: My Advice for Pursuing Work in Research 2022-03-11T17:56:28.857Z
The Achilles Heel Hypothesis for AI 2020-10-13T14:35:10.440Z
Procrastination Paradoxes: the Good, the Bad, and the Ugly 2020-08-06T19:47:54.718Z
Solipsism is Underrated 2020-03-28T18:09:45.295Z
Dissolving Confusion around Functional Decision Theory 2020-01-05T06:38:41.699Z

Comments

Comment by scasper on EIS XIV: Is mechanistic interpretability about to be practically useful? · 2024-10-12T17:26:55.164Z · LW · GW

Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations. 

But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much. 

Comment by scasper on Latent Adversarial Training · 2024-08-27T19:46:23.409Z · LW · GW

Some relevant papers to anyone spelunking around this post years later:

https://arxiv.org/abs/2403.05030
 https://arxiv.org/abs/2407.15549
 

Comment by scasper on EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024 · 2024-05-22T04:21:29.964Z · LW · GW

Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements. 

Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this. 

Comment by scasper on EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024 · 2024-05-21T22:15:20.330Z · LW · GW

Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems. 

First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them. 

Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to models' behaviors. Ideally, we'd want SAEs to be competitive with them. Unfortunately,  good comparisons would be hard because using SAEs for editing models is a pretty unique method with lots of compute required upfront. This would make it non-straightforward to compare the difficulty of making different changes with different methods, but it does not obviate the need for baselines. 

Comment by scasper on Third-party testing as a key ingredient of AI policy · 2024-03-26T16:05:05.063Z · LW · GW

Thanks for the useful post. There are a lot of things to like about this. Here are a few questions/comments.

First, I appreciate the transparency around this point. 

we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.

Second, I have a small question. Why the use of the word "testing" instead of "auditing"? In my experience, most of the conversations lately about this kind of thing revolve around the term "auditing." 

Third, I wanted to note that this post does not talk about access levels for testers and ask why this is the case. My thoughts here relate to this recent work. I would have been excited to see (1) a commitment to providing auditors with methodological details, documentation, and data or (2) a commitment to working on APIs and SREs that can allow for auditors to be securely given grey- and white-box access. 

Comment by scasper on I was raised by devout Mormons, AMA [&|] Soliciting Advice · 2024-03-14T16:35:21.837Z · LW · GW

I think one thing that's pretty cool is "home teaching." Mormon congregation members who are able are assigned various other members of the congregation to check in on. This often involves a monthly visit to their house, talking for a bit, sharing some spiritual thoughts, etc. The nice thing about it is that home teaching sometimes really benefits people who need it. Especially for old or disabled people, they get nice visits, and home teachers often help them with stuff. In my experience, Mormons within a congregation are pretty good at helping each other with misc. things in life (e.g. fixing a broken water heater), and this is largely done through home teaching.

Comment by scasper on Analogies between scaling labs and misaligned superintelligent AI · 2024-02-22T23:47:08.945Z · LW · GW

Thanks. I agree that the points apply to individual researchers. But I don't think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the "when put under great optimization pressure" of Goodhart's Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one. 

Comment by scasper on Analogies between scaling labs and misaligned superintelligent AI · 2024-02-21T19:59:07.325Z · LW · GW

See also this much older and closely related post by Thomas Woodside: Is EA an advanced, planning, strategically-aware power-seeking misaligned mesa-optimizer?

Comment by scasper on The case for ensuring that powerful AIs are controlled · 2024-01-26T01:54:10.865Z · LW · GW

Thanks for the reply. I think this resolves the thread and puts us on the same page :)

Comment by scasper on The case for ensuring that powerful AIs are controlled · 2024-01-26T00:31:43.418Z · LW · GW

I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written. 

First, the appendix hinges somewhat on this point. 

For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any strategy which can be evaded by a red-team-constructed model.

I agree with this, but saying that "control evals should be black-box" is a non sequitur. This is making the point that bad reverse-engineering of the model is bad and seems to assume that good reverse-engineering is hard/unlikely implicitly.

Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning -- and all of these things actually seem plausibly useful for control evals.  Plus I don't see why interp should be off limits for evaluators who know the limitations of the tools they are using. 

Ultimately, white-box attacks offer strictly more options for any type of eval compared to black-box ones. For any property that an AI system might have, if there are two teams of competent people trying to verify if the system has that property, all else equal, I'd bet on the success of the team that has more access. 

So if I understand things correctly, and if the intended point of this appendix is that "Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls," I would recommend just saying that more directly. 

Comment by scasper on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-09T03:56:35.847Z · LW · GW

Thanks!

I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method's ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.

I could imagine an unlearning benchmark, for example, with textbooks and ap tests. Then for each of different knowledge-recovery strategies, one could construct the grid of how well the model performs on each target test for each unlearning textbook.

Comment by scasper on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-06T04:20:09.659Z · LW · GW

+1, I'll add this and credit you.

Comment by scasper on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-05T21:09:45.653Z · LW · GW

+1

Although the NeurIPS challenge and prior ML lit on forgetting and influence functions seem worth keeping on the radar because they're still closely-related to challenges here. 

Comment by scasper on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-05T20:10:43.438Z · LW · GW

Thanks! I edited the post to add a link to this.

Comment by scasper on The 6D effect: When companies take risks, one email can be very powerful. · 2023-11-04T23:59:56.386Z · LW · GW

A good critical paper about potentially risky industry norms is this one

Comment by scasper on Interpretability Externalities Case Study - Hungry Hungry Hippos · 2023-09-21T06:37:31.528Z · LW · GW

Thanks

To use your argument, what does MI actually do here?

The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers. 

 

And yes to your second point.

Comment by scasper on Interpretability Externalities Case Study - Hungry Hungry Hippos · 2023-09-20T16:23:53.149Z · LW · GW

Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways. 

I'm particularly worried about MI people studying instances of when LLMs do and don't express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.

Lastly,

On the other hand, interpretability research is probably crucial for AI alignment. 

I don't think this is true and I especially hope it is not true because (1) mechanistic interpretability still fails to do impressive things by trying to reverse engineer networks and (2) it is entirely fungible from a safety standpoint with other techniques that often do better for various things.

Comment by scasper on Barriers to Mechanistic Interpretability for AGI Safety · 2023-08-29T21:50:10.764Z · LW · GW

Several people seem to be coming to similar conclusions recently (e.g., this recent post). 

I'll add that I have as well and wrote a sequence about it :)

Comment by scasper on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T16:37:43.909Z · LW · GW

Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones. 

Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don't think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.

Comment by scasper on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T16:14:26.541Z · LW · GW

I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make. 

Comment by scasper on Mech Interp Challenge: August - Deciphering the First Unique Character Model · 2023-08-16T22:55:08.793Z · LW · GW

I think this is very exciting, and I'll look forward to seeing how it goes!

Comment by scasper on Open Problems and Fundamental Limitations of RLHF · 2023-08-01T05:00:07.098Z · LW · GW

Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!

Comment by scasper on Open Problems and Fundamental Limitations of RLHF · 2023-08-01T03:16:50.855Z · LW · GW

No, I don't think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.

Comment by scasper on Open Problems and Fundamental Limitations of RLHF · 2023-07-31T21:41:28.906Z · LW · GW

Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper. 

Comment by scasper on Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2) · 2023-07-30T23:55:10.664Z · LW · GW
Comment by scasper on Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2) · 2023-07-29T22:58:50.121Z · LW · GW

This is exciting to see. I think this solution is impressive, and I think the case for the structure you find is compelling. It's also nice that this solution goes a little further in one aspect than the previous one. The analysis with bars seems to get a little closer to a question I have still had since the last solution:

My one critique of this solution is that I would have liked to see an understanding of why the transformer only seems to make mistakes near the parts of the domain where there are curved boundaries between regimes (see fig above with the colored curves). Meanwhile, the network did a great job of learning the periodic part of the solution that led to irregularly-spaced horizontal bars. Understanding why this is the case seems interesting but remains unsolved.

I think this work gives a bit more of a granular idea of what might be happening. And I think it's an interesting foil to the other one. Both came up with some fairly different pictures for the same process. The differences between these two projects seem like an interesting case study in MI. I'll probably refer to this a lot in the future. 

Overall, I think this is great, and although the challenge is over, I'm adding this to the github readme. And If you let me know a high-impact charity you'd like to support, I'll send $500 to it as a similar prize for the challenge :)

Comment by scasper on Eight Strategies for Tackling the Hard Part of the Alignment Problem · 2023-07-22T19:11:00.191Z · LW · GW

Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless. 

Comment by scasper on Eight Strategies for Tackling the Hard Part of the Alignment Problem · 2023-07-21T16:58:26.034Z · LW · GW

Thanks -- I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don't quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression. 

Comment by scasper on Eight Strategies for Tackling the Hard Part of the Alignment Problem · 2023-07-13T17:03:24.584Z · LW · GW

I don't work on this, so grain of salt. 

But wouldn't this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability. 

Comment by scasper on Eight Strategies for Tackling the Hard Part of the Alignment Problem · 2023-07-12T15:44:14.580Z · LW · GW

I think this is a good point, thanks. 

Comment by scasper on Takeaways from the Mechanistic Interpretability Challenges · 2023-06-10T18:53:23.081Z · LW · GW

There are existing tools like lucid/lucent, captum, transformerlens, and many others that make it easy to use certain types of interpretability tools. But there is no standard, broad interpretability coding toolkit. Given the large number of interpretability tools and how quickly methods become obsolete, I don't expect one. 

Comment by scasper on Takeaways from the Mechanistic Interpretability Challenges · 2023-06-10T18:50:40.878Z · LW · GW

Thoughts of mine on this are here. In short, I have argued that toy problems, cherry-picking models/tasks, and a lack of scalability has contributed to mechanistic interpretability being relatively unproductive. 

Comment by scasper on Advice for Entering AI Safety Research · 2023-06-02T21:16:03.508Z · LW · GW

I think not. Maybe circuits-style mechanistic interpretability is though. I generally wouldn't try dissuading people from getting involved in research on most AIS things. 

Comment by scasper on EIS IX: Interpretability and Adversaries · 2023-05-25T16:26:08.930Z · LW · GW

We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment. 

A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that deep and thin networks tend to be more robust suggests that representing information more densely w.r.t. neurons in a layer does not make these networks less robust than wide shallow nets. 

Comment by scasper on EIS IX: Interpretability and Adversaries · 2023-05-25T16:20:24.130Z · LW · GW

Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this. 

Comment by scasper on EIS V: Blind Spots In AI Safety Interpretability Research · 2023-05-05T19:15:13.798Z · LW · GW

Thanks for the comment.

In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.

This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety. 

there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.

No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda. 

I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?

Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren't interpreting networks, they were just training the networks.  So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is. 

Comment by scasper on GPT-4 is easily controlled/exploited with tricky decision theoretic dilemmas. · 2023-04-14T20:03:48.140Z · LW · GW

I just went by what it said. But I agree with your point. It's probably best modeled as a predictor in this case -- not an agent. 

Comment by scasper on Latent Adversarial Training · 2023-04-12T00:31:55.582Z · LW · GW

In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate. 

Comment by scasper on Penalize Model Complexity Via Self-Distillation · 2023-04-09T22:25:11.880Z · LW · GW

One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts. 

I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974 

Comment by scasper on Penalize Model Complexity Via Self-Distillation · 2023-04-07T16:23:29.982Z · LW · GW

I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on. 
 

Comment by scasper on EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety · 2023-03-29T19:03:43.270Z · LW · GW

I buy this value -- FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.

Comment by scasper on EIS XI: Moving Forward · 2023-03-09T17:40:03.538Z · LW · GW

Thanks! Fixed. 
https://arxiv.org/abs/2210.04610 

Comment by scasper on Introducing Leap Labs, an AI interpretability startup · 2023-03-07T22:54:13.901Z · LW · GW

Thanks. 

Are you concerned about AI risk from narrow systems of this kind?

No. Am I concerned about risks from methods that work for this in narrow AI? Maybe. 

This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of "backbone" which I would not have used. I think we're on the same page. 
 

Comment by scasper on Introducing Leap Labs, an AI interpretability startup · 2023-03-07T18:08:59.195Z · LW · GW

Thanks for the post. I'll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:

We must grow interpretability and AI safety in the real world.

Strong +1 to working on more real-world-relevant approaches to interpretability. 

Regulation is coming – let’s use it.

Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be something that has a once-and-for-all solution, so governance seems to matter a lot in the likely case of a future with highly-prolific TAI. And because of the pace of governance, work now to establish concern, offices, precedent, case law, etc. seems uniquely key. 

Speed potentially transformative narrow domain systems. AI for scientific progress is an important side quest. Interpretability is the backbone of knowledge discovery with deep learning, and has huge potential to advance basic science by making legible the complex patterns that machine learning models identify in huge datasets.

I do not see the reasoning or motivation for this, and it seems possibly harmful.

First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good. They are heavy-tailed in both directions. This seems like possible safety washing. But to be fair, this is a critique I have of a ton of AI alignment work including some of my own. 

Second, I don't know of any examples of gaining particularly useful domain knowledge from interpretability related things in deep learning other than maybe the predictivness of nonrobust features. Another possible example could be using deep-learning to find new algorithms for things like matrix multiplication, but this isn't really "interpretability". Do you have other examples in mind? Progress in the last 6 years on reverse-engineering nontrivial systems has seemed to be tenuous at best

So I'd be interested in hearing more about whether/how you expect this one type of work to be robustly good and what is meant by "Interpretability is the backbone of knowledge discovery with deep learning."
 

Comment by scasper on EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety · 2023-03-06T17:12:02.766Z · LW · GW

I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not. 

Comment by scasper on EIS VIII: An Engineer’s Understanding of Deceptive Alignment · 2023-02-26T00:23:33.293Z · LW · GW

Thanks for the comment. I appreciate how thorough and clear it is. 

Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem.

Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are  equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one. 

Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans.

+1, but this seems difficult to scale. 

Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification.

+1, see https://arxiv.org/abs/2206.10673. It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws. 

(e.g. detecting an asteroid heading towards the earth)

This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn't be called deceptive. I don't think my definition of deceptive alignment applies to this because my definition requires that the model does something we don't want it to. 

Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually.

Strong +1. This points out a difference between trojans and deception. I'll add this to the post. 

This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn't trying to do bad things.

+1

Thanks!

Comment by scasper on EIS V: Blind Spots In AI Safety Interpretability Research · 2023-02-22T02:17:42.222Z · LW · GW

Thanks. See also EIS VIII.

Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.  

Comment by scasper on EIS X: Continual Learning, Modularity, Compression, and Biological Brains · 2023-02-21T18:11:43.973Z · LW · GW

Thanks! I am going to be glad to have this post around to refer to in the future. I'll probably do it a lot. Glad you have found some of it interesting. 

Comment by scasper on EIS VIII: An Engineer’s Understanding of Deceptive Alignment · 2023-02-19T20:35:53.972Z · LW · GW

thanks

Comment by scasper on EIS VII: A Challenge for Mechanists · 2023-02-19T00:12:23.034Z · LW · GW

Yes, it does show the ground truth.

The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.