Posts
Comments
Thanks
To use your argument, what does MI actually do here?
The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers.
And yes to your second point.
Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways.
I'm particularly worried about MI people studying instances of when LLMs do and don't express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.
Lastly,
On the other hand, interpretability research is probably crucial for AI alignment.
I don't think this is true and I especially hope it is not true because (1) mechanistic interpretability still fails to do impressive things by trying to reverse engineer networks and (2) it is entirely fungible from a safety standpoint with other techniques that often do better for various things.
Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I'll add that I have as well and wrote a sequence about it :)
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don't think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
I think this is very exciting, and I'll look forward to seeing how it goes!
Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!
No, I don't think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.
Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper.
This is exciting to see. I think this solution is impressive, and I think the case for the structure you find is compelling. It's also nice that this solution goes a little further in one aspect than the previous one. The analysis with bars seems to get a little closer to a question I have still had since the last solution:
My one critique of this solution is that I would have liked to see an understanding of why the transformer only seems to make mistakes near the parts of the domain where there are curved boundaries between regimes (see fig above with the colored curves). Meanwhile, the network did a great job of learning the periodic part of the solution that led to irregularly-spaced horizontal bars. Understanding why this is the case seems interesting but remains unsolved.
I think this work gives a bit more of a granular idea of what might be happening. And I think it's an interesting foil to the other one. Both came up with some fairly different pictures for the same process. The differences between these two projects seem like an interesting case study in MI. I'll probably refer to this a lot in the future.
Overall, I think this is great, and although the challenge is over, I'm adding this to the github readme. And If you let me know a high-impact charity you'd like to support, I'll send $500 to it as a similar prize for the challenge :)
Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless.
Thanks -- I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don't quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression.
I don't work on this, so grain of salt.
But wouldn't this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability.
I think this is a good point, thanks.
There are existing tools like lucid/lucent, captum, transformerlens, and many others that make it easy to use certain types of interpretability tools. But there is no standard, broad interpretability coding toolkit. Given the large number of interpretability tools and how quickly methods become obsolete, I don't expect one.
Thoughts of mine on this are here. In short, I have argued that toy problems, cherry-picking models/tasks, and a lack of scalability has contributed to mechanistic interpretability being relatively unproductive.
I think not. Maybe circuits-style mechanistic interpretability is though. I generally wouldn't try dissuading people from getting involved in research on most AIS things.
We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment.
A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that deep and thin networks tend to be more robust suggests that representing information more densely w.r.t. neurons in a layer does not make these networks less robust than wide shallow nets.
Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this.
Thanks for the comment.
In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.
This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.
there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.
No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda.
I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?
Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren't interpreting networks, they were just training the networks. So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is.
I just went by what it said. But I agree with your point. It's probably best modeled as a predictor in this case -- not an agent.
In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate.
One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts.
I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974
I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on.
I buy this value -- FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.
Thanks! Fixed.
https://arxiv.org/abs/2210.04610
Thanks.
Are you concerned about AI risk from narrow systems of this kind?
No. Am I concerned about risks from methods that work for this in narrow AI? Maybe.
This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of "backbone" which I would not have used. I think we're on the same page.
Thanks for the post. I'll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:
We must grow interpretability and AI safety in the real world.
Strong +1 to working on more real-world-relevant approaches to interpretability.
Regulation is coming – let’s use it.
Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be something that has a once-and-for-all solution, so governance seems to matter a lot in the likely case of a future with highly-prolific TAI. And because of the pace of governance, work now to establish concern, offices, precedent, case law, etc. seems uniquely key.
Speed potentially transformative narrow domain systems. AI for scientific progress is an important side quest. Interpretability is the backbone of knowledge discovery with deep learning, and has huge potential to advance basic science by making legible the complex patterns that machine learning models identify in huge datasets.
I do not see the reasoning or motivation for this, and it seems possibly harmful.
First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good. They are heavy-tailed in both directions. This seems like possible safety washing. But to be fair, this is a critique I have of a ton of AI alignment work including some of my own.
Second, I don't know of any examples of gaining particularly useful domain knowledge from interpretability related things in deep learning other than maybe the predictivness of nonrobust features. Another possible example could be using deep-learning to find new algorithms for things like matrix multiplication, but this isn't really "interpretability". Do you have other examples in mind? Progress in the last 6 years on reverse-engineering nontrivial systems has seemed to be tenuous at best.
So I'd be interested in hearing more about whether/how you expect this one type of work to be robustly good and what is meant by "Interpretability is the backbone of knowledge discovery with deep learning."
I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not.
Thanks for the comment. I appreciate how thorough and clear it is.
Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem.
Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one.
Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans.
+1, but this seems difficult to scale.
Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification.
+1, see https://arxiv.org/abs/2206.10673. It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws.
(e.g. detecting an asteroid heading towards the earth)
This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn't be called deceptive. I don't think my definition of deceptive alignment applies to this because my definition requires that the model does something we don't want it to.
Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually.
Strong +1. This points out a difference between trojans and deception. I'll add this to the post.
This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn't trying to do bad things.
+1
Thanks!
Thanks. See also EIS VIII.
Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.
Thanks! I am going to be glad to have this post around to refer to in the future. I'll probably do it a lot. Glad you have found some of it interesting.
thanks
Yes, it does show the ground truth.
The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.
The MNIST CNN was trained only on the 50k training examples.
I did not guarantee that the models had perfect train accuracy. I don't believe they did.
I think that any interpretability tools are allowed. Saliency maps are fine. But to 'win,' a submission needs to come with a mechanistic explanation and sufficient evidence for it. It is possible to beat this challenge by using non mechanistic techniques to figure out the labeling function and then using that knowledge to find mechanisms by which the networks classify the data.
At the end of the day, I (and possibly Neel) will have the final say in things.
Thanks :)
Thanks for the comment and pointing these things out.
---
I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names.
Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
I don't know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?
---
In fact it's almost like a running joke in academia that there's always someone grumbling that you didn't cite the right things (their favourite work on this topic, their fellow countryman, them etc.)...
Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless.
---
I understand that your take is that it is closer to program synthesis or program induction and that these aren't all the same thing but in the first subsection of the "TASIC has reinvented..." section, I'm a little confused why there's no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).
Thanks for pointing out these posts. They are examples of discussing a similar idea to MI's dependency on programmatic hypothesis generation, but they don't act on it. But they both serve to draw analogies instead of providing methods. The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)
If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017, Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals.
I see the point of this post. No arguments with the existence of productive reframing. But I do not think this post makes a good case for reframing being robustly good. Obviously, it can be bad too. And for the specific cases discussed in the post, the post you linked doesn't make me think "Oh, these are reframed ideas, so good -- glad we are doing redundant work in isolation."
For example with polysemanticity/superposition I think that TAISIC's work has created generational confusion and insularity that are harmful. And I think TAISIC's failure to understand that MI means doing program synthesis/induction/language-translation has led to a lot of unproductive work on toy problems using methods that are unlikely to scale.
This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?
Thanks. I'll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it's a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.
Thanks, but I'm asking more about why you chose to study this particular thing instead of something else entirely. For example, why not study "this" versus "that" completions or any number of other simple things in the language model?
How was the ' a' v. ' an' selection task selected? It seems quite convenient to probe for and also the kind of thing that could result from p-hacking over a set of similar simple tasks.
Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.
There are not that many that I don't think are fungible with interpretability work :)
But I would describe most outer alignment work to be sufficiently different...
Interesting to know that about the plan. I have assumed that remix was in large part about getting more people into this type of work. But I'm interested in the conclusions and current views on it. Is there a post reflecting on how it went and what lessons were learned from it?
I think that my personal thoughts on capabilities externalities are reflected well in this post.
I'd also note that this concern isn't very unique to interpretability work but applies to alignment work in general. And in comparison to other alignment techniques, I think that the downside risks of interpretability tools are most likely lower than those of stuff like RLHF. Most theories of change for interpretability helping with AI safety involve engineering work at some point in time, so I would expect that most interpretability researchers have similar attitudes to this on dual use concerns.
In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don't have big advancements in mind so much as stuff like fairly simple debugging work.
I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?
Thanks! I discuss in the second post of the sequence why I lump ARC's work in with human-centered interpretability.
There seems to be high variance in the scope of the challenges that Katja has been tackling recently.
There are 4 points of disagreement I have about this post.
First, I think it's fundamentally based on a strawperson.
my fundamental objection is that their specific strategy for delaying AI is not well targeted.
This post provides an argument for not adopting the "neo-luddite" agenda or not directly empowering neo-luddites. This is not an argument against allying with neo-luddites for specific purposes. I don't know of anyone who has actually advocated for the former. This is not how I would characterize Katija's post.
Second, I think there is an inner strawperson with the example about text-to-image models. From a bird's eye view, I agree with caring very little about these models mimicking humans artistic styles. But this is not where the vast majority of tangible harm may be coming from with text-to-image models. I think that this most likely comes from non-consensual deepfakes being easy to use for targeted harassment, humiliation, and blackmail. I know you've seen the EA forum post about this because you commented on it. But I'd be interested in seeing a reply to my reply to your comment on the post.
Third, I think that this post fails to consider how certain (most?) regulations that neo-luddites would support could meaningfully slow risky things down. In general, any type of regulation that makes research and dev for risky AI technologies harder or less incentivized will in fact slow risky AI progress down. I think that the one example you bring up--text-to-image models--is a counterexample to your point. Suppose we pass a bunch of restrictive IP laws that make it more painful to research, develop, and deploy text-to-image models. That would suddenly slow down this branch of research which could concievably be useful for making riskier AI in the future (e.g. multimodal media generators), hinder revenue opportunities for companies who are speeding up risky AI progress, close off this revenue option to possible future companies who may do the same, and establish law/case law/precedent around generative models that could be set precedent or be repurposed for other types of AI later.
Fourth, I also am not convinced by the specific argument about how indiscriminate regulation could make alignment harder.
Suppose the neo-luddites succeed, and the US congress overhauls copyright law. A plausible consequence is that commercial AI models will only be allowed to be trained on data that was licensed very permissively, such as data that's in the public domain...Right now, if an AI org needs some data that they think will help with alignment, they can generally obtain it, unless that data is private.
This is a nitpick, but I don't actually predict this scenario would pan out. I don't think we'd realistically overhaul copyright law and have the kind of regime with datasets that you describe. But this is probably a question for policy people. There are also neo-luddite solutions that your argument would not apply to--like having legal requirements for companies to make their models "forget" certain content upon request. This would only be a hindrance to the deployer.
Ultimately though, what matters is not whether something makes certain alignment research harder. It matters how much something makes alignment research harder relative to how much it makes risky research harder. Alignment researchers are definitely the ones that are differentially data-hungry. What's a concrete, concievable story in which something like the hypothetical law you described makes things differentially harder for alignment researchers compared to capabilities researchers?