"The Urgency of Interpretability" (Dario Amodei)
post by RobertM (T3t) · 2025-04-27T04:31:50.090Z · LW · GW · 2 commentsThis is a link post for https://www.darioamodei.com/post/the-urgency-of-interpretability
Contents
2 comments
Dario Amodei posted a new essay titled "The Urgency of Interpretability" a couple days ago.
Some excerpts I think are worth highlighting:
The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments[1]. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking[2] because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts.
One might be forgiven for forgetting about Bing Sydney [LW · GW] as an obvious example of "power-seeking" AI behavior, given how long ago that was, but lying? Given the very recent releases of Sonnet 3.7 and OpenAI's o3, and their much-remarked-upon propensity for reward hacking and lying to users in self-aware ways, I think we are well past the point where we can credibly claim to not have seen "real-world scenarios of deception". I could try to imagine ways to make those sentences sound more reasonable, but I don't believe I'm omitting relevant context for how readers should understand them.
There are other more exotic consequences of opacity, such as that it inhibits our ability to judge whether AI systems are (or may someday be) sentient and may be deserving of important rights. This is a complex enough topic that I won’t get into it in detail, but I suspect it will be important in the future.
Interesting to note that Dario feels comfortable bringing up AI welfare concerns.
Recently, we did an experiment where we had a “red team” deliberately introduce an alignment issue into a model (say, a tendency for the model to exploit a loophole in a task) and gave various “blue teams” the task of figuring out what was wrong with it. Multiple blue teams succeeded; of particular relevance here, some of them productively applied interpretability tools during the investigation. We still need to scale these methods, but the exercise helped us gain some practical experience using interpretability techniques to find and address flaws in our models.
My understanding is that mech interp techniques didn't provide [LW · GW] any obvious advantages over other techniques in that experiment, and SAEs underperform much simpler & cheaper techniques, to the point where GDM is explicitly deprioritizing [LW · GW] them as a research direction.
On one hand, recent progress—especially the results on circuits and on interpretability-based testing of models—has made me feel that we are on the verge of cracking interpretability in a big way. Although the task ahead of us is Herculean, I can see a realistic path towards interpretability being a sophisticated and reliable way to diagnose problems in even very advanced AI—a true “MRI for AI”. In fact, on its current trajectory I would bet strongly in favor of interpretability reaching this point within 5-10 years.
On the other hand, I worry that AI itself is advancing so quickly that we might not have even this much time. As I’ve written elsewhere, we could have AI systems equivalent to a “country of geniuses in a datacenter” as soon as 2026 or 2027. I am very concerned about deploying such systems without a better handle on interpretability. These systems will be absolutely central to the economy, technology, and national security, and will be capable of so much autonomy that I consider it basically unacceptable for humanity to be totally ignorant of how they work.
The timelines seem aggressive to me, but I think it's good to have the last sentence spelled out.
If it helps, Anthropic will be trying to apply interpretability commercially to create a unique advantage, especially in industries where the ability to provide an explanation for decisions is at a premium. If you are a competitor and you don’t want this to happen, you too should invest more in interpretability!
Some people predicted [LW · GW] that interpretability would be dual-use for capabilities (and that this was a downside that had to be accounted for). I think I was somewhat skeptical at the time, mostly because I didn't expect to see traction on interpretability techniques that were good enough to move the needle very much compared to more direct capabilities research, but to the extent that you trust Dario's research taste this might be an update.
I don't want to sound too down on the essay; I think it's much better than the last one in a bunch of ways, not the least of which is that it directly acknowledges misalignment concerns (though still shies away from mentioning x-risk). There's also an interesting endorsement of a "conditional pause" (framed differently).
I recommend reading the essay in full.
- ^
Copied footnote: "You can of course try to detect these risks by simply interacting with the models, and we do this in practice. But because deceit is precisely the behavior we’re trying to find, external behavior is not reliable. It’s a bit like trying to determine if someone is a terrorist by asking them if they are a terrorist—not necessarily useless, and you can learn things by how they answer and what they say, but very obviously unreliable."
- ^
Copied footnote: "I’ll probably describe this in more detail in a future essay, but there are a lot of experiments (many of which were done by Anthropic) showing that models can lie or deceive under certain circumstances when their training is guided in a somewhat artificial way. There is also evidence of real-world behavior that looks vaguely like “cheating on the test”, though it’s more degenerate than it is dangerous or harmful. What there isn’t is evidence of dangerous behaviors emerging in a more naturalistic way, or of a general tendency or general intent to lie and deceive for the purposes of gaining power over the world. It is the latter point where seeing inside the models could help a lot."
2 comments
Comments sorted by top scores.
comment by Ben Pace (Benito) · 2025-04-27T04:53:24.113Z · LW(p) · GW(p)
I couldn’t get two sentences in without hitting propaganda, so I set it aside. But I’m sure it’s of great political relevance.
Replies from: Davidmanheim↑ comment by Davidmanheim · 2025-04-27T07:33:39.412Z · LW(p) · GW(p)
Quick take: it's focused on interpretability as a way to solve prosaic alignment, ignoring the fact that prosaic alignment is clearly not scalable to the types of systems they are actively planning to build. (And it seems to actively embrace the fact that interpretability is a capabilities advantage in the short term, but pretends that it is a safety thing, as if the two are not at odds with each other when engaged in racing dynamics.)