To be legible, evidence of misalignment probably has to be behavioral
post by ryan_greenblatt · 2025-04-15T18:14:53.022Z · LW · GW · 5 commentsContents
5 comments
One key hope for mitigating risk from misalignment is inspecting the AI's behavior, noticing that it did something egregiously bad [LW · GW], converting this into legible evidence the AI is seriously misaligned, and then this triggering some strong and useful response (like spending relatively more resources on safety or undeploying this misaligned AI).
You might hope that (fancy) internals-based techniques (e.g., ELK methods [? · GW] or interpretability) allow us to legibly incriminate misaligned AIs even in cases where the AI hasn't (yet) done any problematic actions despite behavioral red-teaming [LW · GW] (where we try to find inputs on which the AI might do something bad), or when the problematic actions the AI does are so subtle and/or complex that humans can't understand how the action is problematic[1]. That is, you might hope that internals-based methods allow us to legibly incriminate misaligned AIs even when we can't produce behavioral evidence that they are misaligned.
Unfortunately, I think it's unlikely that evidence which is just from fancy internals-based methods (and can't be supported by human inspection of AI behavior) will be legible and convincing in practice. Suppose that we use some fancy method to train an ELK head to honestly predict what is going to happen or to honestly answer our questions. Then, suppose this head says/predicts "this action would yield very bad consequences in a way which is unlikely to be an accident" but humans can't understand how this action would result in these bad consequences (even after it is explained to them). Even if the internals-based method is extremely well supported theoretically and empirically (which seems quite unlikely), I don't think this would suffice for this to trigger a strong response by convincing relevant people[2]. This reasoning also applies to "our interpretability method decoded some very problematic deceptive reasoning" or "the deception probe fires in a very suspicious way (e.g. it fires more when we give it a fake opportunity to escape or ask it about its alignment research outputs)".
More generally, I think that to convince people enough to take strong action, evidence will need to bottom out in some sort of behavioral evidence that (many) humans can understand.[3] (That's not to say solid behavioral evidence will necessarily suffice [LW · GW].)
This isn't to say that internals-based techniques for detecting scheming aren't useful. These techniques could be quite useful via two mechanisms:
- The technique could allow us to find behavioral evidence that would have otherwise been hard to find. For instance, maybe we can use the method to discover an input on which the AI does something egregiously bad (that we can understand). (However, it will be important to legibly not apply too much optimization pressure as you might be able to get any model to look misaligned with enough optimization pressure [LW · GW].) Internals-based techniques could also help with having humans understand why an action is egregiously bad (e.g., we might be able to use internals-based techniques to effectively do better elicitation for debaters).
- Detection of scheming can be useful via mechanisms other than convincing people to take strong and expensive actions:
- Marginally more buy-in and support for handling misalignment could be helpful.
- If we can detect misalignment, we can study techniques for preventing or removing misalignment by iterating (or even training) against our detectors. The more robust the detector is, the more useful this would be.
- The subset of people who buy this evidence could usefully alter their actions and strategy. For instance, AI company employees implementing countermeasures and utilizing AIs might adjust their approach now that they know the AIs are likely seriously misaligned (though presumably not in a way that greatly increases safety costs). In the case where you know your AI is seriously misaligned, you want to adjust your control protocol, even if you're unable to take much of a hit on (apparent) usefulness[4].
- We might be able to trigger well-enforced preexisting commitments. We could even imagine some sort of legally binding contract which triggers in this case (effectively a bet between people who are more worried about misalignment and those who are less worried).
Given these difficulties in producing legible evidence, I think researchers working on internals-based techniques for detecting misalignment should spend some time thinking about the ultimate use case of their technique and then alter their approach as needed. These difficulties also generally make me somewhat less excited about internals-based methods for detecting misalignment (at least if these methods don't involve finding inputs on which the AI behaves poorly). I think techniques for detecting misalignment should focus on finding behavioral evidence or should aim to be robust (and cheap) enough to be iterated against.
Audio narrations of this blog are available on our podcast. Search "Redwood Research" or subscribe via Apple Podcasts, Spotify, RSS, or other platforms.
We can push the frontier of what humans can understand using recursive oversight techniques like debate which might allow for eliciting explanations that allow for humans to understand how the action is problematic (e.g., pointing out exactly what the problematic behavior is and explaining how to test this in isolation). However, if the problematic nature of an action is sufficiently subtle and/or complex (perhaps due to the AI having superhuman understanding), then even well elicited debate might be unable to provide a good enough explanation. We might have hoped that internals-based methods could push frontier beyond debate (perhaps to indefinite worst case scalability) while still producing legible and convincing evidence, but I'll argue that non-behavioral evidence is unlikely to be convincing. This suggests an upper bound on strategies which depend on legible evidence: once well elicited debate (and other methods for improving human understanding) aren't viable, misaligned AIs can get away with doing bad actions without these actions resulting in legible evidence of misalignment. ↩︎
That is, relevant humans, this approach might work for eliciting a strong response from AIs once AIs make the important decisions. ↩︎
That is, it must at least partially bottom out behaviorally. Internals-based evidence which is substantially behaviorally supported (in a relatively direct way) could suffice. E.g., if we can read off the AI's internal reasoning using some fancy method, it looks quite malign, and the malign aspects of the reasoning are also quite predictive of the AI's behavior, this could suffice. For instance, Alignment Faking in Large Language Models [LW · GW] utilized a mixture of inspecting the Chain-of-Thought and behavioral evidence in a relatively effective way. ↩︎
Presumably, the actual level of usefulness might be substantially lower as the AI is a misaligned AI trying to disempower humans. ↩︎
5 comments
Comments sorted by top scores.
comment by Curt Tigges (curt-tigges) · 2025-04-15T21:47:31.135Z · LW(p) · GW(p)
I'm not sure I entirely agree with the overall recommendation for researchers working on internals-based techniques. I do agree that findings will need to be behavioral initially in order to be legible and something that decision-makers find worth acting on.
My expectation is that internals-based techniques (including mech interp) and techniques that detect specific highly legible behaviors will ultimately converge. That is:
- Internals/mech interp researchers will, as they have been so far at least in model organisms, find examples of concerning cognition that will be largely ignored or not acted on fully
- Eventually, legible examples of misbehavior will be found, resulting in action or increased scrutiny
- This scrutiny will then propagate backwards to finding causes or indicators of that misbehavior, and provided interp tools are indeed predictive, this path that has been developed in parallel will suddenly be much more worth paying attention to
Thus, I think it's worth progressing on these internals-based techniques even if their use isn't immediately apparent. When legible misbehaviors arrive, I expect internals-based detection or analysis to be more directly applicable.
comment by Dusto · 2025-04-16T02:02:51.418Z · LW(p) · GW(p)
Honestly I think this can go one step further and require that the behavioural output be something of more substance than text of what the model says it would do, or chain of thought pondering. I'm thinking more of the lines of the honeypot [LW · GW] work. I suspect we are interested in convincing similar people about AI safety and will likely get the "cool, come back to me when the models start actually doing bad things" responses if we aren't able to provide concrete examples of model capability.
Replies from: Buckcomment by Knight Lee (Max Lee) · 2025-04-16T07:01:31.388Z · LW(p) · GW(p)
When Gemini randomly told an innocent user to go kill himself, it made the news, but this news didn't really affect very much in the big picture.
It's possible that relevant decision-makers don't care that much about dramatic bad behaviours since the vibe is "oh yeah AI glitches up, oh well."
It's possible that relevant decision-makers do care more about what the top experts believe, and if the top experts are convinced that current models already want to kill you (but can't), it may have an effect. Imagine if many top experts agree that "the lie detectors start blaring like crazy when the AI is explaining how it won't kill all humans even if can get away with it."
I'm not directly disagreeing with this post, I'm just saying there exists this possible world model where behavioural evidence isn't much stronger (than other misalignment evidence).