Putting up Bumpers
post by Sam Bowman (sbowman) · 2025-04-23T16:05:05.476Z · LW · GW · 6 commentsContents
6 comments
tl;dr: Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment.
If a child is bowling for the first time, and they just aim at the pins and throw, they’re almost certain to miss. Their ball will fall into one of the gutters. But if there were beginners’ bumpers in place blocking much of the length of those gutters, their throw would be almost certain to hit at least a few pins. This essay describes an alignment strategy for early AGI systems I call ‘putting up bumpers’, in which we treat it as a top priority to implement and test safeguards that allow us to course-correct if we turn out to have built or deployed a misaligned model, in the same way that bowling bumpers allow a poorly aimed ball to reach its target.
To do this, we'd aim to build up many largely-independent lines of defense that allow us to catch and respond to early signs of misalignment. This would involve significantly improving and scaling up our use of tools like mechanistic interpretability audits, behavioral red-teaming, and early post-deployment monitoring. We believe that, even without further breakthroughs, this work we can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.
On this view, if we’re dealing with an early AGI system that has human-expert-level capabilities in at least some key domains, our approach to alignment might look something like this:
- Pretraining: We start with a pretrained base model.
- Finetuning: We attempt to fine-tune that model into a helpful, harmless, and honest agent using some combination of human preference data, Constitutional AI-style model-generated preference data, outcome rewards, and present-day forms of scalable oversight.
- Audits as our Primary Bumpers: At several points during this process, we perform alignment audits on the system under construction, using many methods in parallel. We’d draw on mechanistic interpretability, prompting outside the human/assistant frame, etc.
- Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
- Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.
- Repeat: We repeat this process until we’ve developed a complete system that appears, as best we can tell, to be well aligned. Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.
- Post-Deployment Monitoring as Secondary Bumpers: We then begin rolling out the system in earnest, both for internal R&D and external use, but keep substantial monitoring measures in place that provide additional opportunities to catch misalignment. If we observe substantial warning signs for misalignment, we return to step 4.
The hypothesis behind the Bumpers approach is that, on short timelines (i) this process is fairly likely to converge, after a reasonably small number of iterations, at an end state in which we are no longer observing warning signs for misalignment and (ii) if it converges, the resulting model is very likely to be aligned in the relevant sense.
We are not certain of this hypothesis, but we think it’s important to consider. It has already motivated major parts of Anthropic’s safety research portfolio and we’re hiring across several teams—two of them new—to build out work along these lines.
(Continued in linked post. Caveat: I'm jumping into a lot of direct work on this and may be very slow to engage with comments.)
6 comments
Comments sorted by top scores.
comment by habryka (habryka4) · 2025-04-23T17:31:55.958Z · LW(p) · GW(p)
Each time we go through the core loop of catching a warning sign for misalignment, adjusting our training strategy to try to avoid it, and training again, we are applying a bit of selection pressure against our bumpers. If we go through many such loops and only then, finally, see a model that can make it through without hitting our bumpers, we should worry that it’s still dangerously misaligned and that we have inadvertently selected for a model that can evade the bumpers.
How severe of a problem this is depends on the quality and diversity of the bumpers. (It also depends, unfortunately, on your prior beliefs about how likely misalignment is, which renders quantitative estimates here pretty uncertain.) If you’ve built excellent implementations of all of the bumpers listed above, it’s plausible that you can run this loop thousands of times without meaningfully undermining their effectiveness.[8] If you’ve only implemented two or three, and you’re unlucky, even a handful of iterations could lead to failure.
This seems like the central problem of this whole approach, and indeed it seems very unlikely to me that we would end up with a system that we feel comfortable scaling to superintelligence after 2-3 iterations on our training protocols. This plan really desperately needs a step that is something like "if the problem appears persistent, or we are seeing signs that the AI systems are modeling our training process in a way that suggests that upon further scaling they would end up looking aligned independently of their underlying alignment, we stop halt and advocate for much larger shifts in our training process, which likely requires some kind of coordinated pause or stop with other actors".
Replies from: sbowman↑ comment by Sam Bowman (sbowman) · 2025-04-23T17:51:13.854Z · LW(p) · GW(p)
That's part of Step 6!
Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.
I think we probably do have different priors here on how much we'd be able to trust a pretty broad suite of measures, but I agree with the high-level take. Also relevant:
Replies from: habryka4However, we expect it to also be valuable, to a lesser extent, in many plausible harder worlds where this work could provide the evidence we need about the dangers that lie ahead.
↑ comment by habryka (habryka4) · 2025-04-23T18:04:56.221Z · LW(p) · GW(p)
Ah, indeed! I think the "consistent" threw me off a bit there and so I misread it on first reading, but that's good.
Sorry for missing it on first read, I do think that is approximately the kind of clause I was imagining (of course I would phrase things differently and would put an explicit emphasis on coordinating with other actors in ways beyond "articulation", but your phrasing here is within my bounds of where objections feel more like nitpicking).
comment by Zach Stein-Perlman · 2025-04-23T22:03:12.089Z · LW(p) · GW(p)
A crucial step is bouncing off the bumpers.
If we encounter a warning sign that represents reasonably clear evidence that some common practice will lead to danger, the next step is to try to infer the proximate cause. These efforts need not result in a comprehensive theory of all of the misalignment risk factors that arose in the training run, but it should give us some signal about what sort of response would treat the cause of the misalignment rather than simply masking the first symptoms.
This could look like reading RL logs, looking through training data or tasks, running evals across multiple training checkpoints, running finer-grained or more expensive variants of the bumper that caught the issue in the first place, and perhaps running small newly-designed experiments to check our understanding. Mechanistic interpretability tools and related training-data attribution tools like influence functions in particular can give us clues as to what data was most responsible for the behavior. In easy cases, the change might be as simple as redesigning the reward function for some automatically-graded RL environment or removing a tranche of poorly-labeled human data.
Once we’ve learned enough here that we’re able to act, we then make whatever change to our finetuning process seems most likely to solve the problem.
I'm surprised[1] that you're optimistic about this. I would have guessed that concerning-audit-results don't help you solve the problem much. Like if you catch sandbagging that doesn't let you solve sandbagging. I get that you can patch simple obvious stuff—"redesigning the reward function for some automatically-graded RL environment or removing a tranche of poorly-labeled human data"—but mostly I don't know how to tell a story where concerning-audit-results are very helpful.
- ^
(I'm actually ignorant on this topic; "surprised" mostly isn't a euphemism for "very skeptical.")
comment by NullSetOwl · 2025-04-24T00:07:42.897Z · LW(p) · GW(p)
I think it’s an important caveat that this is meant for early AGI with human-expert-level capabilities, which means we can detect misalignment as it manifests in small-scale problems. When capabilities are weak, the difference between alignment and alignment-faking is less relevant because the model’s options are more limited. But once we scale to more capable systems, the difference becomes critical.
Whether this approach helps in the long term depends on how much the model internalizes the corrections, as opposed to just updating its in-distribution behavior. It’s possible that the behavior we see is not a good indicator of the internal nature of the model, so we would be improving the acting method of the model but not fixing the underlying misalignment. This is a question about the amount of overlap between visible misalignment and total misalignment. If most of the misalignment is invisible until late, then this approach is less helpful in the long term.
comment by Dusto · 2025-04-23T23:15:14.735Z · LW(p) · GW(p)
Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
This seems sensible for some of these well known larger scale incidents of misalignment, but what exactly counts as "hitting a bumper"? Currently it seems as though there is no distinction between shallow and deep examples of misaligned capability, and a deep capability is a product of many underlying subtle capabilities, so repeatedly steering the top level behaviour isn't going to change the underlying ones. I think about this a lot from the perspective of deception. There are many core deceptive capabilities that are actually critical to being a "good" human (at least assuming we want these models to ever interact with moral grey situations which I am guessing is yes).