Posts
Comments
Project proposal: No-holds-barred jailbreak attacks and defenses for control
Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?
The blue team can try a variety of strategies here.
- The simplest strategy is randomizing the prompt to the trusted monitor. This forces the red team to find a jailbreak that works over multiple prompts.
- There are fancier approaches here, e.g. giving the trusted model a randomized prompt that says something like “if the code is safe, reply with the number 1385, otherwise reply with the number 9247”, which breaks jailbreaks which just convince the trusted model to say a particular fixed string.
- Blue team can modify the current input to contain a known-bad substring, and check that it gets flagged.
Project proposal: Control for auditing
(I'm sharing this here because I needed to link to it, and I thought I might as well make it public.)
When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely on human labor to audit, we’ll only be able to audit a tiny fraction of actions that our AI agents are doing. It would be extremely helpful if we were able to use AI assistance for auditing, from both trusted and untrusted models. But obviously auditing with untrusted models leads to massive risk of collusion. So this setting seems potentially pretty hard and potentially crucial; I’m excited for a project investigating it.
This is pretty closely related to untrusted monitoring; you could also call this project “deep dive into untrusted monitoring”.
The simplest such setting I can think of
We’re going to take the APPS backdooring setting from AI Control: Improving Safety Despite Intentional Subversion, fix the main control technique to be trusted monitoring with 1% auditing budget and defer-to-trusted for the 10% most suspicious solutions, and focus on the problem of implementing an auditor. That is:
- Get a dataset of backdoor attempts.
- For statistical power, this dataset will need to be large (at least 100; it would be better for it to be more like 1000). This might require developing a better/cheaper backdoor generation pipeline.
- It might be good to explore backdooring in a harder setting than the interview subset of APPS. Maybe the advanced subset of APPS would be better.
- The red team will be allowed to modify these backdoor attempts to add collusion signals. (These modifications will probably be cheap and easy compared to generating the backdoor attempts.)
- Get a dataset of benign code, filter with a trusted monitor for the most suspicious 1%. (This should be pretty cheap and easy)
We want to implement a first-pass auditor (better name TBD). That is, we want the actual audit process to have two stages:
- Firstly, some AI scaffold (the first-pass auditor) carefully investigates the sketchiest 1% of model-proposed code.
- That scaffold needs to choose 1% of the code it reviews to pass along to a human reviewer.
The first-pass auditor is allowed to be somewhat expensive compared to generating the original code, because it’s only getting called 1% of the time. For our purposes, maybe let’s say it can cost up to 20x as much.
And we’re doing a control eval: the untrusted model will, as usual, be replaced by an attack policy.
Note that in this control eval there’s no notion of usefulness: our only metric is the probability that the red team is incriminated on a given backdoor attempt. (Actually this might be more confusing than I’ve written here, because we might want to do honeypotting of the untrusted model in the first-pass auditing setup, which leads to the game varying with scale for the usual reasons (see untrusted monitoring in our original control paper).)
Techniques
- Get the untrusted model to reimplement the function, compare the implementations.
- Ask the untrusted model to explain why various parts of the function are the way they are; make collusion hard.
- Show the untrusted model only a small amount of the context (e.g. the action but not the rest of the trajectory), have it ask a trusted model for information that it needs to know in order to assess whether the action was dangerous; hope that it can’t learn whether to collude given this restricted information channel.
There are many potential options here.
My guess is that this project would actually be much better with more complicated programming problems than APPS.
I also asked it whether it consented to being a coauthor, and it said yes.
It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'
Those sentences are talking in the context of defer to resample, where I think it's actually reasonable to think the model won't see the resampled actions.
I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change?
I'm specifically talking about not showing it actions that were resampled. You're totally right that they'll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).
Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?
Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.
Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn't second guess it, I agree it's wrong.
Yep! You might have to show it CoT from previous, non-resampled steps though
It would be crazy to hide that CoT from the monitor.
Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R?
Yep.
If it can tell that, how come it can't guess it specifically would have produced A?
I'm just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it's a lot easier to notice that something isn't your style than to guess what exactly you would have done (among other things, there are many more bits in "what action did I take" than "did I take this action").
For the main body, I gave it a first draft, it wrote a much improved second draft, then I edited from there. It wrote the appendix basically itself (with some feedback from me)
Thanks, this is closely related!
FWIW, the there are some people around the AI safety space, especially people who work on safety cases, who have that experience. E.g. UK AISI works with some people who are experienced safety analysts from other industries.
I mostly mean "we are sure that it isn't egregiously unaligned and thus treating us adversarially". So models can be aligned but untrusted (if they're capable enough that we believe they could be schemers, but they aren't actually schemers). There shouldn't be models that are trusted but unaligned.
Everywhere I wrote "unaligned" here, I meant the fairly specific thing of "trying to defeat our safety measures so as to grab power", which is not the only way the word "aligned" is used.
I think your short definition should include the part about our epistemic status: "We are happy to assume the AI isn't adversarially trying to cause a bad outcome".
I don’t understand, can you say more?
An example snippet of a rogue internal deployment threat model that I find fairly compelling:
- The schemer, while deployed in an authorized scaffold, launches an unmonitored scaffold that has access to trusted model weights.
- It uses this scaffold to do a GCG attack on the trusted model, then writes the resulting GCG attack somewhere that the agent in the authorized scaffold can see. This takes a moderate amount of compute, but a small amount compared to the amount the untrusted model will regularly need to run experiments.
- The agent in the authorized scaffold uses that GCG attack to do something else that's more bad.
To defend against this, we should try to both prevent the untrusted model from ever getting rogue deployments where it can research jailbreaks on the trusted model, and also try to prevent unacceptable outcomes conditional on the untrusted agent having access to the best jailbreaks you can get with GCG etc.
I find this interesting because it seems kind of hard to stop either step 1 (because rogue deployments might be naturally possible given the affordances the model has) or step 3 (because it's hard to ensure safety against an adversary with white-box access to a model it wants to jailbreak). I'd be excited for more research on both of these.
That's a great way of saying it. I edited this into my original comment.
Two different meanings of “misuse”
The term "AI misuse" encompasses two fundamentally different threat models that deserve separate analysis and different mitigation strategies:
- Democratization of offense-dominant capabilities
- This involves currently weak actors gaining access to capabilities that dramatically amplify their ability to cause harm. That amplification of ability to cause harm is only a huge problem if access to AI didn’t also dramatically amplify the ability of others to defend against harm, which is why I refer to “offense-dominant” capabilities; this is discussed in The Vulnerable World Hypothesis.
- The canonical example is terrorists using AI to design bioweapons that would be beyond their current technical capacity (c.f. Aum Shinrikyo, which failed to produce bioweapons despite making a serious effort)
- Power Concentration Risk
- This involves AI systems giving already-powerful actors dramatically more power over others
- Examples could include:
- Government leaders using AI to stage a self-coup then install a permanent totalitarian regime, using AI to maintain a regime with currently impossible levels of surveillance.
- AI company CEOs using advanced AI systems to become world dictator.
- The key risk here is particular already-powerful people getting potentially unassailable advantages
These threats require different solutions:
- Misuse that involves offense-dominant capabilities can be addressed by preventing users of your AIs from doing catastrophically bad things, e.g. by training the models to robustly refuse requests that could lead to these catastrophic outcomes (which might require improvements in adversarial robustness), or by removing dangerous knowledge from the AI training data.
- Power concentration risks require different solutions. Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks, or the threat of the US government expropriating the AI system and using it for their own purposes. To resist against these threats, interventions include:
- Transparency interventions: making it so that more people know about the situation, so it's less likely a tiny conspiracy can grab lots of power. E.g. see 4 Ways to Advance Transparency in Frontier AI Development.
- And then there are some technical interventions, but all of these suffer from the problem that our main concern here is, as jbash put it in a comment, "By far the most important risk isn't that they'll steal them. It's that they will be fully authorized to misuse them. No security measure can prevent that."
- Improved computer security. This guards against the risks of third parties stealing the models.
- Security against insider threats, to make it harder for the AI to be misused internally. This is a classic insider threat problem; addressing it will require both technical interventions and workflow changes inside AI companies.
Many discussions of "AI misuse" focus primarily on interventions that only help with the first category, while using rhetoric that suggests they're addressing both. This creates a motte-and-bailey situation where:
- The "motte" (easily defensible position) is "we need to prevent terrorists from using AI for mass harm"
- The "bailey" (broader claim) is "our work on AI misuse prevention will solve the major misuse risks from AI, therefore we aren't causing huge risks through our work"
This conflation is dangerous because it may lead us to overinvest in technical solutions that only address the less concerning risk, and underinvest in countermeasures for power concentration risks.
If we can construct fake honeypots that are indistinguishable - to the policy - from real opportunities, and do this at scale, then that could suffice: we could drive the policy's P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.)
Yes totally. (This is impractical if you think that the model has a very large number of opportunities to attack, e.g. if every time it's programming is an opportunity to attack.)
If we can construct fake attacks that are indistinguishable - to the untrusted monitor - from real attacks, then that possibly suffices: we could drive the monitor's prior P(deployment | attack) very low
Indeed; as you know, this is probably easier than driving the policy's P(deployment|great opportunity) sufficiently low, because untrusted monitors have to defect reliably.
If you are saying that we don't need either of these - i.e. that we can get by with honeypots/attacks that are noticeably fake - then that's news to me, and I don't see how you can do that.
What I was trying to say is that if you don't have the ability to generate tasks that look real, you might still be able to honeypot your monitor by taking real trajectories and editing them to contain fake attacks (that you hope the monitor will be unable to distinguish from real attacks). I discuss this here.
Another potential benefit of this is that Anthropic might get more experience deploying their models in high-security environments.
In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:
- How to evaluate whether models should be trusted or untrusted: currently I don't have a good answer and this is bottlenecking the efforts to write concrete control proposals.
- How AI control should interact with AI security tools inside labs.
More generally:
- How can we get more evidence on whether scheming is plausible?
- How scary is underelicitation? How much should the results about password-locked models or arguments about being able to generate small numbers of high-quality labels or demonstrations affect this?
I don't think this makes any sense. How are you hoping to get the model to attack except by telling it that it's in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.
Note that you can make untrusted monitoring work without the synthetic inputs too, you just have to do the honeypots during deployment instead of before deployment.
I think the control safety case described here is moderately different from what I’d recommend. In particular, this post describes needing to generate situations that are indistinguishable from real situations:
The situations for the game were generated using Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution. (See below for validating the indistinguishability of this distribution.)
I don’t know why Roger thinks you need this. Control evaluations for the control protocols listed don’t require situations that the model can’t distinguish from the genuine deployment distribution; it’s fine if the situations are visibly fake. The control evaluation involves getting a red team to elicit the untrusted model to generate attacks, so it doesn’t matter if the model knows that the situation is fake. (There are techniques that require synthetic input generation, e.g. upfront auditing as we presented it in our paper, but trusted monitoring isn’t one of them.)
The condition we do require for the control-style safety case to be valid is that the situations in the evaluation are at least as easy to attack on as the real distribution. (For example, this distribution needs to have as many good opportunities to attack, it needs to be similarly confusing to the trusted model, etc…)
Roger writes:
The idea here is to achieve sufficient realism and sufficient coverage of high-stakes deployment scenarios that the deployed model could never (in a high-stakes scenario) be highly confident that it’s in deployment.
Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.
What do you think of the arguments in this post that it's possible to make safety cases that don't rely on the model being unlikely to be a schemer?
You might not need privacy guarantees for internal deployments ofc
Oh yeah sorry, I misunderstood that
Thanks for writing this up. Some miscellaneous thoughts:
- My guess is that the safety property you'll want to ensure here is something like "the model is unlikely to succeed at a problematically large quantity of sabotage without us first getting a big (20:1) update that it's engaging in intentional sabotage".
I think your internal usage restrictions are wildly unaffordable; it seems extremely unlikely that a sensible AI lab should agree to not deploying this model as a coding agent, or not use it for day-to-day tasks. My guess is that companies should treat the untrusted model similarly to how they treat external contractors (or how they treat employees at organizations that take insider threat seriously).
We're doing some research related to control measures for this kind of threat model, some of which should be out soon.
What do you see as the main properties required for an organization to serve as such an evaluator?
Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good.
Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.
I'm curious @Dan Braun, why don't you work on computer security (assuming I correctly understand that you don't)?
FWIW, I don't think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I've seriously considered doing the latter).
(I do feel more optimistic than you about being able to use controlled AIs for something useful.)
I think that even after your edit, your argument still applies more broadly than you're giving it credit for: if computer security is going to go poorly, then we're facing pretty serious AI risk even if the safety techniques require trivial effort during deployment.
If your AI is stolen, you face substantial risk even if you had been able to align it (e.g. because you might immediately get into an AI enabled war, and you might be forced to proceed with building more powerful and less-likely-to-be-aligned models because of the competitive pressure).
So I think your argument also pushes against working on alignment techniques.
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)
I agree that I didn’t really have much of an effect on this community’s thinking about AIS until like 2021.
A lot of OP's funding to technical AI safety goes to people outside the main x-risk community (e.g. applications to Ajeya's RFPs).
not even ARC has been able to get OP funding (in that case because of COIs between Paul and Ajeya)
As context, note that OP funded ARC in March 2022.
Thanks for the discussion Nate, I think this ended up being productive.
Another straightforward problem with this argument is that the AI doesn't just have the sun in this hypothetical, it also has the rest of the reachable universe. So the proportion of its resources it would need to spend on leaving us the sunlight is actually dramatically lower than you estimate here, by a factor of 10^20 or something.
(I don't think this is anyone's crux, but I do think it's good to avoid straightforward errors in arguments.)
I agree. But people often refer to the cost of the solar output that goes to earth, and that particular cost doesn't get paid until late.
Actually the AI can use powerful sims here: if the AI holds off on killing us until it makes the powerful sims, then if the acausal trade proposed here doesn't work out, it can just kill us then. That lets it avoid the cost of letting us have the tiny share of sunlight, though not the costs of keeping us alive during its early capabilities explosion.
Yep fair point. Those AIs will plausibly have much more thought put into this stuff than we currently have, but I agree the asymmetry is smaller than I made it sound.
I think that we should indeed consider what the corvid-god wants at the same point in the future where we're considering building the simulations David describes in this post.
More directly: David isn't proposing we should do particularly different things now, he's just noting an argument that we might take actions later that affect whether unaligned AIs kill us.
Re "It's costly for AI to leave humans alive", I think the best thing written on this is Paul's comment here, the most relevant part of which is:
- First, I think you should talk quantitatively. How many more resources can an AI get by killing humans? I'd guess the answer is something like 1 in a billion to 1 in a trillion.
- If you develop as fast as possible you will wreck the human habitat and incidentally kill a lot of people. It's pretty complicated to figure out exactly how much "keep earth livable enough for human survival" will slow you down, since it depends a lot on the dynamics of the singularity. I would guess more like a month than a year, which results in a miniscule reduction in available resources. I think that (IMO implausible) MIRI-style views would suggest more like hours or days than months.
- Incidentally, I think "byproducts of rapid industrialization trash Earth's climate" is both much more important than the dyson sphere as well as much more intuitively plausible.
- You can get energy from harvesting the biosphere, and you can use it to develop slightly faster. This is a rounding error compared to the last factor though.
- Killing most humans might be the easiest way to prevail in conflict. I think this is especially plausible for weak AI. For very powerful AI, it also seems like a rounding error. Even a moderately advanced civilization could spend much less than 1 in a trillion of its resources to have much less than 1 in a billion chance of being seriously inconvenienced by humanity.
I don't think you should commit to doing this scheme; I think you should just commit to thinking carefully about this argument post-singularity and doing the scheme if you think it still seems good. Acausal trade is potentially really scary and I don't think you want to make unnecessarily strong commitments.
I agree with you that "slow takeoff" is often used, but I think "continuous" is self-evident enough and is enough of a better term that it seems pretty good to switch. This term has been used e.g. here.
Sorry, I don't understand the question, can you rephrase?
Maybe you should change the title of this post? It would also help if the post linked to the kinds of arguments he was refuting.
As is maybe obvious from my comment, I really disliked this essay and I'm dismayed that people are wasting their time on it. I strong downvoted. LessWrong isn't the place for this kind of sloppy rhetoric.
Yeah ok I weakened my positive statement.
I wish the title of this made it clear that the post is arguing that ASIs won't spare humanity because of trade, and isn't saying anything about whether ASIs will want to spare humanity for some other reason. This is confusing because lots of people around here (e.g. me and many other commenters on this post) think that ASIs are likely to not kill all humans for some other reason.
(I think the arguments in this post are a vaguely reasonable argument for "ASIs are pretty likely to be scope-sensitively-maximizing enough that it's a big problem for us", and respond to some extremely bad arguments for "ASI wouldn't spare humanity because of trade", though in neither case does the post particularly engage with the counterarguments that are most popular among the most reasonable people who disagree with Eliezer.)