Managing catastrophic misuse without robust AIs

post by ryan_greenblatt, Buck · 2024-01-16T17:27:31.112Z · LW · GW · 16 comments

Contents

  Mitigations for bioterrorism
    What reliability is acceptable?
  Mitigations for large-scale cybercrime
  Fine-tuning?
  Conclusion
None
16 comments

Many people worry about catastrophic misuse of future AIs with highly dangerous capabilities. For instance, powerful AIs might substantially lower the bar to building bioweapons or allow for massively scaling up cybercrime.

How could an AI lab serving AIs to customers manage catastrophic misuse? One approach would be to ensure that when future powerful AIs are asked to perform tasks in these problematic domains, the AIs always refuse. However, it might be a difficult technical problem to ensure these AIs refuse: current LLMs are possible to jailbreak into doing arbitrary behavior, and the field of adversarial robustness, which studies these sorts of attacks, has made only slow progress in improving robustness over the past 10 years. If we can’t ensure that future powerful AIs are much more robust than current models[1], then malicious users might be able to jailbreak these models to allow for misuse. This is a serious concern, and it would be notably easier to prevent misuse if models were more robust [LW · GW] to these attacks. However, I think there are plausible approaches to effectively mitigating catastrophic misuse which don't require high levels of robustness on the part of individual AI models.

(In this post, I'll use "jailbreak" to refer to any adversarial attack.)

In this post, I'll discuss addressing bioterrorism and cybercrime misuse as examples of how I imagine mitigating catastrophic misuse[2] for a model deployed on an API. I'll do this as a nearcast where I suppose that scaling up LLMs results in powerful AIs that would present misuse risk in the absence of countermeasures. I think The approaches I discuss won't require better adversarial robustness than exhibited by current LLMs like Claude 2 and GPT-4. I think that the easiest mitigations for bioterrorism and cybercrime are fairly different, because of the different roles that LLMs play in these two threat models.

The mitigations I'll describe are non-trivial, and it's unclear if they will happen by default. But regardless, this type of approach seems considerably easier to me than trying to achieve very high levels of adversarial robustness. I'm excited for work which investigates and red-teams methods like the ones I discuss.

Note that the approaches I discuss here don't help at all with catastrophic misuse of open source AIs. Distinct approaches would be required to address that problem, such as ensuring that powerful AIs which substantially lower the bar to building bioweapons aren't open sourced. (At least, not open sourced until sufficiently strong bioweapons defense systems exist or we ensure there are sufficiently large difficulties elsewhere in the bioweapon creation process.)

[Thanks to Fabien Roger, Ajeya Cotra, Nate Thomas, Max Nadeau, Aidan O'Gara, Nathan Helm-Burger, and Ethan Perez for comments or discussion. This post was originally posted as a comment in response to this post [LW · GW] by Aidan O'Gara; you can see the original comment here for reference [LW(p) · GW(p)]. Inside view, I think most research on preventing misuse over an API seems less leveraged (for most people) than preventing AI takeover caused by catastrophic misalignment; see here [LW(p) · GW(p)] for more discussion. However, I think the style of work I discuss here has good transfer with the AI control approach for avoiding AI takeover and I also think that AI labs should manage catastrophic misuse.]

Mitigations for bioterrorism

In this section, I’ll describe how I imagine handling bioterrorism risk for an AI lab deploying powerful models (e.g., ASL-3/ASL-4).

As I understand it, the main scenario by which LLMs cause bioterrorism risk is something like the following: there’s a team of relatively few people, who are not top experts in the relevant fields but who want to do bioterrorism for whatever reason. Without LLMs, these people would struggle to build bioweapons—they wouldn't be able to figure out various good ideas, and they’d get stuck while trying to manufacture their bioweapons (perhaps like Aum Shinrikyo). But with LLMs, they can get past those obstacles.

(I'm making the assumption here that the threat model is more like "the LLM gives the equivalent of many hours of advice" rather than "the LLM gives the equivalent of five minutes of advice". I'm not a biosecurity expert and so don’t know whether that's an appropriate assumption to make; it probably comes down to questions about what the hard steps in building catastrophic bioweapons are. I don’t see why LLMs would zero-shot be able to effectively answer extremely broad and difficult questions like “how to build a really bad bioweapon”, so it seems to me like the LLMs probably have to do at least a moderate amount of back-and-forth in order to be wildly helpful.)

And so in this case, the LLMs caused problems not by being superhuman, but by allowing aspiring bioterrorists to get a fairly large amount of expert advice that human experts would refuse to give them.

Thus, our countermeasures will heavily utilize the assumption that a large number of queries to the AI are likely required to substantially enable catastrophic bioterrorism. We’ll assume the number of required queries is likely at least hundreds, but possibly thousands[3].

First, we’ll remove bio data (particularly synth bio and virology) from the training corpus and verify that this will result in a model with weak biology knowledge. This is sufficient if it's acceptable to not have any model deployed which can handle biology questions.

However, it might be important for some users to be able to ask bio questions. If so, we'll create an additional API endpoint (an interface via which users can make queries to the model) with a model created by fine-tuning our normal base model (without bio) on the excluded biology data. Now, we'll need countermeasures on this additional bio endpoint to prevent misuse.

For this additional endpoint, we'll use the following countermeasures:

This approach would fail if a user was able to jailbreak the model that responds to queries while also consistently jailbreaking (or otherwise bypassing) the monitoring models so they don't report the issue. Our key hope is that it will be very hard to reliably jailbreak all models in the monitoring ensemble given that users won't be able to try many times without first getting banned (in addition to potential criminal investigation and suing them).

We verify this approach worked as expected by ensuring that a red team which contains both bio experts and LLM experts can't execute a moderate number (e.g., 50) of diverse bioterror queries without having a very high probability of being caught. The red team is familiar with the high-level countermeasures but isn't necessarily informed of non-public implementation details. The red team can try to find attacks by running reasonably well-resourced experiments on open-source models or by using other available APIs. (For instance, the red team could construct a mock-up of our setup using open-source models and their understanding of countermeasures. Then, they could iterate against that setup and with the hope that they can find attacks which transfer sufficiently well.) We'd want to try this with several different red teams.

Here are some possible additional countermeasures:

What reliability is acceptable?

A nice property of this proposal is that our monitoring ensemble doesn't need to be that reliable to make bioterror hard: we only have to catch one query to ban and prosecute a user and they need to make many queries to have the AI help much with bioweapons.

For instance, let's suppose that we can only afford to have a human audit the most suspicious-seeming 0.1% of queries for a bio-specific API endpoint. Equivalently, we can afford to have a 0.1% False Positive Rate (FPR) for our automated monitoring ensemble. Now, suppose (for simplicity) that catastrophic bio misuse requires 100 queries to the model and we're happy as long as we catch the user and report them to the FBI prior to them finishing all 100 queries. Then, if our monitoring ensemble at 0.1% FPR catches a given bio misuse query with just 5% probability (for a competent adversary), bio misuse is 99.4% likely to be caught prior to succeeding. So, we don't actually need very high reliability on the ensemble. (Though we might target more conservative bars than 99.4% in practice.)

A reasonable objection here is that maybe we'll have a large number of people (e.g., >200) trying to create bioweapons using the API and thus one will get lucky. We can avoid this failure mode by having a policy such as "if there are more than 3 open FBI investigations into attempted bioweapon construction based on our models, we need to either shut down the API endpoint or further increase reliability by a substantial amount (e.g. by spending additional money on human auditing and increasing the cost of the API endpoint)". Then, if we do actually see a large amount of serious misuse, we can respond dynamically to handle this. Once we see examples, we should also be able to examine case studies to see how effective our safeguards were in practice.

Mitigations for large-scale cybercrime

The threat model for LLM cybercrime has pretty different properties than the bioterrorism threat model. A bioterrorism attack might only require hundreds of queries to the LLM in order to cause a catastrophic outcome. In contrast, the risk associated with cybercrime is that we massively reduce the cost of labor involved in hacking. This distinction is important because there are already thousands or tens of thousands of people who are employed as hackers for pay, and so if LLMs aren’t massively superhuman, they have to contribute via lowering costs. As such, if our goal is to ensure less than 100 million dollars in cybercrime damages, it might suffice to avoid more than 100,000 cybercrime queries (unlike the bio case where merely hundreds of queries could be unacceptable). Thus, we can prevent our model from doing unacceptable amounts of cybercrime without needing to have as high of reliability as in the bioterrorism case.

It likely won't be possible to avoid having our main model (as deployed on an API) be capable of cybercrime. So, we don't have the option of creating a separate endpoint as in the bio case. Otherwise, we'll use a somewhat similar set of countermeasures as with the bio approach discussed above. For the following list, I'm assuming that you've read the bio section above.

Due to the potential lack of KYC, we might not be able to ban users effectively (as well as missing potential criminal investigation and suing them). We might not be able to apply KYC, but we can at least estimate the total amount of cybercrime queries and guess at total damages. If likely damages reach some threshold, we can institute additional mitigations like actually requiring KYC.

Given that these mitigations have to be on our primary API, it's plausible that privacy concerns are a blocker. We can probably do some amount of user info redaction for auditing, but I broadly think we should just aim to set the expectation that queries made to powerful models aren't private. This is similar to (e.g.) emails sent to contractors.

Fine-tuning?

I think that allowing end-users to fine-tune these models makes the picture notably more complicated, as it might be possible to train knowledge into the model which isn't present in the monitoring models. There are still possible mitigations. We could also try to resolve this by only allowing large corporate contracts when fine-tuning powerful models.

Conclusion

It might be possible to prevent adversaries from catastrophically misusing models even without improvements in the robustness of individual AIs. To do this, we'd take advantage of the moderate to large number of queries needed to cause harm under various threat models, monitoring, and the ability to ban users. I'd be excited about empirical work which demonstrates the viability of this approach or shows that it won't work without fundamental advances.


  1. I think it's plausible that the first dangerously powerful AIs will by default be considerably more robust than the current state of the art LLMs, though I'm very uncertain overall. That is, it might be that our current methods for improving robustness suffice for future powerful models despite failing for current models due to increases in scale or various architectural changes. ↩︎

  2. For this post, I’m not discussing catastrophic misalignment concerns like scheming AIs and catastrophic reward hacking. These concerns would also need to be addressed for sufficiently powerful AIs. I expect that in practice, our solutions for mitigating misuse over an API and avoiding bad outcomes due to catastrophic misalignment will mostly be separate, though there could be some interactions. ↩︎

  3. There is also a threat model where AIs are very helpful for bioterror via the mechanism of occasionally giving really good answers and thus only a small number of bioterror queries (1-30) would be dangerous. This threat model could look like having really good high-level ideas for bioterror or acting like a considerably better version of Google which points you toward the ideal online resource. I think this sort of usage probably doesn't help that much with bioterror, but this might be a crux. ↩︎

16 comments

Comments sorted by top scores.

comment by AdamGleave · 2024-02-17T19:22:00.081Z · LW(p) · GW(p)

Thanks for the post Ryan -- I agree that given the difficulty in making models actually meaningfully robust the best solution to misuse in the near-term is going to be via a defence in depth approach consisting of filtering the pre-training data, input filtering, output filtering, automatic and manual monitoring, KYC checks, etc.

At some point though we'll need to grapple with what to do about models that are superhuman in some domains related to WMD development, cybercrime or other potential misuses. There's glimmers of this already here, e.g. my impression is that AlphaFold is better than human experts at protein folding. It does not seem far-fetched that automatic drug discovery AI systems in the near future might be better than human experts at finding toxic substances (Urbina et al, 2022 give a proof of concept). In this setting, a handful of queries that slip through a model's defences might be dangerous: "how to build a really bad bioweapon" might be something the system could make significant headway on zero-shot. Additionally, if the model is superhuman, then it starts becoming attractive for nation-state or other well-resourced adversaries to seek to attack it (whereas at human-level, they can just hire their own human experts). The combination of lower attack tolerance and increased sophistication of attacks makes me somewhat gloomy this regime will hold up indefinitely.

Now I'm still excited to see the things you propose be implemented in the near-term: they're some easy wins, and lay foundations for a more rigorous regime later (e.g. KYC checks seem generally really helpful in mitigating misuse). But I do suspect that in the long-run we'll need a more principled solution to security, or simply refrain from training such dangerous models.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-02-17T19:58:37.090Z · LW(p) · GW(p)

In this setting, a handful of queries that slip through a model's defences might be dangerous [...] But I do suspect that in the long-run we'll need a more principled solution to security, or simply refrain from training such dangerous models.

This seems right to me, but it's worth noting that this point might occur after the world is already radically transformed by AI (e.g. all human labor is obsolete). So, it might not be a problem that humans really need to deal with.

The main case I can imagine where this happens prior to the world being radically transformed is the case where automatic drug/virus/protein outputting AIs (like you mentioned) can do a massive amount of the work end-to-end. I'd hope that for this case, the application is sufficiently narrow that there are additional precautions we can use, e.g. just have a human screen every request to the model. But this seems pretty scary overall.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-01-17T23:35:45.604Z · LW(p) · GW(p)

As someone currently quite involved in bioweapon risk evals of AI models... I appreciate this post, but also feel like the intro really ought to at least mention the hard part. The hard part is what to do about open source models. Given that the relevant dangerous knowledge for bioweapon construction is, unfortunately, already publicly available, any plan for dealing with misuse has to assume the bad actors have that info. 

The question then is, if a model has been trained without the bad knowledge, how much additional compute is required to fine-tune the knowledge in? How much effort/cost is that process compared to directly reading and understanding the academic papers?

My best guess as to the answers for these questions is 'only a small amount of extra compute, like <5% of the training compute' and 'much quicker and easier than reading and understanding thousands of academic papers'.

If these are both true, then to prevent the most likely path to misuse you must prevent the release of the open source model in the first place. Or else, come up with some sort of special training regime which greatly increases the cost of fine-tuning dangerous knowledge into the model.

How about as the state of the art of hardware and algorithms continue to advance?  Open source models of sufficient capability to be dangerous get cheaper and easier to produce. What kind of monitoring regime would be required to stop groups or wealthy individuals from training their own models on dangerous knowledge?

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-01-17T23:57:06.957Z · LW(p) · GW(p)

Yes, my hope here would be to prevent the release of these open source model. I'll add a note to the intro. The post is about "How could an AI lab serving AIs to customers manage catastrophic misuse?" and assumes that the AI lab has already properly contained their AI (made it hard to steal, let alone open sourcing it).

Replies from: ryan_greenblatt, nathan-helm-burger
comment by ryan_greenblatt · 2024-01-18T00:05:11.047Z · LW(p) · GW(p)

I added:

Note that the approaches I discuss here don't help at all with catastrophic misuse of open source AIs. Distinct approaches would be required to address that problem, such as ensuring that powerful AIs which substantially lower the bar to building bioweapons aren't open sourced. (At least, not open sourced until sufficiently strong bioweapons defense systems exist or we ensure there are sufficiently large difficulties elsewhere in the bioweapon creation process.)

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-01-18T00:09:11.842Z · LW(p) · GW(p)

Thanks Ryan. Although, it's unclear how powerful a model needs to be to be dangerous (which is why I'm working on evals to measure this). In my opinion, Llama2 and Mixtral are potentially already quite dangerous given the right fine-tuning regime. 

So, if I'm correct, we're already at the point of 'too late, the models have been released. The problem will get only get worse as more powerful models are released. Only the effort of processing the raw data into a training set and running the fine-tuning is saving the world from having a seriously dangerous bioweapon-designer-tuned LLM in bad actors hands.'

Of course, that's just like... my well-informed opinion, man. I deliberately created such a bioweapon-designer-tuned LLM from Llama 70B as part of my red teaming work on biorisk evals. It spits out much scarier information than a google search supplies. Much. I realize there's a lot of skepticism around my claim on this, so not much can be done until better objective evaluations of the riskiness of the bioweapon design capability are developed. For now, I can only say that this is my opinion from personal observations.

Replies from: Vladimir_Nesov, ryan_greenblatt
comment by Vladimir_Nesov · 2024-01-21T12:21:08.540Z · LW(p) · GW(p)

It spits out much scarier information than a google search supplies. Much.

I see a sense in which GPT-4 is completely useless for serious programming in the hands of a non-programmer who wouldn't be capable/inclined to become a programmer without LLMs, even as it's somewhat useful for programming (especially with unfamiliar but popular libraries/tools). So the way in which a chatbot helps needs qualification.

One possible measure is how much a chatbot increases the fraction of some demographic that's capable of some achievement within some amount of time. All these "changes the difficulty by 4x" or "by 1.25x" need to mean something specific, otherwise there is hopeless motte-and-bailey that allows credible reframing of any data as fearmongering. That is, even when it's only intuitive guesses, the intuitive guesses should be about a particular meaningful thing rather than level of scariness. Something prediction-marketable.

Replies from: ryan_greenblatt, nathan-helm-burger
comment by ryan_greenblatt · 2024-01-21T17:16:17.335Z · LW(p) · GW(p)

I was trying to say "cost in time/money goes down by that factor for some group".

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-01-21T20:00:56.641Z · LW(p) · GW(p)

Yes, I quite agree. Do you have suggestions for what a credible objective eval might consist of? What sort of test would seem convincing to you, if administered by a neutral party?

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-01-21T20:09:13.025Z · LW(p) · GW(p)

Here's my guess (which is maybe the obvious thing to do).

Take bio undergrads, have them do synthetic biology research projects (ideally using many of the things which seem required for bioweapons), randomize into two groups where one is allowed to use LLMs (e.g. GPT-4) and one isn't. The projects should ideally have a reasonable duration (at least >1 week, more ideally >4 weeks). Also, for both groups, provide high level research advice/training about how to use the research tools they are given (in the LLM case, advice about how to best use LLMs).

Then, have experts in the field assess the quality of projects.

For a weaker preliminary experiment, you could do 2-4 hour experiments of doing some quick synth bio lab experiment with the same approximate setup (but there are complications with the shortened duration).

comment by ryan_greenblatt · 2024-01-18T00:18:39.893Z · LW(p) · GW(p)

In my opinion, Llama2 and Mixtral are potentially already quite dangerous given the right fine-tuning regime.

Indeed, I think it seems pretty unlikely that these models (finetuned effectively using current methods) change the difficulty of making a bioweapon by more than a factor of 4x. (Though perhaps you think something like "these models (finetuned effectively) make it maybe 25% easier to make a bioweapon and that's pretty scary".)

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-01-18T00:26:03.146Z · LW(p) · GW(p)

Yes, I'm unsure about the multiplication factors on "likelihood of bad actor even trying to make a bioweapon" and on "likelihood of succeeding given the attempt". I think probably both are closer to 4x than 1.25x. But I think it's understandable that this claim on my part seems implausible. Hopefully at some point I'll have a more objective measure available.

comment by Charbel-Raphaël (charbel-raphael-segerie) · 2024-02-10T21:57:23.752Z · LW(p) · GW(p)

TLDR of the post: The solution against misuse is for labs to be careful and keep the model behind an API. Probably not that complicated.

My analysis: Safety culture is probably the only bottleneck.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-01-17T14:46:34.011Z · LW(p) · GW(p)

I think similar threat models and similar lines of reasoning might also be useful with respect to (potentially misaligned) ~human-level/not-strongly-superhuman AIs, especially since more complex tasks seem to require more intermediate outputs (that can be monitored).

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-01-17T16:16:50.974Z · LW(p) · GW(p)

We strongly agree, see our recent work [LW · GW]. As I state in the post "I think the style of work I discuss here has good transfer with the AI control approach". We have a forthcoming post explaining AI control in more detail.

comment by RogerDearnaley (roger-d-1) · 2024-01-17T02:17:10.245Z · LW(p) · GW(p)

This all seems very sensible, and I must admit, I had been basically assuming that things along these lines were going to occur, once risks from frontier models became significant enough. Likely via a tiered series of a cheap weak filter passing the most suspicious X% plus a random Y% of its input to a stronger more expensive filter, and so on up to more routine/cheaper and finally more expensive/careful human oversight. Another obvious addition for the cybercrime level of risk would be IP address logging of particularly suspicious queries, and not being able to use the API via a VPN that hides your IP address if you're unpaid and unsigned-in, or seeing more refusals if you do.

I also wouldn't assume that typing a great many queries with clearly-seriously-criminal intent into a search engine in breach of its terms of use was an entirely risk-free thing to do, either — or if it were now, that it will remain so with NLP models becoming cheaper.

Obviously open-source models are a separate question here: about the best approach currently available for them is, as you suggest above, filtering really dangerous knowledge out of their training set.