Dan Braun's Shortform

post by Dan Braun (Daniel Braun) · 2024-10-05T12:26:46.329Z · LW · GW · 2 comments

Contents

2 comments

2 comments

Comments sorted by top scores.

comment by Dan Braun (Daniel Braun) · 2024-10-05T12:26:46.818Z · LW(p) · GW(p)

In which worlds would AI Control prevent significant harm?

When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.

Using AI Control [LW · GW] (an area which has recently excited many [LW(p) · GW(p)] in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.

  1. AI Control concerns itself with models that intentionally try to subvert its developers.
  2. These models are likely to be very generally capable and capable of causing significant harm without countermeasures.
  3. Leading cyber-capable institutions would likely expend significant resources and political capital to steal these models or steal enough insights to reproduce such models.
  4. If the weights or insights are stolen, work on AI control will not prevent these models from causing significant harm.
  5. Current AI developers are not on track to be able to defend against high-priority operations from leading cyber-capable institutions in the coming years.
  6. Therefore,  AI control will only be useful in the coming years under one (or more) of these conditions:
    1. Models that scheme are unlikely to be generally capable/dangerous enough to be a high-priority target for leading cyber-capable institutions.
    2. Models that scheme are only developed by actors that can thwart high-priority operations from leading cyber-capable institutions (which precludes current AI developers for at least several years).
    3. AI Control won’t be directly useful in the coming years but it will be indirectly useful to progress the field for when models are developed by actors capable of thwarting top cyber operations.
    4. Even if the model was stolen and caused significant harm, there would still be less harm overall than if the model undergoing AI control measures also broke out. Of course, this argument is much weaker for models capable of existential harm.
    5. Actors that steal the model would also implement strong-enough AI control measures and would not misuse the AI to cause significant harm.

There are of course other arguments against working on AI control. E.g. it may encourage the development and use of models that are capable of causing significant harm. This is an issue if the AI control methods fail or if the model is stolen. So one must be willing to eat this cost or argue that it’s not a large cost when advocating for AI Control work.

This isn’t to say that AI Control isn’t a promising agenda, I just think people need to carefully consider the cases in which their agenda falls down for reasons that aren’t technical arguments about the agenda itself.

I’m also interested to hear takes from those excited by AI Control on which conditions listed in #6 above that they expect to hold (or to otherwise poke holes in the argument).

Replies from: timot.cool
comment by tchauvin (timot.cool) · 2024-10-05T13:59:38.863Z · LW(p) · GW(p)

In general, the hacking capabilities of state actors and the likely involvement of national security when we get closer to AGI feel like significant blind spots of Lesswrong discourse.

(The Hacker and The State by Ben Buchanan is a great book to learn about the former)