There's probably a tradeoff between AI capability and safety, and we should act like it

post by David Johnston (david-johnston) · 2022-06-09T00:17:24.722Z · LW · GW · 3 comments

Contents

3 comments

I think something like the following is probably true under fairly weak assumptions:[1]

For almost any goal X, and almost any agent that achieves X safely, there's an agent that achieves even more X unsafely.

If it is true, it makes AI safety hard: the desire to achieve goals pushes us in the direction of unsafe AI, and the desire to be safe pushes us away from achieving goals. Managing this tradeoff could be really difficult, and we could fail.

However, the fact that the desired outcome might not be achieved - lots of good stuff, no disasters - does not mean that the best strategy is hard to work out. At a high level, it seems pretty obvious to me: improve our understanding of the line between "safe" and "unsafe" and do our best to ensure that AI experimentation remains on the safe side.

In the immediate term, this seems to imply:

  1. Research on the question of when AI experiments are unsafe
  2. Convincing the at-present relatively small number of key players that there is a tradeoff, and helping them understand how to tell the difference between safe and unsafe (both of these aims will be helped by 1)

This is already difficult, and in the medium term it gets harder when the number of key players gets larger. However, we have more time to work on a better strategy in the medium term, and if a majority of key players in AI today can be convinced that AI experimentation can be unsafe and develop a decent understanding of what kind of AI experimentation is unsafe, then we're also in a much better position to embark on more ambitious political projects.

It seems to me that right now there is some investment in 2, but not very much in 1, so I think we should be doing more of that.

  1. ^

    What I mean when I say it is probably true under fairly weak assumptions is that I think it's likely possible to offer precise definitions of everything in quotes below that capture most instances of what the words mean, and show that the resulting claim is true:

    For "almost any" "goal" X, and "almost any" "agent" that "achieves" X "safely", there's an "agent" that "achieves even more" X "unsafely"

3 comments

Comments sorted by top scores.

comment by Anirandis · 2022-06-09T00:21:11.638Z · LW(p) · GW(p)

Related: alignment tax [? · GW]

Replies from: david-johnston
comment by David Johnston (david-johnston) · 2022-06-09T00:41:31.690Z · LW(p) · GW(p)

Indeed. If the idea of a tradeoff wasn't widely considered plausible I'd have spent more time defending it. I'd say my contribution here is the "and we should act like it" part.

comment by JBlack · 2022-06-09T07:06:01.617Z · LW(p) · GW(p)

While largely true, I don't think it captures the essence of the problem.

To me, the hardest part of the problem is that there is probably some threshold of capability below which every agent is essentially safe, and above which increasingly many agents are increasingly unsafe. That threshold is probably not far above an average human in the overall scheme of things.

Making it worse, increasingly many AI agents may self-improve or create successor agents that are vastly more capable (and therefore also vastly more likely to be unsafe in a catastrophic combination).