If alignment problem was unsolvable, would that avoid doom?

post by Kinrany · 2023-05-07T22:13:12.910Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    3 JBlack
None
No comments

Suppose there is a useful formulation of the alignment problem that is mathematically unsolvable. Suppose that as a corollary, modifying your own mind while ensuring any non-trivial property of the resulting mind was also impossible.

Would that prevent a new AI from trying to modify itself?

Has this direction been explored before?

Answers

answer by JBlack · 2023-05-08T01:41:52.395Z · LW(p) · GW(p)

It has been explored (multiple times even on this site), and doesn't avoid doom. It does close off some specific paths that might otherwise lead to doom, but not all or even most of them.

Some remaining problems:

  • AI may be perfectly well capable of killing everyone without self-improvement;
  • An AI may be capable of some large self-improvement step, but not aware of this theorem;
  • Self-improving AI's might not care about whether the result is aligned with their former self, and indeed may not even have any goals at all before self-improvement;
  • AIs may create smarter AIs without improving their own capabilities, knowing that the result won't be fully aligned but expecting that they can nevertheless keep the result under control (and they were wrong);
  • In a population with many AIs, those that don't self-improve may be out-competed by those that do - leading to selection for AIs that self-improve regardless of consequences;
  • It is extremely unlikely that a mere change of computing substrate would meet the conditions of such a theorem, so an AI can almost certainly upgrade its hardware (possibly by many orders of magnitude) to run faster without modifying its mind in any fundamental way.

At this point my 5-minute timer on "think up ways things can still go wrong" ran out, and I just threw out the dumbest ideas and listed the rest. I'm sure with more thought other objections could be found.

comment by Kinrany · 2023-05-08T08:51:56.539Z · LW(p) · GW(p)

Thanks!

It has been explored (multiple times even on this site), and doesn't avoid doom. It does close off some specific paths that might otherwise lead to doom, but not all or even most of them.

Do you have any specific posts in mind?

To be clear, I'm not suggesting that because of this possibility we can just hope that this is how it plays out and we will get lucky.

If we could find a hard limit like this, it seems like it would make the problem more tractable, however. It doesn't have to exist simply because we want it to exist. Searching for it still seems like a good idea.

There's a hundred problems to solve, but it seems like it could avoid the main bad scenario at least: that of AI rapidly self-improving. Improving its hardware wouldn't be trivial for a human-level AI, and it wouldn't have options present in other scenarios. And scaling beyond a single machine seems likely to be a significant barrier at least.

It could still create millions of copies of itself. That's still a problem, but also still a better problem to have than a single AI with no coordination overhead.

Replies from: Kinrany
comment by Kinrany · 2023-05-08T08:58:39.172Z · LW(p) · GW(p)

it would make the problem more tractable

The problem of creating a strong AI and surviving, that is. We'd still get Hanson's billions of self-directed EMs.

No comments

Comments sorted by top scores.