AI safety: three human problems and one AI issue

post by Stuart_Armstrong · 2017-06-02T16:12:23.000Z · LW · GW · 4 comments

Contents

  How to classify methods and problems
  Further refinements of the framework
None
4 comments

A putative new idea for AI control; index here.

There have been various attempts to classify the problems in AI safety research. Our old Oracle paper that classified then-theoretical methods of control, to more recent classifications that grow out of modern more concrete problems.

These all serve their purpose, but I think a more enlightening classification of the AI safety problems is to look at what the issues we are trying to solve or avoid. And most of these issues are problems about humans.


Specifically, I feel AI safety issues can be classified as three human problems and one central AI issue. The human problems are:

  1. Humans don't know their own values (sub-issue: humans know their values better in retrospect than in prediction).
  2. Humans are not (idealised) agents and don't have stable values (sub-issue: humanity itself is even less of an agent).
  3. Humans have poor predictions of an AI's behaviour.

And the central AI issue is:

  1. AIs could become extremely powerful.

Obviously if humans were agents and knew their own values and could predict whether a given AI would follow those values or not, there would be not problem. Conversely, if AIs were weak, then the human failings wouldn't matter so much.

The points about human values is relatively straightforward, but what's the problem with humans not being agents? Essentially, humans can be threatened, tricked, seduced, exhausted, drugged, modified, and so on, in order to act seemingly against our interests and values.

If humans were clearly defined agents, then what counts as a trick or a modification would be easy to define and exclude. But since this is not the case, we're reduced to trying to figure out the extent to which something like a heroin injection is a valid way to influence human preferences. This makes both humans susceptible to manipulation, and human values hard to define.

Finally, the issue of humans having poor predictions of AI is more general than it seems. If you want to ensure that an AI has the same behaviour in the testing and training environment, then you're essentially trying to guarantee that you can predict that the testing environment behaviour will be the same as the (presumably safe) training environment behaviour.

How to classify methods and problems

That's well and good, but how to various traditional AI methods or problems fit into this framework? This should give us an idea as to whether the framework is useful.

It seems to me that:

Putting this all in a table:

Further refinements of the framework

It seems to me that the third category -- poor predictions -- is the most likely to be expandable. For the moment, it just incorporates all our lack of understanding about how AIs would behave, but this might more useful to subdivide.

4 comments

Comments sorted by top scores.

comment by danieldewey · 2017-05-19T19:28:51.000Z · LW(p) · GW(p)

Thanks for writing this -- I think it's a helpful kind of reflection for people to do!

comment by Vanessa Kosoy (vanessa-kosoy) · 2017-07-05T16:23:54.000Z · LW(p) · GW(p)

It seems to me that "friendly AI" is a name for the entire field rather than a particular approach, otherwise I don't understand what you mean by "friendly AI"? More generally, it would be nice to provide a link for each of the approaches.

comment by IAFF-User-238 (Imported-IAFF-User-238) · 2017-06-28T09:17:13.000Z · LW(p) · GW(p)

How about making transition in brain to make own behave like human and making program that be analyzed according to the brain

comment by thetasafe · 2017-05-26T01:07:55.000Z · LW(p) · GW(p)

I would like to suggest that I do not identify the problems of "values" and the "poor predictions" as potentially resolvable problems. It is because,

  1. Among humans there are infants, younger children and growing adults too who at least (for the sake of brevity for construct) develop at maximum till 19 years of age to their naturally physical and mental potentials. Holding thus, it remains no longer a logical validity to constitute the "values problems" as a problem for developing an AI/Oracle AI because before 19 years of age the values cannot be known, by the virtue of the development stage being at onset. Apart from being ideal theoretically, it might prove dangerous to assign or align values to humans for the sake of natural development of the human civilization.

  2. Holding the current status quo of "Universal Basic Education" and the above(1.) "values development" argument, it is not a logical argument that humans would be able to predict AI/Oracle AI behaviour at a time when not even AI researchers can predict with full guarantee the potentials of an Oracle AI or an AI developing itself into an AGI (a meagre case but cannot be held as holding no potential for now). Thus, holding the "poor predictions" case to be logically irresolvable as a problem.

But, to halt the development, if, of the two mentioned cases, especially the "poor predictions" one, is not logical for academic purpose.