An idea for creating safe AI

post by IAFF-User-211 (Imported-IAFF-User-211) · 2017-02-04T11:16:35.000Z · LW · GW · 3 comments

This is a link post for https://medium.com/@huwjames81/an-idea-for-creating-safe-ai-3bda7fb336ac#.a15oaxpzt

3 comments

Comments sorted by top scores.

comment by Vanessa Kosoy (vanessa-kosoy) · 2017-03-10T16:55:38.000Z · LW(p) · GW(p)

Designing an agent which is guaranteed to terminate is not, in itself, a solution to AI safety. Indeed, this desideratum is already satisfied by Minsky's ultimate machine. At the very least, we have to design an agent which will be powerful enough to permanently defend us against malicious AIs without adverse side effects. So, we can indeed have AIs that are incentivized to complete some task in a short amount of time, but it is not clear how to formulate the task of "defending against malicious AIs" for such an agent. The closest thing is probably Paul Christiano's approval-directed agents, where the AI generates some output (e.g. plan of defense against malicious AIs) which a human has to approve. There are problems with this: for one thing, a plan which a human would approve might still be a bad plan (or even a dangerous memetic virus), for another, the module inside the AI responsible for modeling humans is susceptible to acausal attack.

Replies from: Imported-IAFF-User-111
comment by IAFF-User-111 (Imported-IAFF-User-111) · 2017-03-13T01:09:32.000Z · LW(p) · GW(p)

I agree it's not a complete solution, but it might be a good path towards creating a task-AI, which is a potentially important unsolved sub-problem.

comment by IAFF-User-111 (Imported-IAFF-User-111) · 2017-03-13T01:07:12.000Z · LW(p) · GW(p)

I spoke with Huw about this idea. I was thinking along similar lines at some point, but only for "safe-shutdown", e.g. if you had a self-driving car that anticipated encountering a dangerous situation and wanted to either:

  1. pull over immediately
  2. cede control to a human operator

It seems intuitive to give it a shutdown policy that triggers in such cases, and that aims to minimize a combined objective of time-to-shutdown and risk-of-shutdown. (Of course, this doesn't deal with interrupting the agent, ala Armstrong and Orseau.)


Huw pointed out that a similar strategy can be used for any "genie"-style goal (i.e. you want an agent to do one thing as efficiently as possible, and then shut-down until you give it another command), which made me substantially more interested in it.

This seems similar in spirit to giving your agent a short horizon, but now you also have regular terminations, by default, which has some extra pros and cons.