An Oracle standard trick

stuart_armstrong

An Oracle standard trick

post by Stuart_Armstrong · 2015-06-03T14:17:11.645Z · LW · GW · Legacy · 33 comments

33 comments

A putative new idea for AI control; index here.

EDIT: To remind everyone, this method does not entail the Oracle having false beliefs, just behaving as if it did; see here and here.

An idea I thought I'd been mentioning to everyone, but a recent conversation reveals I haven't been assiduous about it.

It's quite simple: whenever designing an Oracle, you should, as a default, run it's output channel through a probabilistic process akin to the false thermodynamic miracle, in order to make the Oracle act as if it believed its message will never be read.

This reduces the possibility of the Oracle manipulating us through message content, because it's action as if that content will never be seen by anyone.

Now, some Oracle designs can't use that (eg if accuracy is defined in terms of the reaction of people that read its output). But in general, if your design allows such a precaution, there's no reason not to put it on, so it should be default in the Oracle design.

Even if the Oracle design precludes this directly, some version of it can be often be used. For instance, if accuracy is defined in terms of the reaction of the first person to read the output, and that person is isolated from the rest of the world, then we can get the Oracle to act as if it believed a nuclear bomb was due to go off before the person could communicate with the rest of the world.

33 comments

Comments sorted by top scores.

comment by SilentCal · 2015-06-04T17:47:31.834Z · LW(p) · GW(p)

Is there a danger that if you use an Oracle with this trick repeatedly it will conclude that there's an identical Oracle whose message is being read, and start trying to acausally cooperate with this other Oracle?

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2015-06-05T11:49:06.689Z · LW(p) · GW(p)

The oracle does not have inaccurate beliefs, it knows everything there is to know (see http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/ ).

And if there's an acausal trade issue, we can break it via: http://lesswrong.com/r/discussion/lw/luy/acaucal_trade_barriers/ .

comment by Epictetus · 2015-06-04T22:08:31.349Z · LW(p) · GW(p)

Would there be any unintended consequences? I'm worried that possessing an incorrect belief may lead the Oracle to lose accuracy in other areas.

For instance, if accuracy is defined in terms of the reaction of the first person to read the output, and that person is isolated from the rest of the world, then we can get the Oracle to act as if it believed a nuclear bomb was due to go off before the person could communicate with the rest of the world.

In this example, would the imminent nuclear threat affect the Oracle's reasoning process? I'm sure there are some questions whose answers could vary depending on the likelihood of a nuclear detonation in the near future.

Replies from: Stuart_Armstrong, roystgnr

↑ comment by Stuart_Armstrong · 2015-06-05T11:46:46.020Z · LW(p) · GW(p)

The Oracle does not possess inaccurate beliefs. Look at http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/ and http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ . Note I've always very carefully phrased it as "act as if it believed" rather than "believed".

↑ comment by roystgnr · 2015-06-04T22:48:42.919Z · LW(p) · GW(p)

Regardless of the mechanism for misleading the oracle, its predictions for the future ought to become less accurate in proportion to how useful they have been in the past.

"What will the world look like when our source of super-accurate predictions suddenly disappears" is not usually the question we'd really want to ask. Suppose people normally make business decisions informed by oracle predictions: how would the stock market react to the announcement that companies and traders everywhere had been metaphorically lobotomized?

We might not even need to program in "imminent nuclear threat" manually. "What will our enemies do when our military defenses are suddenly in chaos due to a vanished oracle?"

comment by [deleted] · 2015-06-03T14:26:01.712Z · LW(p) · GW(p)

Talk about inferential distances... I was thinking from the title must be about the Oracle Financials software package or at least the database.

Replies from: Stuart_Armstrong, Gondolinian

↑ comment by Stuart_Armstrong · 2015-06-03T14:32:16.254Z · LW(p) · GW(p)

Those are less frightening, currently.

↑ comment by Gondolinian · 2015-06-03T15:55:34.428Z · LW(p) · GW(p)

In the interest of helping to bridge the inferential distance of others reading this, here's a link to the wiki page for Oracle AI.

comment by HungryHobo · 2015-06-05T15:32:15.046Z · LW(p) · GW(p)

I'm not seeing the point when it could simply be disallowed from any reasoning chain involving references to it's own output and has no goals.

comment by Silver_Swift · 2015-06-04T14:15:00.891Z · LW(p) · GW(p)

(eg if accuracy is defined in terms of the reaction of people that read its output).

I'm mostly ignorant about AI design beyond what I picked up on this site, but could you explain why you would define accuracy in terms of how people react to the answers? There doesn't seem to be an obvious difference between how I react to information that is true or (unbeknownst to me) false. Is it just for training questions?

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2015-06-05T11:44:50.578Z · LW(p) · GW(p)

It might happen. "accuracy" could involve the AI answering with the positions of trillions of atoms, which is not human parsable. So someone might code "human parsable" as "a human confirms the message is parsable".

comment by Lumifer · 2015-06-03T15:11:09.183Z · LW(p) · GW(p)

Why would the Oracle send any messages at all, then?

Replies from: ike

↑ comment by ike · 2015-06-03T21:44:26.324Z · LW(p) · GW(p)

You don't expect the Oracle to try to influence you; it answers questions. Just in case it does try something, you lead it to believe it can't do anything anyway.

It would send a message because it's programmed to answer questions, I'm assuming.

Replies from: Lumifer, Stuart_Armstrong

↑ comment by Lumifer · 2015-06-04T14:40:29.477Z · LW(p) · GW(p)

Are we talking about an AI which has recursively self-improved?

Replies from: ike

↑ comment by ike · 2015-06-04T14:43:49.125Z · LW(p) · GW(p)

I don't think that should matter, unless the improvement causes them to realize they do have an effect.

The point is that this is a failsafe in case something goes wrong.

(Or that's how I understood the proposal.)

Personally, I doubt it would work, because the AI should be able to see that you've programmed it that way. You need to outsmart the AI, which is similar to boxing it and telling it it's not boxed.

Replies from: Lumifer

↑ comment by Lumifer · 2015-06-04T15:13:30.356Z · LW(p) · GW(p)

The issue with the self-modifying AI is precisely that "it was programmed to do that" stops being a good answer.

Replies from: Stuart_Armstrong, ike

↑ comment by Stuart_Armstrong · 2015-06-05T11:51:36.423Z · LW(p) · GW(p)

The "act as if it doesn't believe its messages will be read" is part of its value function, not its decision theory. So we are only requiring the value function to be stable over self improvement.

Replies from: Lumifer

↑ comment by Lumifer · 2015-06-05T14:31:22.072Z · LW(p) · GW(p)

Why is that? The value function tells you what is important, but the "act" part requires decision theory.

Replies from: Stuart_Armstrong

↑ comment by Stuart_Armstrong · 2015-06-05T16:17:20.952Z · LW(p) · GW(p)

What I mean is that I haven't wired the decision theory to something odd (which might be removed by self improvement), just chosen a particular value system (which has much higher chance of being preserved by self improvement).

↑ comment by ike · 2015-06-04T17:43:27.237Z · LW(p) · GW(p)

It's supposed to keep that part of its programming. If we could rely on that, we wouldn't need any control. But we're worried it has changed, so we build in data which makes the AI think it won't have any control on the world, so even if it messes up it should at least not try to manipulate us.

Replies from: Lumifer

↑ comment by Lumifer · 2015-06-04T18:19:39.264Z · LW(p) · GW(p)

Right, so we have an AI which (1) is no longer constrained by its original programming; and (2) believes no one ever reads its messages. And thus we get back to my question: why such an AI would bother to send any messages at all?

Replies from: Stuart_Armstrong, ike

↑ comment by Stuart_Armstrong · 2015-06-05T11:53:38.631Z · LW(p) · GW(p)

The design I had in mind is: utility u causes the AI to want to send messages. This is modified to u' so that it also acts as if it believed the message wasn't read (note this doesn't mean that it believes it!). Then if u' remains stable under self-improvement, we have the same behaviour after self-improvement.

Replies from: Lumifer

↑ comment by Lumifer · 2015-06-05T14:35:44.949Z · LW(p) · GW(p)

it also acts as if it believed the message wasn't read (note this doesn't mean that it believes it!)

So... you want to introduce, as a feature, the ability to believe one thing but act as if you believe something else? That strikes me as a remarkably bad idea. For one thing, people with such a feature tend to end up in psychiatric wards.

Replies from: gjm, Stuart_Armstrong

↑ comment by gjm · 2015-06-05T16:28:40.077Z · LW(p) · GW(p)

I haven't thought hard about Stuart's ideas, so this may or may not have any relevance to them; but it's at least arguable that it's really common (even outside psychiatric wards) for explicit beliefs and actions to diverge. A standard example: many Christians overtly believe that when Christians die they enter into a state of eternal infinite bliss, and yet treat other people's deaths as tragic and try to avoid dying themselves.

↑ comment by Stuart_Armstrong · 2015-06-05T16:17:49.416Z · LW(p) · GW(p)

Have you read the two article I linked to, explaining the general principle?

Replies from: Lumifer

↑ comment by Lumifer · 2015-06-05T16:53:01.375Z · LW(p) · GW(p)

Yes, though I have not thought deeply (hat tip to Jonah :-D) about them.

The idea of decoupling AI beliefs from AI actions looks bad to me on its face. I expect it to introduce a variety of unpleasant failure modes ("of course I fully believe in CEV, it's just that I'm going to act differently...") and general fragility. And even if one of utility functions is "do not care about anything but miracles" I still think it's just going to lead to a catatonic state, is all.

↑ comment by ike · 2015-06-04T18:26:11.672Z · LW(p) · GW(p)

In that case, you expect it to send no messages.

This strategy is supposed to make it that instead of failing by sending bad messages, its failure mode is by just shutting down.

If all works well, it answers normally, and if it doesn't work, it doesn't do anything because it expects nobody will listed. As opposed to an oracle that, if it messes up its own programming, will try to manipulate people with its answers.

Replies from: Lumifer

↑ comment by Lumifer · 2015-06-04T19:19:35.020Z · LW(p) · GW(p)

Well, yes, except that you can have a perfectly good entirely Friendly AI which just shuts down because nobody listens, so why bother?

You're not testing for Friendliness, you're testing for the willingness to continue the irrational waste of bits and energy.

Replies from: Silver_Swift

↑ comment by Silver_Swift · 2015-06-05T13:03:02.917Z · LW(p) · GW(p)

False positives are vastly better than false negatives when testing for friendliness though. In the case of an oracle AI, friendliness includes a desire to answer questions truthfully regardless of the consequences to the outside world.

Replies from: Lumifer

↑ comment by Lumifer · 2015-06-05T14:37:36.314Z · LW(p) · GW(p)

friendliness includes a desire to answer questions

Which definition of Friendliness are you referring to? I have a feeling you're treating Friendliness as a sack into which you throw whatever you need at the moment...

Replies from: Silver_Swift

↑ comment by Silver_Swift · 2015-06-08T13:49:08.025Z · LW(p) · GW(p)

Fair enough, let me try to rephrase that without using the word friendliness:

We're trying to make a superintelligent AI that answers all of our questions accurately but does not otherwise influence the world and has no ulterior motives beyond correctly answering questions that we ask of it.

If we instead accidentally made an AI that decides that it is acceptable to (for instance) manipulate us into asking simpler question so that it can answer more of them, it is preferable that it doesn't believe anyone is listening to the answers it gives because that is one less way it has for interacting with the outside world.

It is a redundant safeguard. With it, you might end up with a perfectly functioning AI that does nothing, without it, you may end up with an AI that is optimizing the world in an uncontrolled manner.

Replies from: Lumifer

↑ comment by Lumifer · 2015-06-08T14:55:44.590Z · LW(p) · GW(p)

it is preferable that it doesn't believe anyone is listening to the answers it gives

I don't think so. As I mentioned in another subthread here, I consider separating what an AI believes (e.g. that no one is listening) from what it actually does (e.g. answer questions) to be a bad idea.

↑ comment by Stuart_Armstrong · 2015-06-04T09:13:37.005Z · LW(p) · GW(p)

It would send a message because it's programmed to answer questions, I'm assuming.

Yes. Most hypothetical designs are setup that way.

An Oracle standard trick

Contents

33 comments