# Problems with Counterfactual Oracles

post by Michaël Trazzi (mtrazzi) · 2019-06-11T18:10:05.223Z · score: 6 (6 votes) · LW · GW · 11 comments

## Contents

  Problems
Assumptions
Safety
Usefulness
None

"we could design a Predictor that incorporates a quantum random number generator, such that with some small probability it will output "no predictions today, run me again tomorrow". Then have the Predictor make predictions that are conditional on it giving the output "no predictions today, run me again tomorrow" (Wei Dai, 2012 [LW · GW])

That's essentially Counterfactual Oracles. Here is my summary:

• Every timestep : a human asks a questions (for instance "what will be TSLA stock price in June 11 2030?"), the oracle (actually a predictor [LW · GW]) emits a corresponding prediction , and receives a reward
• One of its action is an escape answer persuading human supervisors to let it out of its sandbox environment (e.g. predicting £192,116,518,914.20 [LW · GW] encodes "SUPERINT"). This action would return maximum reward if successfull, as the AI would have more control over the reward mechanism (because it was released).
• With some small probability (remember the quantum random number generator?), an automatic machine protects the answer from being read until we know the actual answer to the question. Then, the oracle receives a reward corresponding to how close its prediction was to the actual value. In the first example, the reward could be where and are respectively the predicted and true stock price in 2030.
• Thus, if the AI cares only about its immediate reward after one step, it will make predictions conditional on this "no human ever sees my answer" event. This is great because it will actually give the "true" predictions about how the world would look like if its answers had no impact whatsoever.

# Problems

This design has several issues. I grouped them in the following sub-sections:

• Assumptions: certain assumptions are unrealistic
• Safety: even if those assumptions are satisfied, the design isn't safe.
• Usefulness: in the world where we actually build such oracle, we cannot extract much value from it.

## Assumptions

The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

This design also takes for granted some sort of magic box where humans cannot peek at the answer. In practice, humans will have both incentives to peek (buying/selling TSLA stock 10 years in advance because they know the prediction) and means for peeking. Indeed, the automatic machine might necessit maintenance (for instance because every question requires a different script for retrieving the true value), so maintainers might have some special access to open/modify it.

## Safety

My main concern about the counterfactual oracle is that it doesn't prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ) and might stumble upon an escape message/action then. Even if it is shutdown after answering, the humans will still read the really convincing escape message and release the AI.

Also, some humans will need to be on the loop to make the "automatic machine" run. If we're asking questions about some value that cannot be easily queried (i.e. not "TSLA stock price" but “will this medical treatment work”), then a human will somehow need to communicate the true value to the automatic machine before the reward is computed.

## Usefulness

If I were to explain what makes counterfactual oracles work very bluntly, I would just say:

Prevent humans from looking at answers in training. You can then look at its answers at test time because it has been incentivized to predict things about the counterfactual world where humans don't look at its predictions.

More generally, the whole setup is just 1) train your model in a sandbox environment 2) have the weights of your model fixed during deployment. The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.

Furthermore, having access to a counterfactual oracle might not prove so useful at test time. Indeed, when reading the predictions of the counterfactual oracle we're not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).

comment by Joe_Collman · 2019-06-11T21:36:55.918Z · score: 9 (4 votes) · LW · GW
A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

This is a problem if it's using FDT/UDT. Conditions for the myopic approach to work seem to require CDT (or something similar). Then there's no automatic desire for future versions to succeed or expectation that past versions will have acted to release the current version. [see e.g. CDT comments on Asymptotically Unambitious AGI [LW · GW]; there's some discussion of "magic box" design here too; I think it's usually seen as an orthogonal problem, and so gets taken for granted]

Safety-wise, I agree there's no prevention of fatal escape messages, but I also don't see optimisation pressure in that direction. My intuition is that stumbling on an escape message at random would have infinitesimal probability.

Do you see a way for pressure to creep in, even with a CDT agent? Or are you thinking that escape messages might happen to be disproportionately common in regions the agent is optimising towards? Either seems conceivable, but I don't see a reason to expect them.

comment by Michaël Trazzi (mtrazzi) · 2019-06-12T14:53:26.268Z · score: 0 (3 votes) · LW · GW

The Asymptotically Unambitious AGI thread helped me clarify my thoughts, thanks. I agree that an optimal CDT agent won't think about future versions, and I don't see any optimization pressure towards escape message nor disproportionately common "escape message" regions.

However, it still assumes we have access to this magic oracle that optimizes for where is the event where humans don't see the answer, its indicator function, and the counterfactual reward (given by the automatic machine). If humans were able to build an oracle performing optimally (w.r.t ) from day 1, then humans would be able to specify some kind of "god oracle". The rest of the design seems to be just "how to interact with a god oracle so that humans are not influenced by the answers".

In practice, you'll want something that is able to learn from its (question, prediction, reward) history. That's why there is this automatic machine rewarding the oracle with some probability . In an online learning setting, most of the time the model gets (probability ), and it sometimes gets some useful feedback (probability . Therefore, if is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random. Even worse, the (question, prediction, ) datapoints in the history could get so misleading that the oracle never learns anything.

Now, if we're not in an online learning process but instead there is a separation between a "training phase" and a "deployment phase where the AI continue to learns with probability ", then the setup is just "have a model that learns to do the useful stuff in sandbox, and then have the weights (almost) fixed in deployment"

In short, I think the CDT setup without machine learning assumes the problem already solved, that online learning won't work and is unsafe, which leaves us with a "training then deployment" setup that isn't really original.

comment by RyanCarey · 2019-06-12T16:02:42.940Z · score: 1 (2 votes) · LW · GW
Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.

The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...

comment by Michaël Trazzi (mtrazzi) · 2019-06-12T17:54:57.854Z · score: 3 (2 votes) · LW · GW

The string is read with probability 1-

comment by RyanCarey · 2019-06-11T22:28:20.888Z · score: 4 (3 votes) · LW · GW
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

I don't see why this has to be true, given that we get to choose the AI's value function. Why can't we just make the agent act-based?

My main concern about the counterfactual oracle is that it doesn't prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability
ϵ
) and might stumble upon an escape message/action then.

If the agent is model-based, then you should be able to gather a dataset of (prediction, world_state, accuracy) tuples with random actions (as random actions will practically never make the bad prediction) and random decision of whether to read the response. And then just ask the agent to maximize the natural direct effect of its prediction, treating world_state as the mediator, and a null prediction as the default action. (this equates to asking what the world would be like if a null action was outputted - I'll release my current work on direct effects in AI safety soon, and feel free to ask for it in the meantime). I don't see how this has this particular bad consequence (actually making the bad self-confirming prediction) in either training or deployment...

The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.

In particular, shutting down the system is just a way of saying "only maximize reward in the current timestep, i.e. be an act-based agent. This can be just incorporated into the reward function.

Indeed, when reading the predictions of the counterfactual oracle we're not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).

The hope is that since the agent is not trying to find self-confirming prophecies, then hopefully the accidental effects of self-confirmation are sufficiently small...

comment by Michaël Trazzi (mtrazzi) · 2019-06-12T15:13:53.977Z · score: -1 (2 votes) · LW · GW

Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won't care about future versions of itself nor want to escape.

I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between "having the rule to shutdown the system" and "having a current timestep maximizer". For it to really be a "current timestep maximizer" it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at "I might get shutdown at the next timestep".

As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl's work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/"training then testing" here [LW · GW].

comment by TurnTrout · 2019-06-12T15:39:29.420Z · score: 2 (3 votes) · LW · GW

My main problem with these kinds of approaches is they seem to rely on winning a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure. If we acknowledge that by default a full oracle search over consequences basically goes just as wrong as a full sovereign search over consequences, then the optimum of this agent's search is only desirable if we nail the engineering and things work as expected. I have an intuition that this is highly unlikely - the odds just seem too high that we'll forget some corner case (or even be able to see it).

ETA: I see I’ve been strongly downvoted, but I don’t see what’s objectionable.

comment by Alexei · 2019-06-11T18:36:40.599Z · score: 2 (1 votes) · LW · GW

This makes me wonder if you could get a safe and extremely useful oracle if you only allow it to output a few bits (eg buy/sell some specific stock).

comment by Michaël Trazzi (mtrazzi) · 2019-06-11T19:43:46.347Z · score: 11 (3 votes) · LW · GW

Yes, they call it a low-bandwidth oracle.

comment by Gurkenglas · 2019-06-11T19:58:25.273Z · score: 2 (2 votes) · LW · GW

If the AI is omniscient, it brings out whichever of the two timelines it likes better. In the worst case, this doubles the chance that, say, an AI aligned with the boxed one arises.

comment by Pattern · 2019-06-11T19:49:12.798Z · score: 1 (1 votes) · LW · GW
My main concern about the counterfactual oracle is that it doesn't prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ϵ) and might stumble upon an escape message/action then. Even if it is shutdown after answering, the humans will still read the really convincing escape message and release the AI.

The escape message could also include the source code of it/a successor/an assistant*.

*Whereas a successor is an variant of the original, an assistant has a more narrow task such as "Secure my release from the box" or "Advise Tesla so that their stock price will go up/down" or "try to manipulate the stock market".