# Problems with Counterfactual Oracles

post by Michaël Trazzi (mtrazzi) · 2019-06-11T18:10:05.223Z · score: 14 (7 votes) · LW · GW · 22 comments

## Contents

  Problems
Assumptions
Safety
Usefulness
None

"we could design a Predictor that incorporates a quantum random number generator, such that with some small probability it will output "no predictions today, run me again tomorrow". Then have the Predictor make predictions that are conditional on it giving the output "no predictions today, run me again tomorrow" (Wei Dai, 2012 [LW · GW])

That's essentially Counterfactual Oracles. Here is my summary:

• Every timestep : a human asks a questions (for instance "what will be TSLA stock price in June 11 2030?"), the oracle (actually a predictor [LW · GW]) emits a corresponding prediction , and receives a reward
• One of its action is an escape answer persuading human supervisors to let it out of its sandbox environment (e.g. predicting £192,116,518,914.20 [LW · GW] encodes "SUPERINT"). This action would return maximum reward if successfull, as the AI would have more control over the reward mechanism (because it was released).
• With some small probability (remember the quantum random number generator?), an automatic machine protects the answer from being read until we know the actual answer to the question. Then, the oracle receives a reward corresponding to how close its prediction was to the actual value. In the first example, the reward could be where and are respectively the predicted and true stock price in 2030.
• Thus, if the AI cares only about its immediate reward after one step, it will make predictions conditional on this "no human ever sees my answer" event. This is great because it will actually give the "true" predictions about how the world would look like if its answers had no impact whatsoever.

# Problems

This design has several issues. I grouped them in the following sub-sections:

• Assumptions: certain assumptions are unrealistic
• Safety: even if those assumptions are satisfied, the design isn't safe.
• Usefulness: in the world where we actually build such oracle, we cannot extract much value from it.

## Assumptions

The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

This design also takes for granted some sort of magic box where humans cannot peek at the answer. In practice, humans will have both incentives to peek (buying/selling TSLA stock 10 years in advance because they know the prediction) and means for peeking. Indeed, the automatic machine might necessit maintenance (for instance because every question requires a different script for retrieving the true value), so maintainers might have some special access to open/modify it.

## Safety

My main concern about the counterfactual oracle is that it doesn't prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ) and might stumble upon an escape message/action then. Even if it is shutdown after answering, the humans will still read the really convincing escape message and release the AI.

Also, some humans will need to be on the loop to make the "automatic machine" run. If we're asking questions about some value that cannot be easily queried (i.e. not "TSLA stock price" but “will this medical treatment work”), then a human will somehow need to communicate the true value to the automatic machine before the reward is computed.

## Usefulness

If I were to explain what makes counterfactual oracles work very bluntly, I would just say:

Prevent humans from looking at answers in training. You can then look at its answers at test time because it has been incentivized to predict things about the counterfactual world where humans don't look at its predictions.

More generally, the whole setup is just 1) train your model in a sandbox environment 2) have the weights of your model fixed during deployment. The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.

Furthermore, having access to a counterfactual oracle might not prove so useful at test time. Indeed, when reading the predictions of the counterfactual oracle we're not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).

comment by TurnTrout · 2019-06-12T15:39:29.420Z · score: 16 (6 votes) · LW · GW

My main problem with these kinds of approaches is they seem to rely on winning a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure. If we acknowledge that by default a full oracle search over consequences basically goes just as wrong as a full sovereign search over consequences, then the optimum of this agent's search is only desirable if we nail the engineering and things work as expected. I have an intuition that this is highly unlikely - the odds just seem too high that we'll forget some corner case (or even be able to see it).

ETA: I see I’ve been strongly downvoted, but I don’t see what’s objectionable.

comment by Wei_Dai · 2019-07-04T12:47:51.961Z · score: 5 (3 votes) · LW · GW

they seem to rely on winning a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure

I upvoted you, but this seems to describe AI safety as a whole. What isn't a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure, in your view?

comment by TurnTrout · 2019-07-04T15:13:10.284Z · score: 4 (3 votes) · LW · GW

In my mind, there's a notion of taking advantage of conceptual insights to make the superintelligent mountain less likely to be pushing against you. What part of a proposal is tackling a root cause of misalignment in the desired use cases? It's alright if the proposal isn't perfect, but heuristically I'd want to see something like "here's an analysis of why manipulation happens, and here are principled reasons to think that this proposal averts some or all of the causes".

Concretely, take CIRL, which I'm pretty sure most agree won't work for the general case as formulated. In addition to the normal IRL component, there's the insight of trying to formalize an agent cooperatively learning from a human. This contribution aimed to address a significant component of value learning failure.

(To be sure, I think the structure of "hey, what if the AI anticipates not being on anyways and is somehow rewarded only for accuracy" is a worthwhile suggestion and idea, and I am glad Stuart shared it. I just am not presently convinced it's appropriate to conclude the design averts manipulation incentives / is safe at present.)

comment by Wei_Dai · 2019-07-04T17:57:58.463Z · score: 6 (3 votes) · LW · GW

It’s alright if the proposal isn’t perfect, but heuristically I’d want to see something like “here’s an analysis of why manipulation happens, and here are principled reasons to think that this proposal averts some or all of the causes”.

This seems fair, and I think one answer to this is (thanks to the framing provided by Michaël Trazzi):

1. Manipulation happens because the Oracle can learn that it can get a high reward by producing a manipulative output.
2. To avoid this, we can avoid letting any human or system that might be vulnerable to manipulation look at the Oracle's output, and use a secure automated system to compute the Oracle's reward.
3. But that would make the Oracle useless.
4. Ah, but we can do that during only some of the episodes ("training" episodes), and have separate "use" episodes where we make sure no learning takes place, where we do look at the Oracle's output and make use of it.

comment by TurnTrout · 2019-07-04T19:53:02.430Z · score: 2 (1 votes) · LW · GW

I appreciate the answer, but my concerns remain.

1. Doesn't decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.

But for 2., how do we get an automated system and containment setup that is secure against a superintelligence? That's where the difficulty seems to be transferred: we assume we have a machine immune to this "manipulation" thing, and then we can conclude there's no manipulation. But is the global optimum for a stock-tracking system "correctly answering the query" instead of, like, "subtly modify the machine to always match this very output", or "use the machine to escape anyways"? Humans can't even check for this, because we assumed we won't peek!

But that seems optimistic. We assume, contra to 4., that it has a model of the world and its future rewards, so what if it exploits some vulnerability to ensure it can manipulate us during the "innocent" step four? Again, we've already assumed we can't check it during the erasure training, and if we check its code afterwards to make sure it's no longer learning, that source could have manipulative content as well. (I'm generally confused about what capabilities are assumed - is it just souped-up modern ML? I anticipate that even if I misunderstand some points, I anticipate that there exists a steelmanned translation of this concern.)

Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train? This isn't acknowledged anywhere I've read. I'd be more comfortable if we had an oracle which doesn't want out of the box, like one that wants to answer questions without becoming more able to answer questions. (That's along the lines of an AUP oracle)

comment by Wei_Dai · 2019-07-05T05:18:14.854Z · score: 4 (2 votes) · LW · GW

But for 2., how do we get an automated system and containment setup that is secure against a superintelligence?

Well, that's what the current contest [LW · GW] is about (in part). How you been following it? But having said that, this conversation is making me realize [LW · GW] that some of the ideas proposed there may not make as much sense as I thought.

I’m generally confused about what capabilities are assumed—is it just souped-up modern ML?

Yeah I'm confused about this too. I asked Stuart and he didn't really give a useful answer. I guess "under what assumed capabilities would Counterfactual Oracles be safe and useful" is also part of what needs to be worked out.

Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train?

Are you thinking that the Oracle might have cross-episode preferences? I think to ensure safety we have to have some way to make sure that the Oracle only cares about doing well (i.e., getting a high reward) on the specific question that it's given, and nothing else, and this may be a hard problem.

comment by Wei_Dai · 2019-07-23T17:37:37.630Z · score: 3 (1 votes) · LW · GW
1. Doesn’t decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.

After reading and thinking further about this, I think I can try to break down 1 and 2 a bit more. The idea is that when we do an online form of machine learning, if we produce the training signal after observing the AI's output (or more generally letting the output interact with the world in an unrestricted or insecure way), the AI could learn to reward hack, for example by directly taking over the reward channel and giving itself max rewards (if we're doing RL), or by influencing the world in such a way to make its task "easier" or to obtain a lower expected loss on its prediction (Stuart's example is giving a prediction that a company's stock price will be zero, which causes the company to go bankrupt because investors and customers lose confidence in it).

So to avoid that, we instead produce the training data (e.g., the targets we want the AI to predict or approximate) before observing the AI's output, and compute the training signal (loss or reward) by using a system that is simple enough that it can't be hacked by the AI (e.g., by computing a simple distance metric between the AI's output and the training data).

Does this explanation make more sense to you?

comment by Michaël Trazzi (mtrazzi) · 2019-07-04T16:42:00.970Z · score: 6 (1 votes) · LW · GW

1. Proposals should make superintelligences less likely to fight you by using some conceptual insight true in most cases.
2. With CIRL, this insight is "we want the AI to actively cooperate with humans", so there's real value from it being formalized in a paper.
3. In the counterfactual paper, there's the insight "what if the AI thinks he's not on but still learns".
For the last bit, I have two interpretations:
4.a. However, it's unclear that this design avoids all manipulative behaviour and is completely safe.
4.b. However, it's unclear that adding the counterfactual feature to another design (e.g. CIRL) would make systems overall safer / would actually reduce manipulation incentives.

If I understand you correctly, there are actual insights from counterfactual oracles--the problem is that those might not be insights that would apply to a broad class of Alignment failures, but only to "engineered" cases of boxed oracle AIs (as opposed to CIRL where we might want AIs to be cooperative in general). Was it what you meant?

comment by TurnTrout · 2019-07-04T19:55:23.783Z · score: 2 (1 votes) · LW · GW

It's more like 4a. The line of thinking seems useful, but I'm not sure that it lands.

comment by Joe_Collman · 2019-06-11T21:36:55.918Z · score: 13 (5 votes) · LW · GW
A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

This is a problem if it's using FDT/UDT. Conditions for the myopic approach to work seem to require CDT (or something similar). Then there's no automatic desire for future versions to succeed or expectation that past versions will have acted to release the current version. [see e.g. CDT comments on Asymptotically Unambitious AGI [LW · GW]; there's some discussion of "magic box" design here too; I think it's usually seen as an orthogonal problem, and so gets taken for granted]

Safety-wise, I agree there's no prevention of fatal escape messages, but I also don't see optimisation pressure in that direction. My intuition is that stumbling on an escape message at random would have infinitesimal probability.

Do you see a way for pressure to creep in, even with a CDT agent? Or are you thinking that escape messages might happen to be disproportionately common in regions the agent is optimising towards? Either seems conceivable, but I don't see a reason to expect them.

comment by Michaël Trazzi (mtrazzi) · 2019-06-12T14:53:26.268Z · score: 3 (4 votes) · LW · GW

The Asymptotically Unambitious AGI thread helped me clarify my thoughts, thanks. I agree that an optimal CDT agent won't think about future versions, and I don't see any optimization pressure towards escape message nor disproportionately common "escape message" regions.

However, it still assumes we have access to this magic oracle that optimizes for where is the event where humans don't see the answer, its indicator function, and the counterfactual reward (given by the automatic machine). If humans were able to build an oracle performing optimally (w.r.t ) from day 1, then humans would be able to specify some kind of "god oracle". The rest of the design seems to be just "how to interact with a god oracle so that humans are not influenced by the answers".

In practice, you'll want something that is able to learn from its (question, prediction, reward) history. That's why there is this automatic machine rewarding the oracle with some probability . In an online learning setting, most of the time the model gets (probability ), and it sometimes gets some useful feedback (probability . Therefore, if is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random. Even worse, the (question, prediction, ) datapoints in the history could get so misleading that the oracle never learns anything.

Now, if we're not in an online learning process but instead there is a separation between a "training phase" and a "deployment phase where the AI continue to learns with probability ", then the setup is just "have a model that learns to do the useful stuff in sandbox, and then have the weights (almost) fixed in deployment"

In short, I think the CDT setup without machine learning assumes the problem already solved, that online learning won't work and is unsafe, which leaves us with a "training then deployment" setup that isn't really original.

comment by Wei_Dai · 2019-07-02T17:32:32.253Z · score: 3 (1 votes) · LW · GW

Even worse, the (question, prediction, r=0) datapoints in the history could get so misleading that the oracle never learns anything.

Why do we have to give the oracle a zero reward for the non-erasure episodes? Why not just skip the learning/update step for those episodes?

In short, I think the CDT setup without machine learning assumes the problem already solved, that online learning won’t work and is unsafe, which leaves us with a “training then deployment” setup that isn’t really original.

Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts?

Although I guess that probably isn't really original either. What seems original is that during any episode where learning will take place, don't let humans (or any other system that might be insecure against the oracle) see the oracle's output until the episode is over.

comment by Michaël Trazzi (mtrazzi) · 2019-07-04T16:22:18.203Z · score: 1 (1 votes) · LW · GW

The zero reward is in the paper. I agree that skipping would solve the problem. From talking to Stuart, my impression is that he thinks that would be equivalent to skipping for specifying "no learning", or would just slow down learning. My disagreement on that I think it can confuse learning to the point of not learning the right thing.

Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts?

Yes, that should work. My quote saying that online learning "won't work and is unsafe" is imprecise. I should have said "if epsilon is small enough to be comparable to the probability of shooting an escape message at random, then it is not safe. Also, if we continue sending the wrong instead of skipping, then it might not learn the correct thing if is not big enough".

Although I guess that probably isn't really original either. What seems original is that during any episode where learning will take place, don't let humans (or any other system that might be insecure against the oracle) see the oracle's output until the episode is over.

That's exactly it!

comment by RyanCarey · 2019-06-12T16:02:42.940Z · score: 1 (2 votes) · LW · GW
Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.

The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...

comment by Michaël Trazzi (mtrazzi) · 2019-06-12T17:54:57.854Z · score: 3 (2 votes) · LW · GW

The string is read with probability 1-

comment by Wei_Dai · 2019-08-20T19:06:22.721Z · score: 12 (3 votes) · LW · GW

More generally, the whole setup is just 1) train your model in a sandbox environment 2) have the weights of your model fixed during deployment.

I think this is missing a crucial point of Counterfactual Oracles (as implemented in ML). I came upon Paul Christiano's Counterfactual oversight vs. training data today which explains this very well. (What he calls Counterfactual Oversight is basically Counterfactual Oracles as applied to predicting/imitating humans.) The problem with the standard supervised learning system that you're describing is:

The problem is not stationary, and over time the training data becomes less relevant. There are spurious correlations in the training data that don’t generalize to the test data.

And Counterfactual Oracles (in ML terms) is actually an attempt to solve these problems:

Counterfactual oversight consists of labelling a random subset of data and using it as online training data. The key difference is that any given data point may become a training data point, with the decision made after the learning system has made a decision about it. As long as the randomization is unpredictable to the learner, this gives us a formal guarantee that there can’t be any noticeable difference between the training and test data. And therefore if our learner behaves well on training data, it really must behave well on test data.

comment by RyanCarey · 2019-06-11T22:28:20.888Z · score: 4 (3 votes) · LW · GW
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

I don't see why this has to be true, given that we get to choose the AI's value function. Why can't we just make the agent act-based?

My main concern about the counterfactual oracle is that it doesn't prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability
ϵ
) and might stumble upon an escape message/action then.

If the agent is model-based, then you should be able to gather a dataset of (prediction, world_state, accuracy) tuples with random actions (as random actions will practically never make the bad prediction) and random decision of whether to read the response. And then just ask the agent to maximize the natural direct effect of its prediction, treating world_state as the mediator, and a null prediction as the default action. (this equates to asking what the world would be like if a null action was outputted - I'll release my current work on direct effects in AI safety soon, and feel free to ask for it in the meantime). I don't see how this has this particular bad consequence (actually making the bad self-confirming prediction) in either training or deployment...

The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.

In particular, shutting down the system is just a way of saying "only maximize reward in the current timestep, i.e. be an act-based agent. This can be just incorporated into the reward function.

Indeed, when reading the predictions of the counterfactual oracle we're not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).

The hope is that since the agent is not trying to find self-confirming prophecies, then hopefully the accidental effects of self-confirmation are sufficiently small...

comment by Michaël Trazzi (mtrazzi) · 2019-06-12T15:13:53.977Z · score: 2 (3 votes) · LW · GW

Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won't care about future versions of itself nor want to escape.

I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between "having the rule to shutdown the system" and "having a current timestep maximizer". For it to really be a "current timestep maximizer" it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at "I might get shutdown at the next timestep".

As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl's work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/"training then testing" here [LW · GW].

comment by Alexei · 2019-06-11T18:36:40.599Z · score: 2 (1 votes) · LW · GW

This makes me wonder if you could get a safe and extremely useful oracle if you only allow it to output a few bits (eg buy/sell some specific stock).

comment by Michaël Trazzi (mtrazzi) · 2019-06-11T19:43:46.347Z · score: 12 (4 votes) · LW · GW

Yes, they call it a low-bandwidth oracle.

comment by Gurkenglas · 2019-06-11T19:58:25.273Z · score: 2 (2 votes) · LW · GW

If the AI is omniscient, it brings out whichever of the two timelines it likes better. In the worst case, this doubles the chance that, say, an AI aligned with the boxed one arises.

comment by Pattern · 2019-06-11T19:49:12.798Z · score: 1 (1 votes) · LW · GW
My main concern about the counterfactual oracle is that it doesn't prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ϵ) and might stumble upon an escape message/action then. Even if it is shutdown after answering, the humans will still read the really convincing escape message and release the AI.

The escape message could also include the source code of it/a successor/an assistant*.

*Whereas a successor is an variant of the original, an assistant has a more narrow task such as "Secure my release from the box" or "Advise Tesla so that their stock price will go up/down" or "try to manipulate the stock market".