Exploiting EDT

post by Benya_Fallenstein (Benja_Fallenstein) · 2014-11-10T19:59:51.000Z · LW · GW · 10 comments

Contents

10 comments

The problem with EDT is, as David Lewis put it, its "irrational policy of managing the news" (Lewis, 1981): it chooses actions not only because of their effects of the world, but also because of what the fact that it's taking these actions tells it about events the agent can't affect at all. The canonical example is the smoking lesion problem.

I've long been uncomfortable with the smoking lesion problem as the case against EDT, because an AI system would know its own utility function, and would therefore know whether or not it values "smoking" (presumably in the AI case it would be a different goal), and if it updates on this fact it would behave correctly in the smoking lesion. (This is an AI-centric version of the "tickle defense" of EDT.) Nate and I have come up with a variant I find much more convincing: a way to get EDT agents to pay you for managing the news for them, which works by the same mechanism that makes these agents one-box in Newcomb's problem. (It's a variation of the thought experiment in my LessWrong post on "the sin of updating when you can change whether you exist".)


Suppose that there's this EDT agent around which plays the stock market. It's pretty good at doing so, and has amassed a substantial net worth, but, unsurprisingly, it's not perfect; there's always a small chance of its investments going south. Currently, for example, there's a bit of a hubbub around the CEO of one of the companies the agent has been investing in, and the agent assigns a 0.4% chance that there's a scandal about to be revealed which will force that CEO to resign; if that's the case, it expects to lose $150,000,000.

Along comes a clever AI researcher, who is known to be extremely good at predicting how different kinds of AI systems will react on simple thought experiments, who has access to the source code of our agent, and who happens to be able to access information about whether or not there's a scandal. The researcher could find out and sell the information to the agent, but since it's a low-probability scenario, the value of that information wouldn't be that large. Instead, the researcher has a better idea: they are going to

  1. find out whether or not there is a scandal;
  2. figure out whether or not the agent will pay them $100,000,000 if they ask it to in the next step;
  3. if either (a) there is no scandal and the researcher predicts that the agent will pay up, or (b) there is a scandal and they predict that the agent won't pay up, then they will send the agent a pre-drafted e-mail that explains this whole procedure, explains that either case (a) or (b) turned out to be true (but not which one), and asks the agent to immediately pay the researcher $100,000,000.

We assume that the agent has sufficient information about the researcher to know that they are trustworthy and a good predictor of the agent's behavior. Then, as an EDT agent, it'll calculate expected utility by conditioning on its observations and on the action it's considering taking, as follows:

The first of these leads to higher conditional expected utility, so the AI pays up. The researcher decides to use the money to fund investigations into better decision theories.

10 comments

Comments sorted by top scores.

comment by danieldewey · 2014-11-10T16:52:41.000Z · LW(p) · GW(p)

Nice example! I think I understood better why this picks out the particular weakness of EDT (and why it's not a general exploit that can be used against any DT) when I thought of it less as a money-pump and more as "Not only does EDT want to manage the news, you can get it to pay you a lot for the privilege".

Replies from: abramdemski, Benja_Fallenstein
comment by abramdemski · 2014-11-12T01:03:08.000Z · LW(p) · GW(p)

There is a nuance that needs to be mentioned here. If the EDT agent is aware of the researcher's ploys ahead of time, it will set things up so that emails from the researcher go straight to the spam folder, block the researcher's calls, and so on. It is not actually happy to pay the researcher for managing the news!

This is less pathological than listening to the researcher and paying up, but it's still an odd news-management strategy that's result of EDT.

Replies from: Benja_Fallenstein
comment by Benya_Fallenstein (Benja_Fallenstein) · 2014-11-12T01:18:42.000Z · LW(p) · GW(p)

True. This looks to me like an effect of EDT not being stable under self-modification, although here the issue is handicapping itself through external means rather than self-modification---like, if you offer a CDT agent a potion that will make it unable to lift more than one box before it enters Newcomb's problem (i.e., before Omega makes its observation of the agent), then it'll cheerfully take it and pay you for the privilege.

comment by Benya_Fallenstein (Benja_Fallenstein) · 2014-11-10T20:00:13.000Z · LW(p) · GW(p)

Thanks! I didn't really think at all about whether or not "money-pump" was the appropriate word (I'm not sure what the exact definition is); have now changed "way to money-pump EDT agents" into "way to get EDT agents to pay you for managing the news for them".

Replies from: danieldewey
comment by danieldewey · 2014-11-10T22:00:21.000Z · LW(p) · GW(p)

Hm, I don't know what the definition is either. In my head, it means "can get an arbitrary amount of money from", e.g. by taking it around a preference loop as many times as you like. In any case, glad the feedback was helpful.

comment by abramdemski · 2014-11-12T00:55:43.000Z · LW(p) · GW(p)

I find this surprising, and quite interesting.

Here's what I'm getting when I try to translate the Tickle Defense:

"If this argument works, the AI should be able to recognize that, and predict the AI researcher's prediction. It knows that it is already the type of agent that will say yes, effectively screening-off its action from the AI researcher's prediction. When it conditions on refusing to pay, it still predicts that the AI researcher thought it would pay up, and expects the fiasco with the same probability as ever. Therefore, it refuses to pay. By way of contradiction, we conclude that the original argument doesn't work."

This is implausible, since it seems quite likely that conditioning on its "don't pay up" action causes the AI to consider a universe in which this whole argument doesn't work (and the AI researcher sent it a letter knowing that it wouldn't pay, following (b) in the strategy). However, it does highlight the importance of how the EDT agent is computing impossible possible worlds.

Replies from: abramdemski
comment by abramdemski · 2014-11-12T01:16:53.000Z · LW(p) · GW(p)

More technically, we might assume that the AI is using a good finite-time approximation one of the logical priors that has been explored, conditioned on the description of the scenario. We include a logical description of its own source code and physical computer [making the agent unable to consider disruptions to its machine, but this isn't important]. To decide actions, the agent makes decisions by the ambient chicken rule: if the agent can prove what action it will take, it does something different from that. Otherwise, it takes the action with the highest expected utility (according to Bayesian conditional).

Then, the agent cannot predict that it will give the researcher money, because it doesn't know whether it will trip its chicken clause. However, it knows that the researcher will make a correct prediction. So, it seems that it will pay up.

The tickle defense fails as a result of the chicken rule.

Replies from: Imported-IAFF-User-310
comment by IAFF-User-310 (Imported-IAFF-User-310) · 2018-08-29T11:37:32.000Z · LW(p) · GW(p)

But isn't triggering a chicken clause impossible without a contradiction?

comment by orthonormal · 2014-11-10T00:21:10.000Z · LW(p) · GW(p)

I like this! You could also post it to Less Wrong without any modifications.

comment by danieldewey · 2014-11-10T22:06:47.000Z · LW(p) · GW(p)

I wonder if this example can be used to help pin down desiderata for decisions or decision counterfactuals. What axiom(s) for decisions would avoid this general class of exploits?