Disentangling four motivations for acting in accordance with UDT

post by Julian Stastny · 2023-11-05T21:26:22.514Z · LW · GW · 3 comments

Contents

  Introduction
  Four justifications for UDT-like policies
    Updatelessness as a meta-commitment
    Updatelessness as having intrinsic preferences about counterfactual worlds
    UDT-like behavior as a consequence of anthropic uncertainty
    UDT as a result of rejecting everything else
  Takeaways
  Acknowledgements
None
3 comments

Introduction

In this post, I examine a number of different arguments for acting like an updateless agent. Concretely, I try to provide answers to the following question: What could move someone who isn’t already born a perfectly coherent UDT [LW · GW] agent to adopt a UDT-like policy? 

I take this perspective because I was not born as an idealized UDT agent, and I want the extent to which I’m willing to follow something like UDT (or endorse instructing an AI to do it) to be justified on non-tautological grounds. After all, doing ‘the updateless as opposed to the updateful thing’ by definition means doing something that, to first approximation, is bad by one’s current lights. 

To gain clarity, I consider four ways of climbing the mountain of updatelessness. At the base, I assume that we start with non-indexical preferences and some pre-existing decision-theoretical intuitions, for instance ‘I want to cause good outcomes’ for those who are inclined towards CDT. The first three paths begin the journey from there: if we treat our current decision-theoretical intuitions as bedrock, will this lead us to act in UDT-like ways? The fourth approach I consider is to drop our previous decision-theoretical intuitions, attempting to jump up to UDT directly. 

A caveat: none of the motivations I present here will receive the in-depth discussion they deserve. The role of this post is rather that of an opinionated guidebook: to make salient the fact that there exist different paths, the difficulties one is faced with by taking them, and that they end up at relevantly different UDT-like endpoints. That isn’t to say that the guide is complete: I’m still a wanderer myself, occasionally lost, and looking to diverge from the beaten paths.

Four justifications for UDT-like policies

To set the stage, here are some deliberately vague definitions:

So why would we want to adopt a UDT-like policy? Let us look at four possible reasons. 

Updatelessness as a meta-commitment

When thinking about a multitude of possible decision problems, we notice that in the future we might act differently from what we now wish we will act like. For example, we know that committing to paying up in counterfactual mugging [LW · GWnow is cause/evidence for a higher expected payoff in case we encounter this situation in the future. Because we don’t know all the decision problems in advance, we can implement a meta-commitment [LW · GW] to always act like we would have liked to commit, had we known about the decision problem at the time of making the meta-commitment: given a pre-existing updateful decision theory X, like EDT, we say that we self-modified into Son of X [LW · GW]. Another way to frame this reasoning is to say that an updateful decision theory does action selection in isolated decision situations, and Son of X represents the insight that by the lights of X, it is optimal to use the earliest opportunity to perform the action ‘commit to an optimal policy according to my current beliefs’.

Precisely which commitments are implied from you becoming Son of X depends on which commitments you would have endorsed by the lights of X at the time of self-modification. This gives rise to some subtleties:

Updatelessness as having intrinsic preferences about counterfactual worlds

When faced with the decision to pay up in counterfactual mugging, you notice that you would receive a large payoff in a counterfactual world if and only if you pay up in the world you observe. Even though you are not observing that counterfactual world, you might still have intrinsic preferences [LW · GW] about it. This could be because you think that the other world actually exists, such that what you do in the world you are observing has (logical or evidential) implications for what happens there.[4] But you could have preferences about counterfactuals for other reasons.

Again, we can observe some subtleties about this view:

UDT-like behavior as a consequence of anthropic uncertainty

If you find yourself in a counterfactual mugging, you might think “maybe Omega is simulating me in order to make its prediction”.[6] (Note that this subtly violates the initial assumption of an honest Omega in counterfactual mugging.) So paying up is cause/evidence for getting a large payoff in reality. This sounds simple and elegant [AF · GW], but is actually pretty hairy for many reasons. To give just three considerations:

Assuming this works, however, here are two observations related to the ones we made for other motivations:

UDT as a result of rejecting everything else

You reject CDT for two-boxing in Newcomb’s problem, then reject EDT for not smoking in Smoking Lesion, then embrace EDT again because the tickle defense takes care of Smoking Lesion, then (maybe) reject EDT again for paying in XOR blackmail [AF · GW]. TDT[7] seems reasonable (modulo some questions around similar-but-not-identical algorithms [LW · GW]), and becoming Son of X does too, but then you realize that all of the previously mentioned decision theories wouldn’t pay in a logical counterfactual mugging if they’re ‘born into’ it [LW · GW]. However, whether paying in such a counterfactual mugging means winning or losing is a matter of perspective: a natural one is to say “well, I already know that the world in which Omega just gives me a million bucks is impossible, so the way to win is not to pay”. Having priors that prima facie give nonzero weight to impossible hypotheses, in my opinion, requires a normative or epistemic justification from somewhere, such as that maybe I care about logically impossible counterfactual worlds – but why? – or maybe I believe that I’m in a simulation, or that there exists a world where Omega has bet on the opposite parity of the relevant digit of π. 

more compelling reason [LW · GW] for rejecting updateful decision theories is their reliance on anthropics. Anthropic reasoning seems dubious[8] and enables various Dutch books, and perhaps other self-defeating maneuvers [LW · GW]. Addressing this cluster of issues is worth a post of its own, so I will not go into more detail here – for now I note that in my opinion, most of the argumentative force for why one would want to follow UDT comes from here.

Nevertheless, subscribing to UDT is not as straightforward as we would hope. There remain issues around how to formalize UDT in practice [LW · GW], how to select priors, and the question of whether (and to what extent) one should be logically updateless.

Takeaways

To summarize: there are different possible considerations that could pull someone towards considering UDT-like behavior, but different considerations yield qualitatively different flavors of it. In particular, this affects attitudes to logical updatelessness, or what one should do when one is ‘born into’ a decision problem without a previous opportunity to self-modify into Son of X.

Insofar as one thinks of UDT as an ideal, this might be unsatisfactory: for someone who has been ‘poisoned’ by previous decision-theoretic intuitions, it’s hard to arrive at what appears to be the very top of the mountain. Even with all of the first three approaches combined, it is likely that this doesn’t, for instance, lead to logical updatelessness with respect to facts one starts out knowing. But, to carry this analogy a bit too far, jumping straight to the very top of the mountain might devoid us of oxygen: UDT has a bunch of free parameters, we don’t know how to set them, and one way to ground UDT-like behavior is by asking why, and to which extent, we would endorse it by our current lights. 

The bottom line:

Acknowledgements

Thanks to Caspar Oesterheld, Jesse Clifton, Sylvester Kollin, Anthony DiGiovanni, and Martín Soto for helpful comments and discussion.

  1. ^

    Functional Decision Theory (FDT) is often considered a version of UDT, but Wei Dai doesn’t endorse this as far as I can tell [LW(p) · GW(p)]. For more context and an overview of different versions of UDT, I recommend this post [LW · GW]. This exchange [LW · GW] between people at MIRI and OpenPhil also provides helpful context and framing.

  2. ^

    An alternative attempt to gesture at this without reference to UDT or updatelessness: My decision making is UDT-like if I sometimes do something only because it would be endorsed by a version of myself in a different location in spacetime (of an actual or hypothetical universe).

  3. ^

    As the post linked in this sentence explains, this might be problematic if done carelessly. We might want to consider open-minded updatelessness [AF · GW].

  4. ^

    One could argue about whether this really counts as ‘caring about counterfactual worlds’.

  5. ^

    Though one way to motivate being UDT-like in this particular case is that maybe there exist worlds where Omega performed the counterfactual mugging on us by betting that the trillionth digit of pi would be odd, and that our choice is logically entangled with what happens in that world. But in practice, many problems involving logical counterfactuals could involve logical facts that are chosen in some deterministic (but previously unknown to us) way.

  6. ^

    The post [LW · GW] by Paul Christiano that I linked in the above paragraph could also appropriately be linked under this one, since it seems to draw intuition from both motivations.

  7. ^
  8. ^

    Different anthropic theories partially rely on metaphysical intuitions/stories about how centered worlds or observer moments are 'sampled', and have counterintuitive implications (e.g., the Doomsday argument for SSA and the Presumptuous philosopher for SIA).

3 comments

Comments sorted by top scores.

comment by Vladimir_Nesov · 2023-11-06T10:43:40.980Z · LW(p) · GW(p)

A frame that helps with logical updatelessness is acausal trade. A UDT agent in a particular situation coordinates with its other instances in other situations by adopting a shared policy that doesn't depend on details of the current situation, and then carrying out what the policy prescribes for the current situation. The same generalizes to coordination with other agents, or very unusual variants of the same agent, such as those that have different states of logical uncertainty, or different priors, or different preferences.

With acausal trade, we shouldn't insist on a single global policy of mysterious origin, but consider how hypothetical agents negotiate many contracts of smaller scopes with each other. These contracts nudge the agents that subscribe to them and not others, with every agent weaving its actions out of relevant contracts, and negotiating contracts with relevant other hypothetical agents.

Thus an agent O knowing that 100th digit of pi is odd might hold a tripartite negotiation with both a hypothetical agent E that knows that 100th digit of pi is even, and an agent U that doesn't know parity of this digit. We might say that (a sane) E doesn't exist, so doesn't merit consideration, but it does merit consideration for U who doesn't know, and U does merit consideration for O. This gives a situation where O cares about E's position in negotiating a coordination policy/contract. The meaning of E's position for O is mediated by U's understanding of E, which is not much different from U's understanding of O.

comment by Ben (ben-lang) · 2023-11-07T11:47:14.279Z · LW(p) · GW(p)

This is an interesting post. I think it could be improved by defining some of the acronyms as they come up, (UDT, CDT, EDT, all the decision theories basically.) I think there is an audience on Lesswrong who have read about decision theory before, but not enough to instinctively remember the acronyms. There are plenty of useful links, but I think that a quick reminder is still useful.

comment by Anthony DiGiovanni (antimonyanthony) · 2023-11-16T14:43:45.867Z · LW(p) · GW(p)

I enjoyed this post and think it should help reduce confusion in many future discussions, thanks!

Some comments on your remarks about anthropics:

Different anthropic theories partially rely on metaphysical intuitions/stories about how centered worlds or observer moments are 'sampled', and have counterintuitive implications (e.g., the Doomsday argument for SSA and the Presumptuous philosopher for SIA).

I'm not sure why this is an indictment of "anthropic reasoning" per se, as if that's escapable. It seems like all anthropic theories are trying to answer a question that one needs to answer when forming credences, i.e., how do we form likelihoods P(I observe I exist | world W)? (Which we want in order to compute P(world W | I observe I exist).)

Indeed just failing to anthropically update at all has counterintuitive implications, like the verdict of minimal-reference-class SSA in Joe C's "God's coin toss with equal numbers." [LW · GW] [no longer endorsed] And mrcSSA relies on the metaphysical intuition that oneself was necessarily going to observe X, i.e., P(I observe I exist | world W) = P(I observe I exist | not-W) = 1(which is quite implausible IMO). [I think endorsed, but I feel confused:] And mrcSSA relies on the metaphysical intuition that, given that someone observes X, oneself was necessarily going to observe X, which is quite implausible IMO.