Disentangling four motivations for acting in accordance with UDT

julian-stastny

Disentangling four motivations for acting in accordance with UDT

post by Julian Stastny · 2023-11-05T21:26:22.514Z · LW · GW · 3 comments

  Introduction
  Four justifications for UDT-like policies
    Updatelessness as a meta-commitment
    Updatelessness as having intrinsic preferences about counterfactual worlds
    UDT-like behavior as a consequence of anthropic uncertainty
    UDT as a result of rejecting everything else
  Takeaways
  Acknowledgements
None
3 comments

Introduction

In this post, I examine a number of different arguments for acting like an updateless agent. Concretely, I try to provide answers to the following question: What could move someone who isn’t already born a perfectly coherent UDT [LW · GW] agent to adopt a UDT-like policy?

I take this perspective because I was not born as an idealized UDT agent, and I want the extent to which I’m willing to follow something like UDT (or endorse instructing an AI to do it) to be justified on non-tautological grounds. After all, doing ‘the updateless as opposed to the updateful thing’ by definition means doing something that, to first approximation, is bad by one’s current lights.

To gain clarity, I consider four ways of climbing the mountain of updatelessness. At the base, I assume that we start with non-indexical preferences and some pre-existing decision-theoretical intuitions, for instance ‘I want to cause good outcomes’ for those who are inclined towards CDT. The first three paths begin the journey from there: if we treat our current decision-theoretical intuitions as bedrock, will this lead us to act in UDT-like ways? The fourth approach I consider is to drop our previous decision-theoretical intuitions, attempting to jump up to UDT directly.

A caveat: none of the motivations I present here will receive the in-depth discussion they deserve. The role of this post is rather that of an opinionated guidebook: to make salient the fact that there exist different paths, the difficulties one is faced with by taking them, and that they end up at relevantly different UDT-like endpoints. That isn’t to say that the guide is complete: I’m still a wanderer myself, occasionally lost, and looking to diverge from the beaten paths.

Four justifications for UDT-like policies

To set the stage, here are some deliberately vague definitions:

Updatelessness is the maxim of choosing the action you would have wanted to commit to ex-ante, i.e. before you learned everything you know now. What exactly this means depends on various factors, such as what you consider your ex-ante perspective to be, or what sorts of things you would/should have wanted to allow yourself to learn and act upon.
UDT [? · GW] is a particular decision theory that was designed to fulfill the desideratum of updatelessness. Different versions of UDT have been proposed.^[1]
A policy is a mapping from observations to actions. I call a policy UDT-like (in some decision problem) if it chooses the same actions as some updateless agent’s policy, though this doesn’t necessarily require that one is actually updateless.^[2]

So why would we want to adopt a UDT-like policy? Let us look at four possible reasons.

Updatelessness as a meta-commitment

When thinking about a multitude of possible decision problems, we notice that in the future we might act differently from what we now wish we will act like. For example, we know that committing to paying up in counterfactual mugging [LW · GW] now is cause/evidence for a higher expected payoff in case we encounter this situation in the future. Because we don’t know all the decision problems in advance, we can implement a meta-commitment [LW · GW] to always act like we would have liked to commit, had we known about the decision problem at the time of making the meta-commitment: given a pre-existing updateful decision theory X, like EDT, we say that we self-modified into Son of X [LW · GW]. Another way to frame this reasoning is to say that an updateful decision theory does action selection in isolated decision situations, and Son of X represents the insight that by the lights of X, it is optimal to use the earliest opportunity to perform the action ‘commit to an optimal policy according to my current beliefs’.

Precisely which commitments are implied from you becoming Son of X depends on which commitments you would have endorsed by the lights of X at the time of self-modification. This gives rise to some subtleties:

If this is the reason why you want to implement UDT-like behavior, you don’t (yet) have a reason to pay up if you’re ‘born into’ the problem. For example, if Omega approached you with a counterfactual mugging before you self-modified into Son of X, becoming Son of X wouldn’t justify paying up.
There’s a difference between Son of CDT and Son of EDT: Suppose Omega approaches you with a counterfactual mugging after you went updateless, but tells you that it made its prediction before you went updateless. Then Son of CDT would not pay up because going updateless can’t have been a cause of Omega’s prediction, whereas Son of EDT can still reason itself into paying up because going updateless provided evidence for what Omega has predicted before going updateless.
Son of X can implement logical updatelessness [LW · GW] by considering what logical priors it had at the time of going updateless.^[3]

Updatelessness as having intrinsic preferences about counterfactual worlds

When faced with the decision to pay up in counterfactual mugging, you notice that you would receive a large payoff in a counterfactual world if and only if you pay up in the world you observe. Even though you are not observing that counterfactual world, you might still have intrinsic preferences [LW · GW] about it. This could be because you think that the other world actually exists, such that what you do in the world you are observing has (logical or evidential) implications for what happens there.^[4] But you could have preferences about counterfactuals for other reasons.

Again, we can observe some subtleties about this view:

In a sense this is a way in which UDT-like behavior follows from EDT or TDT without self-modification, when combined with preferences about counterfactual worlds. On the flip side, note that this doesn’t work for orthodox CDT agents.
In contrast to Son of X, this reasoning works even for agents ‘born into’ the problem, provided they have sufficient credence in the existence of a world where they were ‘born into’ the other branch of the problem.
It seems hard to be logically updateless using this motivation, because it is pretty counterintuitive – and in my opinion nonsensical – to have intrinsic preferences about impossible counterfactuals. Once Omega approaches you with a counterfactual mugging on the trillionth digit of π being even, you know that there is no world in which the trillionth digit of π is odd, and thus no possible world in which Omega pays you.^[5]

UDT-like behavior as a consequence of anthropic uncertainty

If you find yourself in a counterfactual mugging, you might think “maybe Omega is simulating me in order to make its prediction”.^[6] (Note that this subtly violates the initial assumption of an honest Omega in counterfactual mugging.) So paying up is cause/evidence for getting a large payoff in reality. This sounds simple and elegant [AF · GW], but is actually pretty hairy for many reasons. To give just three considerations:

Coarse-grained simulations: Making simulations is costly, especially if the simulations are of 'big' agents like superintelligences. But a coarse-grained simulation of a superintelligence probably gives different outputs from the full-fidelity version, which itself is a relevant complicating factor when considering the possibility that one is a simulation.
Isomorphic simulations: Suppose you care about maximizing paperclips. Instead of simulating an identical copy of you, Omega could simulate a version of you that cares about, say, staplers. That way, Omega finds out what 'your algorithm' would do in a counterfactual mugging, but each of your algorithm’s instances doesn’t care about benefiting the other. (This problem mainly affects CDT agents.)
Limited compatibility of anthropic theories and decision theories: For instance, there exist Dutch books against EDT with SIA, and CDT with SSA. Doing anthropic reasoning becomes complicated once decision making is involved.

Assuming this works, however, here are two observations related to the ones we made for other motivations:

This reasoning can work for cases where one is ‘born into’ the decision problem, including for orthodox CDT agents.
It is unclear whether this is able to motivate acting as if one were logically updateless. On the one hand, presumably, one can’t simulate agents in impossible worlds (as otherwise they are possible). On the other hand, in practice, our simulator could sometimes fool us into believing that the world is possible – but this might not apply to all cases where logical updatelessness could be relevant.

UDT as a result of rejecting everything else

You reject CDT for two-boxing in Newcomb’s problem, then reject EDT for not smoking in Smoking Lesion, then embrace EDT again because the tickle defense takes care of Smoking Lesion, then (maybe) reject EDT again for paying in XOR blackmail [AF · GW]. TDT^[7] seems reasonable (modulo some questions around similar-but-not-identical algorithms [LW · GW]), and becoming Son of X does too, but then you realize that all of the previously mentioned decision theories wouldn’t pay in a logical counterfactual mugging if they’re ‘born into’ it [LW · GW]. However, whether paying in such a counterfactual mugging means winning or losing is a matter of perspective: a natural one is to say “well, I already know that the world in which Omega just gives me a million bucks is impossible, so the way to win is not to pay”. Having priors that prima facie give nonzero weight to impossible hypotheses, in my opinion, requires a normative or epistemic justification from somewhere, such as that maybe I care about logically impossible counterfactual worlds – but why? – or maybe I believe that I’m in a simulation, or that there exists a world where Omega has bet on the opposite parity of the relevant digit of π.

A more compelling reason [LW · GW] for rejecting updateful decision theories is their reliance on anthropics. Anthropic reasoning seems dubious^[8] and enables various Dutch books, and perhaps other self-defeating maneuvers [LW · GW]. Addressing this cluster of issues is worth a post of its own, so I will not go into more detail here – for now I note that in my opinion, most of the argumentative force for why one would want to follow UDT comes from here.

Nevertheless, subscribing to UDT is not as straightforward as we would hope. There remain issues around how to formalize UDT in practice [LW · GW], how to select priors, and the question of whether (and to what extent) one should be logically updateless.

Takeaways

To summarize: there are different possible considerations that could pull someone towards considering UDT-like behavior, but different considerations yield qualitatively different flavors of it. In particular, this affects attitudes to logical updatelessness, or what one should do when one is ‘born into’ a decision problem without a previous opportunity to self-modify into Son of X.

Insofar as one thinks of UDT as an ideal, this might be unsatisfactory: for someone who has been ‘poisoned’ by previous decision-theoretic intuitions, it’s hard to arrive at what appears to be the very top of the mountain. Even with all of the first three approaches combined, it is likely that this doesn’t, for instance, lead to logical updatelessness with respect to facts one starts out knowing. But, to carry this analogy a bit too far, jumping straight to the very top of the mountain might devoid us of oxygen: UDT has a bunch of free parameters, we don’t know how to set them, and one way to ground UDT-like behavior is by asking why, and to which extent, we would endorse it by our current lights.

The bottom line:

There might not be a compelling justification to be logically updateless with respect to logical facts one starts out knowing when considering updatelessness.
Son of X versions of UDT are relatively unproblematic, but can’t justify updatelessness with respect to any facts one starts out knowing.
Having preferences about counterfactual worlds is one description of what it means to be updateless. Compared to Son of X, preferences about counterfactual worlds can potentially motivate updatelessness even with respect to empirical facts that one has always known. But such preferences do not motivate logical updatelessness, unless one has preferences about impossible worlds.
Anthropic uncertainty leads one to consider UDT-like behavior, but is all in all a pretty big mess in terms of what other implications it has.
It is tempting to reject updateful decision theories for their reliance on dubious anthropics. As far as I can tell, the questions of whether (and how) to be updateless after having rejected anthropics, and on the flip side, whether anthropics combined with updateful decision theories can be saved, are open questions.

Acknowledgements

Thanks to Caspar Oesterheld, Jesse Clifton, Sylvester Kollin, Anthony DiGiovanni, and Martín Soto for helpful comments and discussion.

^{^}
Functional Decision Theory (FDT) is often considered a version of UDT, but Wei Dai doesn’t endorse this as far as I can tell [LW(p) · GW(p)]. For more context and an overview of different versions of UDT, I recommend this post [LW · GW]. This exchange [LW · GW] between people at MIRI and OpenPhil also provides helpful context and framing.
^{^}
An alternative attempt to gesture at this without reference to UDT or updatelessness: My decision making is UDT-like if I sometimes do something only because it would be endorsed by a version of myself in a different location in spacetime (of an actual or hypothetical universe).
^{^}
As the post linked in this sentence explains, this might be problematic if done carelessly. We might want to consider open-minded updatelessness [AF · GW].
^{^}
One could argue about whether this really counts as ‘caring about counterfactual worlds’.
^{^}
Though one way to motivate being UDT-like in this particular case is that maybe there exist worlds where Omega performed the counterfactual mugging on us by betting that the trillionth digit of pi would be odd, and that our choice is logically entangled with what happens in that world. But in practice, many problems involving logical counterfactuals could involve logical facts that are chosen in some deterministic (but previously unknown to us) way.
^{^}
The post [LW · GW] by Paul Christiano that I linked in the above paragraph could also appropriately be linked under this one, since it seems to draw intuition from both motivations.
^{^}
and Wolfgang Spohn’s interpretation of CDT
^{^}
Different anthropic theories partially rely on metaphysical intuitions/stories about how centered worlds or observer moments are 'sampled', and have counterintuitive implications (e.g., the Doomsday argument for SSA and the Presumptuous philosopher for SIA).

3 comments

Comments sorted by top scores.

comment by Vladimir_Nesov · 2023-11-06T10:43:40.980Z · LW(p) · GW(p)

A frame that helps with logical updatelessness is acausal trade. A UDT agent in a particular situation coordinates with its other instances in other situations by adopting a shared policy that doesn't depend on details of the current situation, and then carrying out what the policy prescribes for the current situation. The same generalizes to coordination with other agents, or very unusual variants of the same agent, such as those that have different states of logical uncertainty, or different priors, or different preferences.

With acausal trade, we shouldn't insist on a single global policy of mysterious origin, but consider how hypothetical agents negotiate many contracts of smaller scopes with each other. These contracts nudge the agents that subscribe to them and not others, with every agent weaving its actions out of relevant contracts, and negotiating contracts with relevant other hypothetical agents.

Thus an agent O knowing that 100th digit of pi is odd might hold a tripartite negotiation with both a hypothetical agent E that knows that 100th digit of pi is even, and an agent U that doesn't know parity of this digit. We might say that (a sane) E doesn't exist, so doesn't merit consideration, but it does merit consideration for U who doesn't know, and U does merit consideration for O. This gives a situation where O cares about E's position in negotiating a coordination policy/contract. The meaning of E's position for O is mediated by U's understanding of E, which is not much different from U's understanding of O.

comment by Ben (ben-lang) · 2023-11-07T11:47:14.279Z · LW(p) · GW(p)

This is an interesting post. I think it could be improved by defining some of the acronyms as they come up, (UDT, CDT, EDT, all the decision theories basically.) I think there is an audience on Lesswrong who have read about decision theory before, but not enough to instinctively remember the acronyms. There are plenty of useful links, but I think that a quick reminder is still useful.

comment by Anthony DiGiovanni (antimonyanthony) · 2023-11-16T14:43:45.867Z · LW(p) · GW(p)

I enjoyed this post and think it should help reduce confusion in many future discussions, thanks!

Some comments on your remarks about anthropics:

Different anthropic theories partially rely on metaphysical intuitions/stories about how centered worlds or observer moments are 'sampled', and have counterintuitive implications (e.g., the Doomsday argument for SSA and the Presumptuous philosopher for SIA).

I'm not sure why this is an indictment of "anthropic reasoning" per se, as if that's escapable. It seems like all anthropic theories are trying to answer a question that one needs to answer when forming credences, i.e., how do we form likelihoods P(I observe I exist | world W)? (Which we want in order to compute P(world W | I observe I exist).)

Indeed just failing to anthropically update at all has counterintuitive implications, like the verdict of minimal-reference-class SSA in Joe C's "God's coin toss with equal numbers." [LW · GW] [no longer endorsed] And mrcSSA relies on the metaphysical intuition that oneself was necessarily going to observe X, i.e., P(I observe I exist | world W) = P(I observe I exist | not-W) = 1(which is quite implausible IMO). [I think endorsed, but I feel confused:] And mrcSSA relies on the metaphysical intuition that, given that someone observes X, oneself was necessarily going to observe X, which is quite implausible IMO.

Disentangling four motivations for acting in accordance with UDT

Contents

Introduction

Four justifications for UDT-like policies

Updatelessness as a meta-commitment

Updatelessness as having intrinsic preferences about counterfactual worlds

UDT-like behavior as a consequence of anthropic uncertainty

UDT as a result of rejecting everything else

Takeaways

Acknowledgements

3 comments