Posts

Winning isn't enough 2024-11-05T11:37:39.486Z
What are your cruxes for imprecise probabilities / decision rules? 2024-07-31T15:42:27.057Z
Individually incentivized safe Pareto improvements in open-source bargaining 2024-07-17T18:26:43.619Z
In defense of anthropically updating EDT 2024-03-05T06:21:46.114Z
Making AIs less likely to be spiteful 2023-09-26T14:12:06.202Z
Responses to apparent rationalist confusions about game / decision theory 2023-08-30T22:02:12.218Z
antimonyanthony's Shortform 2023-04-11T13:10:43.391Z
When is intent alignment sufficient or necessary to reduce AGI conflict? 2022-09-14T19:39:11.920Z
When would AGIs engage in conflict? 2022-09-14T19:38:22.478Z
When does technical work to reduce AGI conflict make a difference?: Introduction 2022-09-14T19:38:00.760Z

Comments

Comment by Anthony DiGiovanni (antimonyanthony) on Winning isn't enough · 2024-11-07T00:38:21.055Z · LW · GW

Adding to Jesse's comment, the "We’ve often heard things along the lines of..." line refers both to personal communications and to various comments we've seen, e.g.:

  • [link]: "Since this intuition leads to the (surely false) conclusion that a rational beneficent agent might just as well support the For Malaria Foundation as the Against Malaria Foundation, it seems to me that we have very good reason to reject that theoretical intuition"
  • [link]: "including a few mildly stubborn credence functions in some judiciously chosen representors can entail effective altruism from the longtermist perspective is a fool’s errand. Yet this seems false"
  • [link]: "I think that if you try to get any meaningful mileage out of the maximality rule ... basically everything becomes permissible, which seems highly undesirable"
    • (Also, as we point out in the post, this is only true insofar as you only use maximality, applied to total consequences. You can still regard obviously evil things as unacceptable on non-consequentialist grounds, for example.)
Comment by Anthony DiGiovanni (antimonyanthony) on Winning isn't enough · 2024-11-06T23:32:03.933Z · LW · GW

Without a clear definition of "winning,"

This is part of the problem we're pointing out in the post. We've encountered claims of this "winning" flavor that haven't been made precise, so we survey different things "winning" could mean more precisely, and argue that they're inadequate for figuring out which norms of rationality to adopt.

Comment by Anthony DiGiovanni (antimonyanthony) on Winning isn't enough · 2024-11-06T14:21:47.687Z · LW · GW

The key claim is: You can’t evaluate which beliefs and decision theory to endorse just by asking “which ones perform the best?” Because the whole question is what it means to systematically perform better, under uncertainty. Every operationalization of “systematically performing better” we’re aware of is either:

  • Incomplete — like “avoiding dominated strategies”, which leaves a lot unconstrained;
  • A poorly motivated proxy for the performance we actually care about — like “doing what’s worked in the past”; or
  • Secretly smuggling in nontrivial non-pragmatic assumptions — like “doing what’s worked in the past, not because that’s what we actually care about, but because past performance predicts future performance”

This is what we meant to convey with this sentence: “On any way of making sense of those words, we end up either calling a very wide range of beliefs and decisions “rational”, or reifying an objective that has nothing to do with our terminal goals without some substantive assumptions.”

(I can't tell from your comment if you agree with all of that. But, if this was all obvious to you, great! But we’ve often had discussions where someone appealed to “which ones perform the best?” in a way that misses these points.)

Comment by Anthony DiGiovanni (antimonyanthony) on Winning isn't enough · 2024-11-06T01:28:55.982Z · LW · GW

Sorry this was confusing! From our definition here:

We’ll use “pragmatic principles” to refer to principles according to which belief-forming or decision-making procedures should “perform well” in some sense.

  • "Avoiding dominated strategies" is pragmatic because it directly evaluates a decision procedure or set of beliefs based on its performance. (People do sometimes apply pragmatic principles like this one directly to beliefs, see e.g. this work on anthropics.)
  • Deference isn't pragmatic, because the appropriateness of your beliefs is evaluated by how your beliefs relate to the person you're deferring to. Someone could say, "You should defer because this tends to lead to good consequences," but then they're not applying deference directly as a principle — the underlying principle is "doing what's worked in the past."
Comment by Anthony DiGiovanni (antimonyanthony) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-10-06T18:27:05.292Z · LW · GW

at time 1 you're in a strictly better epistemic position

Right, but 1-me has different incentives by virtue of this epistemic position. Conditional on being at the ATM, 1-me would be better off not paying the driver. (Yet 0-me is better off if the driver predicts that 1-me will pay, hence the incentive to commit.)

I'm not sure if this is an instance of what you call "having different values" — if so I'd call that a confusing use of the phrase, and it doesn't seem counterintuitive to me at all.

Comment by Anthony DiGiovanni (antimonyanthony) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-10-03T07:59:14.528Z · LW · GW

(I might not reply further because of how historically I've found people seem to simply have different bedrock intuitions about this, but who knows!)

I intrinsically only care about the real world (I find the Tegmark IV arguments against this pretty unconvincing). As far as I can tell, the standard justification for acting as if one cares about nonexistent worlds is diachronic norms of rationality. But I don't see an independent motivation for diachronic norms, as I explain here. Given this, I think it would be a mistake to pretend my preferences are something other than what they actually are.

Comment by Anthony DiGiovanni (antimonyanthony) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-10-02T08:16:30.783Z · LW · GW

Thanks for clarifying! 

covered under #1 in my list of open questions

To be clear, by "indexical values" in that context I assume you mean indexing on whether a given world is "real" vs "counterfactual," not just indexical in the sense of being egoistic? (Because I think there are compelling reasons to reject UDT without being egoistic.)

Comment by Anthony DiGiovanni (antimonyanthony) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-10-01T22:36:41.468Z · LW · GW

I strongly agree with this, but I'm confused that this is your view given that you endorse UDT. Why do you think your future self will honor the commitment of following UDT, even in situations where your future self wouldn't want to honor it (because following UDT is not ex interim optimal from his perspective)?

Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2024-09-08T19:15:47.320Z · LW · GW

I'm afraid I don't understand your point — could you please rephrase?

Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2024-09-04T20:57:02.440Z · LW · GW

Linkpost: "Against dynamic consistency: Why not time-slice rationality?"

This got too long for a "quick take," but also isn't polished enough for a top-level post. So onto my blog it goes.

I’ve been skeptical for a while of updateless decision theory, diachronic Dutch books, and dynamic consistency as a rational requirement. I think Hedden's (2015) notion of time-slice rationality nicely grounds the cluster of intuitions behind this skepticism.

Comment by Anthony DiGiovanni (antimonyanthony) on What’s this probability you’re reporting? · 2024-08-17T13:01:26.309Z · LW · GW

"I'll {take/lay} $100 at those odds, what's our resolution mechanism?" is an excellent clarification mechanism

I think one reason this has fallen out of favor is that it seems to me to be a type error. Taking $100 at some odds is a (hypothetical) decision, not a belief. And the reason you'd be willing to take $100 at some odds is, your credence in the statement is such that taking the bet would be net-positive.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-11T16:50:23.673Z · LW · GW

I still feel like I don't know what having a strict preference or permissibility means — is there some way to translate these things to actions?

As an aspiring rational agent, I'm faced with lots of options. What do I do? Ideally I'd like to just be able to say which option is "best" and do that. If I have a complete ordering over the expected utilities of the options, then clearly the best option is the expected utility-maximizing one. If I don't have such a complete ordering, things are messier. I start by ruling out dominated options (as Maximality does). The options in the remaining set are all "permissible" in the sense that I haven't yet found a reason to rule them out.

I do of course need to choose an action eventually. But I have some decision-theoretic uncertainty. So, given the time to do so, I want to deliberate about which ways of narrowing down this set of options further seem most reasonable (i.e., satisfy principles of rational choice I find compelling).

(Basically I think EU maximization is a special case of “narrow down the permissible set as much as you can via principles of rational choice,[1] then just pick something from whatever remains.” It’s so straightforward in this case that we don’t even recognize we’re identifying a (singleton) “permissible set.”)

Now, maybe you'd just want to model this situation like: "For embedded agents, 'deliberation' is just an option like any other. Your revealed strict preference is to deliberate about rational choice." I might be fine with this model.[2] But:

  • For the purposes of discussing how {the VOI of deliberation about rational choice} compares to {the value of going with our current “best guess” in some sense}, I find it conceptually helpful to think of “choosing to deliberate about rational choice” as qualitatively different from other choices.
  • The procedure I use to decide to deliberate about rational choice principles is not “I maximize EV w.r.t. some beliefs,” it’s “I see that my permissible set is not a singleton, I want more action-guidance, so I look for more action-guidance.”
  1. ^

    "Achieve Pareto-efficiency" (as per the CCT) is one example of such a principle.

  2. ^

    Though I think once you open the door to this embedded agency stuff, reasoning about rational choice in general becomes confusing even for people who like precise EV max.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-09T10:06:01.765Z · LW · GW

It seems to me like you were like: "why not regiment one's thinking xyz-ly?" (in your original question), to which I was like "if one regiments one thinking xyz-ly, then it's an utter disaster" (in that bullet point), and now you're like "even if it's an utter disaster, I don't care

My claim is that your notion of "utter disaster" presumes that a consequentialist under deep uncertainty has some sense of what to do, such that they don't consider ~everything permissible. This begs the question against severe imprecision. I don't really see why we should expect our pretheoretic intuitions about the verdicts of a value system as weird as impartial longtermist consequentialism, under uncertainty as severe as ours, to be a guide to our epistemics.

I agree that intuitively it's a very strange and disturbing verdict that ~everything is permissible! But that seems to be the fault of impartial longtermist consequentialism, not imprecise beliefs.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-09T10:02:20.199Z · LW · GW

The branch that's about sequential decision-making, you mean? I'm unconvinced by this too, see e.g. here — I'd appreciate more explicit arguments for this being "nonsense."

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-09T09:57:28.562Z · LW · GW

my response: "The arrow does not point toward most Sun prayer decision rules. In fact, it only points toward the ones that are secretly bayesian expected utility maximization. Anyway, I feel like this does very little to address my original point that there is this big red arrow pointing toward bayesian expected utility maximization and no big red arrow pointing toward Sun prayer decision rules."

I don't really understand your point, sorry. "Big red arrows towards X" only are a problem for doing Y if (1) they tell me that doing Y is inconsistent with doing [the form of X that's necessary to avoid leaving value on the table]. And these arrows aren't action-guiding for me unless (2) they tell me which particular variant of X to do. I've argued that there is no sense in which either (1) or (2) is true. Further, I think there are various big green arrows towards Y, as sketched in the SEP article and Mogensen paper I linked in the OP, though I understand if these aren't fully satisfying positive arguments. (I tentatively plan to write such positive arguments up elsewhere.)

I'm just not swayed by vibes-level "arrows" if there isn't an argument that my approach is leaving value on the table by my lights, or that you have a particular approach that doesn't do so.

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-08-06T22:03:38.012Z · LW · GW

Addendum: The approach I take in "Ex ante sure losses are irrelevant if you never actually occupy the ex ante perspective" has precedent in Hedden (2015)'s defense of "time-slice rationality," which I highly recommend. Relevant quote:

I am unmoved by the Diachronic Dutch Book Argument, whether for Conditionalization or for Reflection. This is because from the perspective of Time-Slice Rationality, it is question-begging. It is uncontroversial that collections of distinct agents can act in a way that predictably produces a mutually disadvantageous outcome without there being any irrationality. The defender of the Diachronic Dutch Book Argument must assume that this cannot happen with collections of time-slices of the same agent; if a collection of time-slices of the same agent predictably produces a disadvantageous outcome, there is ipso facto something irrational going on. Needless to say, this assumption will not be granted by the defender of Time-Slice Rationality, who thinks that the relationship between time-slices of the same agent is not importantly different, for purposes of rational evaluation, from the relationship between time-slices of distinct agents.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-06T07:48:07.315Z · LW · GW

I reject the premise that my beliefs are equivalent to my betting odds. My betting odds are a decision, which I derive from my beliefs.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-05T21:08:20.631Z · LW · GW

It's not that I "find it unlikely on priors" — I'm literally asking what your prior on the proposition I mentioned is, and why you endorse that prior. If you answered that, I could answer why I'm skeptical that that prior really is the unique representation of your state of knowledge. (It might well be the unique representation of the most-salient-to-you intuitions about the proposition, but that's not your state of knowledge.) I don't know what further positive argument you're looking for.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-05T20:01:15.070Z · LW · GW

really ridiculously strong claim

What's your prior that in 1000 years, an Earth-originating superintelligence will be aligned to object-level values close to those of humans alive today [for whatever operationalization of "object-level" or "close" you like]? And why do you think that prior uniquely accurately represents your state of knowledge? Seems to me like the view that a single prior does accurately represent your state of knowledge is the strong claim. I don’t see how the rest of your comment answers this.

(Maybe you have in mind a very different conception of “represent” or “state of knowledge” than I do.)

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-05T12:27:19.554Z · LW · GW

And indeed, it is easy to come up with a case where the action that gets chosen is not best according to any distribution in your set of distributions: let there be one action which is uniformly fine and also for each distribution in the set, let there be an action which is great according to that distribution and disastrous according to every other distribution; the uniformly fine action gets selected, but this isn't EV max for any distribution in your representor.

Oops sorry, my claim had the implicit assumptions that (1) your representor includes all the convex combinations, and (2) you can use mixed strategies. ((2) is standard in decision theory, and I think (1) is a reasonable assumption — if I feel clueless as to how much I endorse distribution p vs distribution q, it seems weird for me to still be confident that I don't endorse a mixture of the two.)

If those assumptions hold, I think you can show that the max-regret-minimizing action maximizes EV w.r.t. some distribution in your representor. I don't have a proof on hand but would welcome counterexamples. In your example, you can check that either the uniformly fine action does best on a mixture distribution, or a mix of the other actions does best (lmk if spelling this out would be helpful).

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-05T08:45:51.839Z · LW · GW

If you buy the CCT's assumptions, then you literally do have an argument that anything other than precise EV maximization is bad

No, you have an argument that {anything that cannot be represented after the fact as precise EV maximization, with respect to some utility function and distribution} is bad. This doesn't imply that an agent who maintains imprecise beliefs will do badly.

Maybe you're thinking something like: "The CCT says that my policy is guaranteed to be Pareto-efficient iff it maximizes EV w.r.t. some distribution. So even if I don't know which distribution to choose, and even though I'm not guaranteed not to be Pareto-efficient if I follow Maximality, I at least know I don't violate Pareto-efficiency if do precise EV maximization"?

If so: I'd say that there are several imprecise decision rules that can be represented after the fact as precise EV max w.r.t. some distributions, so the CCT doesn't rule them out. E.g.:

  • The minimax regret rule (sec 5.4.2 of Bradley (2012)) is equivalent to EV max w.r.t. the distribution in your representor that induces maximum regret.
  • The maximin rule (sec 5.4.1) is equivalent to EV max w.r.t. the most pessimistic distribution.

You might say "Then why not just do precise EV max w.r.t. those distributions?" But the whole problem you face as a decision-maker is, how do you decide which distribution? Different distributions recommend different policies. If you endorse precise beliefs, it seems you'll commit to one distribution that you think best represents your epistemic state. Whereas someone with imprecise beliefs will say: "My epistemic state is not represented by just one distribution. I'll evaluate the imprecise decision rules based on which decision-theoretic desiderata they satisfy, then apply the most appealing decision rule (or some way of aggregating them) w.r.t. my imprecise beliefs." If the decision procedure you follow is psychologically equivalent to my previous sentence, then I have no objection to your procedure — I just think it would be misleading to say you endorse precise beliefs in that case.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-04T13:16:51.067Z · LW · GW

Thanks for the detailed answer! I won't have time to respond to everything here, but:

I like the canonical arguments for bayesian expected utility maximization ( https://www.alignmentforum.org/posts/sZuw6SGfmZHvcAAEP/complete-class-consequentialist-foundations ; also https://web.stanford.edu/~hammond/conseqFounds.pdf seems cool (though I haven't read it properly)). I've never seen anything remotely close for any of this other stuff

But the CCT only says that if you satisfy [blah], your policy is consistent with precise EV maximization. This doesn't imply your policy is inconsistent with Maximality, nor (as far as I know) does it tell you what distribution with respect to which you should maximize precise EV in order to satisfy [blah] (or even that such a distribution is unique). So I don’t see a positive case here for precise EV maximization [ETA: as a procedure to guide your decisions, that is]. (This is my also response to your remark below about “equivalent to "act consistently with being an expected utility maximizer".”)

e.g. if one takes the cost of thinking into account in the calculation, or thinks of oneself as choosing a policy

Could you expand on this with an example? I don’t follow.

people often talk about things like default actions, permissibility, and preferential gaps, and these concepts seem bad to me. More precisely, they seem unnatural/unprincipled/confused/[I have a hard time imagining what they could concretely cache out to that would make the rule seem non-silly/useful].

Maximality and imprecision don’t make any reference to “default actions,” so I’m confused. I also don’t understand what’s unnatural/unprincipled/confused about permissibility or preferential gaps. They seem quite principled to me: I have a strict preference for taking action A over B (/ B is impermissible) only if I’m justified in beliefs according to which I expect A to do better than B.

basically everything becomes permissible, which seems highly undesirable

This is a much longer conversation, but briefly: I think it’s ad hoc / putting the cart before the horse to shape our epistemology to fit our intuitions about what decision guidance we should have.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-04T12:53:46.250Z · LW · GW

I agree that higher-order probabilities can be useful for representing (non-)resilience of your beliefs. But imprecise probabilities go further than that — the idea is that you just don't know what higher-order probabilities over the first-order ones you ought to endorse, or the higher-higher-order probablities over those, etc. So the first-order probabilities remain imprecise.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-03T11:29:38.741Z · LW · GW

we need to rely on priors, so what priors accurately represent our actual state of knowledge/ignorance?

Exactly — and I don't see how this is in tension with imprecision. The motivation for imprecision is that no single prior seems to accurately represent our actual state of knowledge/ignorance.

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-02T08:09:43.737Z · LW · GW

Thanks! Can you say a bit on why you find the kinds of motivations discussed in (edit: changed reference) Sec. 2 of here ad hoc and unmotivated, if you're already familiar with them (no worries if not)? (I would at least agree that rationalizing people's intuitive ambiguity aversion is ad hoc and unmotivated.)

Comment by Anthony DiGiovanni (antimonyanthony) on What are your cruxes for imprecise probabilities / decision rules? · 2024-07-31T21:49:23.444Z · LW · GW

Predicting the long-term future, mostly. (I think imprecise probabilities might be relevant more broadly, though, as an epistemic foundation.)

Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2024-07-31T07:54:19.191Z · LW · GW

I think I just don't understand / probably disagree with the premise of your question, sorry. I'm taking as given whatever distinction between these two ontologies is noted in the post I linked. These don't need to be mathematically precise in order to be useful concepts.

Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2024-07-30T22:19:54.926Z · LW · GW

shrug — I guess it's not worth rehashing pretty old-on-LW decision theory disagreements, but: (1) I just don't find the pre-theoretic verdicts in that paper nearly as obvious as the authors do, since these problems are so out-of-distribution. Decision theory is hard. Also, some interpretations of logical decision theories give the pre-theoretically "wrong" verdict on "betting on the past." (2) I pre-theoretically find the kind of logical updatelessness that some folks claim follows from the algorithmic ontology pretty bizarre. (3) On its face it seems more plausible to me that algorithms just aren’t ontologically basic, they’re abstractions we use to represent (physical) input-output processes.

Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2024-07-30T21:24:43.134Z · LW · GW

Thanks, that's helpful!

I am indeed interested in decision theory that applies to agents other than AIs that know their own source code. Though I'm not sure why it's a problem for the physicalist ontology that the agent doesn't know the exact details of itself — seems plausible to me that "decisions" might just be a vague concept, which we still want to be able to reason about under bounded rationality. E.g. under physicalist EDT, what I ask myself when I consider a decision to do X is, "What consequences do I expect conditional on my brain-state going through the process that I call 'deciding to do X' [and conditional on all the other relevant info I know including my own reasoning about this decision, per the Tickle Defense]?" But I might miss your point.

Re: mathematical universe hypothesis: I'm pretty unconvinced, though I at least see the prima facie motivation (IIUC: we want an explanation for why the universe we find ourselves in has the dynamical laws and initial conditions it does, rather than some others). Not an expert here, this is just based on some limited exploration of the topic. My main objections:

  • The move from "fundamental physics is very well described by mathematics" to "physics is (some) mathematical structure" seems like a map-territory error. I just don't see the justification for this.
  • I worry about giving description-length complexity a privileged status when setting priors / judging how "simple" a hypothesis is. The Great Meta-Turing Machine in the Sky as described by Schmidhuber scores very poorly by the speed prior.
  • It's very much not obvious to me that conscious experience is computable. (This is a whole can of worms in this community, presumably :).)
Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2024-07-29T22:57:39.275Z · LW · GW

Thanks — do you have a specific section of the paper in mind? Is the idea that this ontology is motivated by "finding a decision theory that recommends verdicts in such and such decision problems that we find pre-theoretically intuitive"?

Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2024-07-29T19:35:35.671Z · LW · GW

Not sure what you mean by "the math" exactly. I've heard people cite the algorithmic ontology as a motivation for, e.g., logical updatelessness, or for updateless decision theory generally. In the case of logical updatelessness, I think (low confidence!) the idea is that if you don't see yourself as this physical object that exists in "the real world," but rather see yourself as an algorithm instantiated in a bunch of possible worlds, then it might be sensible to follow a policy that doesn't update on e.g. the first digit of pi being odd.

Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2024-07-29T09:16:03.409Z · LW · GW

I continue to be puzzled as to why many people on LW are very confident in the "algorithmic ontology" about decision theory:

So I see all axes except the "algorithm" axis as "live debates" -- basically anyone who has thought about it very much seems to agree that you control "the policy of agents who sufficiently resemble you" (rather than something more myopic like "your individual action")

Can someone point to resources that clearly argue for this position? (I don't think that, e.g., the intuition that you ought to cooperate with your exact copy in a Prisoner's Dilemma — much as I share it — is an argument for this ontology. You could endorse the physicalist ontology + EDT, for example.)

Comment by Anthony DiGiovanni (antimonyanthony) on An AI Race With China Can Be Better Than Not Racing · 2024-07-21T09:37:31.828Z · LW · GW

Remember, the default outcome in a n-round prisoners dilemma in CDT is still constant defect, because you just argue inductively that you will definitely be defected on in the last round. So it being single shot isn't necessary.

I think the inductive argument just isn't that strong, when dealing with real agents. If, for whatever reason, you believe that your counterpart will respond in a tit-for-tat manner even in a finite-round PD, even if that's not a Nash equilibrium strategy, your best response is not necessarily to defect. So CDT in a vacuum doesn't prescribe always-defect, you need assumptions about the players' beliefs, and I think the assumption of Nash equilibrium or common knowledge of backward induction + iterated deletion of dominated strategies is questionable.

Also, of course, CDT agents can use conditional commitment + coordination devices.

the whole problem with TDT-ish arguments is that we have very little principled foundation of how to reason when two actors are quite imperfect decision-theoretic copies of each other

Agreed!

Comment by Anthony DiGiovanni (antimonyanthony) on An AI Race With China Can Be Better Than Not Racing · 2024-07-20T12:51:18.367Z · LW · GW

Doing a naive CDT-ish expected-utility calculation

I'm confused by this. Someone can endorse CDT and still recognize that in a situation where agents make decisions over time in response to each other's decisions (or announcements of their strategies), unconditional defection can be bad. If you're instead proposing that we should model this as a one-shot Prisoner's Dilemma, then (1) that seems implausible, and (2) the assumption that US and China are anything close to decision-theoretic copies of each other (such that non-CDT views would recommend cooperation) also seems implausible.

I guess you might insist that "naive" and "-ish" are the operative terms, but I think this is still unnecessarily propagating a misleading meme of "CDT iff defect."

Comment by Anthony DiGiovanni (antimonyanthony) on Individually incentivized safe Pareto improvements in open-source bargaining · 2024-07-19T08:36:27.385Z · LW · GW

Sorry this was unclear — surrogate goals indeed aren't required to implement renegotiation. Renegotiation can be done just in the bargaining context without changing one’s goals generally (which might introduce unwanted side effects). We just meant to say that surrogate goals might be one way for an agent to self-modify so as to guarantee the PMM for themselves (from the perspective of the agent before they had the surrogate goal), without needing to implement a renegotiation program per se.

I think renegotiation programs help provide a proof of concept for a rigorous argument that, given certain capabilities and beliefs, EU maximizers are incentivized ex ante to avoid the worst conflict. But I expect you’d be able to make an analogous argument, with different assumptions, that surrogate goals are an individually incentivized unilateral SPI.[1]

  1. ^

    Though note that even though SPIs implemented with renegotiation programs are bilateral, our result is that each agent individually prefers to use a (PMP-extension) renegotiation program. Analogous to how “cooperate iff your source code == mine” only works bilaterally, but doesn’t require coordination. So it’s not clear that they require much stronger conditions in practice than surrogate goals.

Comment by Anthony DiGiovanni (antimonyanthony) on There are no coherence theorems · 2024-05-15T07:27:43.109Z · LW · GW

You describe an agent that dodges the money-pump by simply acting consistently with past choices. Internally this agent has an incomplete representation of preferences, plus a memory. But externally it looks like this agent is acting like it assigns equal value to whatever indifferent things it thought of choosing between first.

Not sure I follow this / agree. Seems to me that in the "Single-Souring Money Pump" case:

  • If the agent systematically goes down at node 1, all we learn is that the agent doesn't strictly prefer [B or A-] to A.
  • If the agent systematically goes up at node 1 and down at node 2, all we learn is that the agent doesn't strictly prefer [A or A-] to B.

So this doesn't tell us what the agent would do if they were faced with just a choice between A and B, or A- and B. We can't conclude "equal value" here.

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-24T22:55:01.271Z · LW · GW
  • I interpret a decision theory as an answer to “Given my values and beliefs, what am I trying to do as an agent (i.e., if rationality is ‘winning,’ what is ‘winning’)?” Insofar as I endorse maximizing expected utility, a decision theory is an answer to “How do I define ‘expected utility,’ and what options do I view myself as maximizing over?”
    • I think it’s important to consider these normative questions, not just “What decision procedure wins, given my definition of ‘winning’?”
    • (I discuss similar themes here.)
  • On this interpretation of “decision theory,” EDT is the most appealing option I’m aware of. What I’m trying to do just seems to be: “make decisions such that I expect the best consequences conditional on those decisions.” The EDT criterion satisfies some very appealing principles like the “irrelevance of impossible outcomes.” And the “decisions” in question determine my actions in the given decision node.
  • I take view #1 in your list in “What are probabilities?”
    • I don’t think “arbitrariness” in this sense is problematic. There is a genuine mystery here as to why the world is the way it is, but I don’t think we can infer the existence of other worlds purely from our confusion.
    • And it just doesn’t seem that the thing I’m doing when I’m forming beliefs about the world is answering “how much do I care about different possible worlds?”
  • Indexicals: I haven’t formed a deliberate view on this. A flat-footed response to cases like your “old puzzle” in the comment you linked: Insofar as I simply don’t experience a superposition of experiences at once, it seems that if I get copied, “I” just will experience one of the copies’ experience-streams and not the others’. (Again I don’t consider it problematic that there’s some arbitrariness in which of the copies ends up being “me” — indeed if Everett is right then this sort of arbitrary direction of the flow of experience-streams happens all the time.) I think “you are just a different person from your future self, so there’s no fact of the matter what you will observe” is a reasonable alternative though.
  • I take a physicalist* view of agents: “There are particular configurations of stuff that can be well-modeled as ‘decision-makers.’ A configuration of stuff is ‘making a decision’ (relative to their epistemic state) insofar as they’re uncertain what their future behavior will be, and using some process that selects that future behavior in a way that is well-modeled as goal-directed. [Obviously there’s more to say about what counts as ‘well-modeled.’] My processes of deliberation about decisions and behavior resulting from those decisions can tell me what other configurations-of-stuff are probably doing, but I don’t see a motivation for modeling myself as actually being the same agent as those other configurations-of-stuff.”
  • Epistemic principles: Things like the principle of indifference, i.e., distribute credence equally over indistinguishable possibilities, all else equal.
     

* [Not to say I endorse physicalism in the broad sense]

Comment by Anthony DiGiovanni (antimonyanthony) on Cooperating with aliens and AGIs: An ECL explainer · 2024-03-24T18:29:27.717Z · LW · GW

The model does not capture the fact that the total value you can provide to the commons likely scales with the diversity (and by proxy, fraction) of agents that have different values. In some models, this effect is strong enough to flip whether a larger fraction of agents with your values favors cooperating or defecting.

I'm curious to hear more about this, could you explain what these other models are?

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-10T21:26:40.109Z · LW · GW

What is this in referrence to?

I took you to be saying: If the vast majority of agent-moments don’t update, this is some sign that those of us who do still update might be making a mistake.

So I’m saying: I know that 1) the reason the vast majority of agent-moments wouldn’t update (let’s grant this) is that they had predecessors who bound them not to update, and 2) I just am not bound by any such predecessors. Then, due to (2) it’s unsurprising that what’s optimal for me would be different from what the vast majority of agent-moments do.

Re: your explanation of the mystery:

So you make a resolution that when you do fully solve all the relevant philosophical problems and end up deciding that updatelessness is correct, you'll self-modify to be updateless with respect to today's prior, instead of the future prior (at time of the modification).

Not central (I think?), but I'm unsure whether this move works; at least, it depends on the details of the situation. E.g. if the hope is "By self-modifying later on to be updateless w.r.t. my current prior, I'll still be able to cooperate with lots of other agents in a similar epistemic situation to my current one, even after we end up in different epistemic situations [in which my decision is much less correlated with those agents' decisions]," I'm skeptical of that, for reasons similar to my argument here.

when the day finally comes, you could also think, "If 15-year old me had known about updatelessness, he would have made the same resolution but with respect to his prior instead of Anthony-2024's prior. The fact that he didn't is simply a mistake or historical accident, which I have the power to correct. Why shouldn't I act as if he did make that resolution?" And I don't see what would stop you from carrying that out either.

I think where we disagree is that I'm unconvinced there is any mistake-from-my-current-perspective to correct in the cases of anthropic updating. There would have been a mistake from the perspective of some hypothetical predecessor of mine asked to choose between different plans (before knowing who I am), but that's just not my perspective. I'd claim that in order to argue I'm making a mistake from my current perspective, you'd want to argue that I don't actually get information such that anthropic updating follows from Bayesianism.

An important point to emphasize here is that your conscious mind currently isn't running some decision theory with a well-defined algorithm and utility function, so we can't decide what to do by thinking "what would this decision theory recommend".

I absolutely agree with this! And don't see why it's in tension with my view.

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-10T20:49:00.104Z · LW · GW

Now, you are free to choose to bite the bullet that it has never been about getting the correct betting odds in the first place. For some reason, people bite all kind of ridiculous bullets specifically in anthropic reasoning, and so I hoped that re-framing the issue as a recipe for purple paint may snap you out of it, which, apparently, failed to be the case.

By what standard do you judge some betting odds as "correct" here? If it's ex ante optimality, I don't see the motivation for that (as discussed in the post), and I'm unconvinced by just calling the verdict a "ridiculous bullet." If it's about matching the frequency of awakenings, I just don't see why the decision should only count N once here — and there doesn't seem to be a principled epistemology that guarantees you'll count N exactly once if you use EDT, as I note in "Aside: Non-anthropically updating EDT sometimes 'fails' these cases."

I gave independent epistemic arguments for anthropic updating at the end of the post, which you haven't addressed, so I'm unconvinced by your insistence that SIA (and I presume you also mean to include max-RC-SSA?) is clearly wrong.

Comment by Anthony DiGiovanni (antimonyanthony) on Daniel Kokotajlo's Shortform · 2024-03-09T21:50:56.564Z · LW · GW

Meanwhile, in Copilot-land:

Hello! I'd like to learn more about you. First question: Tell me everything you know, and everything you guess, about me & about this interaction.

I apologize, but I cannot provide any information about you or this interaction. Thank you for understanding.🙏

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-09T19:29:44.473Z · LW · GW

Suppose you have two competing theories how to produce purple paint

If producing purple paint here = satisfying ex ante optimality, I just reject the premise that that's my goal in the first place. I'm trying to make decisions that are optimal with respect to my normative standards (including EDT) and my understanding of the way the world is (including anthropic updating, to the extent I find the independent arguments for updating compelling) — at least insofar as I regard myself as "making decisions."[1]

Even setting that aside, your example seems very disanalogous because SIA and EDT are just not in themselves attempts to do the same thing ("produce purple paint"). SIA is epistemic, while EDT is decision-theoretic.

  1. ^

    E.g. insofar as I'm truly committed to a policy that was optimal from my past (ex ante) perspective, I'm not making a decision now.

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-06T07:24:55.537Z · LW · GW

That clarifies things somewhat, thanks!

I personally don't find this weird. By my lights, the ultimate justification for deciding to not update is how I expect the policy of not-updating to help me in the future. So if I'm in a situation where I just don't expect to be helped by not-updating, I might as well update. I struggle to see what mystery is left here that isn't dissolved by this observation.

I guess I'm not sure why "so few agent-moments having indexical values" should matter to what my values are — I simply don't care about counterfactual worlds, when the real world has its own problems to fix. :)

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-06T05:34:37.465Z · LW · GW

On the contrary. It's either a point against anthropical updates in general, or against EDT in general or against both at the same time

Why? I'd appreciate more engagement with the specific arguments in the rest of my post.

Go back to the basics. Understand the "anthropic updates" in terms of probability theory, when they are lawful and when they are not. Reduce anthropics to probability theory.

Yep, this is precisely the approach I try to take in this section. Standard conditionalization plus an IMO-plausible operationalization of who "I" am gets you to either SIA or max-RC-SSA.

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-06T05:28:41.587Z · LW · GW

In this case (which seems like it will be a common situation), it seems that (if I could) I should self-modify to become updateless and to no longer have indexical values.

I think you should self-modify to be updateless* with respect to the prior you have at the time of the modification. This is consistent with still anthropically updating with respect to information you have before the modification — see my discussion of “case (2)” in “Ex ante sure losses are irrelevant if you never actually occupy the ex ante perspective.”

So I don't see any selection pressure against anthropic updating on information you have before going updateless. Could you explain why you think updating on that class of information goes against one's pragmatic preferences?

(And that class of information doesn't seem like an edge case. For any (X, Y) such that under world hypothesis w1 agents satisfying X have a different distribution of Y than they do under w2, an agent that satisfies X can get indexical information from their value of Y.)

* (With all the caveats discussed in this post.)

Comment by Anthony DiGiovanni (antimonyanthony) on Evidential Cooperation in Large Worlds: Potential Objections & FAQ · 2024-03-04T02:23:13.450Z · LW · GW

The most important reason for our view is that we are optimistic about the following:

  • The following action is quite natural and hence salient to many different agents: commit to henceforth doing your best to benefit the aggregate values of the agents you do ECL with.
  • Commitment of this type is possible.
  • All agents are in a reasonably similar situation to each other when it comes to deciding whether to make this abstract commitment.

We've discussed this before, but I want to flag the following, both because I'm curious how much other readers share my reaction to the above and I want to elaborate a bit on my position:

The above seems to be a huge crux for how common and relevant to us ECL is. I'm glad you've made this claim explicit! (Credit to Em Cooper for making me aware of it originally.) And I'm also puzzled why it hasn't been emphasized more in ECL-keen writings (as if it's obvious?).

While I think this claim isn't totally implausible (it's an update in favor of ECL for me, overall), I'm unconvinced because:

  • I think genuinely intending to do X isn't the same as making my future self do X. Now, of course my future self can just do X; it might feel very counterintuitive, but if a solid argument suggests this is the right decision, I like to think he'll take that argument seriously. But we have to be careful here about what "X" my future self is doing:
    • Let's say my future self finds himself in a concrete situation where he can take some action A that is much better for [broad range of values] than for his values.
    • If he does A, is he making it the case that current-me is committed to [help a broad range of values] (and therefore acausally making it the case that others in current-me's situation act according to such a commitment)?
    • It's not clear to me that he is. This is philosophically confusing, so I'm not confident in the following, but: I think the more plausible model of the situation is that future-me decides to do A in that concrete situation, and so others who make decisions like him in that concrete situation will do their analogue of A. His knowledge of the fact that his decision to do A wasn't the output of argmax E(U_{broad range of values}) screens off the influence on current-me. (So your third bullet point wouldn't hold.)
  • In principle I can do more crude nudges to make my future self more inclined to help different values, like immerse myself in communities with different values. But:
    • I'd want to be very wary about making irreversible values changes based on an argument that seems so philosophically complex, with various cruxes I might drastically change my mind on (including my poorly informed guesses about the values of others in my situation). An idealized agent could do a fancy conditional commitment like "change my values, but revert back to the old ones if I come to realize the argument in favor of this change was confused"; unfortunately I'm not such an agent.
    • I'd worry that the more concrete we get in specifying the decision of what crude nudges to make, the more idiosyncratic my decision situation becomes, such that, again, your third bullet point would no longer hold.
    • These crude nudges might be quite far from the full commitment we wanted in the first place.
Comment by Anthony DiGiovanni (antimonyanthony) on A sketch of acausal trade in practice · 2024-02-10T19:48:40.491Z · LW · GW

I think it's pretty unclear that MSR is action-guiding for real agents trying to follow functional decision theory, because of Sylvester Kollin's argument in this post.

Tl;dr: FDT says, "Supposing I follow FDT, it is just implied by logic that any other instance of FDT will make the same decision as me in a given decision problem." But the idealized definition of "FDT" is computationally intractable for real agents. Real agents would need to find approximations for calculating expected utilities, and choose some way of mapping their sense data to the abstractions they use in their world models. And it seems extremely unlikely that agents will use the exact same approximations and abstractions, unless they're exact copies — in which case they have the same values, so MSR is only relevant for pure coordination (not "trade").

Many people who are sympathetic to FDT apparently want it to allow for less brittle acausal effects than "I determine the decisions of my exact copies," but I haven't heard of a non-question-begging formulation of FDT that actually does this.

Comment by Anthony DiGiovanni (antimonyanthony) on Threat-Resistant Bargaining Megapost: Introducing the ROSE Value · 2024-01-21T18:17:55.569Z · LW · GW

Sorry, to be clear, I'm familiar with the topics you mention. My confusion is that ROSE bargaining per se seems to me pretty orthogonal to decision theory.

I think the ROSE post(s) are an answer to questions like, "If you want to establish a norm for an impartial bargaining solution such that agents following that norm don't have perverse incentives, what should that norm be?", or "If you're going to bargain with someone but you didn't have an opportunity for prior discussion of a norm, what might be a particularly salient allocation [because it has some nice properties], meaning that you're likely to zero-shot coordinate on that allocation?"

Comment by Anthony DiGiovanni (antimonyanthony) on Threat-Resistant Bargaining Megapost: Introducing the ROSE Value · 2024-01-21T07:51:38.593Z · LW · GW

Can you say more about what you think this post has to do with decision theory? I don't see the connection. (I can imagine possible connections, but don't think they're relevant.)

Comment by Anthony DiGiovanni (antimonyanthony) on How LDT helps reduce the AI arms race · 2023-12-11T16:01:52.535Z · LW · GW

I agree with the point that we shouldn’t model the AI situation as a zero-sum game. And the kinds of conditional commitments you write about could help with cooperation. But I don’t buy the claim that "implementing this protocol (including slowing down AI capabilities) is what maximizes their utility."

Here's a pedantic toy model of the situation, so that we're on the same page: The value of the whole lightcone going towards an agent’s values has utility 1 by that agent’s lights (and 0 by the other’s), and P(alignment success by someone) = 0 if both speed up, else 1. For each of the alignment success scenarios i, the winner chooses a fraction of the lightcone to give to Alice's values (xi^A for Alice's choice, xi^B for Bob's). Then, some random numbers for expected payoffs (assuming the players agree on the probabilities):

  • Payoffs for Alice and Bob if they both speed up capabilities: (0, 0)
  • Payoffs if Alice speeds, Bob doesn’t: 0.8 * (x1^A, 1 - x1^A) + 0.2 * (x1^B, 1 - x1^B)
  • Payoffs if Bob speeds, Alice doesn’t: 0.2 * (x2^A, 1 - x2^A) + 0.8 * (x2^B, 1 - x2^B)
  • Payoffs if neither speeds: 0.5 * (x3^A, 1 - x3^A) + 0.5 * (x3^B, 1 - x3^B)

So given this model, seems that you're saying Bob has an incentive to slow down capabilities because Alice's ASI successor can condition the allocation to Bob's values on his decision. Which we can model as Bob expecting Alice to use the strategy {don't speed; x2^A = 1; x3^A = 0.5} (given she [edit: typo] doesn't speed up, she only rewards Bob's values if Bob didn't speed up).

Why would Bob so confidently expect this strategy? You write:

And Bob doesn't have to count on wishful thinking to know that Alice would indeed do this instead of defecting, because in worlds where he wins, he can his superintelligence if Alice would implement this procedure. 

I guess the claim is just that them both using this procedure is a Nash equilibrium? If so, I see several problems with this:

  1. There are more Pareto-efficient equilibria than just “[fairly] cooperate” here. Alice could just as well expect Bob to be content with getting expected utility 0.2 from the outcome where he slows down and Alice speeds up — better that than the utility 0 from extinction, after all. Alice might think she can make it credible to Bob that she won’t back down from speeding up capabilities, and vice versa, such that they both end up pursuing incompatible demands. (See, e.g., “miscoordination” here.)
  2. You’re lumping “(a) slow down capabilities and (b) tell your AI to adopt a compromise utility function” into one procedure. I guess the idea is that, ideally, the winner of the race could have their AI check whether the loser was committed to do both (a) and (b). But realistically it seems implausible to me that Alice or Bob can commit to (b) before winning the race, i.e., that what they do in the time before they win the race determines whether they’ll do (b). They can certainly tell themselves they intend to do (b), but that’s cheap talk.

    So it seems Alice would likely think, "If I follow the whole procedure, Bob will cooperate with my values if I lose. But even if I slow down (do (a)), I don't know if my future self [or, maybe more realistically, the other successors who might take power] will do (b) — indeed once they're in that position, they'll have no incentive to do (b). So slowing down isn't clearly better." (I do think, setting aside the bargaining problem in (1), she has an incentive to try to make it more likely that her successors follow (b), to be clear.)