Evidential Cooperation in Large Worlds: Potential Objections & FAQ

post by Chi Nguyen, _will_ (Will Aldred) · 2024-02-28T18:58:25.688Z · LW · GW · 5 comments

5 comments

Comments sorted by top scores.

comment by Wei Dai (Wei_Dai) · 2024-02-28T20:42:18.288Z · LW(p) · GW(p)

The argument for time-sensitivity is that we might be able to increase the chance of future AI systems doing ECL in worlds where we cannot do so later

What are some ideas for how to increase the chance of future AI systems doing ECL?

An obvious approach is to give AIs a decision theory that is likely to recommend ECL, but given how many open problems there are in decision theory [LW · GW] (as well as the apparent trajectory of research progress), I think we're unlikely to solve it well enough in the relevant time-frame to be comfortable with letting AI use some human-specified decision theory to make highly consequential decisions like whether or not to do ECL (not to mention how exactly to do ECL). Instead it seems advisable to try to ensure that AI will be philosophically competent and then let it fully solve decision theory using its own superior intellect before making such highly consequential decisions.

I'm guessing you may have a different perspective or different ideas, and I'm curious to learn what they are.

Replies from: Chi Nguyen
comment by Chi Nguyen · 2024-02-29T23:59:27.884Z · LW(p) · GW(p)

Thanks! I actually agree with a lot of what you say. Lack of excitement about existing intervention ideas is part of the reason why I'm not all in on this agenda at the moment. Although in part I'm just bottlenecked by lack of technical expertise (and it's not like people had great ideas for how to align AIs at the beginning of the field...), so I don't want people to overupdate from "Chi doesn't have great ideas."

With that out of the way, here are some of my thoughts:

  • We can try to prevent silly path-dependencies in (controlled or uncontrolled i.e. misaligned) AIs. As a start, we can use DT benchmarks to study how DT endorsements and behaviour change under different conditions and how DT competence scales with size compared to other capabilities. I think humanity is unlikely to care a ton about AI's DT views and there might be path-dependencies. So like, I guess I'm saying I agree with "let's try to make the AI philosophically competent."
    • This depends a lot on whether you think there are any path-dependencies conditional on ~solving alignment. Or if humanity will, over time, just be wise enough to figure everything out regardless of the starting point.
    • One source of silly path-dependencies is if AIs' native DT depends on the training process and we want to de-bias against that. (See for example this or this for some research on what different training processes should incentivise.) Honestly, I have no idea how much things like that matter. Humans aren't all CDT even though my very limited understanding of evolution is that it should, in the limit, incentivise CDT.
    • I think depending on what you think about the default of how AIs/AI-powered earth-originating civilisation will arrive at conclusions about ECL, you might think some nudging towards the DT views you favour is more or less justified. Maybe we can also find properties of DTs that we are more confident in (e.g. "does this or that in decision problem X" than whole specified DTs, which, yeah, I have no clue. Other than "probably not CDT."
  • If the AI is uncontrolled/misaligned, there are things we can do to make it more likely it is interested in ECL, which I expect to be net good for the agents I try to acausally cooperate with. For example, maybe we can make misaligned AI's utility function more likely to have diminishing returns or do something else that would make its values more porous. (I'm using the term in a somewhat broader way than Bostrom.)
    • This depends a lot on whether you think we have any influence over AIs we don't fully control.
  • It might be important and mutable that future AIs don't take any actions that decorrelate them with other agents (i.e. does things that decrease the AI's acausal influence) before they discover and implement ECL. So, we might try to just make it aware of that early.
    • You might think that's just not how correlation or updatelessness work, such that there's no rush. Or that this is a potential source of value loss but a pretty negligible one.
  • Things that aren't about making AIs more likely to do ECL: Something not mentioned, but there might be some trades that we have to do now. For example, maybe ECL makes it super important to be nice to AIs we're training. (I am mostly lean no on this question (at least for "super important") but it's confusing.) I also find it plausible we want to do ECL with other pre-ASI civilisations who might or might not succeed at alignment and, if we succeed and they fail, part-optimise for their values. It's unclear to me whether this requires us to get people to spiritually commit to this now before we know whether we'll succeed at alignment or not. Or whether updatelessness somehow sorts this because if we (or the other civ) were to succeed at alignment, we would have seen that this is the right policy, and done this retroactively.
comment by Richard_Ngo (ricraz) · 2024-02-28T22:17:57.800Z · LW(p) · GW(p)

Is ECL the same thing as acausal trade?

Typically, no. “Acausal trade [? · GW]” usually refers to a different mechanism: “I do this thing for you if you do this other thing for me.” Discussions of acausal trade often involve the agents attempting to simulate each other. In contrast, ECL flows through direct correlation: “If I do this, I learn that you are more likely to also do this.” For more, see Christiano (2022)’s discussion of correlation versus reciprocity and Oesterheld, 2017, section 6.1.

I'm skeptical about the extent to which these are actually different things. Oesterheld says "superrationality may be seen as a special case of acausal trade in which the agents’ knowledge implies the correlation directly, thus avoiding the need for explicit mutual modeling and the complications associated with it". So at the very least, we can think of one as a subset of the other (though I think I'd actually classify it the other way round, with acausal trade being a special case of superrationality).

But it's not just that. Consider an ECL model that concludes: "my decision is correlated with X's decision, therefore I should cooperate". But this conclusion also requires complicated recursive reasoning—specifically, reasons for thinking that the correlation holds even given that you're taking the correlation into account when making your decision.

(E.g. suppose that you know that you were similar to X, except that you are doing ECL and X isn't. But then ECL might break the previous correlation between you and X. So actually the ECL process needs to reason "the outcome of the decision process I'm currently doing is correlated with the outcome of the decision process that they're doing", and I think realistically finding a fixed point probably wouldn't look that different from standard descriptions of acausal trade.)

This may be another example of the phenomenon Paul describes in his post on why EDT > CDT: although EDT is technically more correct, in practice you need to do something like CDT to reason robustly. (In this case ECL is EDT and acausal trade is more CDTish.)

Replies from: Chi Nguyen
comment by Chi Nguyen · 2024-03-01T00:14:47.102Z · LW(p) · GW(p)

I'm not sure I understand exactly what you're saying, so I'm just gonna write some vaguely related things to classic acausal trade + ECL:

 

I'm actually really confused about the exact relationship between "classic" prediction-based acausal trade and ECL. And I think I tend to think about them as less crisply different than others. I've tried to unconfuse myself about that for a few hours some months ago and just ended up with a mess of a document. Some intuitive way to differentiate them:

  • ECL leverages the correlation between you and the other agent "directly."
  • "Classic" prediction-based acausal trade leverages the correlation between you and the other agent's prediction of you. (Which, intuitively, they are less in control of than their decision-making.

--> This doesn't look like a fundamental difference between the mechanisms (and maybe there are in-betweeners? But I don't know of any set-ups) but like...it makes a difference in practice or something?

 

On the recursion question:

I agree that ECL has this whole "I cooperate if I think that makes it more likely that they cooperate", so there's definitely also some prediction flavoured thing going on and often, the deliberation about whether they'll be more likely to cooperate when you do will include "they think that I'm more likely to cooperate if they cooperate". So it's kind of recursive.

Note that ECL at least doesn't strictly require that. You can in principle do ECL with rocks "My world model says that conditioning on me taking action X, the likelihood of this rock falling down is higher than if I condition on taking action Y." Tbc, if action X isn't "throw the rock" or something similar, that's a pretty weird world model.  You probably can't do "classic" acausal trade with rocks?

 

Some more not well-in-order not thought-out somewhat incoherent thinking-out-loud random thoughts and intuitions:

More random and less coherent: Something something about how when you think of an agent using some meta-policy to answer the question "What object-level policy should I follow?", there's some intuitive sense in which ECL is recursive in the meta-policy while "classic" acausal trade is recursive in the object-level policy. I'm highly skeptical of this meta-policy object-level policy thing making sense though and also not confident in what I said about which type of trade is recursive in what.

Another intuitive difference is that with classic acausal trade, you usually want to verify whether the other agent is cooperating. In ECL you don't. Also, something something about how it's great to learn a lot about your trade partner for classic acausal trade and it's bad for ECL? (I suspect that there's nothing actually weird going on here and that this is because it's about learning different kinds of things. But I haven't thought about it enough to articulate the difference confidently and clearly.)

The concept of commitment race doesn't seem to make much sense when thinking just about ECL and maybe nailing down where the difference comes from is interesting?

comment by Anthony DiGiovanni (antimonyanthony) · 2024-03-04T02:23:13.450Z · LW(p) · GW(p)

The most important reason for our view is that we are optimistic about the following:

  • The following action is quite natural and hence salient to many different agents: commit to henceforth doing your best to benefit the aggregate values of the agents you do ECL with.
  • Commitment of this type is possible.
  • All agents are in a reasonably similar situation to each other when it comes to deciding whether to make this abstract commitment.

We've discussed this before, but I want to flag the following, both because I'm curious how much other readers share my reaction to the above and I want to elaborate a bit on my position:

The above seems to be a huge crux for how common and relevant to us ECL is. I'm glad you've made this claim explicit! (Credit to Em Cooper for making me aware of it originally.) And I'm also puzzled why it hasn't been emphasized more in ECL-keen writings (as if it's obvious?).

While I think this claim isn't totally implausible (it's an update in favor of ECL for me, overall), I'm unconvinced because:

  • I think genuinely intending to do X isn't the same as making my future self do X. Now, of course my future self can just do X; it might feel very counterintuitive, but if a solid argument suggests this is the right decision, I like to think he'll take that argument seriously. But we have to be careful here about what "X" my future self is doing:
    • Let's say my future self finds himself in a concrete situation where he can take some action A that is much better for [broad range of values] than for his values.
    • If he does A, is he making it the case that current-me is committed to [help a broad range of values] (and therefore acausally making it the case that others in current-me's situation act according to such a commitment)?
    • It's not clear to me that he is. This is philosophically confusing, so I'm not confident in the following, but: I think the more plausible model of the situation is that future-me decides to do A in that concrete situation, and so others who make decisions like him in that concrete situation will do their analogue of A. His knowledge of the fact that his decision to do A wasn't the output of argmax E(U_{broad range of values}) screens off the influence on current-me. (So your third bullet point wouldn't hold.)
  • In principle I can do more crude nudges to make my future self more inclined to help different values, like immerse myself in communities with different values. But:
    • I'd want to be very wary about making irreversible values changes based on an argument that seems so philosophically complex, with various cruxes I might drastically change my mind on (including my poorly informed guesses about the values of others in my situation). An idealized agent could do a fancy conditional commitment like "change my values, but revert back to the old ones if I come to realize the argument in favor of this change was confused"; unfortunately I'm not such an agent.
    • I'd worry that the more concrete we get in specifying the decision of what crude nudges to make, the more idiosyncratic my decision situation becomes, such that, again, your third bullet point would no longer hold.
    • These crude nudges might be quite far from the full commitment we wanted in the first place.