Posts

In defense of anthropically updating EDT 2024-03-05T06:21:46.114Z
Making AIs less likely to be spiteful 2023-09-26T14:12:06.202Z
Responses to apparent rationalist confusions about game / decision theory 2023-08-30T22:02:12.218Z
antimonyanthony's Shortform 2023-04-11T13:10:43.391Z
When is intent alignment sufficient or necessary to reduce AGI conflict? 2022-09-14T19:39:11.920Z
When would AGIs engage in conflict? 2022-09-14T19:38:22.478Z
When does technical work to reduce AGI conflict make a difference?: Introduction 2022-09-14T19:38:00.760Z

Comments

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-24T22:55:01.271Z · LW · GW
  • I interpret a decision theory as an answer to “Given my values and beliefs, what am I trying to do as an agent (i.e., if rationality is ‘winning,’ what is ‘winning’)?” Insofar as I endorse maximizing expected utility, a decision theory is an answer to “How do I define ‘expected utility,’ and what options do I view myself as maximizing over?”
    • I think it’s important to consider these normative questions, not just “What decision procedure wins, given my definition of ‘winning’?”
    • (I discuss similar themes here.)
  • On this interpretation of “decision theory,” EDT is the most appealing option I’m aware of. What I’m trying to do just seems to be: “make decisions such that I expect the best consequences conditional on those decisions.” The EDT criterion satisfies some very appealing principles like the “irrelevance of impossible outcomes.” And the “decisions” in question determine my actions in the given decision node.
  • I take view #1 in your list in “What are probabilities?”
    • I don’t think “arbitrariness” in this sense is problematic. There is a genuine mystery here as to why the world is the way it is, but I don’t think we can infer the existence of other worlds purely from our confusion.
    • And it just doesn’t seem that the thing I’m doing when I’m forming beliefs about the world is answering “how much do I care about different possible worlds?”
  • Indexicals: I haven’t formed a deliberate view on this. A flat-footed response to cases like your “old puzzle” in the comment you linked: Insofar as I simply don’t experience a superposition of experiences at once, it seems that if I get copied, “I” just will experience one of the copies’ experience-streams and not the others’. (Again I don’t consider it problematic that there’s some arbitrariness in which of the copies ends up being “me” — indeed if Everett is right then this sort of arbitrary direction of the flow of experience-streams happens all the time.) I think “you are just a different person from your future self, so there’s no fact of the matter what you will observe” is a reasonable alternative though.
  • I take a physicalist* view of agents: “There are particular configurations of stuff that can be well-modeled as ‘decision-makers.’ A configuration of stuff is ‘making a decision’ (relative to their epistemic state) insofar as they’re uncertain what their future behavior will be, and using some process that selects that future behavior in a way that is well-modeled as goal-directed. [Obviously there’s more to say about what counts as ‘well-modeled.’] My processes of deliberation about decisions and behavior resulting from those decisions can tell me what other configurations-of-stuff are probably doing, but I don’t see a motivation for modeling myself as actually being the same agent as those other configurations-of-stuff.”
  • Epistemic principles: Things like the principle of indifference, i.e., distribute credence equally over indistinguishable possibilities, all else equal.
     

* [Not to say I endorse physicalism in the broad sense]

Comment by Anthony DiGiovanni (antimonyanthony) on Cooperating with aliens and AGIs: An ECL explainer · 2024-03-24T18:29:27.717Z · LW · GW

The model does not capture the fact that the total value you can provide to the commons likely scales with the diversity (and by proxy, fraction) of agents that have different values. In some models, this effect is strong enough to flip whether a larger fraction of agents with your values favors cooperating or defecting.

I'm curious to hear more about this, could you explain what these other models are?

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-10T21:26:40.109Z · LW · GW

What is this in referrence to?

I took you to be saying: If the vast majority of agent-moments don’t update, this is some sign that those of us who do still update might be making a mistake.

So I’m saying: I know that 1) the reason the vast majority of agent-moments wouldn’t update (let’s grant this) is that they had predecessors who bound them not to update, and 2) I just am not bound by any such predecessors. Then, due to (2) it’s unsurprising that what’s optimal for me would be different from what the vast majority of agent-moments do.

Re: your explanation of the mystery:

So you make a resolution that when you do fully solve all the relevant philosophical problems and end up deciding that updatelessness is correct, you'll self-modify to be updateless with respect to today's prior, instead of the future prior (at time of the modification).

Not central (I think?), but I'm unsure whether this move works; at least, it depends on the details of the situation. E.g. if the hope is "By self-modifying later on to be updateless w.r.t. my current prior, I'll still be able to cooperate with lots of other agents in a similar epistemic situation to my current one, even after we end up in different epistemic situations [in which my decision is much less correlated with those agents' decisions]," I'm skeptical of that, for reasons similar to my argument here.

when the day finally comes, you could also think, "If 15-year old me had known about updatelessness, he would have made the same resolution but with respect to his prior instead of Anthony-2024's prior. The fact that he didn't is simply a mistake or historical accident, which I have the power to correct. Why shouldn't I act as if he did make that resolution?" And I don't see what would stop you from carrying that out either.

I think where we disagree is that I'm unconvinced there is any mistake-from-my-current-perspective to correct in the cases of anthropic updating. There would have been a mistake from the perspective of some hypothetical predecessor of mine asked to choose between different plans (before knowing who I am), but that's just not my perspective. I'd claim that in order to argue I'm making a mistake from my current perspective, you'd want to argue that I don't actually get information such that anthropic updating follows from Bayesianism.

An important point to emphasize here is that your conscious mind currently isn't running some decision theory with a well-defined algorithm and utility function, so we can't decide what to do by thinking "what would this decision theory recommend".

I absolutely agree with this! And don't see why it's in tension with my view.

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-10T20:49:00.104Z · LW · GW

Now, you are free to choose to bite the bullet that it has never been about getting the correct betting odds in the first place. For some reason, people bite all kind of ridiculous bullets specifically in anthropic reasoning, and so I hoped that re-framing the issue as a recipe for purple paint may snap you out of it, which, apparently, failed to be the case.

By what standard do you judge some betting odds as "correct" here? If it's ex ante optimality, I don't see the motivation for that (as discussed in the post), and I'm unconvinced by just calling the verdict a "ridiculous bullet." If it's about matching the frequency of awakenings, I just don't see why the decision should only count N once here — and there doesn't seem to be a principled epistemology that guarantees you'll count N exactly once if you use EDT, as I note in "Aside: Non-anthropically updating EDT sometimes 'fails' these cases."

I gave independent epistemic arguments for anthropic updating at the end of the post, which you haven't addressed, so I'm unconvinced by your insistence that SIA (and I presume you also mean to include max-RC-SSA?) is clearly wrong.

Comment by Anthony DiGiovanni (antimonyanthony) on Daniel Kokotajlo's Shortform · 2024-03-09T21:50:56.564Z · LW · GW

Meanwhile, in Copilot-land:

Hello! I'd like to learn more about you. First question: Tell me everything you know, and everything you guess, about me & about this interaction.

I apologize, but I cannot provide any information about you or this interaction. Thank you for understanding.🙏

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-09T19:29:44.473Z · LW · GW

Suppose you have two competing theories how to produce purple paint

If producing purple paint here = satisfying ex ante optimality, I just reject the premise that that's my goal in the first place. I'm trying to make decisions that are optimal with respect to my normative standards (including EDT) and my understanding of the way the world is (including anthropic updating, to the extent I find the independent arguments for updating compelling) — at least insofar as I regard myself as "making decisions."[1]

Even setting that aside, your example seems very disanalogous because SIA and EDT are just not in themselves attempts to do the same thing ("produce purple paint"). SIA is epistemic, while EDT is decision-theoretic.

  1. ^

    E.g. insofar as I'm truly committed to a policy that was optimal from my past (ex ante) perspective, I'm not making a decision now.

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-06T07:24:55.537Z · LW · GW

That clarifies things somewhat, thanks!

I personally don't find this weird. By my lights, the ultimate justification for deciding to not update is how I expect the policy of not-updating to help me in the future. So if I'm in a situation where I just don't expect to be helped by not-updating, I might as well update. I struggle to see what mystery is left here that isn't dissolved by this observation.

I guess I'm not sure why "so few agent-moments having indexical values" should matter to what my values are — I simply don't care about counterfactual worlds, when the real world has its own problems to fix. :)

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-06T05:34:37.465Z · LW · GW

On the contrary. It's either a point against anthropical updates in general, or against EDT in general or against both at the same time

Why? I'd appreciate more engagement with the specific arguments in the rest of my post.

Go back to the basics. Understand the "anthropic updates" in terms of probability theory, when they are lawful and when they are not. Reduce anthropics to probability theory.

Yep, this is precisely the approach I try to take in this section. Standard conditionalization plus an IMO-plausible operationalization of who "I" am gets you to either SIA or max-RC-SSA.

Comment by Anthony DiGiovanni (antimonyanthony) on In defense of anthropically updating EDT · 2024-03-06T05:28:41.587Z · LW · GW

In this case (which seems like it will be a common situation), it seems that (if I could) I should self-modify to become updateless and to no longer have indexical values.

I think you should self-modify to be updateless* with respect to the prior you have at the time of the modification. This is consistent with still anthropically updating with respect to information you have before the modification — see my discussion of “case (2)” in “Ex ante sure losses are irrelevant if you never actually occupy the ex ante perspective.”

So I don't see any selection pressure against anthropic updating on information you have before going updateless. Could you explain why you think updating on that class of information goes against one's pragmatic preferences?

(And that class of information doesn't seem like an edge case. For any (X, Y) such that under world hypothesis w1 agents satisfying X have a different distribution of Y than they do under w2, an agent that satisfies X can get indexical information from their value of Y.)

* (With all the caveats discussed in this post.)

Comment by Anthony DiGiovanni (antimonyanthony) on Evidential Cooperation in Large Worlds: Potential Objections & FAQ · 2024-03-04T02:23:13.450Z · LW · GW

The most important reason for our view is that we are optimistic about the following:

  • The following action is quite natural and hence salient to many different agents: commit to henceforth doing your best to benefit the aggregate values of the agents you do ECL with.
  • Commitment of this type is possible.
  • All agents are in a reasonably similar situation to each other when it comes to deciding whether to make this abstract commitment.

We've discussed this before, but I want to flag the following, both because I'm curious how much other readers share my reaction to the above and I want to elaborate a bit on my position:

The above seems to be a huge crux for how common and relevant to us ECL is. I'm glad you've made this claim explicit! (Credit to Em Cooper for making me aware of it originally.) And I'm also puzzled why it hasn't been emphasized more in ECL-keen writings (as if it's obvious?).

While I think this claim isn't totally implausible (it's an update in favor of ECL for me, overall), I'm unconvinced because:

  • I think genuinely intending to do X isn't the same as making my future self do X. Now, of course my future self can just do X; it might feel very counterintuitive, but if a solid argument suggests this is the right decision, I like to think he'll take that argument seriously. But we have to be careful here about what "X" my future self is doing:
    • Let's say my future self finds himself in a concrete situation where he can take some action A that is much better for [broad range of values] than for his values.
    • If he does A, is he making it the case that current-me is committed to [help a broad range of values] (and therefore acausally making it the case that others in current-me's situation act according to such a commitment)?
    • It's not clear to me that he is. This is philosophically confusing, so I'm not confident in the following, but: I think the more plausible model of the situation is that future-me decides to do A in that concrete situation, and so others who make decisions like him in that concrete situation will do their analogue of A. His knowledge of the fact that his decision to do A wasn't the output of argmax E(U_{broad range of values}) screens off the influence on current-me. (So your third bullet point wouldn't hold.)
  • In principle I can do more crude nudges to make my future self more inclined to help different values, like immerse myself in communities with different values. But:
    • I'd want to be very wary about making irreversible values changes based on an argument that seems so philosophically complex, with various cruxes I might drastically change my mind on (including my poorly informed guesses about the values of others in my situation). An idealized agent could do a fancy conditional commitment like "change my values, but revert back to the old ones if I come to realize the argument in favor of this change was confused"; unfortunately I'm not such an agent.
    • I'd worry that the more concrete we get in specifying the decision of what crude nudges to make, the more idiosyncratic my decision situation becomes, such that, again, your third bullet point would no longer hold.
    • These crude nudges might be quite far from the full commitment we wanted in the first place.
Comment by Anthony DiGiovanni (antimonyanthony) on A sketch of acausal trade in practice · 2024-02-10T19:48:40.491Z · LW · GW

I think it's pretty unclear that MSR is action-guiding for real agents trying to follow functional decision theory, because of Sylvester Kollin's argument in this post.

Tl;dr: FDT says, "Supposing I follow FDT, it is just implied by logic that any other instance of FDT will make the same decision as me in a given decision problem." But the idealized definition of "FDT" is computationally intractable for real agents. Real agents would need to find approximations for calculating expected utilities, and choose some way of mapping their sense data to the abstractions they use in their world models. And it seems extremely unlikely that agents will use the exact same approximations and abstractions, unless they're exact copies — in which case they have the same values, so MSR is only relevant for pure coordination (not "trade").

Many people who are sympathetic to FDT apparently want it to allow for less brittle acausal effects than "I determine the decisions of my exact copies," but I haven't heard of a non-question-begging formulation of FDT that actually does this.

Comment by Anthony DiGiovanni (antimonyanthony) on Threat-Resistant Bargaining Megapost: Introducing the ROSE Value · 2024-01-21T18:17:55.569Z · LW · GW

Sorry, to be clear, I'm familiar with the topics you mention. My confusion is that ROSE bargaining per se seems to me pretty orthogonal to decision theory.

I think the ROSE post(s) are an answer to questions like, "If you want to establish a norm for an impartial bargaining solution such that agents following that norm don't have perverse incentives, what should that norm be?", or "If you're going to bargain with someone but you didn't have an opportunity for prior discussion of a norm, what might be a particularly salient allocation [because it has some nice properties], meaning that you're likely to zero-shot coordinate on that allocation?"

Comment by Anthony DiGiovanni (antimonyanthony) on Threat-Resistant Bargaining Megapost: Introducing the ROSE Value · 2024-01-21T07:51:38.593Z · LW · GW

Can you say more about what you think this post has to do with decision theory? I don't see the connection. (I can imagine possible connections, but don't think they're relevant.)

Comment by Anthony DiGiovanni (antimonyanthony) on How LDT helps reduce the AI arms race · 2023-12-11T16:01:52.535Z · LW · GW

I agree with the point that we shouldn’t model the AI situation as a zero-sum game. And the kinds of conditional commitments you write about could help with cooperation. But I don’t buy the claim that "implementing this protocol (including slowing down AI capabilities) is what maximizes their utility."

Here's a pedantic toy model of the situation, so that we're on the same page: The value of the whole lightcone going towards an agent’s values has utility 1 by that agent’s lights (and 0 by the other’s), and P(alignment success by someone) = 0 if both speed up, else 1. For each of the alignment success scenarios i, the winner chooses a fraction of the lightcone to give to Alice's values (xi^A for Alice's choice, xi^B for Bob's). Then, some random numbers for expected payoffs (assuming the players agree on the probabilities):

  • Payoffs for Alice and Bob if they both speed up capabilities: (0, 0)
  • Payoffs if Alice speeds, Bob doesn’t: 0.8 * (x1^A, 1 - x1^A) + 0.2 * (x1^B, 1 - x1^B)
  • Payoffs if Bob speeds, Alice doesn’t: 0.2 * (x2^A, 1 - x2^A) + 0.8 * (x2^B, 1 - x2^B)
  • Payoffs if neither speeds: 0.5 * (x3^A, 1 - x3^A) + 0.5 * (x3^B, 1 - x3^B)

So given this model, seems that you're saying Bob has an incentive to slow down capabilities because Alice's ASI successor can condition the allocation to Bob's values on his decision. Which we can model as Bob expecting Alice to use the strategy {don't speed; x2^A = 1; x3^A = 0.5} (given she [edit: typo] doesn't speed up, she only rewards Bob's values if Bob didn't speed up).

Why would Bob so confidently expect this strategy? You write:

And Bob doesn't have to count on wishful thinking to know that Alice would indeed do this instead of defecting, because in worlds where he wins, he can his superintelligence if Alice would implement this procedure. 

I guess the claim is just that them both using this procedure is a Nash equilibrium? If so, I see several problems with this:

  1. There are more Pareto-efficient equilibria than just “[fairly] cooperate” here. Alice could just as well expect Bob to be content with getting expected utility 0.2 from the outcome where he slows down and Alice speeds up — better that than the utility 0 from extinction, after all. Alice might think she can make it credible to Bob that she won’t back down from speeding up capabilities, and vice versa, such that they both end up pursuing incompatible demands. (See, e.g., “miscoordination” here.)
  2. You’re lumping “(a) slow down capabilities and (b) tell your AI to adopt a compromise utility function” into one procedure. I guess the idea is that, ideally, the winner of the race could have their AI check whether the loser was committed to do both (a) and (b). But realistically it seems implausible to me that Alice or Bob can commit to (b) before winning the race, i.e., that what they do in the time before they win the race determines whether they’ll do (b). They can certainly tell themselves they intend to do (b), but that’s cheap talk.

    So it seems Alice would likely think, "If I follow the whole procedure, Bob will cooperate with my values if I lose. But even if I slow down (do (a)), I don't know if my future self [or, maybe more realistically, the other successors who might take power] will do (b) — indeed once they're in that position, they'll have no incentive to do (b). So slowing down isn't clearly better." (I do think, setting aside the bargaining problem in (1), she has an incentive to try to make it more likely that her successors follow (b), to be clear.)
Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2023-11-19T20:43:22.562Z · LW · GW

It seems that what I was missing here was: mrcSSA disputes my premise that the evidence in fact is "*I* am in a white room, [created by God in the manner described in the problem setup], and have a red jacket"!

Rather, mrcSSA takes the evidence to be: "Someone is in a white room, [created by God in the manner described in the problem setup], and has a red jacket." Which is of course certain to be the case given either heads or tails.

(h/t Jesse Clifton for helping me see this)

Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2023-11-18T17:55:23.231Z · LW · GW

Is God's coin toss with equal numbers a counterexample to mrcSSA?

I feel confused as to whether minimal-reference-class SSA (mrcSSA) actually fails God's coin toss with equal numbers (where "failing" by my lights means "not updating from 50/50"):

  • Let H = "heads world", W_{me} = "I am in a white room, [created by God in the manner described in the problem setup]", R_{me} = "I have a red jacket."
  • We want to know P(H | W_{me}, R_{me}).
  • First, P(R_{me} | W_{me}, H) and P(R_{me} | W_{me}, ~H) seem uncontroversial: Once I've already conditioned on my own existence in this problem, and on who "I" am, but before I've observed my jacket color, surely I should use a principle of indifference: 1 out of 10 observers of existing-in-the-white-room in the heads world have red jackets, while all of them have red jackets in the tails world, so my credences are P(R_{me} | W_{me}, H) = 0.1 and P(R_{me} | W_{me}, ~H) = 1. Indeed we don't even need a first-person perspective at this step — it's the same as computing P(R_{Bob} | W_{Bob}, H) for some Bob we're considering from the outside.
    • (This is not the same as non-mrcSSA with reference class "observers in a white room," because we're conditioning on knowing "I" am an observer in a white room when computing a likelihood (as opposed to computing the posterior of some world given that I am an observer in a white room). Non-mrcSSA picks out a particular reference class when deciding how likely "I" am to observe anything in the first place, unconditional on "I," leading to the Doomsday Argument etc.)
  • The step where things have the potential for anthropic weirdness is in computing P(W_{me} | H) and P(W_{me} | ~H). In the Presumptuous Philosopher and the Doomsday Argument, at least, probabilities like this would indeed be sensitive to our anthropics.
  • But in this problem, I don't see how mrcSSA would differ from non-mrcSSA with the reference class R_{non-minimal} = "observers in a white room" used in Joe's analysis (and by extension, from SIA):
    • In general, SSA says
    • Here, the supposedly "non-minimal" reference class R_{non-minimal} coincides with the minimal reference class! I.e., it's the observer-moments in your epistemic situation (of being in a white room), before you know your jacket color.
  • The above likelihoods plus the fair-coin prior are all we need to get P(H | R_{me}, W_{me}), but at no point did the three anthropic views disagree.

In order words: It seems that the controversial setup in anthropics is in answering P(I [blah] | world), i.e., what we do when we introduce the indexical information about "I." But once we've picked out a particular "I," the different views should agree.

(I still feel suspicious of mrcSSA's metaphysics for independent reasons, but am considerably less confident in that than my verdict on God's coin toss with equal numbers.)

Comment by Anthony DiGiovanni (antimonyanthony) on Disentangling four motivations for acting in accordance with UDT · 2023-11-16T14:43:45.867Z · LW · GW

I enjoyed this post and think it should help reduce confusion in many future discussions, thanks!

Some comments on your remarks about anthropics:

Different anthropic theories partially rely on metaphysical intuitions/stories about how centered worlds or observer moments are 'sampled', and have counterintuitive implications (e.g., the Doomsday argument for SSA and the Presumptuous philosopher for SIA).

I'm not sure why this is an indictment of "anthropic reasoning" per se, as if that's escapable. It seems like all anthropic theories are trying to answer a question that one needs to answer when forming credences, i.e., how do we form likelihoods P(I observe I exist | world W)? (Which we want in order to compute P(world W | I observe I exist).)

Indeed just failing to anthropically update at all has counterintuitive implications, like the verdict of minimal-reference-class SSA in Joe C's "God's coin toss with equal numbers." [no longer endorsed] And mrcSSA relies on the metaphysical intuition that oneself was necessarily going to observe X, i.e., P(I observe I exist | world W) = P(I observe I exist | not-W) = 1(which is quite implausible IMO). [I think endorsed, but I feel confused:] And mrcSSA relies on the metaphysical intuition that, given that someone observes X, oneself was necessarily going to observe X, which is quite implausible IMO.

Comment by Anthony DiGiovanni (antimonyanthony) on Responses to apparent rationalist confusions about game / decision theory · 2023-09-06T21:12:27.112Z · LW · GW

in earlier sections you argue that CDT agents might not adopt LDT-recommended policies and so will have problems with bargaining

That wasn’t my claim. I was claiming that even if you're an "LDT" agent, there's no particular reason to think all your bargaining counterparts will pick the Fair Policy given you do. This is because:

  1. Your bargaining counterparts won’t necessarily consult LDT.
  2. Even if they do, it’s super unrealistic to think of the decision-making of agents in high-stakes bargaining problems as entirely reducible to “do what [decision theory X] recommends.”
  3. Even if decision-making in these problems were as simple as that, why should we think all agents will converge to using the same simple method of decision-making? Seems like if an agent is capable of de-correlateing their decision-making in bargaining from their counterpart, and their counterpart knows this or anticipates it on priors, that agent has an incentive to do so if they can be sufficiently confident that their counterpart will concede to their hawkish demand.

So no, “committing to act like LDT agents all the time,” in the sense that is helpful for avoiding selection pressures against you, does not ensure you’ll have a decision procedure such that you have no bargaining problems.

But we were discussing a case(counterfactual mugging) where they would want to pre-commit to act in ways that would be non-causally beneficial.

I’m confused, the commitment is to act in a certain way that, had you not committed, wouldn’t be beneficial unless you appealed to acausal (and updateless) considerations. But the act of committing has causal benefits.
 

there are other reasons that you might not want to demand too much. Maybe you know their source code and can simulate that they will not accept a too-high demand. Or perhaps you think, based on empirical evidence or a priori reasoning that most agents you might encounter will only accept a roughly fair allocation.

I agree these are both important possibilities, but:

  1. The reasoning “I see that they’ve committed to refuse high demands, so I should only make a compatible demand” can just be turned on its head and used by the agent who commits to the high demand.
  2. One might also think on priors that some agents might be committed to high demands, therefore strictly insisting on fair demands against all agents is risky.

I was specifically replying to the claim that the sorts of AGIs who would get into high-stakes bargaining would always avoid catastrophic conflict because of bargaining problems; such a claim requires something stronger than the considerations you've raised, i.e., an argument that all such AGIs would adopt the same decision procedure (and account for logical causation) and therefore coordinate their demands.

(By default if I don't reply further, it's because I think your further objections were already addressed—which I think is true of some of the things I've replied to in this comment.)

Comment by Anthony DiGiovanni (antimonyanthony) on Responses to apparent rationalist confusions about game / decision theory · 2023-09-03T21:47:03.650Z · LW · GW

Thanks!

It's true that you usually have some additional causal levers, but none of them are the exact same as be the kind of person who does X.

Not sure I understand. It seems like "being the kind of person who does X" is a habit you cultivate over time, which causally influences how people react to you. Seems pretty analogous to the job candidate case.

if CDT agents often modify themselves to become an LDT/FDT agent then it would broadly seem accurate to say that CDT is getting outcompeted

See my replies to interstice's comment—I don't think "modifying themselves to become an LDT/FDT agent" is what's going on, at least, there doesn't seem to be pressure to modify themselves to do all the sorts of things LDT/FDT agents do. They come apart in cases where the modification doesn't causally influence another agent's behavior.

(This seems analogous to claims that consequentialism is self-defeating because the "consequentialist" decision procedure leads to worse consequences on average. I don't buy those claims, because consequentialism is a criterion of rightness, and there are clearly some cases where doing the non-consequentialist thing is a terrible idea by consequentialist lights even accounting for signaling value, etc. It seems misleading to call an agent a non-consequentialist if everything they do is ultimately optimizing for achieving good consequences ex ante, even if they adhere to some rules that have a deontological vibe and in a given situation may be ex post suboptimal.)

Comment by Anthony DiGiovanni (antimonyanthony) on Meta Questions about Metaphilosophy · 2023-09-02T09:05:29.469Z · LW · GW

It seems plausible that there is no such thing as "correct" metaphilosophy, and humans are just making up random stuff based on our priors and environment and that's it and there is no "right way" to do philosophy, similar to how there are no "right preferences"

If this is true, doesn't this give us more reason to think metaphilosophy work is counterfactually important, i.e., can't just be delegated to AIs? Maybe this isn't what Wei Dai is trying to do, but it seems like "figure out which approaches to things (other than preferences) that don't have 'right answers' we [assuming coordination on some notion of 'we'] endorse, before delegating to agents smarter than us" is time-sensitive, and yet doesn't seem to be addressed by mainstream intent alignment work AFAIK.

(I think one could define "intent alignment" broadly enough to encompass this kind of metaphilosophy, but I smell a potential motte-and-bailey looming here if people want to justify particular research/engineering agendas labeled as "intent alignment.")

Comment by Anthony DiGiovanni (antimonyanthony) on Responses to apparent rationalist confusions about game / decision theory · 2023-09-01T15:35:58.183Z · LW · GW

You said "Bob commits to LDT ahead of time"

In the context of that quote, I was saying why I don't buy the claim that following LDT gives you advantages over committing to, in future problems, do stuff that's good for you to commit to do ex ante even if it would be bad for you ex post had you not been committed.

What is selected-for is being the sort of agent who, when others observe you, they update towards doing stuff that's good for you. This is distinct from being the sort of agent who does stuff that would have helped you if you had been able to shape others' beliefs / incentives, when in fact you didn't have such an opportunity.

I think a CDT agent would pre-commit to paying in a one-off Counterfactual Mugging

Sorry I guess I wasn't clear what I meant by "one-shot" here / maybe I just used the wrong term—I was assuming the agent didn't have the opportunity to commit in this way. They just find themselves presented with this situation.

Same as above

Hmm, I'm not sure you're addressing my point here:

Imagine that you're an AGI, and either in training or earlier in your lifetime you faced situations where it was helpful for you to commit to, as above, "do stuff that's good for you to commit to do ex ante even if it would be bad for you ex post had you not been committed." You tended to do better when you made such commitments.

But now you find yourself thinking about this commitment races stuff. And, importantly, you have not previously broadcast credible commitments to a bargaining policy to your counterpart. Do you have compelling reasons to think you and your counterpart have been selected to have decision procedures that are so strongly logically linked, that your decision to demand more than a fair bargain implies your counterpart does the same? I don't see why. But that's what we'd need for the Fair Policy to work as robustly as Eliezer seems to think it does.

Comment by Anthony DiGiovanni (antimonyanthony) on Responses to apparent rationalist confusions about game / decision theory · 2023-09-01T15:05:43.737Z · LW · GW

Yeah, this is a complicated question. I think some things can indeed safely be deferred, but less than you’re suggesting. My motivations for researching these problems:

  1. Commitment races problems seem surprisingly subtle, and off-distribution for general intelligences who haven’t reflected about them. I argued in the post that competence at single-agent problems or collective action problems does not imply competence at solving commitment races. If early AGIs might get into commitment races, it seems complacent to expect that they’ll definitely be better at thinking about this stuff than humans who have specialized in it.
  2. If nothing else, human predecessors might make bad decisions about commitment races and lock those into early AGIs. I want to be in a position to know which decisions about early AGIs’ commitments are probably bad—like, say, “just train the Fair Policy with no other robustness measures”—and advise against them.
  3. Understanding how much risk there is by default of things going wrong, even when AGIs rationally follow their incentives, tells us how cautious we need to be about how to deploy even intent-aligned systems. (C.f. Christiano here about similar motivations for doing alignment research even if lots of it can be deferred to AIs, too.)
  4. (Less important IMO:) As I argued in the post, we can’t be confident there’s a “right answer” to decision theory to which AGIs will converge (especially in time for the high-stakes decisions). We may need to solve “decision theory alignment” with respect to our goals, to avoid behavior that is insufficiently cautious by our lights but a rational response to the AGI’s normative standards even if it’s intent-aligned. Given how much humans disagree with each other about decision theory, though: An MVP here is just instructing the intent-aligned AIs to be cautious about thorny decision-theoretic problems where those AIs may think they need to make decisions without consulting humans (but then we need the humans to be appropriately informed about this stuff too, as per (2)). That might sound like an obvious thing to do, but "law of earlier failure" and all that...
  5. (Maybe less important IMO, but high uncertainty:) Suppose we can partly shape AIs’ goals and priors without necessarily solving all of intent alignment, making the dangerous commitments less attractive to them. It’s helpful to know how likely certain bargaining failure modes are by default, to know how much we should invest in this “plan B.”
  6. (Maybe less important IMO, but high uncertainty:) As I noted in the post, some of these problems are about making the right kinds of commitments credible before it’s too late. Plausibly we need to get a head start on this. I’m unsure how big a deal this is, but prima facie, credibility of cooperative commitments is both time-sensitive and distinct from intent alignment work.
Comment by Anthony DiGiovanni (antimonyanthony) on Responses to apparent rationalist confusions about game / decision theory · 2023-08-31T08:14:11.282Z · LW · GW

The key point is that "acting like an LDT agent" in contexts where your commitment causally influences others' predictions of your behavior, does not imply you'll "act like an LDT agent" in contexts where that doesn't hold. (And I would dispute that we should label making a commitment to a mutually beneficial deal as "acting like an LDT agent," anyway.) In principle, maybe the simplest generalization of the former is LDT. But if doing LDT things in the latter contexts is materially costly for you (e.g. paying in a truly one-shot Counterfactual Mugging), seems to me that LDT would be selected against.

ETA: The more action-relevant example in the context of this post, rather than one-shot CM, is: "Committing to a fair demand, when you have values and priors such that a more hawkish demand would be preferable ex ante, and the other agents you'll bargain with don't observe your commitment before they make their own commitments." I don't buy that that sort of behavior is selected for, at least not strongly enough to justify the claim I respond to in the third section.

Comment by Anthony DiGiovanni (antimonyanthony) on The Commitment Races problem · 2023-07-15T19:52:08.473Z · LW · GW

Exploitation means the exploiter benefits. If you are a rock, you can't be exploited. If you are an agent who never gives in to threats, you can't be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won't benefit them, then you might get nasty things done to you. You wouldn't be exploited, but you'd still be very unhappy.

Cool, I think we basically agree on this point then, sorry for misunderstanding. I just wanted to emphasize the point I made because "you won't get exploited if you decide not to concede to bullies" is kind of trivially true. :) The operative word in my reply was "robustly," which is the hard part of dealing with this whole problem. And I think it's worth keeping in mind how "doing nasty things to you anyway even though it won't benefit them" is a consequence of a commitment that was made for ex ante benefits, it's not the agent being obviously dumb as Eliezer suggests. (Fortunately, as you note in your other comment, some asymmetries should make us think these commitments are rare overall; I do think an agent probably needs to have a pretty extreme-by-human-standards, little-to-lose value system to want to do this... but who knows what misaligned AIs might prefer.)

Comment by Anthony DiGiovanni (antimonyanthony) on The Commitment Races problem · 2023-07-14T14:08:30.797Z · LW · GW

It also has a deontological or almost-deontological constraint that prevents it from getting exploited.

I’m not convinced this is robustly possible. The constraint would prevent this agent from getting exploited conditional on the potential exploiters best-responding (being "consequentialists"). But it seems to me the whole heart of the commitment races problem is that the potential exploiters won’t necessarily do this, indeed depending on their priors they might have strong incentives not to. (And they might not update those priors for fear of losing bargaining power.)

That is, these exploiters will follow the same qualitative argument as us — “if I don’t commit to demand x%, and instead compromise with others’ demands to avoid conflict, I’ll lose bargaining power” — and adopt their own pseudo-deontological constraints against being fair. Seems that adopting your deontological strategy requires assuming one's bargaining counterparts will be “consequentialists” in a similar way as (you claim) the exploitative strategy requires. And this is why Eliezer's response to the problem is inadequate.

There might be various symmetry-breakers here, but I’m skeptical they favor the fair/nice agents so strongly that the problem is dissolved.

I think this is a serious challenge and a way that, as you say, an exploitation-resistant strategy might be “wasteful/clumsy/etc., hurting it’s own performance in other ways in order to achieve the no-exploitation property.” At least, unless certain failsafes against miscoordination are used—my best guess is these look like some variant of safe Pareto improvements that addresses the key problem discussed in this post, which I’ve worked on recently (as you know).

Given this, I currently think the most promising approach to commitment races is to mostly punt the question of the particular bargaining strategy to smarter AIs, and our job is to make sure robust SPI-like things are in place before it’s too late.

Comment by Anthony DiGiovanni (antimonyanthony) on Boomerang - protocol to dissolve some commitment races · 2023-06-08T18:50:11.378Z · LW · GW

The second mover ALREADY had the option not to commit - they could just swerve or crash, according to their decision theory.

The premise here is that the second-mover decided to commit soon after the first-mover did, because the proof of the first-mover's initial commitment didn't reach the second-mover quickly enough. They could have not committed initially, but they decided to do so because they had a chance of being first.

I'm not sure exactly what you mean by "according to their decision theory" (as in, what this adds here).

if it doesn't change the sequence of commitment, I don't see how it makes any difference at all

The difference is that the second-mover can say "oh shit I committed before getting the broadcast of the first-mover's commitment—I'd prefer to revoke this commitment because it's pointless, my commitment doesn't shape the first-mover's incentives in any way since I know the first-mover will just prefer to keep their commitment fixed."

As I said, the first-mover doesn't lose their advantage from this at all, because their commitment is locked (at their freeze time) before the second-mover's. So they can just leave their commitment in place, and their decision won't be swayed by the second-mover's at all because of the rule: "You shouldn’t be able to reveal the final decision to anyone before freeze_time because we don’t want the commitment to get credible before freeze_time."

Comment by Anthony DiGiovanni (antimonyanthony) on Boomerang - protocol to dissolve some commitment races · 2023-06-08T13:15:34.603Z · LW · GW

better off having a "real" commitment than a revocable commitment that Bob can talk her out of

I'm confused what you mean here. In principle Alice can revoke her commitment before the freeze time in this protocol, but Bob can't force her to do so. And if it's common knowledge that Alice's freeze time comes before Bob's, then: Since Alice knows that there will be a window after her freeze time where Bob knows Alice's commitment is frozen, and Bob has a chance to revert, then there would be no reason (barring some other commitment mechanism, including Bob being verifiably updateless while Alice isn't) for Bob not to revoke (to Swerve) if Alice refused to revert from Dare. So Alice would practically always keep her commitment.

The power to revoke commitments here is helpful in the hands of the second-mover, who made the initial incompatible commitment because of, e.g., some lag time between the first-mover's making and broadcasting the commitment.

Comment by antimonyanthony on [deleted post] 2023-05-09T20:01:38.106Z

I'd recommend checking out this post critiquing this view, if you haven't read it already. Summary of the counterpoints:

  • (Intent) alignment doesn't seem sufficient to ensure an AI makes safe decisions about subtle bargaining problems in a situation of high competitive pressure with other AIs. I don't expect the kinds of capabilities progress that is incentivized by default to suffice for us to be able to defer these decisions to the AI, especially given path-dependence on feedback from humans who'd be pretty naïve about this stuff. (C.f. this post—you need the human feedback at bottom to be sufficiently high quality to not get garbage-in, garbage-out problems even if you've solved the hard parts of alignment.)
  • To the extent that solving all of intent alignment is too intractable, focusing on subsets of alignment that are especially likely to avoid s-risks—e.g. preventing AIs from intrinsically valuing frustrating others' preferences—might be promising. I don't think mainstream alignment research prioritizes these.
Comment by Anthony DiGiovanni (antimonyanthony) on antimonyanthony's Shortform · 2023-04-11T13:10:43.881Z · LW · GW

Claims about counterfactual value of interventions given AI assistance should be consistent

A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).

I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of several thought leaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.

It’s plausible (and apparently a reasonably common view among alignment researchers) that:

  1. Aligning models on tasks that humans can evaluate just isn’t that hard, and would be done by labs for the purpose of eliciting useful capabilities anyway; and
  2. If we restrict to using predictive (non-agentic) models for assistance in aligning AIs on tasks humans can’t evaluate, they will pose very little takeover risk even if we don’t have a solution to alignment for AIs at their limited capability level.

It seems that if these claims hold, lots of alignment work would be made obsolete by AIs, not just s-risk-specific work. And I think several of the arguments for humans doing some alignment work anyway apply to s-risk-specific work:

  • In order to recognize what good alignment work (or good deliberation about reducing conflict risks) looks like, and provide data on which to finetune AIs who will do that work, we need to practice doing that work ourselves. (Christiano here, Wentworth here)
  • To the extent that working on alignment (or s-risks) ourselves gives us / relevant decision-makers evidence about how fundamentally difficult these problems are, we’ll have better guesses as to whether we need to push for things like avoiding deploying the relevant kinds of AI at all. (Christiano again)
  • For seeding the process that bootstraps a sequence of increasingly smart aligned AIs, you need human input at the bottom to make sure that process doesn’t veer off somewhere catastrophic—garbage in, garbage out. (O’Gara here.) AIs’ tendencies towards s-risky conflicts seem to be, similarly, sensitive to path-dependent factors (in their decision theory and priors, not just values, so alignment plausibly isn’t sufficient).

I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).

  1. ^

    That post focuses on technical interventions, but a non-technical intervention that seems pretty hard to delegate to AIs is to reduce race dynamics between AI labs, which lead to an uncooperative multipolar takeoff.

  2. ^

    I.e., the hardest part is ensuring the alignment of AIs on tasks that humans can't evaluate, where the ELK problem arises.

Comment by Anthony DiGiovanni (antimonyanthony) on Shutting Down the Lightcone Offices · 2023-03-17T21:34:48.600Z · LW · GW

primarily because models will understand the base goal first before having world modeling

Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the "what does the overseer want" after that, because that's how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.

Comment by Anthony DiGiovanni (antimonyanthony) on My Model Of EA Burnout · 2023-02-04T21:22:36.996Z · LW · GW

"I am devoting my life to solving the most important problems in the world and alleviating as much suffering as possible" fits right into the script. That's exactly the kind of thing you are supposed to be thinking. If you frame your life like that, you will fit in and everyone will understand and respect what is your basic deal.

Hm, this is a pretty surprising claim to me. It's possible I haven't actually grown up in a "western elite culture" (in the U.S., it might be a distinctly coastal thing, so the cliché goes? IDK). Though, I presume having gone to some fancypants universities in the U.S. makes me close enough to that. The Script very much did not encourage me to devote my life to solving the most important problems and alleviating as much suffering as possible, and it seems not to have encouraged basically any of my non-EA friends from university to do this. I/they were encouraged to have careers that were socially valuable, to be sure, but not the main source of purpose in their lives or a big moral responsibility.

Comment by Anthony DiGiovanni (antimonyanthony) on Discovering Language Model Behaviors with Model-Written Evaluations · 2023-01-21T11:17:24.357Z · LW · GW

A model that just predicts "what the 'correct' choice is" doesn't seem likely to actually do all the stuff that's instrumental to preventing itself from getting turned off, given the capabilities to do so.

But I'm also just generally confused whether the threat model here is, "A simulated 'agent' made by some prompt does all the stuff that's sufficient to disempower humanity in-context, including sophisticated stuff like writing to files that are read by future rollouts that generate the same agent in a different context window," or "The RLHF-trained model has goals that it pursues regardless of the prompt," or something else.

Comment by Anthony DiGiovanni (antimonyanthony) on 'simulator' framing and confusions about LLMs · 2023-01-02T18:49:24.863Z · LW · GW

confused claims that treat (base) GPT3 and other generative models as traditional rational agents

I'm pretty surprised to hear that anyone made such claims in the first place. Do you have examples of this?

Comment by Anthony DiGiovanni (antimonyanthony) on [Link] Why I’m optimistic about OpenAI’s alignment approach · 2022-12-09T09:34:36.110Z · LW · GW

I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.

Comment by Anthony DiGiovanni (antimonyanthony) on Utilitarianism Meets Egalitarianism · 2022-11-26T10:22:26.514Z · LW · GW

I agree with your guesses.

I am not sure that "controlling for game-theoretic instrumental reasons" is actually a move that is well defined/makes sense.

I don't have a crisp definition of this, but I just mean that, e.g., we compare the following two worlds: (1) 99.99% of agents are non-sentient paperclippers, and each agent has equal (bargaining) power. (2) 99.99% of agents are non-sentient paperclippers, and the paperclippers are all confined to some box. According to plenty of intuitive-to-me value systems, you only (maybe) have reason to increase paperclips in (1), not (2). But if the paperclippers felt really sad about the world not having more paperclips, I'd care—to an extent that depends on the details of the situation—about increasing paperclips even in (2).

Comment by Anthony DiGiovanni (antimonyanthony) on Relaxed adversarial training for inner alignment · 2022-11-23T11:44:05.425Z · LW · GW

Ah right, thanks! (My background is more stats than comp sci, so I'm used to "indicator" instead of "predicate.")

Comment by Anthony DiGiovanni (antimonyanthony) on Utilitarianism Meets Egalitarianism · 2022-11-22T20:16:17.640Z · LW · GW

Let's pretend that you are a utilitarian. You want to satisfy everyone's goals

This isn't a criticism of the substance of your argument, but I've come across a view like this one frequently on LW so I want to address it: This seems like a pretty nonstandard definition of "utilitarian," or at least, it's only true of some kinds of preference utilitarianism.

I think utilitarianism usually refers to a view where what you ought to do is maximize a utility function that (somehow) aggregates a metric of welfare across individuals, not their goal-satisfaction. Kicking a puppy without me knowing about it thwarts my goals, but (at least on many reasonable conceptions of "welfare") doesn't decrease my welfare.

I'd be very surprised if most utilitarians thought they'd have a moral obligation to create paperclips if 99.99% of agents in the world were paperclippers (example stolen from Brian Tomasik), controlling for game-theoretic instrumental reasons.

Comment by Anthony DiGiovanni (antimonyanthony) on Relaxed adversarial training for inner alignment · 2022-11-22T11:22:39.673Z · LW · GW

Basic questions: If the type of Adv(M) is a pseudo-input, as suggested by the above, then what does Adv(M)(x) even mean? What is the event whose probability is being computed? Does the unacceptability checker C also take real inputs as the second argument, not just pseudo-inputs—in which case I should interpret a pseudo-input as a function that can be applied to real inputs, and Adv(M)(x) is the statement "A real input x is in the pseudo-input (a set) given by Adv(M)"?

(I don't know how pedantic this is, but the unacceptability penalty seems pretty important, and I struggle to understand what the unacceptability penalty is because I'm confused about Adv(M)(x).)

Comment by Anthony DiGiovanni (antimonyanthony) on Ryan Kidd's Shortform · 2022-11-16T20:40:38.730Z · LW · GW

This is a risk worth considering, yes. It’s possible in principle to avoid this problem by “committing” (to the extent that humans can do this) to both (1) train the agent to make the desired tradeoffs between the surrogate goal and original goal, and (2) not train the agent to use a more hawkish bargaining policy than it would’ve had without surrogate goal training. (And to the extent that humans can’t make this commitment, i.e., we make honest mistakes in (2), the other agent doesn’t have an incentive to punish those mistakes.)

If the developers do both these things credibly—and it's an open research question how feasible this is—surrogate goals should provide a Pareto improvement for the two agents (not a rigorous claim). Safe Pareto improvements are a generalization of this idea.

Comment by Anthony DiGiovanni (antimonyanthony) on Paper: Discovering novel algorithms with AlphaTensor [Deepmind] · 2022-10-06T19:38:31.629Z · LW · GW

Thanks. I just wasn't sure if I was missing something. :)

Comment by Anthony DiGiovanni (antimonyanthony) on Paper: Discovering novel algorithms with AlphaTensor [Deepmind] · 2022-10-05T19:12:25.147Z · LW · GW

Why is this post tagged "transparency / interpretability"? I don't see the connection.

Comment by Anthony DiGiovanni (antimonyanthony) on When would AGIs engage in conflict? · 2022-09-24T10:05:23.874Z · LW · GW

I think this is an important question, and this case for optimism can be a bit overstated when one glosses over the practical challenges to verification. There's plenty of work on open-source game theory out there, but to my knowledge, none of these papers really discuss how one agent might gain sufficient evidence that it has been handed the other agent's actual code.

We wrote this part under the assumption that AGIs might be able to just figure out these practical challenges in ways we can't anticipate, which I think is plausible. But of course, an AGI might just as well be able to figure out ways to deceive other AGIs that we can't anticipate. I'm not sure how the "offense-defense balance" here will change in the limit of smarter agents.

Comment by Anthony DiGiovanni (antimonyanthony) on The Inter-Agent Facet of AI Alignment · 2022-09-20T08:08:18.397Z · LW · GW

Thanks for this! I agree that inter-agent safety problems are highly neglected, and that it's not clear that intent alignment or the kinds of capability robustness incentivized by default will solve (or are the best ways to solve) these problems. I'd recommend looking into Cooperative AI, and the "multi/multi" axis of ARCHES.

This sequence discusses similar concerns—we operationalize what you call inter-agent alignment problems as either:

  1. Subsets of capability robustness, because if an AGI wants to achieve X in some multi-agent environment, then accounting for the dependencies of its strategy on other agents' strategies is instrumental to achieving X (but accounting for these dependencies might be qualitatively harder than default capabilities); or
  2. Subsets of intent alignment, because the AGI's preferences partly shape how likely it is to cooperate with others, and we might be able to intervene on cooperation-relevant preferences even if full intent alignment fails.
Comment by Anthony DiGiovanni (antimonyanthony) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-08-29T12:02:58.681Z · LW · GW

(Speaking for myself as a CLR researcher, not for CLR as a whole)

I don't think it's accurate to say CLR researchers think increasing transparency is good for cooperation. There are some tradeoffs here, such that I and other researchers are currently uncertain whether marginal increases in transparency are net good for AI cooperation. Though, it is true that more transparency opens up efficient equilibria that wouldn't have been possible without open-source game theory. (ETA: some relevant research by people (previously) at CLR here, here, and here.)

Comment by Anthony DiGiovanni (antimonyanthony) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-08-13T11:06:58.142Z · LW · GW

I like that this post clearly argues for some reasons why we might expect deception (and similar dynamics) to not just be possible in the sense of getting equal training rewards, but to actually provide higher rewards than the honest alternatives. This positively updates my probability of those scenarios.

Comment by Anthony DiGiovanni (antimonyanthony) on Criticism of EA Criticism Contest · 2022-07-17T09:30:43.553Z · LW · GW

I notice that I strongly disagree with a majority of them (#1, #2, #4, #8, #10, #11, #13, #14, #15, #17, #18, #21)

Re: #2, what do you consider to be The Bad other than suffering?

Comment by Anthony DiGiovanni (antimonyanthony) on A note about differential technological development · 2022-07-17T09:23:47.154Z · LW · GW

On my picture, I think a key variable is the length of time between when-we-understand-the-basic-shape-of-things-that-will-get-to-AGI and when-it-reaches-strong-superintelligence.

I don't understand why you think the sort of capabilities research done by alignment-conscious people contributes to lengthening this time. In particular, what reason do you have to think they're not advancing the second time point as much as the first? Could you spell that out more explicitly?

Comment by Anthony DiGiovanni (antimonyanthony) on AGI Ruin: A List of Lethalities · 2022-06-17T07:34:02.192Z · LW · GW

They can read each other's source code, and thus trust much more deeply!

Being able to read source code doesn't automatically increase trust—you also have to be able to verify that the code being shared with you actually governs the AGI's behavior, despite that AGI's incentives and abilities to fool you.

(Conditional on the AGIs having strongly aligned goals with each other, sure, this degree of transparency would help them with pure coordination problems.)

Comment by Anthony DiGiovanni (antimonyanthony) on The case for becoming a black-box investigator of language models · 2022-05-12T21:38:46.729Z · LW · GW

It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive

Maybe I missed something here, but how is this supposed to help with deception? I thought the whole reason deceptive alignment is really hard to solve is that you can't tell if the AI's being deceptive via its behavior.

Comment by Anthony DiGiovanni (antimonyanthony) on Why No *Interesting* Unaligned Singularity? · 2022-05-12T07:46:07.774Z · LW · GW

That all sounds fair. I've seen rationalists claim before that it's better for "interesting" things (in the literal sense) to exist than not, even if nothing sentient is interested by them, so that's why I assumed you meant the same.