On “first critical tries” in AI alignment

post by Joe Carlsmith (joekc) · 2024-06-05T00:19:02.814Z · LW · GW · 8 comments

Contents

  Some conceptual points
  Unilateral DSAs
  Coordination DSAs
  Correlation DSAs
  A few final thoughts
None
8 comments

People sometimes say that AI alignment is scary partly (or perhaps: centrally) because you have to get it right on the “first critical try,” and can’t learn from failures.[1] What does this mean? Is it true? Does there need to be a “first critical try” in the relevant sense? I’ve sometimes felt confused about this, so I wrote up a few thoughts to clarify.

I start with a few miscellaneous conceptual points. I then focus in on a notion of “first critical try” tied to the first point (if there is one) when AIs get a “decisive strategic advantage [? · GW]” (DSA) over humanity – that is, roughly, the ability to kill/disempower all humans if they try.[2] I further distinguish between four different types of DSA:

I also offer some takes on our prospects for just not ever having “first critical tries” from each type of DSA (via routes other than just not building superhuman AI systems at all). In some cases, just not having a “first critical try” in the relevant sense seems to me both plausible and worth working towards. In particular, I think we should try to make it the case that no single AI system is ever in a position to kill all humans and take over the world. In other cases, I think avoiding “first critical tries,” while still deploying superhuman AI agents throughout the economy, is more difficult (though the difficulty of avoiding failure is another story).

Here’s a chart summarizing my takes in more detail.

Type of DSADefinitionProspects for avoiding AIs ever getting this type of DSA – e.g., not having a “first critical try” for such a situation.What’s required for it to lead to doom
Unilateral DSASome AI agent could take over if it tried, even without the cooperation of other AI agents.Can avoid by making the world sufficiently empowered relative to each AI system. We should work towards this – e.g. aim to make it the case that no single AI system could kill/disempower all humans if it tried.Requires only that this one agent tries to take over.
Coordination DSAIf some set of AI agents coordinated to try to take over, they would succeed; and they are able to so coordinate.Harder to avoid than unilateral DSAs, due to the likely role of other AI agents in preventing unilateral DSAs. But could still avoid/delay by (a) reducing reliance on other AI agents for preventing unilateral DSAs, and (b) preventing coordination between AI agents.Requires that all these agents try to take over, and that they coordinate.
Short-term correlation DSAIf some set of AI agents all sought power in problematic ways within a relatively short period of time, even without coordinating, then ~all humans would be disempowered.Even harder to avoid than coordination DSAs, because doesn’t require that the AI agents in question be able to coordinate.Requires that within a relatively short period of time, all these agents choose to seek power in problematic ways, potentially without the ability to coordinate.
Long-term correlation DSAIf some set of AI agents all sought power in problematic ways within a relatively long period of time, even without coordination, then ~all humans would be disempowered.Easier to avoid than short-term correlation DSAs, because the longer time period gives more time to notice and correct any given instance of power-seeking.Requires that within a relatively long period of time, all these agents choose to seek power in problematic ways, potentially without the ability to coordinate.

Some conceptual points

The notion of “needing to get things right on the first critical try” can be a bit slippery in its meaning and scope. For example: does it apply uniquely to AI risk, or is it a much more common problem? Let's start with a few points of conceptual clarification:

Unilateral DSAs

OK, with those conceptual clarifications out of the way, let’s ask more directly: in what sense, if any, will there be a “first critical try” with respect to AI alignment?

I think the most standard version of the thought goes roughly like this:[9] 

1. At some point, you’ll be building an AI powerful enough to get a “decisive strategic advantage” (DSA). That is, this AI will be such that, if it chose to try to kill all humans and take over the world, it would succeed.[10]

2. So, at that point, you need that AI to be such that it doesn’t choose to kill all humans and take over the world, even though it could.

So the first point where (1) is true, here, is the “first critical try.” And (2), roughly, is the alignment problem. That is, if (1) is true, then whether or not this AI kills everyone depends on how it makes choices, rather than on what it can choose to do. And alignment is about getting the “how the AI makes choices” part sufficiently right.

I think that focusing on the notion of a decisive strategic advantage usefully zeroes in on the first point where we start banking on AI motivations, in particular, for avoiding doom – rather than, e.g., AIs not being able to cause doom if they tried. So I’ll generally follow that model here.

If (1) is true, then I think it is indeed appropriate to say that there will be a “first critical try” that we need to get right in some sense (though note that we haven’t yet said anything about how hard this will be; and it could be that the default path is objectively safe, even if subjectively risky). What’s more: we won’t necessarily know when this “first critical try” is occurring. And even if we get the first one right, there might be others to follow. For example: you might then build an even more powerful AI, which also has (or can get) a decisive strategic advantage.

Is (1) true? I won’t dive in deep here. But I think it’s not obvious, and that we should try to make it false. That is: I think we should try to make it the case that no AI system is ever in a position to kill everyone and take over the world.[11]

How? Well, roughly speaking, by trying to make sure that “the world” stays sufficiently empowered relative to any AI agent that might try to take it over. Of course, if single AI agents can gain sufficiently large amounts of relative power sufficiently fast (including: by copying themselves, modifying/improving themselves, etc), or if we should expect some such agent to start out sufficiently “ahead,” this could be challenging. Indeed, this is a core reason why certain types of “intelligence explosions” are so scary. But in principle, at least, you can imagine AI “take-offs” in which power (including: AI-driven power) remains sufficiently distributed, and defensive technology sufficiently robust and continually-improved, that no single AI agent would ever succeed in “taking over the world” if it tried. And we can work to make things more like that.[12] 

Coordination DSAs

I think that in practice, a lot of the “first critical try” discourse comes down to (1) – i.e., the idea that some AI agent will at some point be in a position to kill everyone and take over the world. However, suppose that we don’t assume this. Is there still a sense in which there will be a “first critical try” on alignment?

Consider the following variant of the reasoning above:

3. At some point, some set of AI agents will be such that:

  • they will all be able to coordinate with each other to try to kill all humans and take over the world; and
  • if they choose to do this, their takeover attempt will succeed.[13]

4. So at that point, you need that set of AI agents to be such that they don’t all choose to coordinate with each other to kill all humans and take over the world, even though they could.

Let’s say that an AI has a “unilateral DSA” if it’s in a position to take over without the cooperation of any other AI agents. Various AI doom stories feature systems with this sort of DSA,[14] and it's the central reading I have in mind for (1) above. But the sort of DSA at stake in (3) is broader, and includes cases where AI systems need to coordinate in order for takeover to succeed. Let’s call the sort of DSA at stake in (3) a “coordination DSA.”

Coordination DSAs, on the part of AI agents, are harder to avoid than unilateral DSAs. In particular: in a world with many different superintelligent AI agents – and especially, in worlds where such agents have been broadly integrated into crucial economic and military functions – it seems plausible that an increasing share of power will in some sense run “via” such agents. For example:

So even if no single AI agent ever gets a decisive strategic advantage, the power held by superintelligent AI agents collectively can easily grow to dominate the power that would oppose them if they all coordinated. And we might worry, on grounds of their superintelligence, that they will be able to coordinate if they want to.

Indeed, we can try to argue that the only plausible scenarios in which (1) is false – i.e., no superintelligence ever gets a unilateral DSA – are scenarios where (3) is true. In particular, we can argue that:

5. At some point, you will build a sufficiently powerful AI agent (call this Agent A) such that the only way to prevent this agent from having a decisive strategic advantage is to use powerful AI agents (Agents B, C, D, etc) to oppose/constrain it.

And we can try to argue, from (5), that at that point, (3) will be true. In particular: if, per 5, you need to rely on Agents B, C, D etc to oppose/constrain Agent A, then the collection of all those agents might well satisfy (3).

If AI capability development and deployment continues unabated, will (5) be true?[15] I think it’s more likely than (1), and likely overall. Still, it’s not totally obvious. For example:

But overall, (5) seems to me worryingly hard to avoid.

Note, though, that even if we grant (5), (3) doesn’t strictly follow. In particular: (3) specifically says that the AIs in question are able to coordinate – that is, that coordination is an option for them. And the fact that Agents B, C, D etc are functioning to oppose/constrain Agent A doesn’t imply this. For example, maybe adequate coordination between all these agents would require suitably unmonitored/opaque channels of interaction/communication, and they don’t have access to such channels.

So one option, for preventing the existence of a set of AI systems with a coordination DSA, is to try to prevent AI systems from being in a position to coordinate. Indeed, I generally think research into the dynamics of AI coordination is a neglected area, and that preventing coordination in only-somewhat-superhuman AIs may be an important line of defense.[17] For highly superintelligent agents, though – especially ones that are operating and interacting in contexts that humans can’t understand – it seems difficult.

So overall, if AI development and deployment continues unabated, it seems likely to me that some set of AI agents will eventually have a coordination DSA in the sense at stake in (3). And so we can view the first such point as a different type of “first critical try.”

Of course, as with unilateral DSAs, there’s still a question of how hard it will be, by the time (3) is true, to be confident that the relevant AIs won’t try to coordinate to kill all humans and take over the world, even though they could. I won’t try to assess this here.

Correlation DSAs

So far, I’ve talked about scenarios where a single AI takes over the world, and scenarios where a set of AIs all coordinate to take over the world. But these don’t exhaust the scenarios in which misaligned AI systems disempower all humans. In particular: humans could end up disempowered via the collective activity of many different AI systems seeking power in an uncoordinated way.

Is there a “first critical try” for this sort of scenario? One can imagine a line of reasoning like:

6. At some point, it will be the case that if some set of AI agents all chose to seek power in problematic ways, even without coordinating, then humanity would end up extinct/disempowered.

7. So at that point, it needs to be the case that these AI systems don’t all choose to seek power in those ways.

In a sense, (6) is its own type of “first critical try.” But setting aside unilateral DSAs and coordination DSAs, is the point at stake in (6) additionally worrying? Not necessarily. In particular, it depends on how much correlation between the AI agents’ choices to power-seek is necessary for doom to occur, over what timescales, without humans being able to correct the situation.

So to better tease out possible worries, here, perhaps we should distinguish between two versions of (6):

6-short: At some point, it will be the case that if some set of AI agents all chose to seek power in problematic ways within a relatively short window of time, even without coordinating, then humanity would end up extinct/disempowered.

6-long: At some point, it will be the case that if some set of AI agents all chose to seek power in problematic ways within a relatively long window of time, even without coordinating, then humanity would end up extinct/disempowered.

I’ll call the sort of DSA at stake in 6-short a “short-term correlation DSA,” and the sort at stake in 6-long a “long-term correlation DSA.”[19]

The degree to which 6-short is additionally worrying (once we’ve taken into account (1) and (3)) depends on the probability that the relevant agents will all choose to seek power in problematic ways within the relevant short period of time, without coordinating. If the “short period” is “the exact same moment,” the relevant sort of correlation seems unlikely. But as it gets chunkier, one should get more worried.

In particular, there are sources of correlation here other than AI agents actively “working together.” For example:

So I do think that 6-short creates its own kind of additional concern, on top of (3). That is, if we get to the point where we couldn’t recover from some set of AI systems all choosing to seek power in a fairly correlated way over a fairly short time-period, this does seem like a notable type of vulnerability, even if we are confident that the AI systems in question aren’t coordinating to engage in this type of behavior. And for reasons similar to those at stake with (3) and (5), I do think that 6-short is plausibly hard to avoid hitting eventually. And indeed, it’s actively harder to avoid than (3), because (3) can be avoided by cutting off the possibility of coordination between AI agents; whereas this doesn’t suffice to avoiding 6-short.

What about 6-long? 6-long, if true, is more worrying than 6-short, because it provides a longer time period for a correlated alignment failure to occur, thereby allowing looser forms of correlated power-seeking to cause doom. But 6-long is also easier to make false than 6-short. In particular: the longer time window allows for more time to notice and correct any given instance of power-seeking. Thus, for example, if the actions of Agent A and Agent B take place six months apart, in the example above, vs. a few days, this gives the humans more time to deal with the Agent A situation, and to have recovered full control, by the time the Agent B situation gets going.

A few final thoughts

Ok, those were four different types of “first critical tries,” corresponding to four different types of DSAs, plus a few takes on each. I’ll close with a few other notes:

I work at Open Philanthropy but I’m here speaking only for myself and not for my employer.

  1. ^

     See e.g. Yudkowsky’s 3 here [LW · GW]: 

    “We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again. This includes, for example: (a) something smart enough to build a nanosystem which has been explicitly authorized to build a nanosystem; or (b) something smart enough to build a nanosystem and also smart enough to gain unauthorized access to the Internet and pay a human to put together the ingredients for a nanosystem; or (c) something smart enough to get unauthorized access to the Internet and build something smarter than itself on the number of machines it can hack; or (d) something smart enough to treat humans as manipulable machinery and which has any authorized or unauthorized two-way causal channel with humans; or (e) something smart enough to improve itself enough to do (b) or (d); etcetera.  We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors. This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try. If we had unlimited retries - if every time an AGI destroyed all the galaxies we got to go back in time four years and try again - we would in a hundred years figure out which bright ideas actually worked. Human beings can figure out pretty difficult things over time, when they get lots of tries; when a failed guess kills literally everyone, that is harder. That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is 'key' and will kill us if we get it wrong.”

    And see also Soares here [LW · GW].

  2. ^

    This reflects how the term is already used by Yudkowsky and Soares.

  3. ^

    I haven't pinned this down in detail, but roughly, I tend to think of a set of AI instances as a "single agent" if they are (a) working towards the same impartially-specified consequences in the world and (b) if they are part of the same "lineage"/causal history. So this would include copies of the same weights (with similar impartial goals), updates to those weights that preserve those goals, and new agents trained by old agents to have the same goals. But it wouldn't include AIs trained by different AI labs that happen to have similar goals; or different copies of an AI where the fact that they're different copies puts their goals at cross-purposes (e.g., they each care about what happens to their specific instance).

    As an analogy: if you're selfish, than your clones aren't "you" on this story. But if you're altruistic, they are. But even if you and your friend Bob both have the same altruistic values, you're still different people.

    That said, the discussion in the post will generally apply to many different ways of individuating AI agents.

  4. ^

    Obviously AI risk is vastly higher stakes. But I'm here making the conceptual point that needing to get the first try (and all the other tries) right comes definitionally from having to avoid ever failing.

  5. ^

    See Christiano here [LW · GW]. Yudkowsky also acknowledges this.

  6. ^

    See, for example, the discourse about “warning shots,” and about catching AIs red-handed [AF · GW].

  7. ^

     See e.g. Karnofsky here, Soares here [LW · GW], and Yudkowsky here. The reason I’m most worried about is “scheming.”

  8. ^

    Sixth: “Needing to get things right” can imply that if you don’t do the relevant “try” in some particular way (e.g., with the right level of technical competence), then doom will ensue. But even in contexts where you have significant subjective uncertainty about whether the relevant “try” will cause doom, you don’t necessarily need to “get things right” in the sense of “execute with a specific level of competence” in order to avoid doom. In particular: your uncertainty may be coming from uncertainty about some underlying objective parameter your execution doesn’t influence.

    Thus: suppose that the evidence were more ambiguous about whether your volcano science experiment was going to cause doom, so you assign it a 10% subjective probability. This doesn’t mean that you have to do the experiment in a particular way – e.g., “get the experiment right” – otherwise doom will ensue. Rather, the objective facts might just be that any way of proceeding is safe; even if subjectively, some/all ways are unacceptably risky.

    I think some AI alignment “tries” might be like this. Thus, suppose that you’re faced with a decision about whether to deploy an AI system that seems aligned, and you’re unsure whether or not it’s “scheming” – i.e., faking alignment in order to get power later. It’s not necessarily the case that at that point, you need to have “figured out how to eliminate scheming,” else doom. Rather, it could be that scheming just doesn’t show up by default – for example, because SGD’s inductive biases don’t favor it [LW · GW].

    That said, of course, proceeding with a “try” that involves a significant subjective risk of doom is itself extremely scary. And insofar as you are banking on some assumption X holding in order to avoid doom, you do need to “get things right” with respect to whether or not assumption X is true.

  9. ^

    Here I’m mostly thinking of Yudkowsky’s usage, which focuses on the first point where an AI is “operating at a ‘dangerous’ level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.” The usage in Soares here [LW · GW] is similar, but the notion of “most theories don’t work on the first real try” could also apply more broadly, to scenarios where you’re using your scientific theory to assess an AI’s capabilities in addition to its alignment.

  10. ^

    Really, whether or not an agent “can” do something like takeover the world isn’t a binary, at least from that agent’s subjective perspective. Rather, a given attempt will succeed with a given probability. I’m skipping over this for now, but in practice, the likelihood of success, for a given AI system, is indeed relevant to whether attempting a takeover is worth it. And it means that there might not be a specific point at which some AI system “gets a DSA.” Rather, there might be a succession of AI systems, each increasingly likely to succeed at takeover if they went for it.

  11. ^

    I also think we should do this with human agents – but I’ll focus on AI agents here.

  12. ^

    We can also try to avoid building “agents” of the relevant kind at all, and focus on getting the benefits of AI in other ways. But for the reasons I describe in section 3 here, I do expect humans to build lots of AI agents, so I won’t focus on this.

  13. ^

    We can think of (1) as a special instance of (3) – e.g., a case where the set in question has only a single agent.

  14. ^

    See e.g. here.

  15. ^

    As ever, you could just not build superintelligent AI agents like agent A at all, and try to get most of the benefits of AI some other way.

  16. ^

    I’m counting high-fidelity human brain emulations as “human” for present purposes.

  17. ^

    I wrote a bit more about this here [LW · GW].

  18. ^

    There’s a case for expecting sufficiently superintelligent agents to succeed in coordinating to avoiding zero-sum forms of conflict like actual war; but this doesn’t mean that the relevant agents, in this sort of scenario, will be smart enough and in a position to do this.

  19. ^

    This is stretching the notion of a “DSA” somewhat, because the uncoordinated AIs in question won’t necessarily be executing a coherent “strategy,” but so it goes.

  20. ^

    See related discussion from Christiano here [AF · GW]: 

    “Eventually we reach the point where we could not recover from a correlated automation failure. Under these conditions influence-seeking systems stop behaving in the intended way, since their incentives have changed---they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives.

    An unrecoverable catastrophe would probably occur during some period of heightened vulnerability---a conflict between states, a natural disaster, a serious cyberattack, etc.---since that would be the first moment that recovery is impossible and would create local shocks that could precipitate catastrophe. The catastrophe might look like a rapidly cascading series of automation failures: A few automated systems go off the rails in response to some local shock. As those systems go off the rails, the local shock is compounded into a larger disturbance; more and more automated systems move further from their training distribution and start failing. Realistically this would probably be compounded by widespread human failures in response to fear and breakdown of existing incentive systems---many things start breaking as you move off distribution, not just ML.”

  21. ^

    Another example might be: a version of the Trinity Test where Bethe was more uncertain about his calculations re: igniting the atmosphere.

  22. ^

    I haven't pinned this down in detail, but roughly, I tend to think of it as single AI if it's working towards the same impartially-specified consequences in the world and if it has a unified causal history. So this would include copies of the same weights (with similar impartial goals), updates to those weights that preserve those goals, and new agents trained by old agents to have the same goals. But it wouldn't include AIs trained by different AI labs that happen to have similar goals; or different copies of an AI where the fact that they're different copies puts their goals at cross-purposes (e.g., they each care about what happens to their specific instance).

  23. ^

    Though standard discussions of DSAs don't t

  24. ^

     

8 comments

Comments sorted by top scores.

comment by Joe Collman (Joe_Collman) · 2024-06-06T05:23:52.677Z · LW(p) · GW(p)

I think the DSA framing is in keeping with the spirit of "first critical try" discourse.
(With that in mind, the below is more "this too seems very important", rather than "omitting this is an error".)

However, I think it's important to consider scenarios where humans lose meaningful control without any AI or group of AIs necessarily gaining a DSA. I think "loss of control" is the threat to think about, not "AI(s) take(s) control". Admittedly this gets into Moloch-related grey areas - but this may indicate that [humans do/don't have control] is too coarse-grained a framing.

I'd say that the key properties of "first critical try" are:

  • We have the option to trigger some novel process.
  • We're unlikely to stop the process once it starts, even if it's not going well.
    • Includes both [we can't stop it] and [we won't stop it].
  • If the process goes badly, the odds of doom greatly increase.
  • There's a significant chance the process goes badly.

My guess is that the most likely near-term failure mode doesn't start out as [some set of AIs gets a DSA], but rather [AI capability increase selects against meaningful human control] - and the DSA stuff is downstream of that.

This is a possibility with the [individually controllable powerful AI assistants] approach - whether or not this immediately takes things to transformational AI territory. Suppose we get the hoped-for >10x research speedup. Do we have a principled strategy for controlling the collective system this produces? I haven't heard one. I wouldn't say we're doing a good job of controlling the current collective system.

I've heard cases for [this will speed things up], and [here are some good things this would make easier] but not for [overall, such a process should be expected to take things in a less doomy direction].

For such cases "you can’t learn enough from analogous but lower-stakes contexts" ought not to apply. However, I'd certainly expect "we won’t learn enough from analogous but lower-stakes contexts" (without huge efforts to avoid this).

Replies from: faul_sname
comment by faul_sname · 2024-07-27T05:29:35.477Z · LW(p) · GW(p)

Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control".

Replies from: Joe_Collman
comment by Joe Collman (Joe_Collman) · 2024-07-29T09:54:12.462Z · LW(p) · GW(p)

It may be better to think about it that way, yes - in some cases, at least.

Probably it makes sense to throw in some more variables.
Something like:

  • To stand x chance of property p applying to system s, we'd need to apply resources r.

In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].

Replies from: faul_sname
comment by faul_sname · 2024-07-29T10:22:41.081Z · LW(p) · GW(p)

I think the most important part of your "To stand x chance of property p applying to system s, we'd need to apply resources r" model is the word "we".

Currently, there exists no "we" in the world that can ensure that nobody in the world does some form of research, or at least no "we" that can do that in a non-cataclysmic way. The International Atomic Energy Agency comes the closest of any group I'm aware of, but the scope is limited and also it does its thing mainly by controlling access to specific physical resources rather than by trying to prevent a bunch of people from doing a thing with resources they already possess.

If "gain a DSA (or cause some trusted other group to gain a DSA) over everyone who could plausibly gain a DSA in the future" is a required part of your threat mitigation strategy, I am not optimistic about the chances for success but I'm even less optimistic about the chances of that working if you don't realize that's the game you're trying to play.

Replies from: Joe_Collman
comment by Joe Collman (Joe_Collman) · 2024-07-29T10:53:26.932Z · LW(p) · GW(p)

I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].

I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].

comment by Logan Zoellner (logan-zoellner) · 2024-06-05T13:14:46.645Z · LW(p) · GW(p)

3. At some point, some set of AI agents will be such that:

  • they will all be able to coordinate with each other to try to kill all humans and take over the world; and
  • if they choose to do this, their takeover attempt will succeed.[13]

 

There are way too many assumptions about what "AI" is baked into this.  Suppose you went back 50 years and told people  "in the year 2024, everyone will have an AI agent built into their phone that they rely on for critical-to-life tasks they do (such as finding directions to the grocery store)."

The 1950's observer would probably say something like "that sounds like a dangerous AI system that could easily take control of the world".  But in fact, no one worries about Siri "coordinating" to suddenly give us all wrong directions to the grocery store, because that's not remotely how assistants work.

Trying to reason about what future AI agents will look like is basically equally fraught.  

Second: for any failure you don't want to ever happen, you always need to avoid that failure on the first try (and the second, the third, etc). 

I think this is the crux of my concern.  Obviously if AI kills us all, there will be some moment when that was inevitable, but merely stating that fact doesn't add any additional information.  I think any attempt to predict what AI agents will do from "pure reasoning" as opposed to careful empirical study of the capabilities of existing AI models is basically doomed to failure.

Replies from: joekc
comment by Joe Carlsmith (joekc) · 2024-06-05T21:32:05.587Z · LW(p) · GW(p)

in fact, no one worries about Siri "coordinating" to suddenly give us all wrong directions to the grocery store, because that's not remotely how assistants work.

 

Note that Siri is not capable of threatening types of coordination. But I do think that by the time we actually face a situation where AIs are capable of coordinating to successfully disempower humanity, we may well indeed know enough about "how they work" that we aren't worried about it.

comment by JBlack · 2024-06-05T05:42:27.450Z · LW(p) · GW(p)

The degree to which 6-short is additionally worrying (once we’ve taken into account (1) and (3)) depends on the probability that the relevant agents will all choose to seek power in problematic ways within the relevant short period of time, without coordinating. If the “short period” is “the exact same moment,” the relevant sort of correlation seems unlikely.

Is this really true? It seems likely that some external event (which could be practically anything) plausibly could alert a sufficient subset of agents to all start trying to seek power as soon as they notice that event, and not before.