New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking

harlan

New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking

post by Harlan, rosehadshar · 2024-04-04T23:41:26.439Z · LW · GW · 5 comments

This is a link post for https://blog.aiimpacts.org/p/new-report-a-review-of-the-empirical

5 comments

Visiting researcher Rose Hadshar recently published a review of some evidence for existential risk from AI, focused on empirical evidence for misalignment and power seeking. (Previously from this project: a blogpost outlining some of the key claims that are often made about AI risk, a series of interviews of AI researchers, and a database of empirical evidence for misalignment and power seeking.)

In this report, Rose looks into evidence for:

Misalignment,^[1] where AI systems develop goals which are misaligned with human goals; and
Power-seeking,^[2] where misaligned AI systems seek power to achieve their goals.

Rose found the current state of this evidence for existential risk from misaligned power-seeking to be concerning but inconclusive:

There is empirical evidence of AI systems developing misaligned goals (via specification gaming^[3] and via goal misgeneralization^[4]), including in deployment (via specification gaming), but it's not clear to Rose whether these problems will scale far enough to pose an existential risk.

Rose considers the conceptual arguments for power-seeking behavior from AI systems to be strong, but notes that she could not find any clear examples of power-seeking AI so far.

With these considerations, Rose thinks that it’s hard to be very confident either that misaligned power-seeking poses a large existential risk, or that it poses no existential risk. She finds this uncertainty to be concerning, given the severity of the potential risks in question. Rose also expressed that it would be good to have more reviews of evidence, including evidence for other claims about AI risks^[5] and evidence against AI risks.^[6]

^{^}
“An AI is misaligned whenever it chooses behaviors based on a reward function that is different from the true welfare of relevant humans.” (Hadfield-Menell & Hadfield, 2019)
^{^}
Rose follows (Carlsmith, 2022) and defines power-seeking as “active efforts by an AI system to gain and maintain power in ways that designers didn’t intend, arising from problems with that system’s objectives."
^{^}
"Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome." (Krakovna et al., 2020).
^{^}
"Goal misgeneralization is a specific form of robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations." (Shah et al., 2022a).
^{^}
Joseph Carlsmith’s report Is Power-Seeking AI an Existential Risk? Reviews some evidence for most of the claims that are central to the argument that AI will pose an existential risk.
^{^}
Last year, Katja wrote Counterarguments to the basic AI x-risk case, which outlines some arguments against existential risk from AI.

5 comments

Comments sorted by top scores.

comment by Matthew Barnett (matthew-barnett) · 2024-04-05T03:44:50.601Z · LW(p) · GW(p)

It's totally possible I missed it, but does this report touch on the question of whether power-seeking AIs are an existential risk, or does it just touch on the questions of whether future AIs will have misaligned goals and will be power-seeking in the first place?

In my opinion, there's quite a big leap from "Misaligned AIs will seek power" to "Misaligned AI is an existential risk". Let me give an analogy to help explain what I mean.

Suppose we were asking whether genetically engineered humans are an existential risk. We can ask:

Will some genetically engineered humans have misaligned goals? The answer here is almost certainly yes.
- If by "misaligned" all we mean is that some of them have goals that are not identical to the goals of the rest of humanity, then the answer is obviously yes. Individuals routinely have indexical goals (such as money for themselves, status for themselves, taking care of family) that are not what the rest of humanity wants.
- If by "misaligned" what we mean is that some of them are "evil" i.e., they want to cause destruction or suffering on purpose, and not merely as a means to an end, then the answer here is presumably also yes, although it's less certain.
Will some genetically engineered humans seek power? Presumably, also yes.

After answering these questions, did we answer the original question of "Are genetically engineered humans are an existential risk?" I'd argue no, because even if some genetically engineered humans have misaligned goals, and seek power, and even if they're smarter, more well-coordinated than non-genetically engineered humans, it's still highly questionable whether they'd kill all the non-genetically engineered humans in pursuit of these goals. This premise needs to be justified, and in my opinion, it's what holds up ~the entire argument here.

Replies from: daniel-kokotajlo, rosehadshar, quetzal_rainbow

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-04-05T11:24:29.832Z · LW(p) · GW(p)

I'd argue no, because even if some genetically engineered humans have misaligned goals, and seek power, and even if they're smarter, more well-coordinated than non-genetically engineered humans, it's still highly questionable whether they'd kill all the non-genetically engineered humans in pursuit of these goals.

1. Wanna spell out the reasons why? (a) They'd be resisted by good gengineered humans, (b) they might be misaligned but not in ways that make them want to kill everyone, (c) they might not be THAT much smarter, such that they can evade the system of laws and power-distribution meant to stop small groups from killing everyone. Anything I missed?

2. Existential risk =/= everyone dead. That's just the central example. Permanent dystopia is also an existential risk, as is sufficiently big (and unjustified, and irreversible) value drift.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-04-05T16:12:15.380Z · LW(p) · GW(p)

Wanna spell out the reasons why?

I think Matthew's view is mostly spelled out in this comment [LW(p) · GW(p)] and also in a few more comments on his shortform on the EA forum.

TLDR: his view is that very powerful (and even coordinated) misaligned entities that want resources would end up with almost all the resources (e.g. >99%), but this likely wouldn't involve violence.

Existential risk =/= everyone dead. That's just the central example. Permanent dystopia is also an existential risk, as is sufficiently big (and unjustified, and irreversible) value drift.

I think the above situation I described (no violence but >99% of resources owned by misaligned entities) would still count as existential risk from a conventional longtermist perspective, but awkwardly the definition of existential risk depends on a notion of value, in particular what counts as "substantially curtailing potential goodness".

Whether or not you think that humanity only getting 0.1% of resources is "substantially curtailing total goodness" depends on other philosophical views.

I think it's worth tabooing this word in this context for this reason.

(I disagree with Matthew about the chance of violence and also about how bad it is to cede 99% of resources.)

↑ comment by rosehadshar · 2024-04-07T09:29:50.635Z · LW(p) · GW(p)

In answer to "It's totally possible I missed it, but does this report touch on the question of whether power-seeking AIs are an existential risk, or does it just touch on the questions of whether future AIs will have misaligned goals and will be power-seeking in the first place?":

No, the report doesn't directly explore whether power-seeking = existential risk
I wrote the report more in the mode of 'many arguments for existential risk depend on power-seeking (and also other things). Let's see what the empirical evidence for power-seeking is like (as it's one, though not the only, prereq for a class of existential risk arguments'
Basically the report has a reasonably limited scope (but I think it's still worth gathering the evidence for this more constrained thing)

↑ comment by quetzal_rainbow · 2024-04-05T13:51:47.492Z · LW(p) · GW(p)

Will some genetically engineered humans have misaligned goals? The answer here is almost certainly yes.
If by "misaligned" all we mean is that some of them have goals that are not identical to the goals of the rest of humanity, then the answer is obviously yes. Individuals routinely have indexical goals (such as money for themselves, status for themselves, taking care of family) that are not what the rest of humanity wants.
If by "misaligned" what we mean is that some of them are "evil" i.e., they want to cause destruction or suffering on purpose, and not merely as a means to an end, then the answer here is presumably also yes, although it's less certain.

This is very strange reasoning. Misaligned goals mean that entity basically doesn't care about our existence or well-being, it doesn't gain anything from us being alive and well relatively to us turning into paperclips. For genetically engineered humans reversal is very likely to be true: they are going to love other humans, be friends with them or take pride from position in human social hierarchy, even if they are selfish by human standards, and it is not clear why they should be selfish.

New report: A review of the empirical evidence for existential risk from AI via misaligned power-seeking

Contents

5 comments