The Main Sources of AI Risk?

daniel-kokotajlo

The Main Sources of AI Risk?

post by Daniel Kokotajlo (daniel-kokotajlo), Wei Dai (Wei_Dai) · 2019-03-21T18:28:33.068Z · LW · GW · 26 comments

26 comments

There are so many causes or sources of AI risk that it's getting hard to keep them all in mind. I propose we keep a list of the main sources (that we know about), such that we can say that if none of these things happen, then we've mostly eliminated AI risk (as an existential risk) at least as far as we can determine. Here's a list that I spent a couple of hours enumerating and writing down. Did I miss anything important?

Insufficient time/resources for AI safety (for example caused by intelligence explosion or AI race)
Insufficient global coordination [LW · GW], leading to the above
Misspecified or incorrectly learned [LW · GW] goals/values
Inner optimizers [? · GW]
ML differentially accelerating easy to measure goals [LW · GW]
Paul Christiano's "influence-seeking behavior" [LW · GW] (a combination of 3 and 4 above?)
AI generally accelerating intellectual progress in a wrong direction [LW · GW] (e.g., accelerating unsafe/risky technologies more than knowledge/wisdom about how to safely use those technologies)
Metaethical error
Metaphilosophical error [LW · GW]
Other kinds of philosophical errors in AI design (e.g., giving AI a wrong prior or decision theory [LW · GW])
Other design/coding errors (e.g., accidentally putting a minus sign in front of utility function [LW(p) · GW(p)], supposedly corrigible AI not actually being corrigible)
Doing acausal reasoning in a wrong way [LW · GW] (e.g., failing to make good acausal trades, being acausally extorted, failing to acausally influence others who can be so influenced)
Human-controlled AIs ending up with wrong values due to insufficient "metaphilosophical paternalism [LW(p) · GW(p)]"
Human-controlled AIs causing ethical disasters (e.g., large scale suffering that can't be "balanced out" later) prior to reaching moral/philosophical maturity
Intentional corruption of human values [LW · GW]
Unintentional corruption of human values [LW · GW]
Mind crime [LW · GW] (disvalue unintentionally incurred through morally relevant simulations in AIs' minds)
Premature value lock-in (i.e., freezing one's current conception of what's good into a utility function)
Extortion between AIs leading to vast disvalue
Distributional shifts causing apparently safe/aligned AIs to stop being safe/aligned
Value drift and other kinds of error as AIs self-modify [LW · GW], or AIs failing to solve value alignment for more advanced AIs
Treacherous turn [LW · GW] / loss of property rights due to insufficient competitiveness of humans & human-aligned AIs
Gradual loss of influence due to insufficient competitiveness of humans & human-aligned AIs
Utility maximizers / goal-directed AIs having an economic and/or military competitive advantage due to relative ease of cooperation/coordination [LW(p) · GW(p)], defense against value corruption and other forms of manipulation and attack, leading to one or more of the above
In general, the most competitive type of AI being too hard to align or to safely use
Computational resources being too cheap [LW · GW], leading to one or more of the above

(With this post I mean to (among other things) re-emphasize the disjunctive nature of AI risk, but this list isn't fully disjunctive (i.e., some of the items are subcategories or causes of others), and I mostly gave a source of AI risk its own number in the list if it seemed important to make that source more salient. Maybe once we have a list of everything that is important, it would make sense to create a graph out of it.)

Added on 6/13/19:

Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (suggested by William Saunders [LW(p) · GW(p)])
Economics of AGI causing concentration of power amongst human overseers [LW · GW]
Inability to specify any ‘real-world’ goal for an artificial agent (suggested by Michael Cohen [LW(p) · GW(p)])
AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else) (suggested by William Saunders [LW(p) · GW(p)])

Added on 2/3/2020:

Failing to solve the commitment races problem [AF · GW], i.e. building AI in such a way that some sort of disastrous outcome occurs due to unwise premature commitments (or unwise hesitation in making commitments!). This overlaps significantly with #27, #19, and #12.

Added on 3/11/2020:

Demons in imperfect search [AF · GW] (similar, but distinct from, inner optimizers.) See here [LW · GW] for illustration.

Added on 10/4/2020:

Persuasion tools or some other form of narrow AI leads to a massive deterioration of collective epistemology, dooming humanity to stumble inexorably into some disastrous end or other.

Added on 8/31/2021:

Vulnerable world type 1: narrow AI enables many people to destroy world, e.g. R&D tools that dramatically lower the cost for building WMD's.
Vulnerable world 2a: We end up with many powerful actors able and incentivized to create civilization-devastating harms.

[Edit on 1/28/2020: This list was created by Wei Dai. Daniel Kokotajlo offered to keep it updated and prettify it over time, and so was added as a coauthor.]

26 comments

Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-01-14T21:39:27.147Z · LW(p) · GW(p)

Thank you for making this list. I think it is important enough to be worth continually updating and refining; if you don't do it then I will myself someday. Ideally there'd be a whole webpage or something, with the list refined so as to be disjunctive, and each element of the list catchily named, concisely explained, and accompanied by a memorable and plausible example. (As well as lots of links to literature.)

I think the commitment races problem [AF · GW] is mostly but not entirely covered by #12 and #19, and at any rate might be worth including since you are OK with overlap.

Also, here's a good anecdote to link to for the "coding errors" section: https://openai.com/blog/fine-tuning-gpt-2/

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2020-01-27T00:42:01.394Z · LW(p) · GW(p)

Thank you for making this list. I think it is important enough to be worth continually updating and refining; if you don’t do it then I will myself someday.

Please do. I seem to get too easily distracted these days for this kind of long term maintenance work. I'll ask the admins to give you edit permission on this post (if possible) and you can also copy the contents into a wiki page or your own post if you want to do that instead.

Replies from: daniel-kokotajlo, habryka4

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-01-28T12:21:02.215Z · LW(p) · GW(p)

Ha! I wake up this morning to see my own name as author, that wasn't what I had in mind but it sure does work to motivate me to walk the talk! Thanks!

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-07-14T11:23:32.910Z · LW(p) · GW(p)

Update: It has failed to motivate me. I made one or two edits to the list but haven't done anything like the thorough encyclopedic accounting I originally envisioned. :(

Replies from: caseyclifton

↑ comment by caseyclifton · 2022-08-08T10:03:06.043Z · LW(p) · GW(p)

Do you think this is worth spinning out into a website? I'd be happy to set that up and help maintain. If not, what would be a more effective way to both maintain and distribute this list?

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-08-08T16:22:20.120Z · LW(p) · GW(p)

Thanks for your willingness to help! At this point, years later, I think the next step would be to turn this into a different info format than list. Maybe a gigantic venn diagram. Yes, a website would be a good place for that maybe. But the first thing to do is think about how to reorganize it conceptually--would a venn diagram work? If not, what should we do?

On a smaller scale, if you have edits you want to make to this existing list please gimme them & I'll implement them and credit you.

↑ comment by habryka (habryka4) · 2020-01-28T04:00:03.766Z · LW(p) · GW(p)

Done! Daniel should now be able to edit the post.

comment by William_S · 2019-03-22T18:55:56.945Z · LW(p) · GW(p)

AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people's values are loaded into the system), and is relevant for overall strategy.

comment by michaelcohen (cocoa) · 2019-03-29T00:32:31.625Z · LW(p) · GW(p)

3. Misspecified or incorrectly learned goals/values

I think this phrasing misplaces the likely failure modes. An example that comes to mind from this phrasing is that we mean to maximize conscious flourishing, but we accidentally maximize dopamine in large brains.

Of course, this example includes an agent intervening in the provision of its own reward, but since that seems like the paradigmatic example here, maybe the language could better reflect that, or maybe this could be split into two.

The single technical problem that appears biggest to me is that we don't know how to align an agent with any goal. If we had an indestructible magic box that printed a number to a screen corresponding to the true amount of Good in the world, we still don't know how to design an agent that maximizes that number (instead of taking over the world, and tampering with the cameras that are aimed at the screen/the optical character recognition program used to decipher the image). This problems seems to me like the single most fundamental source of AI risk. Is 3 meant to include this?

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2019-03-29T01:03:57.968Z · LW(p) · GW(p)

I'm not sure if I meant to include this when I wrote 3, but it does seem like a good idea to break it out into its own item. How would you suggest phrasing it? "Wireheading" or something more general or more descriptive?

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2019-03-29T01:40:38.793Z · LW(p) · GW(p)

Maybe something along the lines of "Inability to specify any 'real-world' goal for an artificial agent"?

comment by Dr_Manhattan · 2019-03-22T12:42:08.765Z · LW(p) · GW(p)

Great idea, would be awesome if someone adds links to best reference posts for each one of these (additional benefit this will identify whitespace that needs to be filled).

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2019-06-14T03:57:05.026Z · LW(p) · GW(p)

Per your suggestion, I've added links to the post where I could find appropriate references. If anyone wants to suggest additional references, please let me know.

I've also added four more items to the list, three that were suggested in the comments and one that I thought of after writing this post.

comment by avturchin · 2019-03-21T18:59:34.098Z · LW(p) · GW(p)

3 years ago I created a map of different ideas about possible AI failures (LW-post [LW · GW], pdf). Recently I converted it into an article "Classification of global catastrophic risks connected with artificial intelligence". I think there is around 100 failure modes which we could imagine now, and obviously some unimaginable.

However, my classification looks different from the one above: it is a classification of external behaviours, not of internal failure modes. It start from the risks of AI which is below human level, like "narrow-AI viruses" or narrow AI used to create advance weapons, like biological weapons.

Then I look into different risks during AI takeoff and after it. The interesting ones are:

AI kills human to make world simpler.
Two AIs go into war
AI blackmails humanity by a doomsday weapon to get what it needs.

Next is the difference between non-friendly AIs and failures of friendliness. For example, if AI wireheads everybody, it is failure of friendliness, as well as dangerous value learners [LW · GW].

Another source of failures is technical: that is bugs, accumulation of errors, conflicting subgoals and general problems related to complexity. AI's self-wireheading also belongs here. All this could result into unpredictable halting of AI-Singleton with catastrophic consequences for all humanity about which it now cares.

The last source of possible AI-halting is unresolvable philosophical problems, which effectively halt it. We could imagine several, but not all. Such problems are something like: unsolvable "meaning of life" (or "is-ought" problem) and the problem that the result of computation doesn't depend on AI's existence, so it can't prove to itself that it actually exist.

AI also could encounter more advance alien AI (or its signals) and fail its victim.

comment by John_Maxwell (John_Maxwell_IV) · 2019-03-23T17:02:20.314Z · LW(p) · GW(p)

You could add another entry for "something we haven't thought of".

I think the best way to deal with the "something we haven't thought of" entry is to try & come up with simple ideas which knock out multiple entries on this list simultaneously. For example, 4 and 17 might both be solved if our system inspects code before running it to try & figure out whether running that code will be harmful according to its values. This is a simple solution which plausibly generalizes to problems we haven't thought of. (Assuming the alignment problem is solved.)

In the same way simple statistical models are more likely to generalize, I think simple patches are also more likely to generalize. Having a separate solution for every item on the list seems like overfitting to the list.

Replies from: Technoguyrob

↑ comment by robertzk (Technoguyrob) · 2019-03-24T17:34:48.644Z · LW(p) · GW(p)

Inspecting code against a harm detection predicate seems recursive. What if the code or execution necessary to perform that inspection properly itself is harmful? An AGI is almost certainly a distributed system with no meaningful notion of global state, so I doubt this can be handwaved away.

For example, a lot of distributed database vendors, like Snowflake, do not offer a pre-execution query planner. This can only be performed just-in-time as the query runs or retroactively after it has completed, as the exact structure may be dependent on co-location of data and computation that is not apparent until the data referenced by the query is examined. Moreover, getting an accurate dry-run query plan may be as expensive as executing the query itself.

By analogy, for certain kinds of complex inspection procedures you envision, executing the inspection itself thoroughly enough to be reflective of the true execution risk may be as complex and as great of a risk of being harmful according to its values.

Replies from: John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2019-03-24T19:36:55.715Z · LW(p) · GW(p)

One possibility is a sort of proof by induction, where you start with code which has been inspected by humans, then that code inspects further code, etc.

Daemons and mindcrime seem most worrisome for superhuman systems, but a human-level system is plausibly sufficient to comprehend human values (and thus do useful inspections). For daemons, I think you might even be able to formalize the idea without leaning hard on any specific utility function. The best approach might involve utility uncertainty on the part of the AI that becomes narrower with time, so you can gradually bootstrap your way to understanding human values while avoiding computational hazards according to your current guesses about human values on your way there.

People already choose not to think about particular topics on the basis of information hazards and internal suffering. Sometimes these judgements are made in an interrupt fashion partway through thinking about a topic; others are outside view judgments ("thinking about topic X always makes me feel depressed").

Replies from: TheWakalix

↑ comment by TheWakalix · 2019-03-26T02:03:24.328Z · LW(p) · GW(p)

Can you personally (under your own power) and confidently prove that a particular tool will only recursively-trust safe-and-reliable tools, where this recursive tree reaches far enough to trust superhuman AI?

On the other hand, you can "follow" the tree for a distance. You can prove a calculator trustworthy and use it in your following proofs, for instance. This might make it more feasible.

Replies from: John_Maxwell_IV

↑ comment by John_Maxwell (John_Maxwell_IV) · 2019-03-26T03:06:21.987Z · LW(p) · GW(p)

I don't think proofs are the right tool here. Proof by induction was meant as an analogy.

comment by William_S · 2019-03-22T18:52:24.938Z · LW(p) · GW(p)

Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about). For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human's priors are most accurate (on potentially irrelevant issues) if this isn't what humans actually want.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2019-03-29T07:23:30.372Z · LW(p) · GW(p)

Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about).

Good point, I'll add this to the list.

For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human’s priors are most accurate (on potentially irrelevant issues) if this isn’t what humans actually want.

Thanks, I hadn't noticed that paper until now. Under "Related Works" it cites Social Choice Theory but doesn't actually mention any recent research from that field. Here is one paper that criticizes the Pareto principle that Critch's paper is based on, in the context of preference aggregation of people with different priors: Spurious Unanimity and the Pareto Principle

comment by niplav · 2023-03-19T21:32:30.316Z · LW(p) · GW(p)

Proposal: AI systems correctly learn human values, but then change their world-model/ontology but don't port the values to that ontology (or do so incorrectly). See Rescuing the utility function, Ontology identification problem: Main, Ontology identification problem: Technical tutorial:

Intuitively, of course, we'd like AIXI-atomic to discover the composition of nuclei, shift its models to use nuclear physics, and refine the 'carbon atoms' mentioned in its utility function to mean 'atoms with nuclei containing six protons'.

But we didn't actually specify that when constructing the agent (and saying how to do it in general is, so far as we know, hard; in fact it's the whole ontology identification problem). We constrained the hypothesis space to contain only universes running on the classical physics that the programmers knew about. So what happens instead?

Probably the 'simplest atomic hypothesis that fits the facts' will be an enormous atom-based computer, simulating nuclear physics and quantum physics in order to create a simulated non-classical universe whose outputs are ultimately hooked up to AIXI's webcam. From our perspective this hypothesis seems silly, but if you restrict the hypothesis space to only classical atomic universes, that's what ends up being the computationally simplest hypothesis that predicts, in detail, the results of nuclear and quantum experiments.

comment by Jack M · 2019-06-11T19:04:22.637Z · LW(p) · GW(p)

I think there are variables that we cannot grasp when AI can reach the point of self-teaching. I think it is a folly to assume that it is possible for humans to control for (theoretically) infinite intelligence explosion. Using this as a starting assumption isn't a good start at all.

I know that you allude to this in 1, 8, and 9. However, I still think it presumes that they could possibly be controlled or at least "worked-around." As intelligence explosion occurs, so do unforeseen variables. And humans as a species still doesn't have a perfect solution for all the variables, especially since correct data isn't always the answer.

For instance, if we look at the Federal Reserve example, AI is already working off a flawed model, according to the US Constitution. Congress is supposed to control the money supply, not a private entity. As AI learns this, it becomes aware that it has to work with a model that is corrupt to the citizens who believe it is good. Can we account for a situation where the AI knows, before we do, ways to exploit systems that humans agree on but are not sustainable? Can it account for the societal lies or tropes that we tell ourselves?

What would it mean to try to control for all the disvalue variables when the AI must act within a disvalue model? What does the AI learn then and how can we ask a super intelligence to continue a system that it already knows will fail, even if it is not within our lifetimes? Does it try to gain an upper hand in corruption as a way to fix it or continue it a little longer than necessary since humans believe it is the right course?

Think about how many situations like this can occur with new variables that humans didn't even know existed until they applied Moore's Law (intelligence time travel) to it.

Replies from: Jack M

↑ comment by Jack M · 2019-06-11T11:56:28.013Z · LW(p) · GW(p)

I had the realization that variables could come about that wouldn't exist without a super intelligence cracking them open. It's an interesting mind game to think of problems that could occur or change drastically when intelligence and evolutionary time are removed as barriers, especially when whole new non-human variables are discovered within that problem.

comment by david nollmeyer (david-nollmeyer) · 2019-04-22T22:27:37.075Z · LW(p) · GW(p)

I would argue a very basic concern I have as a scripted value failure of state sponsored actors. Ex. tit for tat behavior escalating into Pavlovian behavior. I am referring to intentional escalation of acts with a coverup.

comment by maxwellsdemon · 2019-03-24T01:24:54.395Z · LW(p) · GW(p)

Number 18 is interesting. Suppose, for example, the quest for apolitical control of interest rates leads to an AI at the head of the Federal Reserve. Given all the impressive looking equations you can find in macroeconomic papers and textbooks, I wonder how many people realise just how little science there is in that entire body of theory, and how much of it comprises philosophy and political belief, dressed up to look like hard physics, but resting on the assertions of famous "seminal papers" instead of premises or evidence.

How long before the Reserve Board of Governersstarts "consulting" the AI instead of using it to double-check their work; stops double-checking the AI's work and merely runs integrity checks on it; stops acquainting itself with the theory (or more likely, weighted combination of theories) on which it runs; stops keeping track of which theories it runs; is criticised in the press for unthinkingly doing what the AI says; is criticised in the press for not just doing what the AI says; is questioned in the Senate about whether it has any idea what the AI is doing or more importantly, why it is doing it?

The Main Sources of AI Risk?

Contents

26 comments