Clarifying some key hypotheses in AI alignment

post by Ben Cottier (ben-cottier), Rohin Shah (rohinmshah) · 2019-08-15T21:29:06.564Z · LW · GW · 12 comments

Contents

  Diagram
  Caveats
  Background
  Supplementary information
    Definitions
    Agentive AGI?
    Incentive for agentive AGI?
    Modularity over integration?
    Current AI R&D extrapolates to AI services?
    Incidental agentive AGI?
    Convergent rationality?
    Mesa-optimisation?
    Discontinuity to AGI?
    Recursive self improvement?
    Discontinuity from AGI?
    ML scales to AGI?
    Deep insights needed?
    Broad basin for corrigibility?
    Inconspicuous failure?
    Creeping failure?
None
12 comments

We've created a diagram mapping out important and controversial hypotheses for AI alignment. We hope that this will help researchers identify and more productively discuss their disagreements.

Diagram

A part of the diagram. Click through to see the full version.

Diagram preview

Caveats

  1. This does not decompose arguments exhaustively. It does not include every reason to favour or disfavour ideas. Rather, it is a set of key hypotheses and relationships with other hypotheses, problems, solutions, models, etc. Some examples of important but apparently uncontroversial premises within the AI safety community: orthogonality, complexity of value [LW · GW], Goodhart's Curse, AI being deployed in a catastrophe-sensitive context.
  2. This is not a comprehensive collection of key hypotheses across the whole space of AI alignment. It focuses on a subspace that we find interesting and is relevant to more recent discussions we have encountered, but where key hypotheses seem relatively less illuminated. This includes rational agency and goal-directedness, CAIS, corrigibility, and the rationale of foundational and practical research. In hindsight, the selection criteria was something like:
    1. The idea is closely connected to the problem of artificial systems optimizing adversarially against humans.
    2. The idea must be explained sufficiently well that we believe it is plausible.
  3. Arrows in the diagram indicate flows of evidence or soft relations, not absolute logical implications — please read the "interpretation" box in the diagram. Also pay attention to any reasoning written next to a Yes/No/Defer arrow — you may disagree with it, so don't blindly follow the arrow!

Background

Much has been written in the way of arguments for AI risk. Recently there have been some talks [EA · GW] and [AF · GW] posts that clarify different arguments, point to open questions, and highlight the need for further clarification and analysis. We largely share their assessments and echo their recommendations.

One aspect of the discourse that seems to be lacking clarification and analysis is the reasons to favour one argument over another — in particular, the key hypotheses or cruxes that underlie the different arguments. Understanding this better will make discourse more productive and help people reason about their beliefs.

This work aims to collate and clarify hypotheses that seem key to AI alignment in particular (by "alignment" we mean the problem of getting an AI system to reliably do what an overseer intends, or try [? · GW] to do so, depending on which part of the diagram you are in). We point to which hypotheses, arguments, approaches, and scenarios are favoured and disfavoured by each other. It is neither comprehensive nor sufficiently nuanced to capture everyone's views, but we expect it to reduce confusion and encourage further analysis.

You can digest this post through the diagram or the supplementary information, which have their respective strengths and limitations. However, we recommend starting with the diagram, then if you are interested in related reading or our comments about a particular hypothesis, you can click the link on the box title in the diagram, or look it up below.

Supplementary information

The sections here list the hypotheses in the diagram, along with related readings and our more opinion-based comments, for lack of software to neatly embed this information (however, boxes in the diagram do link back to the headings here). Note that the diagram is the best way to understand relationships and high-level meaning, while this offers more depth and resources for each hypothesis. Phrases in italics with the first letter capitalised are referring to a box in the diagram.

Definitions

Agentive AGI?

Will the first AGI be most effectively modelled like a unitary, unbounded, goal-directed agent?

Incentive for agentive AGI?

Are there features of systems built like unitary goal-directed agents that offer a worthwhile advantage over other broadly superintelligent systems?

Modularity over integration?

In general and holding resources constant, is a collection of modular AI systems with distinct interfaces more competent than a single integrated AI system?

Current AI R&D extrapolates to AI services?

AI systems so far generally lack some key qualities that are traditionally supposed of AGI, namely: pursuing cross-domain long-term goals, having broad capabilities, and being persistent and unitary. Does this lacking extrapolate, with increasing automation of AI R&D and the rise of a broad collection of superintelligent services?

Incidental agentive AGI?

Will systems built like unitary goal-directed agents develop incidentally from something humans or other AI systems build?

Convergent rationality?

Given sufficient capacity, does an AI system converge on rational agency and consequentialism to achieve its objective?

Mesa-optimisation?

Will there be optimisation processes that, in turn, develop considerably powerful optimisers to achieve their objective? A historical example is natural selection optimising for reproductive fitness to make humans. Humans may have good reproductive fitness, but optimise for other things such as pleasure even when this diverges from fitness.

Discontinuity to AGI?

Will there be discontinuous, explosive growth in AI capabilities to reach the first agentive AGI? A discontinuity reduces the opportunity to correct course. Before AGI it seems most likely to result from a qualitative change in learning curve, due to an algorithmic insight, architectural change or scale-up in resource utilisation.

Recursive self improvement?

Is an AI system that improves through its own AI R&D and self-modification capabilities more likely than distributed AI R&D automation? Recursive improvement would give some form of explosive growth, and so could result in unprecedented gains in intelligence.

Discontinuity from AGI?

Will there be discontinuous, explosive growth in AI capabilities after agentive AGI? A discontinuity reduces the opportunity to correct course. After AGI it seems most likely to result from a recursive improvement capability.

ML scales to AGI?

Do contemporary machine learning techniques scale to general human level (and beyond)? The state-of-the-art experimental research aiming towards AGI is characterised by a set of theoretical assumptions, such as reinforcement learning and probabilistic inference. Does this paradigm readily scale to general human-level capabilities without fundamental changes in the assumptions or methods?

Deep insights needed?

Do we need a much deeper understanding of intelligence to build an aligned AI?

Broad basin for corrigibility?

Do corrigible [AF · GW] AI systems have a broad basin of attraction to intent alignment [? · GW]? Corrigible AI tries to help an overseer. It acts to improve its model of the overseer's preferences, and is incentivised to make sure any subsystems it creates are aligned — perhaps even more so than itself. In this way, perturbations or errors in alignment tend to be corrected, and it takes a large perturbation to move out of this "basin" of corrigibility.

Inconspicuous failure?

Will a concrete, catastrophic AI failure be overwhelmingly hard to recognise or anticipate? For certain kinds of advanced AI systems (namely the goal-directed type), it seems that short of near proof-level assurances, all safeguards are thwarted by the nearest unblocked strategy. Such AI may also be incentivised for deception and manipulation towards a treacherous turn [LW · GW]. Or, in a machine learning framing, it would be very difficult to make such AI robust to distributional shift.

Creeping failure?

Would gradual gains in the influence of AI allow small problems to accumulate to catastrophe? The gradual aspect affords opportunity to recognise failures and think about solutions. Yet for any given incremental change in the use of AI, the economic incentives could outweigh the problems, such that we become more entangled in, and reliant on, a complex system that can collapse suddenly or drift from our values.

Thanks to Stuart Armstrong, Wei Dai, Daniel Dewey, Eric Drexler, Scott Emmons, Ben Garfinkel, Richard Ngo and Cody Wild for helpful feedback on drafts of this work. Ben especially thanks Rohin for his generous feedback and assistance throughout its development.

12 comments

Comments sorted by top scores.

comment by Ofer (ofer) · 2019-08-16T05:39:53.012Z · LW(p) · GW(p)

Meta: I think there's an attempt to deprecate the term "inner optimizer" in favor of "mesa-optimizer" (which I think makes sense when the discussion is not restricted to a subsystem within an optimized system).

Replies from: ben-cottier
comment by Ben Cottier (ben-cottier) · 2019-08-18T01:24:17.034Z · LW(p) · GW(p)

Noted and updated.

comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2019-12-02T20:52:38.052Z · LW(p) · GW(p)

Nice chart!

A few questions and comments:

  • Why the arrow from "agentive AI" to "humans are economically outcompeted"? The explanation makes it sounds like it should point to "target loading fails"??
  • Suggestion: make the blue boxes without parents more apparent? e.g. a different shade of blue? Or all sitting above the other ones? (e.g. "broad basin of corrigibility" could be moved up and left).
Replies from: ben-cottier
comment by Ben Cottier (ben-cottier) · 2019-12-21T15:13:06.561Z · LW(p) · GW(p)

Thanks! Comments are much appreciated.

Why the arrow from "agentive AI" to "humans are economically outcompeted"? The explanation makes it sounds like it should point to "target loading fails"??

It's been a few months and I didn't write in detail why that arrow is there, so I can't be certain of the original reason. My understanding now: humans getting economically outcompeted means AI systems are competing with humans, and therefore optimising against humans on some level. Goal-directedness enables/worsens this.

Looking back at the linked explanation [AF · GW] of the target loading problem, I understand it as more "at the source": coming up with a procedure that makes AI actually behave as intended. As Richard said there, one can think of it as a more general version of the inner-optimiser (mesa-optimiser) problem. This is why e.g. there's an arrow from "incidental agentive AGI" to "target loading fails". Pointing this arrow to it might make sense, but to me the connection isn't strong enough to be within the "clutter budget" of the diagram.

Suggestion: make the blue boxes without parents more apparent? e.g. a different shade of blue? Or all sitting above the other ones? (e.g. "broad basin of corrigibility" could be moved up and left).

Changing the design of those boxes sounds good. I don't want to move them because the arrows would get more cluttered.

comment by MichaelA · 2021-01-02T07:38:51.308Z · LW(p) · GW(p)

It occurs to me that all of the hypotheses, arguments, and approaches mentioned here (though not necessarily the scenarios) seem to be about the “technical” side of things. There are two main things I mean by that statement:

First, this post seems to be limited to explaining something along the lines of “x-risks from AI accidents”, rather than “x-risks from misuse of AI”, or “x-risk from AI as a risk factor” (e.g., how AI could potentially increase risks of nuclear war). 

I do think it makes sense to limit the scope that way, because: 

  • no one post can cover everything
  • you don’t want to make the diagram overwhelming
  • there’s a relatively clear boundary between what you’re covering and what you’re not
  • what you’re covering seems like the most relevant thing for technical AI safety researchers, whereas the other parts are perhaps more relevant for people working on AI strategy/governance/policy

And the fact that this post's scope is limited in that way seems somewhat highlighted by saying this is about AI alignment (whereas misuse could occur even with a system aligned to some human’s goals), and by saying “The idea is closely connected to the problem of artificial systems optimizing adversarially against humans.” 

But I think misuse and “risk factor”/“structural risk” issues are also quite important, that they should be on technical AI safety researchers’ radars to some extent, and that they probably interact in some ways with technical AI safety/alignment. So, personally, I think I’d have made that choice of scope even more explicit.

I’d also be really excited to see a post that takes the same approach as this one, but for those other classes of AI risks. 

---

The second thing I mean by the above statement is that this post seems to exclude non-technical factors that seem like they’d also impact the technical side or the AI accident risks

One crux of this type would be “AI researchers will be cautious/sensible/competent “by default””. Here are some indications that that’s an “important and controversial hypothes[is] for AI alignment”:

  • AI Impacts summarised some of Rohin’s comments as “AI researchers will in fact correct safety issues rather than hacking around them and redeploying. Shah thinks that institutions developing AI are likely to be careful because human extinction would be just as bad for them as for everyone else.” 
  • But my impression is that many people at MIRI would disagree with that, and are worried that people will merely “patch” issues in ways that don’t adequately address the risks. 
  • And I think many would argue that institutions won’t be careful enough, because they only pay a portion of the price of extinction; reducing extinction risk is a transgenerational global public good (see Todd and this comment [LW(p) · GW(p)]).
  • And I think views on these matters influence how much researchers would be happy with the approach of “Use feedback loops to course correct as we go”. I think the technical things influence how easily we theoretically could do that, while the non-technical things influence how much we realistically can rely on people to do that. 

So it seems to me that a crux like that could perhaps fit well in the scope of this post. And I thus think it’d be cool if someone could either (1) expand this post to include cruxes like that, or (2) make another post with a similar approach, but covering non-technical cruxes relevant to AI safety.

Replies from: ben-cottier
comment by Ben Cottier (ben-cottier) · 2021-01-24T15:38:43.086Z · LW(p) · GW(p)

To your first point - I agree both with why we limited the scope (but also, it was partly just personal interests), and that there should be more of this kind of work on other classes of risk. However, my impression is the literature and "public" engagement (e.g. EA forum, LessWrong) on catastrophic AI misuse/structural risk is too small to even get traction on work like this. We might first need more work to lay out the best arguments. Having said that, I'm aware of a fair amount of writing which I haven't got around to reading. So I am probably misjudging the state of the field.

To your second point - that seems like a real crux and I agree it would be good to expand in that direction. I know some people working on expanded and more in-depth models like this post. It would be great to get your thoughts when they're ready.

Replies from: MichaelA
comment by MichaelA · 2021-01-25T00:27:28.899Z · LW(p) · GW(p)

To your first point...

My impression is that there is indeed substantially less literature on misuse risk and structural risk, compared to accident risk, in relation to AI x-risk. (I'm less confident when it comes to a broader set of negative outcomes, not just x-risks, but that's also less relevant here and less important to me.) I do think that that might the sort of work this post does less interesting if done in relation to those less-discussed types of risks, since there fewer disagreements have been revealed, so there's less to analyse and summarise. 

That said, I still expect interesting stuff along these lines could be done on those topics. It just might be a quicker job with a smaller output than this post. 

I collected a handful of relevant sources and ideas here. I think someone reading those things and providing a sort of summary, analysis, and/or mapping could be pretty handy, and might even be doable in just a day or so of work. It might also be relatively easy to provide more "novel ideas" in the course of that work that it would've been for your post, since misuse/structural risks seem like less charted territory. 

(Unfortunately I'm unlikely to do this myself, as I'm currently focused on nuclear war risk.)

---

A separate point is that I'd guess that one reason why there's less work on misuse/structural AI x-risk than on accidental AI x-risk is that a lot of people aren't aware of those other categories of risks, or rarely think about them, or assume the risks are much smaller. And I think one reason for that is that people often write or talk about "AI x-risk" while actually only mentioning accidental AI x-risk. That's part of why I say "So, personally, I think I’d have made that choice of scope even more explicit." 

(But again, I do very much like this post overall. And as a target of this quibble of mine, you're in good company - I have the same quibble with The Precipice [EA · GW]. I think one of the quibbles I most often have with posts I like is "This post seems to imply, or could be interpreted as implying, that it covers [topic]. But really it covers [some subset of that topic]. That's fair enough and still very useful, but I think it'd be good to be clearer about what the scope is.")

---

I know some people working on expanded and more in-depth models like this post. It would be great to get your thoughts when they're ready.

Sounds very cool! Yeah, I'd be happy to have a look at that work when it's ready.

comment by JenniferRM · 2021-10-22T23:32:24.884Z · LW(p) · GW(p)

Late arriving comment here! :-)

I started working with this as a rubric for analyzing tech companies... then (trying to number and rename in a useful way so that the diagram's contents could be quickly cited in writing) I noticed that the node positions at the bottom did not seem to have been optimized for avoiding crossed lines and easy reading.

Also "Creeping Failure" and "Inconspicuous Failure" have strong overlaps but are far from each other, and "ML Scales to AGI" (at the top right) has no arrow to "Many Powerful AIs" (at the lower left) which it seems like it obviously should have?

Another quirk: if NOT-"Agentive AGI" (in the middle near the top), then maybe "Comprehensive AI Services" (lower right) instead happens instead, but then the only arrow from there is a positive one to its next door neighbor "Context For AGI More Secure".  However, if you think about it, humans having more really good tools seems to me like it would be an obviously useful input to "Use Feedback Loops To Correct Course As We Go" in the lower left, to make that work better? But again I find no such arrow.

A hypothesis that explains most of this is that your tools didn't allow fast iteration or easy validity checking and/or perhaps you didn't do a first draft in a spreadsheet and then convert to this for display purposes.

I started using an actual belief network tool to regenerate things, preparatory to assigning numbers and then letting "calculemus" determine my beliefs... and then noticed a Practice-Level-"smell", on my part, related to refactoring someone's old work without talking to them first.

Is this graph from August 2019 still relevant to anyone else's live models or active plans in October of 2021?

Also, if this document still connects to a living practice, is there a most-recently-updated version that would be a better jumping off point for refinement?

comment by Noa Nabeshima (noa-nabeshima) · 2021-01-25T21:27:38.472Z · LW(p) · GW(p)

What software did you use to produce this diagram?

Replies from: ben-cottier
comment by MichaelA · 2021-01-02T07:35:09.462Z · LW(p) · GW(p)

Thanks for this post! This seems like a really great way of visually representing how these different hypotheses, arguments, approaches, and scenarios interconnect. (I also think it’d be cool to see posts on other topics which use a similar approach!)

It seems that AGI timelines aren’t explicitly discussed here. (“Discontinuity to AGI” is mentioned, but I believe that's a somewhat distinct matter.) Was that a deliberate choice?

It does seem like several of the hypotheses/arguments mentioned here would feed into or relate to beliefs about timelines - in particular, Discontinuity to AGI, Discontinuity from AGI, and Recursive self-improvement, ML scales to AGI, and Deep insights needed (or maybe not that last one, as that means “needed” for alignment in particular). But I don’t think beliefs about timelines would be fully accounted for by those hypotheses/arguments - beliefs about timelines could also involve cruxes like whether “Intelligence is a huge collection of specific things”) or whether “There’ll be another AI winter before AGI” could also play a role.

I’m not sure to what extent beliefs about timelines (aside from beliefs about discontinuity) would influence which of the approaches people should/would take, out of the approaches you list. But I imagine that beliefs that timelines are quite short might motivate work on ML or prosaic alignment rather than (Near) proof-level assurance of alignment or Foundational or “deconfusion” research. This would be because people might then think the latter approaches would take too long, such that our only shot (given these people’s beliefs) is doing ML or prosaic alignment and hoping that’s enough. (See also [EA · GW].)

And it seems like beliefs about timelines would feed into decisions about other approaches you don’t mention, like opting for investment or movement-building [EA · GW] rather than direct, technical work. (That said, it seems reasonable for this post’s scope to just be what a person should do once they have decided to work on AI alignment now.)

Replies from: ben-cottier
comment by Ben Cottier (ben-cottier) · 2021-01-24T15:01:52.609Z · LW(p) · GW(p)

It's great to hear your thoughts on the post!

I'd also like to see more posts that do this sort of "mapping". I think that mapping AI risk arguments is too neglected - more discussion and examples in this post by Gyrodiot [LW · GW]. I'm continuing to work collaboratively in this area in my spare time, and I'm excited that more people are getting involved.

We weren't trying to fully account for AGI timelines - our choice of scope was based on a mix of personal interest and importance. I know people currently working on posts similar to this that will go in-depth on timelines, discontinuity, paths to AGI, the nature of intelligence, etc. which I'm excited about!

I agree with all your points. You're right that this post's scope does not include broader alternatives for reducing AI risk. It was not even designed to guide what people should work on, though it can serve that purpose. We were really just trying to clearly map out some of the discourse, as a starting point and example for future work.