Nearcast-based "deployment problem" analysis

post by HoldenKarnofsky · 2022-09-21T18:52:22.674Z · LW · GW · 2 comments

Contents

  The roles of Magma and IAIA in the scenario
  Phase 1: before transformative AI that can safely help with “inaction risk”
    Magma’s goals: alignment, security, deals with other companies, producing public goods. 
  Phase 2: as aligned-and-transformative AI systems become available
  Drastic measures
  Phase 3: low-misalignment-risk period
  Implications 
  Notes
None
2 comments

When thinking about how to make the best of the most important century, two “problems” loom large in my mind:

This piece is part of a series in which I discuss what both problems might look like under a nearcast [AF · GW]: trying to answer key strategic questions about transformative AI, under the assumption that key events (e.g., the development of transformative AI) will happen in a world that is otherwise relatively similar to today's.

A previous piece [AF · GW] discussed the alignment problem; this one discusses the deployment problem.

I’m using the scenario laid out in the previous post, in which a major AI company (“Magma,” following Ajeya’s [LW · GW] terminology) has good reason to think that it can develop transformative AI very soon (within a year), using what Ajeya calls “human feedback on diverse tasks” (HFDT) - and has some time (more than 6 months, but less than 2 years1) to set up special measures to reduce the risks of misaligned AI before there’s much chance of someone else deploying transformative AI. I discuss what Magma would ideally do in this situation.

I’m also introducing another hypothetical actor in this scenario, “IAIA2”: an organization, which could range from a private nonprofit to a treaty-backed international agency, that tracks3 transformative AI projects and takes actions to censure or shut down dangerous ones, as well as doing other things where a central, neutral body (as opposed to an AI company) can be especially useful. (More on IAIA below.)

I’m going to discuss what Magma’s and IAIA’s major goals and priorities should be in the “nearcast” situation I’m contemplating; a future piece will go through what a few stylized success stories might look like. I’ll be bracketing discussion of the details of how Magma can reduce the risk that its own AI systems are misaligned (since I discussed that previously [AF · GW]), and focusing instead on what Magma and IAIA should be looking to do before and after they achieve some level of confidence in Magma’s systems’ alignment.

I focus on Magma and IAIA for concreteness and simplicity (not because I expect there to be only two important actors, but because my takes on what most actors should be doing can be mostly inferred from how I discuss these two). I sometimes give more detail on Magma, because IAIA is a bit more speculative and unlike actors that exist today.

My discussion will be very high-level and abstract. It leaves a lot of room for variation in the details, and it doesn’t pin down how Magma and IAIA should prioritize between possible key activities - this is too sensitive to details of the situation. Nonetheless, I think this is more specific than previous discussions of the deployment problem, and for one who accepts this broad picture, it implies a number of things about what we should be doing today. I’ll discuss these briefly in the final section, and more in a future post.

Summary of the post (bearing in mind that within the nearcast, I’m using present tense and not heavily flagging uncertainty):

One more note before I go into more detail: this post generally focuses on an end goal of advanced technology being safe and broadly available - by default leaving the world’s governance relations mostly as they are (e.g., same governments overseeing the same populations), and figuring that improving on those is a task for the world as a whole rather than for Magma or AI systems specifically. This is a way of avoiding a number of possible distractions, and hopefully laying out a vision that a large number of parties can agree would be acceptable, even if they don’t find it ideal.

The roles of Magma and IAIA in the scenario

This post mostly uses the same scenario laid out previously [AF · GW], in which a major AI company (“Magma,” following Ajeya’s [LW · GW] terminology) has good reason to think that it can develop transformative AI very soon (within a year), using what Ajeya calls “human feedback on diverse tasks” (HFDT) - and has some time (more than 6 months, but less than 2 years) to set up special measures to reduce the risks of misaligned AI before there’s much chance of someone else deploying transformative AI.

I’m also introducing another hypothetical actor, “IAIA5”: an organization, which could range from a private nonprofit to a treaty-backed international agency, that tracks6 transformative AI projects and takes actions to censure or shut down dangerous ones, as well as doing other things where a central, neutral body (as opposed to an AI company) can be especially useful. (Some more details on the specifics of what IAIA can do, and what sort of “power” it might have, in a footnote.7)

I assume throughout this post that both Magma and IAIA are “good actors” - doing what they can (in actuality, not just in intention) to achieve a positive long-run outcome for humanity - and that they see each other this way. I think assuming this relatively rosy setup is the right way to go for purposes of elucidating lots of potential strategies that might be possible. But Magma-IAIA relations could end up being less trusting and more complicated than what I portray here. In that case, Magma may end up being cautious about how it approaches IAIA, and sticking more (where necessary) to the actions that don’t require IAIA’s cooperation; conversely, IAIA may end up acting adversarially toward Magma.

As noted above, I focus on Magma and IAIA for concreteness and simplicity (not because I expect there to be only two important actors, but because my takes on what most actors should be doing can be mostly inferred from how I discuss these two).

Phase 1: before transformative AI that can safely help with “inaction risk”

I previously [AF · GW]wrote about Magma’s “predicament” as it becomes clear that transformative AI could be developed shortly:

Magma is essentially navigating action risk vs. inaction risk:

Action risk. Say that Magma trains extremely powerful AI systems … The risk here is that (per the previous section) Magma might unwittingly train the systems to pursue some unintended goal(s), such that once the systems are able to find a path to disempowering humans and taking control of all of their resources, they do so.

So by developing and deploying transformative AI, Magma may bring about an existential catastrophe for humanity …

Inaction risk. Say that Magma’s leadership decides: “We don’t want to cause an existential catastrophe; let’s just not build AI advanced enough to pose that kind of risk.” In this case, they should worry that someone else will develop and deploy transformative AI, posing a similar risk (or arguably a greater risk -- any company/coalition that chooses to deploy powerful AI when Magma doesn’t may be less careful than Magma overall) .

Magma’s goals: alignment, security, deals with other companies, producing public goods.

IAIA’s goals: monitoring, encouraging good safety practices, encouraging good information sharing practices (including prioritizing security), sharing/disseminating public goods.

If IAIA suspects that someone could imminently deploy systems leading to a global catastrophe, it should consider drastic actions, discussed below.

A crucial theme for Magma and IAIA: selective information sharing. Certain classes of information sharing can increase risks (e.g., if an incautious actor got access to the weights for powerful but unaligned Magma models, such that they could deploy them themselves without much further effort); others can decrease risk (e.g., insights about where likely security holes and misalignment risks come from, which could cause even incautious actors to change their training setup and patch holes).

Both Magma and IAIA should be deliberate about the pros and cons of sharing information with different parties. For example, they might want frameworks for sharing more information with cautious actors than with incautious (or otherwise dangerous10) ones. I expect that in many cases, “information about how to avoid misaligned AI” and “information about how to produce powerful AI” will overlap; so will “information about the size of the risk” and “information about how powerful AI systems are getting.” It would be best if Magma and IAIA could sometimes (based on case-by-case analysis) share this sort of “dual-use” information with other AI labs that are more “cautious” (in the sense that they’re more likely to make major efforts [AF · GW] to reduce the risk of misaligned AI) without necessarily making it public. This leaves open how Magma and IAIA are to define and determine which actors count as sufficiently “cautious.”

Phase 2: as aligned-and-transformative AI systems become available

Hopefully, at some point it becomes possible to be confident that some of Magma’s AI systems are both very powerful and unlikely to cause catastrophe via misalignment. (A previous piece [AF · GW] discussed what this might look like.)

This could open up new ways of reducing “inaction risk” (the risk that others deploy powerful, misaligned systems), in addition to the key actions from Phase 1 (which Magma and IAIA should be continuing and perhaps intensifying).

Magma and IAIA should both be working to deploy AI systems that can reduce the risk that other actors cause a catastrophe.

AI systems could be deployed toward the following (these were briefly mentioned previously [AF · GW]):

Another important aspect of Phase 2 is trying to prevent misuse of aligned-and-transformative AI systems. I think this topic is relatively unexplored compared to reducing alignment risk, and without much to draw from, I’ll only discuss it briefly. Two key paths to reducing misuse risk could be:

A key overall hope here is that actors such as Magma can (a) roll out powerful but safe AI systems before more dangerous13 actors can deploy comparably advanced (and potentially less safe) systems; (b) build a substantial advantage for these systems, as the fact that they’re seen as non-dangerous leads to wide rollouts and a lot of resources for them; (c) use such systems to help with research toward still-more-advanced systems, maintaining the aggregate advantage of relatively cautious actors and their safe AI systems over more dangerous actors and their more dangerous AI systems.

That hope might not work out, for a number of reasons:

If it looks like things are going this way, IAIA and Magma should pursue more drastic measures, as discussed next.

Drastic measures

In either Phase 1 or Phase 2, Magma and/or IAIA might come to believe that it’s imperative to suppress deployment of dangerous AI systems worldwide - and that the measures described above can’t accomplish this.

In this case, IAIA might recommend (or authorize/mandate, depending on the full scope of its authority14) that governments around the world suppress AI development and/or deployment (as they have in the past for dangerous technologies such as nuclear, chemical and biological weapons, as well as e.g. chlorofluorocarbons) - by any means necessary, including via credible threat of military intervention, by cyberattack, etc.15 Magma might also advocate for such things, though presumably with less sway than an effective version of IAIA would have.

For this kind of scenario, it seems important that whoever is approaching key governments for this kind of intervention should have a good sense - ideally informed by strong pre-existing relationships - of how to approach them in a way that will lead to good outcomes (and not merely to some reaction like “We’d better deploy the most advanced AI systems possible before our rivals do”).

If advanced AI systems are capable of developing powerful advanced technologies - such as highly advanced surveillance, weapons, persuasion techniques, etc. - they could be used to help governments suppress deployment of dangerous systems. This would hopefully involve deploying systems that are highly likely to be safe, but it’s imaginable that Magma and IAIA should advocate for taking a real risk of deploying catastrophically misaligned AI rather than stand by as other actors deploy their systems.16 I think the latter should be a true last resort.

Phase 3: low-misalignment-risk period

A major goal of phases 1 and 2 was to get to the point where there’s no longer significant worry about a world run by misaligned AI.

By this time, it’s possible that the world is already very unfamiliar from today’s vantage point. There may have been several rounds of “using powerful aligned AI systems to help build even more powerful aligned systems”; there may now be many (very powerful by today’s standards) AI systems in wide use, and developed by a number of actors; drastic government action may have been taken.

The world of phase 3 faces enormous challenges. Among other things, I expect some people (and some governments) could be looking to deploy advanced technologies to seize power from others and perhaps lock in bad worlds.

It’s possible that technology will advance rapidly enough, and be destabilizing enough, that there are multiple actors making credible attempts to become an entrenched global hegemon. In such a situation:

Alternatively, it’s also possible that this phase-3 world will be largely like today’s in key respects: a number of powerful governments, mostly respecting each others’ borders and governing on their own. In this world, I think the ideal focus of most AI-involved actors is to push toward AI systems’ being used toward causing humans to be broadly more capable and more inclined to prioritize the good of all beings across the world and across time. This could include:

Implications

Future pieces will go into detail about what I think this whole picture implies for what the most helpful actions are today.

Here, I want to briefly run through some implications that seem to follow if my above picture is accepted, largely to highlight the ways in which (despite being vague in many respects) my picture is “sticking its neck out” and implying nontrivial things.

If we figure that something like the above guidelines (and “success stories” I’ll outline in a future piece) give us our best shot at a good outcome, this implies that:

Overall, my picture stands in significant contrast to what I perceive as a somewhat common view that alignment is purely a technical problem, “solvable” by independent researchers. In my picture, there are a lot of moving parts and a lot of room for important variation in how leading AI labs behave, beyond just the quality of the schemes they generate for reducing misalignment risk. My picture also stands in significant contrast to another common view, which I might summarize as “We should push forward with exciting AI applications as fast as possible and deploy them as widely as possible.” In my view, the range of possible outcomes is wide - as is the range of important inputs along technical, strategic, corporate and political dimensions.

Thanks to Paul Christiano, Allan Dafoe, Daniel Kokotajlo, Jade Leung, Cullen O'Keefe, Carl Shulman, Nate Soares and especially Luke Muehlhauser for particularly in-depth comments on drafts.

Notes


  1. This doesn’t mean the whole situation discussed in this post plays out in a span of 6 months to 2 years. It just means that there isn’t much chance of someone deploying comparably transformative systems to Magma’s first transformative systems within that amount of time. Much of this piece has Magma making attempts to “stay ahead” of others, such that the scenario could take longer to play out. 

  2. A hypothetical International AI Agency (name inspired by IAEA). Pronunciation guide here

  3. Monitoring would be with permission and assistance in the case where IAIA is a private nonprofit, i.e., in this case AI companies would be voluntarily agreeing to be monitored.  

  4. I don’t like the framing of “solving” “the” alignment problem. I picture something like “Taking as many measures as we can (see previous post [AF · GW]) to make catastrophic misalignment as unlikely as we can for the specific systems we’re deploying in the specific contexts we’re deploying them in, then using those systems as part of an ongoing effort to further improve alignment measures that can be applied to more-capable systems.” In other words, I don’t think there is a single point where the alignment problem is “solved”; instead I think we will face a number of “alignment problems” for systems with different capabilities. (And I think there could be some systems that are very easy to align, but just not very powerful.) So I tend to talk about whether we have “systems that are both aligned and transformative” rather than whether the “alignment problem is solved.” 

  5. A hypothetical International AI Agency (name inspired by IAEA). Pronunciation guide here

  6. Monitoring would be with permission and assistance in the case where IAIA is a private nonprofit, i.e., in this case AI companies would be voluntarily agreeing to be monitored.  

  7. There’s a wide variety of possible powers for IAIA. For most of this post, I tend to assume that it is an agency designed for flexibility and adaptiveness, not required or enabled to execute any particular formal scheme along the lines of “If verifiable event X happens, IAIA may/must take pre-specified action Y.”

    Instead, IAIA’s central tool is its informal legitimacy. It has attracted top talent and expertise, and when it issues recommendations, the recommendations are well-informed, well-argued, and commonly seen as something governments should follow by default.

    In the case where IAIA has official recognition from governments or international bodies, there may be various formal provisions that make it easier for governments to quickly take IAIA’s recommendations (e.g., Congressional pre-authorizations for the executive branch to act on formal IAIA recommendations). 

  8. E.g., it’s imaginable that large compute providers could preferentially provide compute to IAIA-endorsed organizations. 

  9. One example (also mentioned in a later footnote): key AI-relevant chips could have features to enable others to monitor their utilization, or even to shut them down in some circumstances, and parties could make deals giving each other access to these mechanisms (somewhat in the spirit of the Treaty on Open Skies). 

  10. E.g., actors who seem likely to use any aligned AI systems for dangerous purposes. 

  11. E.g., aligned AI systems that people are trying to use for illegal and/or highly dangerous activities. 

  12. For example, if the “offense-defense balance” is such that an individual might be able to ask an AI system to design powerful weapons with which they could successfully threaten governments, AI systems might be trained not to help with this sort of goal. There is a nonobvious line to be drawn here, because AI systems shouldn’t necessarily e.g. refuse to help individuals work on developing better clean energy technology, which could be relevant for weapons development.

    This line doesn’t have to be drawn algorithmically - it could be based on human judgments about what sorts of AI assistance constitute “helping with illegal activity or excessive power gain” - but who gets to make those judgments, and how they make them, is still a hairy area with a lot of room for debate and judgment calls.  

  13. Whether due to less caution about alignment, or for other reasons 

  14. See previous section for some discussion of how exactly IAIA’s authority might work. 

  15. One speculative possibility could be for IAIA and others to push for key AI-relevant chips to have features enabling others to monitor their utilization, or even to shut them down in some circumstances. 

  16. The basic reasoning might be: “The systems we have a real chance of causing global catastrophe, but if we stand by, others will deploy systems that are even more likely to.” I think it’s worth having a high bar for making such a call, as a given actor might naturally be biased toward thinking the world is better off with them acting first. 

  17. I think there are quite a few things one can do to “lay the groundwork” for future policy changes; some of them are gestured at in this Open Philanthropy blog post from 2013. I expect a given policy change to be much easier if many of the pros and cons have already been analyzed, the details have already been worked out, and there are a number of experts working in governments and at government-advising organizations (e.g., think tanks) who can give good advice on it; all of these are things that can be worked on in advance. 

  18. One casual conversation I had with an AI researcher implied that training AI systems to refuse dangerous requests could be relatively easy, but it also seems relatively easy (by default) for others to train this behavior back out via fine-tuning. It might be interesting to explore AI system designs that would be hard to use in unintended ways without hugely expensive re-training. 

2 comments

Comments sorted by top scores.

comment by Zach Stein-Perlman · 2022-09-22T16:53:05.158Z · LW(p) · GW(p)
  • Deals with other companies. Magma might be able to reduce some of the pressure to “race” by making explicit deals with other companies doing similar work on developing AI systems, up to and including mergers and acquisitions (but also including more limited collaboration and information sharing agreements).
    • Benefits of such deals might include (a) enabling freer information sharing and collaboration; (b) being able to prioritize alignment with less worry that other companies are incautiously racing ahead; (c) creating incentives (e.g., other labs’ holding equity in Magma) to cooperate rather than compete; and thus (d) helping Magma get more done (more alignment work, more robustly staying ahead of other key actors in terms of the state of its AI systems).

It's often said that there exists "pressure to race." We can break down this pressure, e.g. into

  1. The extent to which Magma's preferences are better satisfied if it has lots of power, relative to a competitor having lots of power (or: the extent to which the universe would look better-according-to-Magma if Magma was in charge, relative to a competitor being in charge)
  2. Factors like wanting to make progress quickly or beat others or be first, independent of how being first would make the universe better-according-to-Magma.

For 1, it's not clear that there exists any real force here-- I don't know what (e.g.) DeepMind or OpenAI would do long-term with lots of power, and I don't think they do either, much less that they each believe that they would do something better than the other. (Perhaps causing them to think more carefully about what they'd do with power would cause them to realize that there are attractors like "do CEV" such that they would very likely do something very similar to the other, and thus see no reason to race... but existing race-y-ness doesn't seem to be due to either specific beliefs or uncertainty about what others would do with lots of power...) For 2, these factors feel closer to mere-psychological than deep-strategic, and so can plausibly be overcome with other psychological factors or incentives that are tiny relative to the cosmic endowment...

We should think carefully about what actually causes racing (and how to negate or counterbalance those factors).

(Of course if a particular lab doesn't believe that safety is a problem, it won't slow down for safety-- but we can try to solve the futures where all leading labs actually care about safety, solve the futures where they don't in other ways, and try to nudge the latter toward the former.)

comment by Nathan Helm-Burger (nathan-helm-burger) · 2022-09-24T15:18:27.023Z · LW(p) · GW(p)

I think there's another danger worth considering here, which is that AI power can be expected to increase as we approach AGI. Depending on how long we are stuck in the pre-AGI regime, and how powerful AI had become in that time, there is a danger that unethical humans substantially empowered by AI tools may attempt to gain and maintain power over society, and that part of this action may be interfering with labs attempting to develop aligned AGI. I feel like this possibility casts the potential race dynamics in a different light. Here's a post discussion the potential wide reaching effects of social manipulation via powerful but pre-AGI tech: https://www.lesswrong.com/posts/3broJA5XpBwDbjsYb/agency-engineering-is-ai-alignment-to-human-intent-enough [LW · GW]