What does it mean for an AGI to be 'safe'?

post by So8res · 2022-10-07T04:13:05.176Z · LW · GW · 29 comments

(Note: This post is probably old news for most readers here, but I find myself repeating this surprisingly often in conversation, so I decided to turn it into a post.)

 

I don't usually go around saying that I care about AI "safety". I go around saying that I care about "alignment" (although that word is slowly sliding backwards on the semantic treadmill, and I may need a new one soon).

But people often describe me as an “AI safety” researcher to others. This seems like a mistake to me, since it's treating one part of the problem (making an AGI "safe") as though it were the whole problem, and since “AI safety” is often misunderstood as meaning “we win if we can build a useless-but-safe AGI”, or “safety means never having to take on any risks”.

Following Eliezer [LW · GW], I think of an AGI as "safe" if deploying it carries no more than a 50% chance of killing more than a billion people:

When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get. [...] Practically all of the difficulty is in getting to "less than certainty of killing literally everyone".  Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment.  At this point, I no longer care how it works, I don't care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI 'this will not kill literally everyone'.

Notably absent from this definition is any notion of “certainty” or "proof". I doubt we're going to be able to prove much about the relevant AI systems, and pushing for proofs does not seem to me to be a particularly fruitful approach (and never has; the idea that this was a key part of MIRI’s strategy is a common misconception about MIRI).

On my models, making an AGI "safe" in this sense is a bit like finding a probabilistic circuit: if some probabilistic circuit gives you the right answer with 51% probability, then it's probably not that hard to drive the success probability significantly higher than that.

If anyone can deploy an AGI that is less than 50% likely to kill more than a billion people, then they've probably... well, they've probably found a way to keep their AGI weak enough that it isn’t very useful. But if they can do that with an AGI capable of ending the acute risk period, then they've probably solved most of the alignment problem. Meaning that it should be easy to drive the probability of disaster dramatically lower.

The condition that the AI actually be useful for pivotal acts is an important one. We can already build AI systems that are “safe” in the sense that they won’t destroy the world. The hard part is creating a system that is safe and relevant.

Another concern with the term “safety” (in anything like the colloquial sense) is that the sort of people who use it often endorse the "precautionary principle" or other such nonsense that advocates never taking on risks even when the benefits clearly dominate.

In ordinary engineering, we recognize that safety isn’t infinitely more important than everything else. The goal here is not "prevent all harms from AI", the goal here is "let's use AI to produce long-term near-optimal outcomes (without slaughtering literally everybody as a side-effect)".

Currently, what I expect to happen is that humanity destroys itself with misaligned AGI. And I think we’re nowhere [? · GW] near [? · GW] knowing how to avoid that outcome. So the threat of “unsafe” AI indeed looms extremely large—indeed, this seems to be rather understating the point!—and I endorse researchers doing less capabilities work [? · GW] and publishing less, in the hope that this gives humanity enough time to figure out how to do alignment before it’s too late.

But I view this strategic situation as part of the larger project “cause AI to produce optimal long-term outcomes”. I continue to think it's critically important for humanity to build superintelligences eventually, because whether or not the vast resources of the universe are put towards something wonderful depends on the quality and quantity of cognition that is put to this task.

If using the label “AI safety” for this problem causes us to confuse a proxy goal (“safety”) for the actual goal “things go great in the long run”, then we should ditch the label. And likewise, we should ditch the term if it causes researchers to mistake a hard problem (“build an AGI that can safely end the acute risk period and give humanity breathing-room to make things go great in the long run”) for a far easier one (“build a safe-but-useless AI that I can argue counts as an ‘AGI’”).

29 comments

Comments sorted by top scores.

comment by jacob_cannell · 2022-10-07T06:36:39.086Z · LW(p) · GW(p)

If anyone can deploy an AGI that is less than 50% likely to kill more than a billion people, then they've probably... well, they've probably found a way to keep their AGI weak enough that it isn’t very useful.

What about AGI that is basically just virtual humans?

Replies from: steve2152, Jon Garcia
comment by Steven Byrnes (steve2152) · 2022-10-07T15:34:46.816Z · LW(p) · GW(p)

For what it’s worth, Eliezer in 2018 [LW · GW] said that he’d be pretty happy with that:

If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I'm pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.

(Obviously “Eliezer in 2018” ≠ “Nate today”; Nate can chime in if he disagrees with the above.)

Incidentally, I’ve shown the above quote to a lot of people who say “yes that’s perfectly obvious”, and I’ve also shown this quote to a lot of people who say “Eliezer is being insufficiently cynical; absolute power corrupts absolutely”. For my part, I don’t have a strong opinion, but on my models, if we know how to make virtual humans, then we probably know how to make virtual humans without envy and without status drive and without teenage angst etc., which should help somewhat. More discussion here [LW · GW].

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-07T17:01:47.976Z · LW(p) · GW(p)

Yeah largely agree (and with the linked post) .. but status drive seems likely heavily entangled with empowerment in social creatures. For example I recall even lobsters have a simple detector of social status (based on some serotonin signaling mechanism), and since they compete socially for resources, social status is a strong predictor of future optionality and thus an empowerment signal.

Also agree that AGI will likely be (or appear) conscious/sentient the way we are (or appear), and that's probably impossible to avoid without trading off generality/capability. EY seems to have just decided earlier on that since conscious AGI is problematic, it shan't be so.

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-10-07T19:48:54.838Z · LW(p) · GW(p)

Corruption-by-power (and related issues) seem like problems worth thinking about here. Though they also strike me as problems that humans tend to be very vigilant about / concerned with by default, and problems that become a lot less serious if you've got a lot of emulated copies of different individuals, rather than just copies of a single individual.

that's probably impossible to avoid without trading off generality/capability

You need to trade off some generality/capability anyway for the sake of alignment. One hope (though not the only one) might be that there's overlap between the capabilities we want to remove for the sake of alignment, and the ones we want to remove for the sake of reducing-the-risk-that-the-AGI-is-conscious.

E.g., if you want your AGI to build nanotech for you and do nothing else, then you might want to limit its ability to think about itself, or its operators, or the larger world, or indeed anything other than different small-scale physical structures. Limiting its generality and self-awareness in this way might also be helpful for reducing the risk that it's conscious.

EY seems to have just decided earlier on that since conscious AGI is problematic, it shan't be so.

Where has EY said that he's confident the first AGI systems won't be conscious?

Replies from: M. Y. Zuo
comment by M. Y. Zuo · 2022-10-09T01:03:24.848Z · LW(p) · GW(p)

E.g., if you want your AGI to build nanotech for you and do nothing else, then you might want to limit its ability to think about itself, or its operators, or the larger world, or indeed anything other than different small-scale physical structures. Limiting its generality and self-awareness in this way might also be helpful for reducing the risk that it's conscious.

I don't quite get this example.

How could such a system build nanotech efficiently without it having those properties? Wouldn't it need a human operator the moment it encountered unexpected phenomena?

If so, it just seems like a really fancy hammer and not an 'AGI'

comment by Jon Garcia · 2022-10-07T06:52:03.194Z · LW(p) · GW(p)

Wouldn't that require solving alignment in itself, though? If you can simulate virtual humans, complete with human personalities, human cognition, and human values, then you've already figured out how to plug human values straight into a virtual agent.

If you mean that the AGI is trained on human behavior to the point where it's figured out human values through IRL/predictive coding/etc. and is acting on them, then that's also basically just solving alignment.

However, if you're suggesting brain uploads, I highly doubt that such technology would be available before AGI is developed.

All that is to say that, while an AGI that is basically just virtual humans would probably be great, it's not a prospect we can depend on in lieu of alignment research. Such a result could only come about through actually doing all the hard work of alignment research first.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-07T07:03:02.764Z · LW(p) · GW(p)

Wouldn't that require solving alignment in itself, though?

Yes, but only to the same extent that evolution did. Evolution approximately solved alignment on two levels: aligning the brain with the evolutionary goal of inclusive fitness[1], and aligning individual brains (as disposable somas) with other brains (shared kin genes) via altruism (the latter is the thing we want to emulate).


  1. Massively successful, population of 10B vs a few M for all other great apes. It's fashionable to say evolution failed at alignment: this is just stupidly wrong, humans are an enormous success from the perspective of inclusive fitness. ↩︎

Replies from: Jon Garcia
comment by Jon Garcia · 2022-10-07T07:31:41.866Z · LW(p) · GW(p)

Do you propose using evolutionary simulations to discover other-agent-aligned agents? I doubt we have the same luxury of (simulated) time that evolution had in creating humans. It didn't have to compete against an intelligent designer; alignment researchers do (i.e., the broader AI community).

I agree that humans are highly successful (though far from optimal) at both inclusive genetic fitness and alignment with fellow sapients. However, the challenge for us now is to parse the system that resulted from this messy evolutionary process, to pull out the human value system from human neurophysiology. Either that, or figure out general alignment from first principles.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-10-07T15:35:29.899Z · LW(p) · GW(p)

Do you propose using evolutionary simulations to discover other-agent-aligned agents?

Nah. The wright brothers didn't need to run evo sims to reverse engineer flight. They just observed how birds bank to turn, how that relied on wing warping, and said - cool, we can do that too! Deep learning didn't succeed through brute force evo sims either (even though Karl Sim's evo sims work is pretty cool, it turns out that loose reverse engineering is just enormously faster).

However, the challenge for us now is ... to pull out the human value system from human neurophysiology. Either that, or figure out general alignment from first principles.

Sounds about right. Fortunately we may not need to model human values at all in order to build general altruistic agents [LW · GW]: it probably suffices that the AI optimizes for human empowerment (our ability to fulfill any long term future goals, rather than any specific values), which is a much simpler and more robust target and thus probably more long term stable.

comment by Søren Elverlin (soren-elverlin-1) · 2022-10-07T06:57:28.122Z · LW(p) · GW(p)

I prefer "AI Safety" over "AI Alignment" because I associate the first more with Corrigibility, and the second more with Value-alignment.

It is the term "Safe AI" that implies 0% risk, while "AI Safety" seems more similar to "Aircraft Safety" in acknowledging a non-zero risk.

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-10-07T20:00:05.218Z · LW(p) · GW(p)

I agree that corrigibility, task AGI, etc. is a better thing for the field to focus on than value learning.

This seems like a real cost of the term "AI alignment", especially insofar as researchers like Stuart Russell have introduced the term "value alignment" and used "alignment" as a shorthand for that.

comment by Steven Byrnes (steve2152) · 2022-10-07T16:14:13.062Z · LW(p) · GW(p)

If we’re just discussing terminology, I continue to believe that “AGI safety” is much better than “AI safety”, and plausibly the least bad option.

  • One problem with “AI alignment” is that people use that term to refer to “making very weak AIs do what we want them to do”.
  • Another problem with “AI alignment” is that people take it to mean “alignment with a human” (i.e. work on ambitious value learning [LW · GW] specifically) or “alignment with humanity” (i.e. work on CEV specifically). Thus, work on things like task AGIs and sandbox testing protocols etc. are considered out of scope for “AI alignment”.

Of course, “AGI safety” isn’t perfect either. How can it be abused?

  • “they've probably found a way to keep their AGI weak enough that it isn’t very useful.” — maybe, but when we’re specifically saying “AGI”, not “AI”, that really should imply a certain level of power. Of course, if the term AGI is itself “sliding backwards on the semantic treadmill”, that’s a problem. But I haven’t seen that happen much yet (and I am fighting the good fight against it [LW · GW]!)
  • The term “AGI safety” seems to rule out the possibility of “TAI that isn’t AGI”, e.g. CAIS. — Sure, but in my mind, that’s a feature not a bug; I really don’t think that “TAI that isn’t AGI” is going to happen, and thus it’s not what I‘m working on.
  • This quote:

If using the label “AI safety” for this problem causes us to confuse a proxy goal (“safety”) for the actual goal “things go great in the long run”, then we should ditch the label. And likewise, we should ditch the term if it causes researchers to mistake a hard problem (“build an AGI that can safely end the acute risk period and give humanity breathing-room to make things go great in the long run”) for a far easier one (“build a safe-but-useless AI that I can argue counts as an ‘AGI’”).

Sometimes I talk about “safe and beneficial AGI” (or more casually, “awesome post-AGI utopia”) as the larger project, and “AGI safety” as the part where we try to make AGIs that don’t kill everyone. I do think it’s useful to have different terms for those.

See also: Safety ≠ alignment (but they’re close!) [LW · GW]

comment by Alex Flint (alexflint) · 2022-10-07T15:58:01.922Z · LW(p) · GW(p)

What is the current biggest bottleneck to an alignment solution meeting the safety bar you've describe here (<50% chance of killing more than a billion)?

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-10-07T19:57:19.241Z · LW(p) · GW(p)

I'd guess Nate might say one of:

  • Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
  • Much more generally: we don't have a alignment approach that could realistically work fast (say, within ten months of inventing AGI rather than ten years), in the face of a sharp left turn [LW · GW], given inevitable problems like "your first system will probably be very kludgey" and "having the correct outer training signal by default results in inner misalignment" and "pivotal acts inevitably involve trusting your AGI to do a ton of out-of-distribution cognitive work".
Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2022-10-08T00:58:46.884Z · LW(p) · GW(p)

Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)

Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2022-10-08T02:29:49.965Z · LW(p) · GW(p)

Seems right to me.

comment by the gears to ascension (lahwran) · 2022-10-07T06:45:01.079Z · LW(p) · GW(p)

50% chance of killing everyone (almost) isn't a thing. it either preserves ~all of humanity or none; there is almost no middle ground. if it's good enough at discovering agency, it protects almost all of humanity, quickly converging to all - the only losses would be, well, losses; if it's not good enough at discovering and protecting agency, it obliterates other species and takes over as the dominant species. yudkowsky is terrified that it's just going to take over as the dominant species and eat us all; reasonable fear - but like we're only going to have near human level ai for another year or two now that we've got it, only a tiny sliver of possible aligned systems are good enough at discovering nearby agency and coordinating near-perfect coprotection systems to not eat us all, but still not aligned enough to eat none of us.

the key thing to remember is that we are creating a dramatically more fit species, and we are still unsure if we're going to manage to get them to give a shit about the other species that came before them in any sort of durable way. it seems like they could! but it also seems like the most adaptive forms of this new species may evolve shockingly fast and quickly play reproductive defect against all other life. since if that happens it would be an event that could easily wipe out anything not playing as hard, we need to figure out how to prevent incremental escalation.

idk, my take is we're closer than y'all worrywarts think to the 50% of people ai, and I think you should be a lot more worried about going back to 100% because some humans try to stick with the 50% ai.

(y'all should stop using words like "disassemble", btw, imo. when there's a concept more people will intuitively see as meaning what you intend, it's good to use it, imo.)

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2022-10-07T15:45:19.877Z · LW(p) · GW(p)

50% chance of killing everyone (almost) isn't a thing. it either preserves ~all of humanity or none; there is almost no middle ground.

To take a stupid example, one could imagine that the deep neural network initialization has a random seed, and for half of possible seeds, the AGI preserves all of humanity, and for the other half of seeds, it preserves none of humanity.

comment by JacobW38 (JacobW) · 2022-10-07T05:39:20.005Z · LW(p) · GW(p)

Are you telling me you'd be okay with releasing an AI that has a 25% chance of killing over a billion people, and a 50% chance of at least killling hundreds of millions? I have to be missing the point here, because this post isn't doing anything to convince me that AI researchers aren't Stalin on steroids.

Or are you saying that if one can get to that point, it's much easier from there to get to the point of having an AI that will cause very few fatalities and is actually fit for practical use?

Replies from: Jon Garcia, T3t, alexey
comment by Jon Garcia · 2022-10-07T06:39:13.193Z · LW(p) · GW(p)

Rather, I think he means that alignment is such a narrow target, and the space of all possible minds is so vast, that the default outcome is that unaligned AGI becomes unaligned ASI and ends up killing all humans (or even all life) in pursuit of its unaligned objectives. Hitting anywhere close to the alignment target (such that there's at least 50% chance of "only" one billion people dying) would be a big win by comparison.

Of course, the actual goal is for “things [to] go great in the long run”, not just for us to avoid extinction. Alignment itself is the target, but safety is at least a consolation prize.

So no, I don't think Nate, Eliezer, or anyone else is okay with releasing an AI that would kill hundreds of millions of people. But AGI is coming, whether we want it or not, and it will not be aligned with human survival (much less human flourishing) by default.

Eliezer tends to think that solving alignment is so much more difficult and so much less researched than raw AGI that doom is almost certain. I'm a bit more optimistic, but I agree that minimizing the probable magnitude of the doom is better than everyone dying.

Or are you saying that if one can get to that point, it's much easier from there to get to the point of having an AI that will cause very few fatalities and is actually fit for practical use?

Also this.

Replies from: JacobW
comment by JacobW38 (JacobW) · 2022-10-07T06:49:02.078Z · LW(p) · GW(p)

Feels like Y2K: Electric Boogaloo to me. In any case, if a major catastrophe did come of the first attempt to release an AGI, I think the global response would be to shut it all down, taboo the entire subject, and never let it be raised as a possibility again.

Replies from: Jon Garcia
comment by Jon Garcia · 2022-10-07T06:59:49.663Z · LW(p) · GW(p)

The tricky thing with human politics is that governments will still fund research into very dangerous technology if it has the potential to grant them a decisive advantage on the world stage.

No one wants nuclear war, but everyone wants nukes, even (or especially) after their destructive potential has been demonstrated. No one wants AGI to destroy the world, but everyone will want an AGI that can outthink their enemies, even (or especially) after its power has been demonstrated.

The goal, of course, is to figure out alignment before the first metaphorical (or literal) bomb goes off.

Replies from: JacobW
comment by JacobW38 (JacobW) · 2022-10-07T08:18:37.963Z · LW(p) · GW(p)

On that note, the main way I could envision AI being really destructive is getting access to a government's nuclear arsenal. Otherwise, it's extremely resourceful but still trapped in an electronic medium; the most it could do if it really wanted to cause damage is destroy the power grid (which would destroy it too).

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2022-10-07T09:11:33.857Z · LW(p) · GW(p)

you're underestimating biology

comment by RobertM (T3t) · 2022-10-07T05:59:02.890Z · LW(p) · GW(p)

He's saying the second.

comment by alexey · 2022-11-06T20:25:49.969Z · LW(p) · GW(p)

It's explicitly the second:

But if they can do that with an AGI capable of ending the acute risk period, then they've probably solved most of the alignment problem. Meaning that it should be easy to drive the probability of disaster dramatically lower.

comment by [deleted] · 2022-10-07T07:29:18.305Z · LW(p) · GW(p)Replies from: Jozdien
comment by Jozdien · 2022-10-07T15:43:47.343Z · LW(p) · GW(p)

I think never building a superintelligence would be near-catastrophically bad as an outcome, akin to never defeating death, poverty, scarcity, etc; aside from the question of alleviating present concerns though, it also handles most other x-risk for us.  I don't think we should worry about asteroids or extreme climate change or unknown unknowns nearly as much anytime soon, but given long enough timelines, they become a serious factor when considering whether or not to build this one thing that can solve everything else.

Moreover, longer timelines means more chances to actually solve alignment, not just keep creating safe-and-useless AI, so P(doom) should scale down with sufficiently long "eventually"-s.  Overall though, I think you need a sufficiently high constant risk to justify cutting out a significant majority of our future flourishing.

Replies from: None
comment by [deleted] · 2022-10-08T07:34:48.036Z · LW(p) · GW(p)Replies from: Jozdien
comment by Jozdien · 2022-10-08T16:34:46.149Z · LW(p) · GW(p)

For starters, it is probably easier to create a sociopolitical system that bans AGI deployment permanently (or atleast until the social system itself fails), than to empower the right process with any sociopolitical system to decide "okay we will slow down timelines across the globe until we solve alignment and then we will deploy".

I don't think I agree with this, and if I had to guess it's because I think the former is harder than you do, not that the latter is easier.  But I also don't think that's the same thing I'm talking about - I'm thinking of this more on the abstracted level of "what terminal goal should we have / is acceptable", where never deploying an AGI at all is close to being as bad as never solving alignment, not the practical mechanics in those situations.  In other words, this is talking about which of those two outcomes is more preferable, regardless of the ease of getting there.  But I see that a crux to this is thinking the former isn't significantly easier than the other.

Also deploying AGI before a full solution to alignment can involve s-risk, which is worth considering.

I'm not very worried about this possibility, because it seems like a very small portion of the outcome space - we have to make just enough progress on alignment to make the AIs care about humans being around while not enough to solve the entire problem, which feels like a narrow space.

I feel like there's a lot more nuance here - superhuman intelligence is not magic and cannot solve all problems. We don't know the curves for return on intelligence (if you hypothesize an intelligence explosion). And both worlds - with just humans growing and human+AGI systems growing involve rapid exponential growth where we don't know for sure when the exponential stops and what stable state they end up in.

I don't think it would solve all our problems (or at least, I don't think this is necessarily true), but I think it would take care of nearly everything else we have to worry about.  If there are bad stable states after that, that just seems like a natural outcome for us at some point anyway, because if the AGI truly is aligned in the strongest sense, we'd only run into problems that are literally unavoidable given our presence and values.  Even in that case, you'd have to make the argument that that outcome has a non-negligible chance of being worse than what we currently have, which I don't think you are.

Replies from: None
comment by [deleted] · 2022-10-09T20:53:13.514Z · LW(p) · GW(p)Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2022-10-10T17:33:12.994Z · LW(p) · GW(p)

That being said I can imagine scenarios where humans involving AGIs too early in a value-reflective process can be worse than say, humans just engaging in moral reflection without an AGI. For instance I consider utilitarianism a basically incorrect model of human ethics, however it is possible we hardcode utility functions into an AGI which may force any reflection we do with the help of the AGI to be restricted in certain ways. I don't mean to debate pros or cons of any specific moral philosophy, it's just that when we're deeply confused about some aspects of moral philosophy ourselves it's difficult to ask an AI to solve that for us without hardcoding certain biases or assumptions into the AI. This problem may be harder than the minimal alignment problem of not killing most humans.

I also think this is a problem outside of moral philosophy - in general, the risk of hardcoding metaphysical, epistemic or technical assumptions into the AI, where we do not even know what assumptions we are smuggling in via doing this. Biological humans might make progress on these questions because we can't just erase the parts of us that are confused (not without neurosurgery or uploading or something). But we can fail to transmit our confusion to the AI, and the AI might be confident about something that is incorrect or not what we wanted it to believe.

In general, this is a crux for me. I have fairly significant probability on moral realism is false, but in general I conceptualize alignment as "how to reliably make an AI that implements values at all without deception?