Thoughts on 'List of Lethalities'

post by Alex Lawsen (alex-lszn) · 2022-08-17T18:33:31.363Z · LW · GW · 0 comments

Contents

    Section A (shorthand: "strategic challenges")
      #1. Human level is nothing special / data efficiency
      #2. Unaligned superintelligence could easily take over
      #3. Can't iterate on dangerous domains
      #4. Can't cooperate to avoid AGI
      #5. Narrow AI is insufficient
      #6. Pivotal act is necessary
      #7. There are no weak pivotal acts because a pivotal act requires power
      #8. Capabilities generalize out of desired scope
      #9. A pivotal act is a dangerous regime
    Section B.1: The distributional leap
      #10. Large distributional shift to dangerous domains
      #11. Sim to real is hard
      #12. High intelligence is a large shift
      #13. Some problems only occur above an intelligence threshold
      #14. Some problems only occur in dangerous domains
      #15. Capability gains from intelligence are correlated
    Section B.2: Central difficulties of outer and inner alignment.
      #16. Inner misalignment
      #17. Can't control inner properties
      #18. No ground truth
      #19. Pointers problem
      #20. Flawed human feedback
      #21. Capabilities go further
      #22. No simple alignment core
      #23. Corrigibility is anti-natural.
      #24. Sovereign vs corrigibility
    Section B.3:  Central difficulties of sufficiently good and useful transparency / interpretability.
      #25. Real interpretability is out of reach
      #26. Interpretability is insufficient
      #27. Selecting for undetectability
      #28. Large option space 
      #29. Real world is an opaque domain
      #30. Powerful vs understandable
      #31. Hidden deception
      #32. Language is insufficient or unsafe
      #33. Alien concepts
    Section B.4:  Miscellaneous unworkable schemes. 
      #34. Multipolar collusion
      #35. Multi-agent is single-agent
      #36. Human flaws make containment difficult 
    Section C (shorthand: "civilizational inadequacy")
  Additional thoughts after writing:
None
No comments

I read through Eliezer’s “AGI Ruin: A List of Lethalities” post [LW · GW], and wrote down my reactions* as I went.

I tried to track my internal responses, *before* adjusting for my perception of the overall thoughts of the field. In order to get this written and posted, I aimed to write the responses roughly how I write tweets, with minimal editing and aiming for short responses. I’m also not an alignment researcher, so I expect this to mostly be useful to people as encouragement to try similar things themselves, rather than to directly add anything to the conversation. If you’re interested in reactions from people who actually know what they’re talking about, there’s this [LW · GW] post from Paul Christiano detailing areas of agreement and disagreement, and a point-by-point list of responses from some DeepMind alignment researchers here [LW · GW].

After drafting my responses I used the DeepMind post as a template to format them as it seemed like a reasonable way of laying stuff out. Please note that I’m responding to Eliezer’s phrasing throughout though, not to the Deepmind summary (which I don’t always exactly endorse, but still seems more useful to include than not). Here’s a blank template I made if you want to try writing down your own responses.

*I’d seen an early draft of the post a couple of months earlier and written responses, so this isn’t quite a ‘first look’ set of reactions, but because that draft had the points in a different order, I decided to start from scratch rather than try to edit all of those into the correct places.

Section A (shorthand: "strategic challenges")

#1. Human level is nothing special / data efficiency

Summary: AGI will not be upper-bounded by human ability or human learning speed (similarly to AlphaGo).  Things much smarter than human would be able to learn from less evidence than humans require.

#2. Unaligned superintelligence could easily take over

Summary: A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure. 

#3. Can't iterate on dangerous domains

Summary: At some point there will be a 'first critical try' at operating at a 'dangerous' level of intelligence, and on this 'first critical try', we need to get alignment right. 

#4. Can't cooperate to avoid AGI

Summary: The world can't just decide not to build AGI.

#5. Narrow AI is insufficient

Summary: We can't just build a very weak system.

#6. Pivotal act is necessary

Summary: We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.

#7. There are no weak pivotal acts because a pivotal act requires power

Summary: It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness. 

#8. Capabilities generalize out of desired scope

Summary: The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve.

#9. A pivotal act is a dangerous regime

Summary: The builders of a safe system would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that.

Section B.1: The distributional leap

#10. Large distributional shift to dangerous domains

Summary: On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.

#11. Sim to real is hard

Summary: There's no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world.

#12. High intelligence is a large shift

Summary: Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level.

#13. Some problems only occur above an intelligence threshold

Summary: Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability. 

#14. Some problems only occur in dangerous domains

Summary: Some problems seem like their natural order of appearance could be that they first appear only in fully dangerous domains.

#15. Capability gains from intelligence are correlated

Summary: Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously. 

Section B.2: Central difficulties of outer and inner alignment.

#16. Inner misalignment

Summary: Outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.

#17. Can't control inner properties

Summary: On the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.

#18. No ground truth

Summary: There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned'.

#19. Pointers problem

Summary: There is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment.

#20. Flawed human feedback

Summary: Human raters make systematic errors - regular, compactly describable, predictable errors. 

#21. Capabilities go further

Summary: Capabilities generalize further than alignment once capabilities start to generalize far.

#22. No simple alignment core

Summary: There is a simple core of general intelligence but there is no analogous simple core of alignment.

#23. Corrigibility is anti-natural.

Summary: Corrigibility is anti-natural to consequentialist reasoning.

#24. Sovereign vs corrigibility

Summary: There are two fundamentally different approaches you can potentially take to alignment [a sovereign optimizing CEV or a corrigible agent], which are unsolvable for two different sets of reasons. Therefore by ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.

Section B.3:  Central difficulties of sufficiently good and useful transparency / interpretability.

#25. Real interpretability is out of reach

Summary: We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers. 

#26. Interpretability is insufficient

Summary: Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system that isn't planning to kill us.

(This comment doesn’t make much sense without looking at the original phrasing)

#27. Selecting for undetectability

Summary: Optimizing against an interpreted thought optimizes against interpretability.

#28. Large option space 

Summary: A powerful AI searches parts of the option space we don't, and we can't foresee all its options.

#29. Real world is an opaque domain

Summary: AGI outputs go through a huge opaque domain before they have their real consequences, so we cannot evaluate consequences based on outputs. 

#30. Powerful vs understandable

Summary: No humanly checkable output is powerful enough to save the world.

#31. Hidden deception

Summary: You can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about.

#32. Language is insufficient or unsafe

Summary: Imitating human text can only be powerful enough if it spawns an inner non-imitative intelligence.

#33. Alien concepts

Summary: The AI does not think like you do, it is utterly alien on a staggering scale.

Section B.4:  Miscellaneous unworkable schemes. 

#34. Multipolar collusion

Summary: Humans cannot participate in coordination schemes between superintelligences.

#35. Multi-agent is single-agent

Summary: Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other.

#36. Human flaws make containment difficult 

Summary: Only relatively weak AGIs can be contained; the human operators are not secure systems.

Section C (shorthand: "civilizational inadequacy")

I don’t think it’s going to be productive to write out my point-by-point responses to this section, most of the top comments on the original post are discussing parts of it, most of the things I think are somewhere in there.

Additional thoughts after writing:

Below are some half-formed thoughts which didn’t make sense as responses to specific points, and were already on my longlist to write up into full posts but may as well get the same very brief treatment as my reactions to individual points, especially as I was reminded of them as I read.

by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.

0 comments

Comments sorted by top scores.