AGI Ruin: A List of Lethalities

eliezer_yudkowsky

AGI Ruin: A List of Lethalities

post by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-05T22:05:52.224Z · LW · GW · 708 comments

  Preamble:
  Section A:
  Section B:
    Section B.1:  The distributional leap.
    Section B.2:  Central difficulties of outer and inner alignment.
    Section B.4:  Miscellaneous unworkable schemes.
  Section C:
None
723 comments

Preamble:

(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)

I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.

Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified [LW · GW].

Three points about the general subject matter of discussion here, numbered so as not to conflict with the list of lethalities:

-3. I'm assuming you are already familiar with some basics, and already know what 'orthogonality' and 'instrumental convergence' are and why they're true. People occasionally claim to me that I need to stop fighting old wars here, because, those people claim to me, those wars have already been won within the important-according-to-them parts of the current audience. I suppose it's at least true that none of the current major EA funders seem to be visibly in denial about orthogonality or instrumental convergence as such; so, fine. If you don't know what 'orthogonality' or 'instrumental convergence' are, or don't see for yourself why they're true, you need a different introduction than this one.

-2. When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get. So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it. Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as "less than roughly certain to kill everybody", then you can probably get down to under a 5% chance with only slightly more effort. Practically all of the difficulty is in getting to "less than certainty of killing literally everyone". Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don't care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI 'this will not kill literally everyone'. Anybody telling you I'm asking for stricter 'alignment' than this has failed at reading comprehension. The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors.

-1. None of this is about anything being impossible in principle. The metaphor I usually use is that if a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months. For people schooled in machine learning, I use as my metaphor the difference between ReLU activations and sigmoid activations. Sigmoid activations are complicated and fragile, and do a terrible job of transmitting gradients through many layers; ReLUs are incredibly simple (for the unfamiliar, the activation function is literally max(x, 0)) and work much better. Most neural networks for the first decades of the field used sigmoids; the idea of ReLUs wasn't discovered, validated, and popularized until decades later. What's lethal is that we do not have the Textbook From The Future telling us all the simple solutions that actually in real life just work and are robust; we're going to be doing everything with metaphorical sigmoids on the first critical try. No difficulty discussed here about AGI alignment is claimed by me to be impossible - to merely human science and engineering, let alone in principle - if we had 100 years to solve it using unlimited retries, the way that science usually has an unbounded time budget and unlimited retries. This list of lethalities is about things we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle.

That said:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable on anything remotely remotely resembling the current pathway, or any other pathway we can easily jump to.

Section A:

This is a very lethal problem, it has to be solved one way or another, it has to be solved at a minimum strength and difficulty level instead of various easier modes that some dream about, we do not have any visible option of 'everyone' retreating to only solve safe weak problems instead, and failing on the first really dangerous try is fatal.

1. Alpha Zero blew past all accumulated human knowledge about Go after a day or so of self-play, with no reliance on human playbooks or sample games. Anyone relying on "well, it'll get up to human capability at Go, but then have a hard time getting past that because it won't be able to learn from humans any more" would have relied on vacuum. AGI will not be upper-bounded by human ability or human learning speed. Things much smarter than human would be able to learn from less evidence than humans require to have ideas driven into their brains; there are theoretical upper bounds here, but those upper bounds seem very high. (Eg, each bit of information that couldn't already be fully predicted can eliminate at most half the probability mass of all hypotheses under consideration.) It is not naturally (by default, barring intervention) the case that everything takes place on a timescale that makes it easy for us to react.

2. A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure. The concrete example I usually use here is nanotech, because there's been pretty detailed analysis of what definitely look like physically attainable lower bounds on what should be possible with nanotech, and those lower bounds are sufficient to carry the point. My lower-bound model of "how a sufficiently powerful intelligence would kill everyone, if it didn't want to not do that" is that it gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they're dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery. (Back when I was first deploying this visualization, the wise-sounding critics said "Ah, but how do you know even a superintelligence could solve the protein folding problem, if it didn't already have planet-sized supercomputers?" but one hears less of this after the advent of AlphaFold 2, for some odd reason.) The nanomachinery builds diamondoid bacteria, that replicate with solar power and atmospheric CHON, maybe aggregate into some miniature rockets or jets so they can ride the jetstream to spread across the Earth's atmosphere, get into human bloodstreams and hide, strike on a timer. Losing a conflict with a high-powered cognitive system looks at least as deadly as "everybody on the face of the Earth suddenly falls over dead within the same second". (I am using awkward constructions like 'high cognitive power' because standard English terms like 'smart' or 'intelligent' appear to me to function largely as status synonyms. 'Superintelligence' sounds to most people like 'something above the top of the status hierarchy that went to double college', and they don't understand why that would be all that dangerous? Earthlings have no word and indeed no standard native concept that means 'actually useful cognitive power'. A large amount of failure to panic sufficiently, seems to me to stem from a lack of appreciation for the incredible potential lethality of this thing that Earthlings as a culture have not named.)

3. We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again. This includes, for example: (a) something smart enough to build a nanosystem which has been explicitly authorized to build a nanosystem; or (b) something smart enough to build a nanosystem and also smart enough to gain unauthorized access to the Internet and pay a human to put together the ingredients for a nanosystem; or (c) something smart enough to get unauthorized access to the Internet and build something smarter than itself on the number of machines it can hack; or (d) something smart enough to treat humans as manipulable machinery and which has any authorized or unauthorized two-way causal channel with humans; or (e) something smart enough to improve itself enough to do (b) or (d); etcetera. We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors. This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try. If we had unlimited retries - if every time an AGI destroyed all the galaxies we got to go back in time four years and try again - we would in a hundred years figure out which bright ideas actually worked. Human beings can figure out pretty difficult things over time, when they get lots of tries; when a failed guess kills literally everyone, that is harder. That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is 'key' and will kill us if we get it wrong. (One remarks that most people are so absolutely and flatly unprepared by their 'scientific' educations to challenge pre-paradigmatic puzzles with no scholarly authoritative supervision, that they do not even realize how much harder that is, or how incredibly lethal it is to demand getting that right on the first critical try.)

4. We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth. The current state of this cooperation to have every big actor refrain from doing the stupid thing, is that at present some large actors with a lot of researchers and computing power are led by people who vocally disdain all talk of AGI safety (eg Facebook AI Research). Note that needing to solve AGI alignment only within a time limit, but with unlimited safe retries for rapid experimentation on the full-powered system; or only on the first critical try, but with an unlimited time bound; would both be terrifically humanity-threatening challenges by historical standards individually.

5. We can't just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so. I've also in the past called this the 'safe-but-useless' tradeoff, or 'safe-vs-useful'. People keep on going "why don't we only use AIs to do X, that seems safe" and the answer is almost always either "doing X in fact takes very powerful cognition that is not passively safe" or, even more commonly, "because restricting yourself to doing X will not prevent Facebook AI Research from destroying the world six months later". If all you need is an object that doesn't do dangerous things, you could try a sponge; a sponge is very passively safe. Building a sponge, however, does not prevent Facebook AI Research from destroying the world six months later when they catch up to the leading actor.

6. We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world. While the number of actors with AGI is few or one, they must execute some "pivotal act", strong enough to flip the gameboard, using an AGI powerful enough to do that. It's not enough to be able to align a weak system - we need to align a system that can do some single very large thing. The example I usually give is "burn all GPUs". This is not what I think you'd actually want to do with a powerful AGI - the nanomachines would need to operate in an incredibly complicated open environment to hunt down all the GPUs, and that would be needlessly difficult to align. However, all known pivotal acts are currently outside the Overton Window, and I expect them to stay there. So I picked an example where if anybody says "how dare you propose burning all GPUs?" I can say "Oh, well, I don't actually advocate doing that; it's just a mild overestimate for the rough power level of what you'd have to do, and the rough level of machine cognition required to do that, in order to prevent somebody else from destroying the world in six months or three years." (If it wasn't a mild overestimate, then 'burn all GPUs' would actually be the minimal pivotal task and hence correct answer, and I wouldn't be able to give that denial.) Many clever-sounding proposals for alignment fall apart as soon as you ask "How could you use this to align a system that you could use to shut down all the GPUs in the world?" because it's then clear that the system can't do something that powerful, or, if it can do that, the system wouldn't be easy to align. A GPU-burner is also a system powerful enough to, and purportedly authorized to, build nanotechnology, so it requires operating in a dangerous domain at a dangerous level of intelligence and capability; and this goes along with any non-fantasy attempt to name a way an AGI could change the world such that a half-dozen other would-be AGI-builders won't destroy the world 6 months later.

7. The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists. There's no reason why it should exist. There is not some elaborate clever reason why it exists but nobody can see it. It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness. If you can't solve the problem right now (which you can't, because you're opposed to other actors who don't want to be solved and those actors are on roughly the same level as you) then you are resorting to some cognitive system that can do things you could not figure out how to do yourself, that you were not close to figuring out because you are not close to being able to, for example, burn all GPUs. Burning all GPUs would actually stop Facebook AI Research from destroying the world six months later; weaksauce Overton-abiding stuff about 'improving public epistemology by setting GPT-4 loose on Twitter to provide scientifically literate arguments about everything' will be cool but will not actually prevent Facebook AI Research from destroying the world six months later, or some eager open-source collaborative from destroying the world a year later if you manage to stop FAIR specifically. There are no pivotal weak acts.

8. The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve; you can't build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.

9. The builders of a safe system, by hypothesis on such a thing being possible, would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that. Running AGIs doing something pivotal are not passively safe, they're the equivalent of nuclear cores that require actively maintained design properties to not go supercritical and melt down.

Section B:

Okay, but as we all know, modern machine learning is like a genie where you just give it a wish, right? Expressed as some mysterious thing called a 'loss function', but which is basically just equivalent to an English wish phrasing, right? And then if you pour in enough computing power you get your wish, right? So why not train a giant stack of transformer layers on a dataset of agents doing nice things and not bad things, throw in the word 'corrigibility' somewhere, crank up that computing power, and get out an aligned AGI?

Section B.1: The distributional leap.

10. You can't train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions. (Some generalization of this seems like it would have to be true even outside that paradigm; you wouldn't be working on a live unaligned superintelligence to align it.) This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they'd do, in order to align what output - which is why, of course, they never concretely sketch anything like that. Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you. This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm. Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you're starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat. (Note that anything substantially smarter than you poses a threat given any realistic level of capability. Eg, "being able to produce outputs that humans look at" is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.)

11. If cognitive machinery doesn't generalize far out of the distribution where you did tons of training, it can't solve problems on the order of 'build nanotechnology' where it would be too expensive to run a million training runs of failing to build nanotechnology. There is no pivotal act this weak; there's no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world and prevent the next AGI project up from destroying the world two years later. Pivotal weak acts like this aren't known, and not for want of people looking for them. So, again, you end up needing alignment to generalize way out of the training distribution - not just because the training environment needs to be safe, but because the training environment probably also needs to be cheaper than evaluating some real-world domain in which the AGI needs to do some huge act. You don't get 1000 failed tries at burning all GPUs - because people will notice, even leaving out the consequences of capabilities success and alignment failure.

12. Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes. Problems that materialize at high intelligence and danger levels may fail to show up at safe lower levels of intelligence, or may recur after being suppressed by a first patch.

13. Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability. Consider the internal behavior 'change your outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over you'. This problem is one that will appear at the superintelligent level; if, being otherwise ignorant, we guess that it is among the median such problems in terms of how early it naturally appears in earlier systems, then around half of the alignment problems of superintelligence will first naturally materialize after that one first starts to appear. Given correct foresight of which problems will naturally materialize later, one could try to deliberately materialize such problems earlier, and get in some observations of them. This helps to the extent (a) that we actually correctly forecast all of the problems that will appear later, or some superset of those; (b) that we succeed in preemptively materializing a superset of problems that will appear later; and (c) that we can actually solve, in the earlier laboratory that is out-of-distribution for us relative to the real problems, those alignment problems that would be lethal if we mishandle them when they materialize later. Anticipating all of the really dangerous ones, and then successfully materializing them, in the correct form for early solutions to generalize over to later solutions, sounds possibly kinda hard.

14. Some problems, like 'the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment', seem like their natural order of appearance could be that they first appear only in fully dangerous domains. Really actually having a clear option to brain-level-persuade the operators or escape onto the Internet, build nanotech, and destroy all of humanity - in a way where you're fully clear that you know the relevant facts, and estimate only a not-worth-it low probability of learning something which changes your preferred strategy if you bide your time another month while further growing in capability - is an option that first gets evaluated for real at the point where an AGI fully expects it can defeat its creators. We can try to manifest an echo of that apparent scenario in earlier toy domains. Trying to train by gradient descent against that behavior, in that toy domain, is something I'd expect to produce not-particularly-coherent local patches to thought processes, which would break with near-certainty inside a superintelligence generalizing far outside the training distribution and thinking very different thoughts. Also, programmers and operators themselves, who are used to operating in not-fully-dangerous domains, are operating out-of-distribution when they enter into dangerous ones; our methodologies may at that time break.

15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously. Given otherwise insufficient foresight by the operators, I'd expect a lot of those problems to appear approximately simultaneously after a sharp capability gain. See, again, the case of human intelligence. We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously. (People will perhaps rationalize reasons why this abstract description doesn't carry over to gradient descent; eg, “gradient descent has less of an information bottleneck”. My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question. When an outer optimization loop actually produced general intelligence, it broke alignment after it turned general, and did so relatively late in the game of that general intelligence accumulating capability and knowledge, almost immediately before it turned 'lethally' dangerous relative to the outer optimization loop of natural selection. Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.)

Section B.2: Central difficulties of outer and inner alignment.

16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

17. More generally, a superproblem of 'outer optimization doesn't produce inner alignment' is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over. This is a problem when you're trying to generalize out of the original training distribution, because, eg, the outer behaviors you see could have been produced by an inner-misaligned system that is deliberately producing outer behaviors that will fool you. We don't know how to get any bits of information into the inner system rather than the outer behaviors, in any systematic or general way, on the current optimization paradigm.

18. There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned', because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function. That is, if you show an agent a reward signal that's currently being generated by humans, the signal is not in general a reliable perfect ground truth about how aligned an action was, because another way of producing a high reward signal is to deceive, corrupt, or replace the human operators with a different causal system which generates that reward signal. When you show an agent an environmental reward signal, you are not showing it something that is a reliable ground truth about whether the system did the thing you wanted it to do; even if it ends up perfectly inner-aligned on that reward signal, or learning some concept that exactly corresponds to 'wanting states of the environment which result in a high reward signal being sent', an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (as seen by the operators).

19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward. This isn't to say that nothing in the system’s goal (whatever goal accidentally ends up being inner-optimized over) could ever point to anything in the environment by accident. Humans ended up pointing to their environments at least partially, though we've got lots of internally oriented motivational pointers as well. But insofar as the current paradigm works at all, the on-paper design properties say that it only works for aligning on known direct functions of sense data and reward functions. All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like 'kill everyone in the world using nanotech to strike before they know they're in a battle, and have control of your reward button forever after'. It just isn't true that we know a function on webcam input such that every world with that webcam showing the right things is safe for us creatures outside the webcam. This general problem is a fact about the territory, not the map; it's a fact about the actual environment, not the particular optimizer, that lethal-to-us possibilities exist in some possible environments underlying every given sense input.

20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors. To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.

21. There's something like a single answer, or a single bucket of answers, for questions like 'What's the environment really like?' and 'How do I figure out the environment?' and 'Which of my possible outputs interact with reality in a way that causes reality to have certain properties?', where a simple outer optimization loop will straightforwardly shove optimizees into this bucket. When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases. This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of 'relative inclusive reproductive fitness' - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else. This abstract dynamic is something you'd expect to be true about outer optimization loops on the order of both 'natural selection' and 'gradient descent'. The central result: Capabilities generalize further than alignment once capabilities start to generalize far.

22. There's a relatively simple core structure that explains why complicated cognitive machines work; which is why such a thing as general intelligence exists and not just a lot of unrelated special-purpose solutions; which is why capabilities generalize after outer optimization infuses them into something that has been optimized enough to become a powerful inner optimizer. The fact that this core structure is simple and relates generically to low-entropy high-structure environments is why humans can walk on the Moon. There is no analogous truth about there being a simple core of alignment, especially not one that is even easier for gradient descent to find than it would have been for natural selection to just find 'want inclusive reproductive fitness' as a well-generalizing solution within ancestral humans. Therefore, capabilities generalize further out-of-distribution than alignment, once they start to generalize at all.

23. Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee. We (MIRI) tried and failed [AF · GW] to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.

24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult. The first approach is to build a CEV-style Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it. The second course is to build corrigible AGI which doesn't want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.

The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It's not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You're not trying to make it have an opinion on something the core was previously neutral on. You're trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it's incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.

Section B.3: Central difficulties of sufficiently good and useful transparency / interpretability.

25. We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers. Drawing interesting graphs of where a transformer layer is focusing attention doesn't help if the question that needs answering is "So was it planning how to kill us or not?"

26. Even if we did know what was going on inside the giant inscrutable matrices while the AGI was still too weak to kill us, this would just result in us dying with more dignity, if DeepMind refused to run that system and let Facebook AI Research destroy the world two years later. Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system of inscrutable matrices that isn't planning to kill us.

27. When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

28. The AGI is smarter than us in whatever domain we're trying to operate it inside, so we cannot mentally check all the possibilities it examines, and we cannot see all the consequences of its outputs using our own mental talent. A powerful AI searches parts of the option space we don't, and we can't foresee all its options.

29. The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI's output to determine whether the consequences will be good.

30. Any pivotal act that is not something we can go do right now, will take advantage of the AGI figuring out things about the world we don't know so that it can make plans we wouldn't be able to make ourselves. It knows, at the least, the fact we didn't previously know, that some action sequence results in the world we want. Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence. An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn't make the same guarantee about an unaligned human as smart as yourself and trying to fool you. There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.

31. A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it's acquired strategic awareness.)

32. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

33. The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.

Section B.4: Miscellaneous unworkable schemes.

34. Coordination schemes between superintelligences are not things that humans can participate in (eg because humans can't reason reliably about the code of superintelligences); a "multipolar" system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like "the 20 superintelligences cooperate with each other but not with humanity".

35. Schemes for playing "different" AIs off against each other stop working if those AIs advance to the point of being able to coordinate via reasoning about (probability distributions over) each others' code. Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other. Eg, if you set an AGI that is secretly a paperclip maximizer, to check the output of a nanosystems designer that is secretly a staples maximizer, then even if the nanosystems designer is not able to deduce what the paperclip maximizer really wants (namely paperclips), it could still logically commit to share half the universe with any agent checking its designs if those designs were allowed through, if the checker-agent can verify the suggester-system's logical commitment and hence logically depend on it (which excludes human-level intelligences). Or, if you prefer simplified catastrophes without any logical decision theory, the suggester could bury in its nanosystem design the code for a new superintelligence that will visibly (to a superhuman checker) divide the universe between the nanosystem designer and the design-checker.

36. What makes an air conditioner 'magic' from the perspective of say the thirteenth century, is that even if you correctly show them the design of the air conditioner in advance, they won't be able to understand from seeing that design why the air comes out cold; the design is exploiting regularities of the environment, rules of the world, laws of physics, that they don't know about. The domain of human thought and human brains is very poorly understood by us, and exhibits phenomena like optical illusions, hypnosis, psychosis, mania, or simple afterimages produced by strong stimuli in one place leaving neural effects in another place. Maybe a superintelligence couldn't defeat a human in a very simple realm like logical tic-tac-toe; if you're fighting it in an incredibly complicated domain you understand poorly, like human minds, you should expect to be defeated by 'magic' in the sense that even if you saw its strategy you would not understand why that strategy worked. AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.

Section C:

Okay, those are some significant problems, but lots of progress is being made on solving them, right? There's a whole field calling itself "AI Safety" and many major organizations are expressing Very Grave Concern about how "safe" and "ethical" they are?

37. There's a pattern that's played out quite often, over all the times the Earth has spun around the Sun, in which some bright-eyed young scientist, young engineer, young entrepreneur, proceeds in full bright-eyed optimism to challenge some problem that turns out to be really quite difficult. Very often the cynical old veterans of the field try to warn them about this, and the bright-eyed youngsters don't listen, because, like, who wants to hear about all that stuff, they want to go solve the problem! Then this person gets beaten about the head with a slipper by reality as they find out that their brilliant speculative theory is wrong, it's actually really hard to build the thing because it keeps breaking, and society isn't as eager to adopt their clever innovation as they might've hoped, in a process which eventually produces a new cynical old veteran. Which, if not literally optimal, is I suppose a nice life cycle to nod along to in a nature-show sort of way. Sometimes you do something for the first time and there are no cynical old veterans to warn anyone and people can be really optimistic about how it will go; eg the initial Dartmouth Summer Research Project on Artificial Intelligence in 1956: "An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer." This is less of a viable survival plan for your planet if the first major failure of the bright-eyed youngsters kills literally everyone before they can predictably get beaten about the head with the news that there were all sorts of unforeseen difficulties and reasons why things were hard. You don't get any cynical old veterans, in this case, because everybody on Earth is dead. Once you start to suspect you're in that situation, you have to do the Bayesian thing and update now to the view you will predictably update to later: realize you're in a situation of being that bright-eyed person who is going to encounter Unexpected Difficulties later and end up a cynical old veteran - or would be, except for the part where you'll be dead along with everyone else. And become that cynical old veteran right away, before reality whaps you upside the head in the form of everybody dying and you not getting to learn. Everyone else seems to feel that, so long as reality hasn't whapped them upside the head yet and smacked them down with the actual difficulties, they're free to go on living out the standard life-cycle and play out their role in the script and go on being bright-eyed youngsters; there's no cynical old veterans to warn them otherwise, after all, and there's no proof that everything won't go beautifully easy and fine, given their bright-eyed total ignorance of what those later difficulties could be.

38. It does not appear to me that the field of 'AI safety' is currently being remotely productive on tackling its enormous lethal problems. These problems are in fact out of reach; the contemporary field of AI safety has been selected to contain people who go to work in that field anyways. Almost all of them are there to tackle problems on which they can appear to succeed and publish a paper claiming success; if they can do that and get funded, why would they embark on a much more unpleasant project of trying something harder that they'll fail at, just so the human species can die with marginally more dignity? This field is not making real progress and does not have a recognition function to distinguish real progress if it took place. You could pump a billion dollars into it and it would produce mostly noise to drown out what little progress was being made elsewhere.

39. I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them. This ability to "notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them" currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others. It probably relates to 'security mindset', and a mental motion where you refuse to play out scripts, and being able to operate in a field that's in a state of chaos.

40. "Geniuses" with nice legible accomplishments in fields with tight feedback loops where it's easy to determine which results are good or bad right away, and so validate that this person is a genius, are (a) people who might not be able to do equally great work away from tight feedback loops, (b) people who chose a field where their genius would be nicely legible even if that maybe wasn't the place where humanity most needed a genius, and (c) probably don't have the mysterious gears simply because they're rare. You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them. They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can't tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do. I concede that real high-powered talents, especially if they're still in their 20s, genuinely interested, and have done their reading, are people who, yeah, fine, have higher probabilities of making core contributions than a random bloke off the street. But I'd have more hope - not significant hope, but more hope - in separating the concerns of (a) credibly promising to pay big money retrospectively for good work to anyone who produces it, and (b) venturing prospective payments to somebody who is predicted to maybe produce good work later.

41. Reading this document cannot make somebody a core alignment researcher. That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That's not what surviving worlds look like.

42. There's no plan. Surviving worlds, by this point, and in fact several decades earlier, have a plan for how to survive. It is a written plan. The plan is not secret. In this non-surviving world, there are no candidate plans that do not immediately fall to Eliezer instantly pointing at the giant visible gaping holes in that plan. Or if you don't know who Eliezer is, you don't even realize you need a plan, because, like, how would a human being possibly realize that without Eliezer yelling at them? It's not like people will yell at themselves about prospective alignment difficulties, they don't have an internal voice of caution. So most organizations don't have plans, because I haven't taken the time to personally yell at them. 'Maybe we should have a plan' is deeper alignment mindset than they possess without me standing constantly on their shoulder as their personal angel pleading them into... continued noncompliance, in fact. Relatively few are aware even that they should, to look better, produce a pretend plan that can fool EAs too 'modest' to trust their own judgments about seemingly gaping holes in what serious-looking people apparently believe.

43. This situation you see when you look around you is not what a surviving world looks like. The worlds of humanity that survive have plans. They are not leaving to one tired guy with health problems the entire responsibility of pointing out real and lethal problems proactively. Key people are taking internal and real responsibility for finding flaws in their own plans, instead of considering it their job to propose solutions and somebody else's job to prove those solutions wrong. That world started trying to solve their important lethal problems earlier than this. Half the people going into string theory shifted into AI alignment instead and made real progress there. When people suggest a planetarily-lethal problem that might materialize later - there's a lot of people suggesting those, in the worlds destined to live, and they don't have a special status in the field, it's just what normal geniuses there do - they're met with either solution plans or a reason why that shouldn't happen, not an uncomfortable shrug and 'How can you be sure that will happen' / 'There's no way you could be sure of that now, we'll have to wait on experimental evidence.'

A lot of those better worlds will die anyways. It's a genuinely difficult problem, to solve something like that on your first try. But they'll die with more dignity than this.

708 comments

Comments sorted by top scores.

comment by evhub · 2022-06-08T22:34:32.169Z · LW(p) · GW(p)

That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That's not what surviving worlds look like.

To say that somebody else should have written up this list before is such a ridiculously unfair criticism. This is an assorted list of some thoughts which are relevant to AI alignment—just by the combinatorics of how many such thoughts there are and how many you chose to include in this list, of course nobody will have written up something like it before. Every time anybody writes up any overview of AI safety, they have to make tradeoffs between what they want to include and what they don't want to include that will inevitably leave some things off and include some things depending on what the author personally believes is most important/relevant to say—ensuring that all such introductions will always inevitably cover somewhat different material. Furthermore, many of these are responses to particular bad alignment plans, of which there are far too many to expect anyone to have previously written up specific responses to.

Nevertheless, I am confident that every core technical idea in this post has been written about before by either me, Paul Christiano, Richard Ngo, or Scott Garrabrant. Certainly, they have been written up in different ways than how Eliezer describes them, but all of the core ideas are there. Let's go through the list:

(1, 2, 4, 15) AGI safety from first principles [? · GW]

(3) This is a common concept, see e.g. Homogeneity vs. heterogeneity in AI takeoff scenarios [AF · GW] (“Homogeneity makes the alignment of the first advanced AI system absolutely critical (in a similar way to fast/discontinuous takeoff without the takeoff actually needing to be fast/discontinuous), since whether the first AI is aligned or not is highly likely tano determine/be highly correlated with whether all future AIs built after that point are aligned as well.”).

(4) This is just answering a particular bad plan.

(5, 6, 7) This is just the concept of competitiveness, see e.g. An overview of 11 proposals for building safe advanced AI [AF · GW].

(8) Risks from Learned Optimization in Advanced Machine Learning Systems: Conditions for Mesa-Optimization [? · GW]

(9) Worst-case guarantees (Revisted)

(10, 13, 14) Risks from Learned Optimization in Advanced Machine Learning Systems: Deceptive Alignment: Distributional shift and deceptive alignment [? · GW]

(11) Another specific bad plan.

(12, 35) Robustness to Scale [AF · GW]

(16, 17, 19) Risks from Learned Optimization in Advanced Machine Learning Systems [? · GW]

(18, 20) ARC's first technical report: Eliciting Latent Knowledge [AF · GW]

(21, 22) 2-D Robustness [AF · GW] for the concept, Risks from Learned Optimization in Advanced Machine Learning Systems [? · GW] for why it occurs.

(23, 24.2) Towards a mechanistic understanding of corrigibility [AF · GW]

(24.1) Risks from Learned Optimization in Advanced Machine Learning Systems: Deceptive Alignment: Internalization or deception after extensive training [? · GW]

(25, 26, 27, 29, 31) Acceptability Verification: A Research Agenda

(28) Relaxed adversarial training for inner alignment [AF · GW]

(30, 33) Chris Olah’s views on AGI safety: What if interpretability breaks down as AI gets more powerful? [AF · GW]

(32) An overview of 11 proposals for building safe advanced AI [AF · GW]

(34) Response to a particular bad plan.

(36) Outer alignment and imitative amplification: The case for imitative amplification [AF · GW]

To spot check the above list, I generated the following three random numbers from 1 - 36 after I wrote the list: 32, 34, 15. Since 34 corresponds to a particular bad plan, I then generated another to replace it: 14. Let's spot check those three—14, 15, 32—more carefully.

(14) Eliezer claims that “Some problems, like 'the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment', seem like their natural order of appearance could be that they first appear only in fully dangerous domains.” In Risks from Learned Optimization in Advanced Machine Learning Systems, we say very directly:

In current AI systems, a small amount of distributional shift between training and deployment need not be problematic: so long as the difference is small enough in the task-relevant areas, the training distribution does not need to perfectly reflect the deployment distribution. However, this may not be the case for a deceptively aligned mesa-optimizer. If a deceptively aligned mesa-optimizer is sufficiently advanced, it may detect very subtle distributional shifts for the purpose of inferring when the threat of modification has ceased.

[...] Some examples of differences that a mesa-optimizer might be able to detect include:

[...]

The presence or absence of good opportunities for the mesa-optimizer to defect against its programmers.

(15) Eliezer says “Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.” Richard says:

If AI development proceeds very quickly, then our ability to react appropriately will be much lower. In particular, we should be interested in how long it will take for AGIs to proceed from human-level intelligence to superintelligence, which we’ll call the takeoff period. The history of systems like AlphaStar, AlphaGo and OpenAI Five provides some evidence that this takeoff period will be short: after a long development period, each of them was able to improve rapidly from top amateur level to superhuman performance. A similar phenomenon occurred during human evolution, where it only took us a few million years to become much more intelligent than chimpanzees. In our case one of the key factors was scaling up our brain hardware - which, as I have already discussed, will be much easier for AGIs than it was for humans.

While the question of what returns we will get from scaling up hardware and training time is an important one, in the long term the most important question is what returns we should expect from scaling up the intelligence of scientific researchers - because eventually AGIs themselves will be doing the vast majority of research in AI and related fields (in a process I’ve been calling recursive improvement). In particular, within the range of intelligence we’re interested in, will a given increase δ in the intelligence of an AGI increase the intelligence of the best successor that AGI can develop by more than or less than δ? If more, then recursive improvement will eventually speed up the rate of progress in AI research dramatically.

Note: for this one, I originally had the link above point to AGI safety from first principles: Superintelligence [AF · GW] specifically, but changed it to point to the whole sequence after I realized during the spot-checking that Richard mostly talks about this in the Control [AF · GW] section.

(32) Eliezer says “This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.” In An overview of 11 proposals for building safe advanced AI, I say:

if RL is necessary to do anything powerful and simple language modeling is insufficient, then whether or not language modeling is easier is a moot point. Whether RL is really necessary seems likely to depend on the extent to which it is necessary to explicitly train agents—which is very much an open question. Furthermore, even if agency is required, it could potentially be obtained just by imitating an actor such as a human that already has it rather than training it directly via RL.

and

the training competitiveness of imitative amplification is likely to depend on whether pure imitation can be turned into a rich enough reward signal to facilitate highly sample-efficient learning. In my opinion, it seems likely that human language imitation (where language includes embedded images, videos, etc.) combined with techniques to improve sample efficiency will be competitive at some tasks—namely highly-cognitive tasks such as general-purpose decision-making—but not at others, such as fine motor control. If that’s true, then as long as the primary economic use cases for AGI fall into the highly-cognitive category, imitative amplification should be training competitive. For a more detailed analysis of this question, see “Outer alignment and imitative amplification [AF · GW].”

Replies from: Vaniver, alexander-gietelink-oldenziel, JamesPayor, Eliezer_Yudkowsky

↑ comment by Vaniver · 2022-06-09T18:43:04.330Z · LW(p) · GW(p)

I agree this list doesn't seem to contain much unpublished material, and I think the main value of having it in one numbered list is that "all of it is in one, short place", and it's not an "intro to computers can think" and instead is "these are a bunch of the reasons computers thinking is difficult to align".

The thing that I understand to be Eliezer's "main complaint" is something like: "why does it seem like No One Else is discovering new elements to add to this list?". Like, I think Risks From Learned Optimization was great, and am glad you and others wrote it! But also my memory is that it was "prompted" instead of "written from scratch", and I imagine Eliezer reading it more had the sense of "ah, someone made 'demons' palatable enough to publish" instead of "ah, I am learning something new about the structure of intelligence and alignment."

[I do think the claim that Eliezer 'figured it out from the empty string' doesn't quite jive with the Yudkowsky's Coming of Age [? · GW] sequence.]

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-09T19:23:26.349Z · LW(p) · GW(p)

Nearly empty string of uncommon social inputs. All sorts of empirical inputs, including empirical inputs in the social form of other people observing things.

It's also fair to say that, though they didn't argue me out of anything, Moravec and Drexler and Ed Regis and Vernor Vinge and Max More could all be counted as social inputs telling me that this was an important thing to look at.

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-06-09T15:11:38.673Z · LW(p) · GW(p)

Thank you, Evan, for living the Virture of Scholarship. Your work is appreciated.

↑ comment by James Payor (JamesPayor) · 2022-06-10T00:00:15.712Z · LW(p) · GW(p)

Eliezer's post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you'd see elsewhere, and it's frank on this point.

I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and "There is no fire alarm".

Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.

Replies from: JamesPayor

↑ comment by James Payor (JamesPayor) · 2022-06-10T01:10:33.996Z · LW(p) · GW(p)

Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.

Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't about connecting that to increased difficulty in succeeding at the alignment problem?

Re (32), I don't think your quote isn't talking about the thing Eliezer is talking about, which is that in order to be human level at modelling human-generated text, your AI must be doing something on par with human thought that figures out what humans would say. Your quote just isn't discussing this, namely that strong imitation requires cognition that is dangerous.

So I guess I don't take much issue with (14) or (15), but I think you're quite off the mark about (32). In any case, I still have a strong sense that Eliezer is successfully being more on the mark here than the rest of us manage. Kudos of course to you and others that are working on writing things up and figuring things out. Though I remain sympathetic to Eliezer's complaint.

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-09T19:18:30.379Z · LW(p) · GW(p)

Well, my disorganized list sure wasn't complete, so why not go ahead and list some of the foreseeable difficulties I left out? Bonus points if any of them weren't invented by me, though I realize that most people may not realize how much of this entire field is myself wearing various trenchcoats.

Replies from: evhub, remmelt-ellen

↑ comment by evhub · 2022-06-09T20:30:33.083Z · LW(p) · GW(p)

Sure—that's easy enough. Just off the top of my head, here's five safety concerns that I think are important that I don't think you included:

The fact that there exist functions that are easier to verify than satisfy ensures that adversarial training can never guarantee the absence of deception [LW · GW].
It is impossible to verify a model's safety—even given arbitrarily good transparency tools—without access to that model's training process. For example, you could get a deceptive model that gradient hacks [LW · GW] itself in such a way that cryptographically obfuscates its deception.
It is impossible in general to use interpretability tools to select models to have a particular behavioral property. I think this is clear if you just stare at Rice's theorem enough: checking non-trivial behavioral properties, even with mechanistic access, is in general undecidable. Note, however, that this doesn't rule out checking a mechanistic property that implies a behavioral property.
Any prior you use to incentivize models to behave in a particular way doesn't necessarily translate to situations where that model itself runs another search over algorithms. For example, the fastest way to search for algorithms isn't to search for the fastest algorithm [LW · GW].
Even if a model is trained in a myopic way—or even if a model is in fact myopic in the sense that it only optimizes some single-step objective—such a model can still end up deceiving you, e.g. if it cooperates with other versions of itself [LW · GW].

Replies from: Eliezer_Yudkowsky, TekhneMakre, espoire

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-10T05:15:56.616Z · LW(p) · GW(p)

Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for.

(I do think there's a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.)

Replies from: DPiepgrass

↑ comment by DPiepgrass · 2022-07-21T05:29:53.930Z · LW(p) · GW(p)

I do think there's a noticeable extent to which I was trying to list difficulties more central than those

Probably people disagree about which things are more central, or as evhub put it:

Every time anybody writes up any overview of AI safety, they have to make tradeoffs [...] depending on what the author personally believes is most important/relevant to say

Now FWIW I thought evhub was overly dismissive of (4) in which you made an important meta-point:

EY: 4. We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it [...]
evhub: This is just answering a particular bad plan.

But I would add a criticism of my own, that this "List of Lethalities" somehow just takes it for granted that AGI will try to kill us all without ever specifically arguing that case. Instead you just argue vaguely in that direction, in passing, while making broader/different points:

an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (???)
All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like 'kill everyone in the world using nanotech to strike before they know they're in a battle, and have control of your reward button forever after'. (I guess that makes sense)
If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. (???)

Perhaps you didn't bother because your audience is meant to be people who already believe this? I would at least expect to see it in the intro: "-5. unaligned superintelligences tend to try to kill everyone, here's why <link>.... -4. all the most obvious proposed solutions to (-5) don't work, here's why <link>".

↑ comment by TekhneMakre · 2022-06-10T06:41:51.348Z · LW(p) · GW(p)

(Note that these have a theme: you can't wrangle general computation / optimization. That's why I'm short universal approaches to AI alignment (approaches that aim at making general optimization safe by enforcing universal rules), and long existential approaches (approaches that try to find specific mechanisms that can be analytically seen to do the right thing).)

↑ comment by espoire · 2024-12-02T08:50:14.850Z · LW(p) · GW(p)

Question: is it actually possible to cryptographically obfuscate goals/plans/precomittments (or other elements of "deception" as mentioned in your second bullet point) against something inspecting your algorithm and working memory?

In particular, it seems to me that anything properly able to inspect you can, at a minimum, discover the use of encryption, and find where the encryption key is read from, and then read the encrypted content. Or fail the model once encryption is detected at all.

I'm not saying it will be easy to build such transparency tools -- in fact, that must be a herculean challenge, since selecting against interpreted thoughts selects for uninterpretable thoughts, and a superintelligence might use principles not yet known to humanity -- I just mean that, from the perspective of the AI itself, I think perfect obfuscation against ideal logically-possible interpretability tools might be impossible.

Edit:

Nevermind, I answered my own question. There is no such logically-possible interpretability tool. For any such tool we logically-counterfactually assume could exist, then the AI could implement that tool, and utilize encryption if and only if that tool asserts that the AI doesn't use encryption. ...therefore no algorithm exists which can verify the use of encryption in the general case.

↑ comment by Remmelt (remmelt-ellen) · 2022-06-15T12:07:44.014Z · LW(p) · GW(p)

Eliezer:

If you find that (for reasons still left explained)

... selection of code for intentionality is coupled – over the long run, in mostly non-reverse-engineerable ways – to various/most of the physical/chemical properties
... of the molecular substrate through which discrete code is necessarily computed/expressed (via input and output channels of information/energy packet transmission),

then given that

... the properties of the solid-state substrate (e.g. silicon-based hardware) computing AGI's code
... differ from the properties of the substrate of humans (carbon-based wetware),

a conclusion that follows is that

... the intentionality being selected for in AGI over the long run
... will diverge from the intentionality that was selected for in humans.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-16T08:40:37.618Z · LW(p) · GW(p)

What do you mean by 'intentionality'? Per SEP, "In philosophy, intentionality is the power of minds and mental states to be about, to represent, or to stand for, things, properties and states of affairs." So I read your comment as saying, a la Searle, 'maybe AI can never think like a human because there's something mysterious and crucial about carbon atoms in particular, or about capital-b Biology, for doing reasoning.'

This seems transparently silly to me -- I know of no reasonable argument for thinking carbon differs from silicon on this dimension -- and also not relevant to AGI risk. You can protest "but AlphaGo doesn't really understand Go!" until the cows come home, and it will still beat you at Go. You can protest "but you don't really understand killer nanobots!" until the cows come home, and superintelligent Unfriendly AI will still build the nanobots and kill you with them.

By the same reasoning, Searle-style arguments aren't grounds for pessimism either. If Friendly AI lacks true intentionality or true consciousness or whatever, it can still do all the same mechanistic operations, and therefore still produce the same desirable good outcomes as if it had human-style intentionality or whatver.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2022-06-16T10:01:03.911Z · LW(p) · GW(p)

So I read your comment as saying, a la Searle, 'maybe AI can never think like a human because there's something mysterious and crucial about carbon atoms in particular, or about capital-b Biology, for doing reasoning.'

That’s not the argument. Give me a few days to write a response. There’s a minefield of possible misinterpretations here.

whatever, it can still do all the same mechanistic operations, and therefore still produce the same desirable good outcomes as if it had human-style intentionality or whatver.

However, the argumentation does undermine the idea that designing for mechanistic (alignment) operations is going to work. I’ll try and explain why.

Replies from: remmelt-ellen, remmelt-ellen, remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2022-06-16T12:41:33.264Z · LW(p) · GW(p)

BTW, with ‘intentionality’, I meant something closer to everyday notions of ‘intentions one has’. Will more precisely define that meaning later.

I should have checked for diverging definitions from formal fields. Thanks for catching that.

↑ comment by Remmelt (remmelt-ellen) · 2022-06-16T10:42:35.011Z · LW(p) · GW(p)

If you happen to have time, this paper serves as useful background reading: https://royalsocietypublishing.org/doi/full/10.1098/rsif.2012.0869

Particularly note the shift from trivial self-replication (e.g. most computer viruses) to non-trivial self-replication (e.g. as through substrate-environment pathways to reproduction).

None of this is sufficient for you to guess what the argumentation is (you might be able to capture a bit of it, along with a lot of incorrect and often implicit assumptions we must dig into).

If you could call on some patience and openness to new ideas, I would really appreciate it! I am already bracing for a next misinterpretation (which is fine, if we can talk about that). I apologise for that I cannot find a viable way yet to throw out all the argumentation in one go, and also for that this will get a bit disorientating when we go through arguments step-by-step.

↑ comment by Remmelt (remmelt-ellen) · 2022-06-19T10:19:55.681Z · LW(p) · GW(p)

Returning to this:

Give me a few days to write a response. There’s a minefield of possible misinterpretations here.

Key idea: Different basis of existence→ different drives→ different intentions→ different outcomes.

@Rob, I wrote up a longer explanation here, which I prefer to discuss with you in private first. Will email you a copy ~~tomorrow~~ in the next weeks.

comment by David Johnston (david-johnston) · 2022-06-07T02:19:00.546Z · LW(p) · GW(p)

I'm sorry to hear that your health is poor and you feel that this is all on you. Maybe you're right about the likelihood of doom, and even if I knew you were, I'd be sorry that it troubles you this way.

I think you've done an amazing job of building the AI safety field and now, even when the field has a degree of momentum of its own, it does seem to be less focused on doom than it should be, and I think you continuing to push people to focus on doom is valuable.

I don't think its easy to get people to take weird ideas seriously. I've had many experiences where I've had ideas about how people should change their approach to a project that weren't particularly far out and (in my view) were right for very straightforward reasons, and yet for the most part I was ignored altogether. What you've accomplished in building the AI safety field is amazing because AI doom ideas seemed really crazy when you started talking about them.

Nevertheless, I think some of the things you've said in this post are counterproductive. Most of the post is good, but insulting people who might contribute to solving the problem is not, nor is demanding that people acknowledge that you are smarter than they are. I'm not telling you that people don't deserve to be insulted, nor that you have no right to consider yourself smarter than them - I'm telling you that you shouldn't say it in public.

My concrete suggestion is this: if you are criticising or otherwise passing pessimistic judgement on people or a group of people
- Give more details about what it is they've done to merit this criticism ("pretend plan that can fool EAs too 'modest' to trust their own judgments" - what are modest EAs actually doing that you think is wrong? Paying not enough attention to AI doom?)
- Avoid talking about yourself ("So most organizations don't have plans~~, because I haven't taken the time to personally yell at them~~")

Many people are proud, including me. If working in AI safety means I have to be regularly reminded that the fact that I didn't go into the field sooner will be held as a mark against me, then that is a reason for me not to do it. Maybe not a decisive reason, but it is a reason. If working in AI safety means that you are going to ask me to publicly acknowledge that you're smarter than me, that's a reason for me not to do it. Maybe not decisive, but it's a reason. I think there might be others who feel similarly.

If you want people to accept what you're saying, it helps let people change their minds without embarrassing them. There are plenty of other things to do - many of which, as I've said, you seem to be much better at doing than me - but this one is important too. I wonder if you might say something like "anyone turned off by these comments can't be of any value to the project". If you think that - I just don't. There are many, many smart people with dumb motivations, and many of them can do valuable work if they can be motivated to do it. This includes thinking deeply about things they were previously motivated not to think about.

You are a key, maybe the key, person in the AI safety field. What you say is attended to people in, around and even disconnected from the field. I don't think you can reasonably claim that you shouldn't be this important to the field. I think you should take this fact seriously, and that means exercising discipline in the things you say.

I say all this because I think that a decent amount of EA/AI safety seems to neglect AI doom an unreasonable amount, and certainly the field of AI in general neglects it. I find statements of the type I pointed out above off-putting, and I suspect I'm not alone.

Replies from: Eliezer_Yudkowsky, lucie-philippon, David Hornbein, joraine, adamzerner, elityre, yitz

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-08T04:37:31.887Z · LW(p) · GW(p)

There's a point here about how fucked things are that I do not know how to convey without saying those things, definitely not briefly or easily. I've spent, oh, a fair number of years, being politer than this, and less personal than this, and the end result is that people nod along and go on living their lives.

I expect this won't work either, but at some point you start trying different things instead of the things that have already failed. It's more dignified if you fail in different ways instead of the same way.

Replies from: lc, david-johnston, Chris_Leong

↑ comment by lc · 2022-06-08T05:32:43.054Z · LW(p) · GW(p)

FWIW you taking off the Mr. Nice guy gloves has actually made me make different life decisions. I'm glad you tried it even if it doesn't work.

↑ comment by David Johnston (david-johnston) · 2022-06-08T21:31:21.207Z · LW(p) · GW(p)

Do whatever you want, obviously, but I just want to clarify that I did not suggest you avoid personally criticising people (only that you avoid vague/hard to interpret criticism) or saying you think doom is overwhelmingly likely. Some other comments give me a stronger impression than yours that I was asking you in a general sense to be nice, but I'm saying it to you because I figure it mostly matters that you're clear on this.

↑ comment by Chris_Leong · 2022-10-10T04:36:13.163Z · LW(p) · GW(p)

There's a point here about how fucked things are that I do not know how to convey without saying those things, definitely not briefly or easily.

You might not have this ability, but surely you know at least one person who does?

↑ comment by Lucie Philippon (lucie-philippon) · 2022-06-09T18:26:10.573Z · LW(p) · GW(p)

I vehemently disagree here, based on my personal and generalizable or not history. I will illustrate with the three turning points of my recent life.

First step: I stumbled upon HPMOR, and Eliezer way of looking straight into the irrationality of all our common ways of interacting and thinking was deeply shocking. It made me feel like he was in a sense angrily pointing at me, who worked more like one of the PNJ rather than Harry. I heard him telling me you're dumb and all your ideals of making intelligent decisions, being the gifted kid and being smarter than everyone are all are just delusions. You're so out of touch with reality on so many levels, where to even start.

This attitude made me embark on a journey to improve myself, read the sequences, pledge on Giving What we can after knowing EA for many years, and overall reassess whether I was striving towards my goal of helping people (spoiler: I was not).

Second step: The April fools post [LW · GW] also shocked me on so many levels. I was once again deeply struck by the sheer pessimism of this figure I respected so much. After months of reading articles on LessWrong and so many about AI alignment, this was the one that made me terrified in the face of the horrors to come.

Somehow this article, maybe by not caring about not hurting people, made me join an AI alignment research group in Berlin. I started investing myself into the problem, working on it regularly, diverting my donations towards effective organizations in the field. It even caused me to publish my first bit of research on preference learning.

Third step: Today this post, by not hiding any reality of the issue and striking a lot of ideas down that I was relying on for hope, made me realize I was becoming complacent. Doing a bit of research in the weekend is the way to be able to say “Yeah I participated in solving the issue” once it's solved, not making sure it is in fact solved.

Therefore, based on my experience, not a lot of works made me significantly alter my life decisions. And those who did are all strangely ranting, smack-in-your-face works written by Eliezer.

Maybe I'm not the audience to optimize for to solve the problem, but on my side, I need even more smacks in the face, breaking you fantasy style posts.

↑ comment by David Hornbein · 2022-06-08T19:58:54.217Z · LW(p) · GW(p)

I disagree strongly. To me it seems that AI safety has long punched below its weight because its proponents are unwilling to be confrontational, and are too reluctant to put moderate social pressure on people doing the activities which AI safety proponents hold to be very extremely bad. It is not a coincidence that among AI safety proponents, Eliezer is both unusually confrontational and unusually successful.

This isn't specific to AI safety. A lot of people in this community generally believe that arguments which make people feel bad are counterproductive because people will be "turned off".

This is false. There are tons of examples of disparaging arguments against bad (or "bad") behavior that succeed wildly. Such arguments very frequently succeed in instilling individual values like e.g. conscientiousness or honesty. Prominent political movements which use this rhetoric abound. When this website was young, Eliezer and many others participated in an aggressive campaign of discourse against religious ideas, and this campaign accomplished many of its goals. I could name many many more large and small examples. I bet you can too.

Obviously this isn't to say that confrontational and insulting argument is always the best style. Sometimes it's truth-tracking and sometimes it isn't. Sometimes it's persuasive and sometimes it isn't. Which cases are which is a difficult topic that I won't get into here (except to briefly mention that it matters a lot whether the reasons given are actually good). Nor is this to say that the "turning people off" effect is completely absent; what I object to is the casual assumption that it outweighs any other effects. (Personally I'm turned off by the soft-gloved style of the parent comment, but I would not claim this necessarily means it's inappropriate or ineffective—it's not directed at me!) The point is that this very frequent claim does not match the evidence. Indeed, strong counterevidence is so easy to find that I suspect this is often not people's real objection.

Replies from: RobbBB, teageegeepea

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T20:24:54.319Z · LW(p) · GW(p)

I think there's an important distinction between:

Deliberately phrasing things in confrontational or aggressive ways, in the hope that this makes your conversation partner "wake up" or something.
Choosing not to hide real, potentially-important beliefs you have about the world, even though those beliefs are liable to offend people, liable to be disagreed with, etc.

Either might be justifiable, but I'm a lot more wary of heuristics like "it's never OK to talk about individuals' relative proficiency at things, even if it feels very cruxy and important, because people just find the topic too triggering" than of heuristics like "it's never OK to say things in ways that sound shouty or aggressive". I think cognitive engines can much more easily get by self-censoring their tone than self-censoring what topics are permissible to think or talk about.

↑ comment by teageegeepea · 2022-06-17T00:52:31.977Z · LW(p) · GW(p)

How is "success" measured among AI safety proponents?

↑ comment by joraine · 2022-06-09T01:12:34.032Z · LW(p) · GW(p)

This kind of post scares away the person who will be the key person in the AI safety field if we define "key person" as the genius main driver behind solving it, not the loudest person. Which is rather unfortunate, because that person is likely to read this post at some point.

I don't believe this post has any "dignity", whatever weird obscure definition dignity has been given now. It's more like flailing around in death throes while pointing fingers and lauding yourself than it is a solemn battle stance against an oncoming impossible enemy.

For context, I'm not some Eliezer hater, I'm a young person doing an ML masters currently who just got into this space and within the past week have become a huge fan of Eliezer Yudkowsky's earlier work while simultaneously very disappointed in the recent, fruitless, output.

↑ comment by Adam Zerner (adamzerner) · 2022-06-08T19:15:54.364Z · LW(p) · GW(p)

It seems worth doing a little user research on this to see how it actually affects people. If it is a net positive, then great. If it is a net negative, the question becomes how big of a net negative it is and whether it is worth the extra effort to frame things more nicely.

↑ comment by Eli Tyre (elityre) · 2023-02-18T04:06:07.716Z · LW(p) · GW(p)

I think this was excellently worded, and I'm glad you said it. I'm also glad to have read all the responses, many of which seem important and on point to me. I strong upvoted this comment as well as several of the responses.

I'm leaving this comment, because I want to give you some social reinforcement for saying what you said, and saying it as clearly and tactfully as you did.

↑ comment by Yitz (yitz) · 2022-06-07T06:13:18.010Z · LW(p) · GW(p)

Strongly agree with this, said more eloquently than I was able to :)

comment by Austin Chen (austin-chen) · 2022-06-06T04:49:28.362Z · LW(p) · GW(p)

I'd have more hope - not significant hope, but more hope - in separating the concerns of (a) credibly promising to pay big money retrospectively for good work to anyone who produces it, and (b) venturing prospective payments to somebody who is predicted to maybe produce good work later.

I desperately want to make this ecosystem exist, either as part of Manifold Markets, or separately. Some people call it "impact certificates" or "retroactive public goods funding"; I call it "equity for public goods", or "Manifund" in the specific case.

If anyone is interested in:

a) Being a retroactive funder for good work (aka bounties, prizes)

b) Getting funding through this kind of mechanism (aka income share agreements, angel investment)

c) Working on this project full time (full-stack web dev, ops, community management)

Please get in touch! Reply here, or message austin@manifold.markets~

Replies from: mbrooks

↑ comment by mbrooks · 2022-06-06T21:43:55.653Z · LW(p) · GW(p)

I'm also on a team trying to build impact certificates/retroactive public goods funding and we are receiving a grant from an FTX Future Fund regrantor to make it happen!

If you're interested in learning more or contributing you can:

Read about [EA · GW] our ongoing $10,000 retro-funding contest (Austin is graciously contributing to the prize pool)
Submit an EA Forum Post to this retro-funding contest (before July 1st)
Join our Discord to chat/ask questions
Read/Comment on our lengthy informational EA forum post "Towards Impact Markets [EA · GW]"

comment by Matthew Barnett (matthew-barnett) · 2022-06-07T06:25:31.872Z · LW(p) · GW(p)

It's as good as time as any to re-iterate my reasons for disagreeing with what I see as the Yudkowskian view of future AI. What follows isn't intended as a rebuttal of any specific argument in this essay, but merely a pointer that I'm providing for readers, that may help explain why some people might disagree with the conclusion and reasoning contained within.

I'll provide my cruxes point-by-point,

I think raw intelligence, while important, is not the primary factor that explains why humanity-as-a-species is much more powerful than chimpanzees-as-a-species. Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.

Similarly, our ability to coordinate through language also plays a huge role in explaining our power compared to other animals. But, on a first approximation, other animals can't coordinate at all, making this distinction much less impressive. The first AGIs we construct will be born into a culture already capable of coordinating, and sharing knowledge, making the potential power difference between AGI and humans relatively much smaller than between humans and other animals, at least at first.

Consequently, the first slightly smarter-than-human agent will probably not be able to leverage its raw intelligence to unilaterally take over the world, for pretty much the same reason that an individual human would not be able to unilaterally take over a band of chimps, in the state of nature, despite the intelligence advantage of the human.
There's a large range of human intelligence, such that it makes sense to talk about AI slowly going from 50th percentile to 99.999th percentile on pretty much any important general intellectual task, rather than AI suddenly jumping to superhuman levels after a single major insight. In cases where progress in performance does happen rapidly, the usual reason is that there wasn't much effort previously being put into getting better at the task.

The case of AlphaGo is instructive here: improving the SOTA on Go bots is not very profitable. We should expect, therefore, that there will be relatively few resources being put into that task, compared to the overall size of the economy. However, if a single rich company, like Google, at some point does decide to invest considerable resources into improving Go performance, then we could easily observe a discontinuity in progress. Yet, this discontinuity in output merely reflects a discontinuity in inputs, not a discontinuity as a response to small changes in those inputs, as is usually a prerequisite for foom in theoretical models.
Hardware progress and experimentation are much stronger drivers of AI progress than novel theoretical insights. The most impressive insights, like backpropagation and transformers, are probably in our past. And as the field becomes more mature, it will likely become even harder to make important theoretical discoveries.

These points make the primacy of recursive self-improvement, and as a consequence, unipolarity in AI takeoff [LW · GW], less likely in the future development of AI. That's because hardware progress and AI experimentation are, for the most part, society-wide inputs, which can be contributed by a wide variety of actors, don't exhibit strong feedback loops on an individual level, and more-or-less have smooth responses to small changes in their inputs. Absent some way of making AI far better via a small theoretical tweak, it seems that we should expect smooth, gradual progress by default, even if overall economic growth becomes very high after the invention of AGI.
[Update (June 2023): While I think these considerations are still important, I think the picture I painted in this section was misleading. I wrote about my views of AI services here [LW · GW].] There are strong pressures -- including the principle of comparative advantage, diseconomies of scale, and gains from specialization -- that incentivize making economic services narrow and modular, rather than general and all-encompassing. Illustratively, a large factory where each worker specializes in their particular role will be much more productive than a factory in which each worker is trained to be a generalist, even though no one understands any particular component of the production process very well.

What is true in human economics will apply to AI services as well. This implies we should expect something like Eric Drexler's AI perspective, which emphasizes economic production across many agents who trade and produce narrow services, as opposed to monolithic agents that command and control.
Having seen undeniable, large economic effects from AI, policymakers will eventually realize that AGI is important, and will launch massive efforts to regulate it. The current lack of concern almost certainly reflects the fact that powerful AI hasn't arrived yet.

There's a long history of people regulating industries after disasters -- like nuclear energy -- and, given the above theses, it seems likely that there will be at least a few "warning shots" which will provide a trigger for companies and governments to crack down and invest heavily into making things go the way they want.

(Note that this does not imply any sort of optimism about the effects of these regulations, only that they will exist and will have a large effect on the trajectory of AI)
The effect of the above points is not to provide us uniform optimism about AI safety, and our collective future. It is true that, if we accept the previous theses, then many of the points in Eliezer's list of AI lethalities become far less plausible. But, equally, one could view these theses pessimistically, by thinking that they imply the trajectory of future AI is much harder to intervene on, and do anything about, relative to the Yudkowskian view.

Replies from: Vaniver, daniel-kokotajlo, israel-tsadok, leogao, lc, emanuele-ascani, vishrut-arya, Emrik North, david-johnston

↑ comment by Vaniver · 2022-06-07T15:19:46.308Z · LW(p) · GW(p)

Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.
Similarly, our ability to coordinate through language also plays a huge role in explaining our power compared to other animals. But, on a first approximation, other animals can't coordinate at all, making this distinction much less impressive. The first AGIs we construct will be born into a culture already capable of coordinating, and sharing knowledge, making the potential power difference between AGI and humans relatively much smaller than between humans and other animals, at least at first.

I basically buy the story that human intelligence is less useful that human coordination; i.e. it's the intelligence of "humanity" the entity that matters, with the intelligence of individual humans relevant only as, like, subcomponents of that entity.

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

Like, one scenario I visualize a lot is the NHS having a single 'DocBot', i.e. an artificial doctor run on datacenters that provides medical advice and decision-making for everyone in the UK (while still working with nurses and maybe surgeons and so on). Normally I focus on the way that it gets about three centuries of experience treating human patients per day, but imagine the difference in coordination capacity between DocBot and the BMA.

Having seen undeniable, large economic effects from AI, policymakers will eventually realize that AGI is important, and will launch massive efforts to regulate it.

I think everyone expects this, and often disagree on the timescale on which it will arrive. See, for example, Elon Musk's speech to the US National Governors Association, where he argues that the reactive regulation model will be too slow to handle the crisis.

But I think the even more important disagreement is on whether or not regulations should be expected to work [LW(p) · GW(p)]. Ok, so you make it so that only corporations with large compliance departments can run AGI. How does that help? There was a tweet by Matt Yglesias a while ago that I can't find now, which went something like: "a lot of smart people are worried about AI, and when you ask them what the government can do about it, they have no idea; this is an extremely wild situation from the perspective of a policy person." A law that says "don't run the bad code" is predicated on the ability to tell the good code from the bad code, which is the main thing we're missing and don't know how to get!

And if you say something like "ok, one major self-driving car accident will be enough to convince everyone to do the Butlerian Jihad and smash all the computers", that's really not how it looks to me. Like, the experience of COVID seems a lot like "people who were doing risky research in labs got out in front of everyone else to claim that the lab leak hypothesis was terrible and unscientific, and all of the anti-disinformation machinery was launched to suppress it, and it took a shockingly long time to even be able to raise the hypothesis, and it hasn't clearly swept the field, and legislation to do something about risky research seems like it definitely isn't a slam dunk."

When we get some AI warning signs, I expect there are going to be people with the ability to generate pro-AI disinfo and a strong incentive to do so. I expect there to be significant latent political polarization which will tangle up any attempt to do something useful about it. I expect there won't be anything like the international coordination that was necessary to set up anti-nuclear-proliferation efforts to set up the probably harder problem of anti-AGI-proliferation efforts.

Replies from: lc, Buck, antimonyanthony

↑ comment by lc · 2022-06-07T18:55:57.980Z · LW(p) · GW(p)

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

This is 100% correct, and part of why I expect the focus on superintelligence, while literally true, is bad for AI outreach. There's a much simpler (and empirically, in my experience, more convincing) explanation of why we lose to even an AI with an IQ of 110. It is Dath Ilan, and we are Earth. Coordination is difficult for humans and the easy part for AIs.

Replies from: Vaniver

↑ comment by Vaniver · 2022-06-07T19:21:32.963Z · LW(p) · GW(p)

I will note that Eliezer wrote That Alien Message [LW · GW] a long time ago I think in part to try to convey the issue to this perspective, but it's mostly about "information-theoretic bounds are probably not going to be tight" in a simulation-y universe instead of "here's what coordination between computers looks like today". I do predict the coordination point would be good to include in more of the intro materials.

↑ comment by Buck · 2022-06-09T18:46:59.865Z · LW(p) · GW(p)

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at the point where the AGIs are becoming very powerful in aggregate, this argument pushes us away from thinking they're good at individual thinking.

Also, it's not obvious that early AIs will actually be able to do this if their creators don't find a way to train them to have this affordance. ML doesn't currently normally make AIs which can helpfully share mind-states, and it probably requires non-trivial effort to hook them up correctly to be able to share mind-state.

↑ comment by Anthony DiGiovanni (antimonyanthony) · 2022-06-17T07:34:02.192Z · LW(p) · GW(p)

They can read each other's source code, and thus trust much more deeply!

Being able to read source code doesn't automatically increase trust—you also have to be able to verify that the code being shared with you actually governs the AGI's behavior, despite that AGI's incentives and abilities to fool you.

(Conditional on the AGIs having strongly aligned goals with each other, sure, this degree of transparency would help them with pure coordination problems.)

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-08T04:51:00.753Z · LW(p) · GW(p)

Nice! Thanks! I'll give my commentary on your commentary, also point by point. Your stuff italicized, my stuff not. Warning: Wall of text incoming! :)

I think raw intelligence, while important, is not the primary factor that explains why humanity-as-a-species is much more powerful than chimpanzees-as-a-species. Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.

Similarly, our ability to coordinate through language also plays a huge role in explaining our power compared to other animals. But, on a first approximation, other animals can't coordinate at all, making this distinction much less impressive. The first AGIs we construct will be born into a culture already capable of coordinating, and sharing knowledge, making the potential power difference between AGI and humans relatively much smaller than between humans and other animals, at least at first.

I don't think I understand this argument. Yes, humans can use language to coordinate & benefit from cultural evolution, so an AI that can do that too (but is otherwise unexceptional) would have no advantage. But the possibility we are considering is that AI might be to humans what humans are to monkeys; thus, if the difference between humans and monkeys is greater intelligence allowing them to accumulate language, there might be some similarly important difference between AIs and humans. For example, language is a tool that lets humans learn from the experience of others, but AIs can literally learn from the experience of others -- via the mechanism of having many copies that share weights and gradient updates! They can also e.g. graft more neurons onto an existing AI to make it smarter, think at greater serial speed, integrate calculators and other programs into their functioning and learn to use them intuitively as part of their regular thought processes... I won't be surprised if somewhere in the grab bag of potential advantages AIs have over humans is one (or several added together) as big as the language advantage humans have over monkeys.

Plus, there's language itself. It's not a binary, it's a spectrum; monkeys can use it too, to some small degree. And some humans can use it more/better than others. Perhaps AIs will (eventually, and perhaps even soon) be better at using language than the best humans.

Consequently, the first slightly smarter-than-human agent will probably not be able to leverage its raw intelligence to unilaterally take over the world, for pretty much the same reason that an individual human would not be able to unilaterally take over a band of chimps, in the state of nature, despite the intelligence advantage of the human.

Here's how I think we should think about it. Taboo "intelligence." Instead we just have a list of metrics a, b, c, ... z, some of which are overlapping, some of which are subsets of others, etc. One of these metrics, then, is "takeover ability (intellectual component)." This metric, when combined with "takeover ability (resources)," "Takeover ability (social status)" and maybe a few others that track "exogenous" factors about how others treat the AI and what resources it has, combine together to create "overall takeover ability."

Now, I claim, (1) Takeover is a tournament (blog post TBD, but see my writings about lessons from the conquistadors) and I cite this as support for claim (2) takeover would be easy for AIs, by which I mean, IF AIs were mildly superhuman in the intellectual component of takeover ability, they would plausibly start off with enough of the other components that they would be able to secure more of those other components fairly quickly, stay out of trouble, etc. until they could actually take over -- in other words, their overall takeover ability would be mildly superhuman as well.

(I haven't argued for this much yet but I plan to in future posts. Also I expect some people will find it obvious, and maybe you are one such person.)

Now, how should we think about AI timelines-till-human-level-takeover-ability-(intellectual)?

Same way we think about AI timelines for AGI, or TAI, or whatever. I mean obviously there are differences, but I don't think we have reason to think that the intellectual component of takeover ability is vastly more difficult than e.g. being human-level AGI, or being able to massively accelerate world GDP, or being able to initiate recursive self-improvement or an R&D acceleration.

I mean it might be. It's a different metric, after all. But it also might come earlier than those things. It might be easier. And I have plausibility arguments to make for that claim in fact.

So anyhow I claim: We can redo all our timelines analyses with "slightly superhuman takeover ability (intellectual)" as the thing to forecast instead of TAI or AGI or whatever, and get roughly the same numbers. And then (I claim) this is tracking when we should worry about AI takeover. Yes, by a single AI system, if only one exists; if multiple exist then by multiple.

We can hope that we'll get really good AI alignment research assistants before we get AIs good at taking over... but that's just a hope at this point; it totally could come in the opposite order and I have arguments that it would.

There's a large range of human intelligence, such that it makes sense to talk about AI slowly going from 50th percentile to 99.999th percentile on pretty much any intellectual task, rather than AI suddenly jumping to superhuman levels after a single major insight. In cases where progress in performance does happen rapidly, the usual reason is that there wasn't much effort previously being put into getting better at the task.

The case of AlphaGo is instructive here: improving the SOTA on Go bots is not very profitable. We should expect, therefore, that there will be relatively few resources being put into that task, compared to the overall size of the economy. However, if a single rich company, like Google, at some point does decide to invest considerable resources into improving Go performance, then we could easily observe a discontinuity in progress. Yet, this discontinuity in output merely reflects a discontinuity in inputs, not a discontinuity as a response to small changes in those inputs, as is usually a prerequisite for foom in theoretical models.

Hardware progress and experimentation are much stronger drivers of AI progress than novel theoretical insights. The most impressive insights, like backpropagation and transformers, are probably in our past. And as the field becomes more mature, it will likely become even harder to make important theoretical discoveries.

These points make the primacy of recursive self-improvement, and as a consequence, unipolarity in AI takeoff [LW · GW], less likely in the future development of AI. That's because hardware progress and AI experimentation are, for the most part, society-wide inputs, which can be contributed by a wide variety of actors, don't exhibit strong feedback loops on an individual level, and more-or-less have smooth responses to small changes in their inputs. Absent some way of making AI far better via a small theoretical tweak, it seems that we should expect smooth, gradual progress by default, even if overall economic growth becomes very high after the invention of AGI.

I claim this argument is a motte and bailey. The motte is the first three paragraphs, where you give good sensible reasons to think that discontinuities and massive conceptual leaps, while possible, are not typical. The bailey is the last paragraph where you suggest that we can therefore conclude unipolar takeoff is unlikely and that progress will go the way Paul Christiano thinks it'll go instead of the way Yudkowsky thinks it'll go. I have sat down to make toy models of what takeoff might look like, and even with zero discontinuities and five-year-spans of time to "cross the human range" the situation looks qualitatively a lot more like Yudkowsky's story than Christiano's. Of course you shouldn't take my word for it, and also just because the one or two models I made looked this way doesn't mean I'm right, maybe someone with different biases could make different models that would come out differently. But still. (Note: Part of why my models came out this way was that I was assuming stuff happens in 5-15 years from now. Paul Christiano would agree, I think, that given this assumption takeoff would be pretty fast. I haven't tried to model what things look like on 20+ year timelines.)

There are strong pressures -- including the principle of comparative advantage, diseconomies of scale, and gains from specialization -- that incentivize making economic services narrow and modular, rather than general and all-encompassing. Illustratively, a large factory where each worker specializes in their particular role will be much more productive than a factory in which each worker is trained to be a generalist, even though no one understands any particular component of the production process very well.

What is true in human economics will apply to AI services as well. This implies we should expect something like Eric Drexler's AI perspective, which emphasizes economic production across many agents who trade and produce narrow services, as opposed to monolithic agents that command and control.

This may be our biggest disagrement. Drexler's vision of comprehensive AI services is a beautiful fantasy IMO. Agents are powerful. [? · GW] There will be plenty of AI services, yes, but there will also be AI agents, and those are what we are worried about. And yes it's theoretically possible to develop the right AI services in advance to help us control the agents when they appear... but we'd best get started building them then, because they aren't going to build themselves. And eyeballing the progress towards AI agents vs. useful interpretability tools etc., it's not looking good.

Having seen undeniable, large economic effects from AI, policymakers will eventually realize that AGI is important, and will launch massive efforts to regulate it. The current lack of concern almost certainly reflects the fact that powerful AI hasn't arrived yet.

There's a long history of people regulating industries after disasters -- like nuclear energy [LW · GW] -- and, given the above theses, it seems likely that there will be at least a few "warning shots" which will provide a trigger for companies and governments to crack down and invest heavily into making things go the way they want.

(Note that this does not imply any sort of optimism about the effects of these regulations, only that they will exist and will have a large effect on the trajectory of AI)

I agree in principle, but unfortunately it seems like things are going to happen fast enough (over the span of a few years at most) and soon enough (in the next decade or so, NOT in 30 years after the economy has already been transformed by narrow AI systems) that it really doesn't seem like governments are going to do much by default. We still have the opportunity to plan ahead and get governments to do stuff! But I think if we sit on our asses, nothing of use will happen. (Probably there will be some regulation but it'll be irrelevant like most regulation is.)

In particular I think that we won't get any cool exciting scary AI takeover near-misses that cause massive crackdowns on the creation of AIs that could possibly take over, the way we did for nuclear power plants. Why would we? The jargon for this is "Sordid Stumble before Treacherous Turn." It might happen but we shouldn't expect it by default I think. Yes, before AIs are smart enough to take over, they will be dumber. But what matters is: Before an AI is smart enough to take over and smart enough to realize this, will there be an AI that can't take over but thinks it can? And "before" can't be "two weeks before" either, it probably needs to be more like two months or two years, otherwise the dastardly plan won't have time to go awry and be caught and argued about and then regulated against. Also the AI in question has to be scarily smart otherwise it's takeover attempt will fail so early that it won't be registered as such, it'll be like GPT-3 lying to users to get reward or Facebook's recommendation algorithm causing thousands of teenage girls to kill themselves, people will be like "Oh yes this was an error, good thing we train that sort of thing away, see look how the system behaves better now."

The effect of the above points is not to provide us uniform optimism about AI safety, and our collective future. It is true that, if we accept the previous theses, then many of the points in Eliezer's list of AI lethalities become far less plausible. But, equally, one could view these theses pessimistically, by thinking that they imply the trajectory of future AI is much harder to intervene on, and do anything about, relative to the Yudkowskian view.

I haven't gone through the list point by point, I won't comment on this then. I agree that longer timelines slow takeoff worlds we have less influence over relative to other humans.

Replies from: chrisvm, Kinrany, paul-kent

↑ comment by Chris van Merwijk (chrisvm) · 2022-06-12T14:07:01.688Z · LW(p) · GW(p)

"I have sat down to make toy models .."

reference?

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-12T17:45:57.206Z · LW(p) · GW(p)

? I am the reference, I'm describing a personal experience.

Replies from: chrisvm

↑ comment by Chris van Merwijk (chrisvm) · 2022-06-17T09:12:55.102Z · LW(p) · GW(p)

I meant, is there a link to where you've written this down somewhere? Maybe you just haven't written it down.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-17T17:18:16.676Z · LW(p) · GW(p)

I'll send you a DM.

↑ comment by Kinrany · 2022-06-08T19:31:23.227Z · LW(p) · GW(p)

Markdown has syntax for quotes: a line with > this on it will look like

this

↑ comment by Paul Kent (paul-kent) · 2023-04-03T20:39:17.444Z · LW(p) · GW(p)

Facebook's recommendation algorithm causing thousands of teenage girls to kill themselves

Can I get a link or two to read more about this incident?

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-04-04T10:03:22.062Z · LW(p) · GW(p)

It's not so much an incident as a trend. I haven't investigated it myself, but I've read lots of people making this claim, citing various studies, etc. See e.g. "The social dilemma" by Tristan Harris.

There's an academic literature on the subject now which I haven't read but which you can probably find by googling.

I just did a quick search and found graphs like this: Suicides in Teen Girls Hit 40-Year High - NBC News

Presumably not all of the increase in deaths is due to Facebook; presumably it's multi-causal blah blah blah. But even if Facebook is responsible for a tiny fraction of the increase, that would mean Facebook was responsible for thousands of deaths.

↑ comment by Israel Tsadok (israel-tsadok) · 2022-06-09T21:45:30.896Z · LW(p) · GW(p)

You said you weren't replying to any specific point Eliezer was making, but I think it's worth pointing out that when he brings up Alpha Go, he's not talking about the 2 years it took Google to build a Go-playing AI - remarkable and surprising as that was - but rather the 3 days it took Alpha Zero to go from not knowing anything about the game beyond the basic rules to being better than all humans and the earlier AIs.

↑ comment by leogao · 2022-06-08T06:41:33.186Z · LW(p) · GW(p)

Some quick thoughts on these points:

I think the ability for humans to communicate and coordinate is a double edged sword. In particular, it enables the attack vector of dangerous self propagating memes. I expect memetic warfare to play a major role in many of the failure scenarios I can think of. As we've seen, even humans are capable of crafting some pretty potent memes, and even defending against human actors is difficult.
I think it's likely that the relevant reference class here is research bets rather then the "task" of AGI. An extremely successful research bet could be currently underinvested in, but once it shows promise, discontinuous (relative to the bet) amounts of resources will be dumped into scaling it up, even if the overall investment towards the task as a whole remains continuous. In other words, in this case even though investment into AGI may be continuous (though that might not even hold), discontinuity can occur on the level of specific research bets. Historical examples would include imagenet seeing discontinuous improvement with AlexNet despite continuous investment into image recognition to that point. (Also, for what it's worth, my personal model of AI doom doesn't depend heavily on discontinuities existing, though they do make things worse.)
I think there exist plausible alternative explanations for why capabilities has been primarily driven by compute. For instance, it may be because ML talent is extremely expensive whereas compute gets half as expensive every 18 months or whatever, that it doesn't make economic sense to figure out compute efficient AGI. Given the fact that humans need orders of magnitude less data and compute than current models, and that the human genome isn't that big and is mostly not cognition related, it seems plausible that we already have enough hardware for AGI if we had the textbook from the future, though I have fairly low confidence on this point.
Monolithic agents have the advantage that they're able to reason about things that involve unlikely connections between extremely disparate fields. I would argue that the current human specialization is at least in part due to constraints about how much information one person can know. It also seems plausible that knowledge can be siloed in ways that make inference cost largely detached from the number of domains the model is competent in. Finally, people have empirically just been really excited about making giant monolithic models. Overall, it seems like there is enough incentive to make monolithic models that it'll probably be an uphill battle to convince people not to do them.
Generally agree with the regulation point given the caveat. I do want to point out that since substantive regulation often moves very slowly, especially when there are well funded actors trying to prevent AGI development being regulated, even in non-foom scenarios (months-years) they might not move fast enough (example: think about how slowly climate change related regulations get adopted)

↑ comment by lc · 2022-06-07T08:22:37.929Z · LW(p) · GW(p)

I hate how convincing so many different people are. I wish I just had some fairly static, reasoned perspective based on object-level facts and not persuasion strings.

Replies from: Vaniver, lc

↑ comment by Vaniver · 2022-06-07T15:28:21.173Z · LW(p) · GW(p)

Note that convincing is a 2-place word [LW · GW]. I don't think I can transfer this ability, but I haven't really tried, so here's a shot:

The target is: "reading as dialogue." Have a world-model. As you read someone else, be simultaneously constructing / inferring "their world-model" and holding "your world-model", noting where you agree and disagree.

If you focus too much on "how would I respond to each line", you lose the ability to listen and figure out what they're actually pointing at. If you focus too little on "how would I respond to this", you lose the ability to notice disagreements, holes, and notes of discord.

The first homework exercise I'd try to printing out something (probably with double-spacing), and writing your thoughts each sentence. "uh huh", "wait what?", "yes and", "no but", etc.; at the beginning you're probably going to be alternating between the two moves before you can do them simultaneously.

[Historically, I think I got this both from 'reading a lot', including a lot of old books, and also 'arguing on the internet' in forum environments that only sort of exist today, which was a helpful feedback loop for the relevant subskills, and of course whatever background factors made me do those activities.]

↑ comment by lc · 2022-06-07T20:17:29.318Z · LW(p) · GW(p)

Why can't I delete comments sometimes? >:(

Replies from: Raemon

↑ comment by Raemon · 2022-06-07T20:37:57.675Z · LW(p) · GW(p)

Users can't delete their own comments if the comment has been replied to, to avoid disrupting other people's content. (you can edit it to be blank though, or mark it as retracted)

↑ comment by emanuele ascani (emanuele-ascani) · 2022-06-07T07:48:28.886Z · LW(p) · GW(p)

Thanks a lot for writing this.

These disagreements mainly concern the relative power of future AIs, the polarity of takeoff, takeoff speed, and, in general, the shape of future AIs. Do you also have detailed disagreements about the difficulty of alignment? If anything, the fact that the future unfolds differently in your view should impact future alignment efforts (but you also might have other considerations informing your view on alignment).

You partially answer this in the last point, saying: "But, equally, one could view these theses pessimistically." But what do you personally think? Are you more pessimistic, more optimistic, or equally pessimistic about humanity's chances of surviving AI progress? And why?

Replies from: matthew-barnett

↑ comment by Matthew Barnett (matthew-barnett) · 2022-06-07T09:02:40.748Z · LW(p) · GW(p)

Part of what makes it difficult for me to talk about alignment difficultly is that the concept doesn’t fit easily into my paradigm of thinking about the future of AI. If I am correct, for example, that AI services will be modular, marginally more powerful than what comes before, and numerous as opposed to monolithic, then there will not be one alignment problem, but many.

I could talk about potential AI safety principles, healthy cultural norms, and specific engineering issues, but not “a problem” called “aligning the AI” — a soft prerequisite for explaining how difficult “the problem” will be. Put another way, my understanding is that future AI alignment will be continuous with ordinary engineering, like cars and skyscrapers. We don’t ordinarily talk about how hard the problem of building a car is, in some sort of absolute sense, though there are many ways of operationalizing what that could mean.

One question is how costly it is to build a car. We could then compare that cost to the overall consumer benefit that people get from cars, and from that, deduce whether and how many cars will be built. Similarly, we could ask about the size of the “alignment tax” (the cost of aligning an AI above the cost of building AI), and compare it to the benefits we get from aligning AI at all.

My starting point in answering this question is to first emphasize the large size of the benefits: what someone gets if they build AI correctly. We should expect this benefit to be extremely large, and thus, we should also expect people to pay very large amounts to align their AIs, including through government regulation and other social costs.

Will people still fail to align AI services, in various ways, due to the numerous issues, like e.g. mesa misalignment, outer alignment, arising from lack of oversight and transparency? Sure — and I’m uncertain by how much this will occur — but because of the points I gave in my original comment, these seem unlikely to be fatal issues, on a civilizational level. It is perhaps less analogous to nukes than to how car safety sometimes fails (though I do not want to lean heavily on this comparison, as there are real differences too).

Now, there is a real risk in misunderstanding me here. AI values and culture could drift very far from human values over time. And eventually, this could culminate in an existential risk. This is all very vague, but if I were forced to guess the probability of this happening — as in, it’s all game over and we lose as humans — I’d maybe go with 25%.

Replies from: Emrik North

↑ comment by Emrik (Emrik North) · 2022-06-26T15:08:18.368Z · LW(p) · GW(p)

Btw, your top-level comment is one of the best comments I've come across ever. Probably. Top 5? Idk, I'll check how I feel tomorrow. Aspiring to read everything you've ever written rn.

Incidentally, you mention that

the concept doesn’t fit easily into my paradigm of thinking about the future of AI.

And I've been thinking lately about how important it is to prioritise original thinking before you've consumed all the established literature in an active field of research.^[1] If you manage to diverge early, the novelty of your perspective compounds over time (feel free to ask about my model) and you're more likely to end up with a productively different paradigm from what's already out there.

Did you ever feel embarrassed trying to think for yourself when you didn't feel like you had read enough? Or, did you feel like other people might have expected you to feel embarrassed for how seriously you took your original thoughts, given how early you were in your learning arc?

^{^}
I'm not saying you haven't. I'm just guessing that you acquired your paradigm by doing original thinking early, and thus had the opportunity to diverge early, rather than greedily over-prioritising the consumption of existing literature in order to "reach the frontier". Once having hastily consumed someone else's paradigm, it's much harder to find its flaws and build something else from the ground up.

↑ comment by Vishrut Arya (vishrut-arya) · 2022-06-07T14:29:41.225Z · LW(p) · GW(p)

hi Matt! on the coordination crux, you say

The first AGIs we construct will be born into a culture already capable of coordinating, and sharing knowledge, making the potential power difference between AGI and humans relatively much smaller than between humans and other animals, at least at first.

but wouldn’t an AGI be able to coordinate and do knowledge sharing with humans because

a) it can impersonate being a human online and communicate with them via text and speech and

b) it‘ll realize such coordination is vital to accomplish it‘s goals and so it’ll do the necessary acculturation?

Watching all the episodes of Friends or reading all the social media posts by the biggest influencers, as examples.

↑ comment by Emrik (Emrik North) · 2022-06-26T14:57:59.615Z · LW(p) · GW(p)

One reason that a fully general AGI might be more profitable than specialised AIs, despite obvious gains-from-specialisation, is if profitability depends on insight-production. For humans, it's easier to understand a particular thing the more other things you understand. One of the main ways you make novel intellectual progress is by combining remote associations from models about different things. Insight-ability for a particular novel task grows with the number of good models you have available to draw connections between.

But, it could still be that the gains from increased generalisation for a particular model grows too slowly and can't compete with obvious gains from specialised AIs.

↑ comment by David Johnston (david-johnston) · 2022-06-09T10:13:44.518Z · LW(p) · GW(p)

I think raw intelligence, while important, is not the primary factor that explains why humanity-as-a-species is much more powerful than chimpanzees-as-a-species. Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.

Slightly relatedly, I think it's possible that "causal inference is hard". The idea is: once someone has worked something out, they can share it and people can pick it up easily, but it's hard to figure the thing out to begin with - even with a lot of prior experience and efficient inference, most new inventions still need a lot of trial and error. Thus the reason the process of technology accumulation is gradual is, crudely, because causal inference is hard.

Even if this is true, one way things could still go badly is if most doom scenarios are locked behind a bunch of hard trial and error, but the easiest one isn't. On the other hand, if both of these things are true then there could be meaningful safety benefits gained from censoring certain kinds of data.

Replies from: None

↑ comment by [deleted] · 2022-06-10T19:26:35.895Z · LW(p) · GW(p)

This is what struck me as the least likely to be true from the above AI doom scenario.

Is diamondoid nanotechnology possible? Very likely it is or something functionally equivalent.

Can a sufficiently advanced superintelligence infer how to build it from scratch solely based on human data? Or will it need a large R&D center with many, many robotic systems that conduct experiments in parallel to extract the information required about our specific details of physics in our actual universe. Not the very slightly incorrect approximations a simulator will give you.

The 'huge R&D center so big you can't see the end of it' is somewhat easier to regulate the 'invisible dust the AI assembles with clueless stooges'.

Replies from: Marion Z., Keenmaster

↑ comment by Marion Z. · 2022-06-12T05:09:46.429Z · LW(p) · GW(p)

Any individual doomsday mechanism we can think of, I would agree is not nearly so simple for an AGI to execute as Yudkowsky implies. I do think that it's quite likely we're just not able to think of mechanisms even theoretically that an AGI could think of, and one or more of those might actually be quite easy to do secretly and quickly. I wouldn't call it guaranteed by any means, but intuitively this seems like the sort of thing that raw cognitive power might have a significant bearing on.

Replies from: None

↑ comment by [deleted] · 2022-06-16T19:48:52.766Z · LW(p) · GW(p)

I agree. One frightening mechanism I thought of is : "ok, assume the AGI can't craft the bioweapon or nanotechnology killbots without collecting vast amounts of information through carefully selected and performed experiments. (Basically enormous complexes full of robotics). How does it get the resources it needs?

And the answer is it scams humans into doing it. We have many examples of humans trusting someone they shouldn't even when the evidence was readily available that they shouldn't.

↑ comment by Keenmaster · 2022-06-17T23:22:13.063Z · LW(p) · GW(p)

Any “huge R&D center” constraint is trivialized in a future where agile, powerful robots will be ubiquitous and an AGI can use robots to create an underground lab in the middle of nowhere, using its superintelligence to be undetectable in all ways that are physically possible. An AGI will also be able to use robots and 3D printers to fabricate purpose-built machines that enable it to conduct billions of physical experiments a day. Sure, it would be harder to construct something like a massive particle accelerator, but 1) that isn’t needed to make killer nanobots 2) even that isn’t impossible for a sufficiently intelligent machine to create covertly and quickly.

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-06-06T09:55:40.972Z · LW(p) · GW(p)

First, some remarks about the meta-level:

The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that.

Actually, I don't feel like I learned that much reading this list, compared to what I already knew. [EDIT: To be clear, this knowledge owes a lot to prior inputs from Yudkowsky and the surrounding intellectual circle, I am making no claim that I would derive it all independently in a world in which Yudkowsky and MIRI didn't exit.] To be sure, it didn't feel like a waste of time, and I liked some particular framings (e.g. in A.4 separating the difficulty into "unlimited time but 1 try" and "limited time with retries"), but I think I could write something that would be similar (in terms of content; it would be very likely much worse in terms of writing quality).

One reason I didn't write such a list is, I don't have the ability to write things comprehensibly. Empirically, everything of substance that I write is notoriously difficult for readers to understand. Another reason is, at some point I decided to write top-level posts only when I have substantial novel mathematical results, with rare exceptions. This is in part because I feel like the field has too much hand-waving and philosophizing and too little hard math (which rhymes with C.38). In part it is because, even if people can't understand the informal component of my reasoning, they can at least understand there is math here and, given sufficient background, follow the definitions/theorems/proofs (although tbh few people follow).

There's no plan

Actually, I do have a plan. It doesn't have an amazing probability of success (my biggest concerns are (i) not enough remaining time and (ii) even if the theory is ready in time, the implementation can be bungled, in particular for reasons of operational adequacy [LW · GW]), but it is also not practically useless. The last time I tried to communicate it was 4 years ago [LW · GW], since which time it obviously evolved. Maybe it's about time to make another attempt, although I'm wary of spending a lot of effort on something which few people will understand.

Now, some technical remarks:

Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.

This is true, but it is notable that deep learning is not equivalent to evolution, and the differences are important. Consider for example a system that is designed to separately (i) learn a generative model of the environment and (ii) search for plans effective on this model (model-based RL). Then, module ii doesn't inherently have the problem where the solution only optimizes the correct thing in the training environment. Because, this module is not bounded by available training data, but only by compute. The question is then, to 1st approximation, whether module i is able to correctly generalize from the training data (obviously there are theoretical bounds on how good such this generalization can be; but we want this generalization to be at least as good as human ability and without dangerous biases). I do not think current systems do such generalization correctly, although they do seem to have some ingredients right, in particular Occam's razor / simplicity bias. But we can imagine some algorithm that does.

...on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.

Also true, but there is nuance. The key problem is that we don't know why deep learning works, or more specifically w.r.t. which prior does it satisfy good generalization bounds. If we knew what this prior is, then we could predict some inner properties. For example, if you know your algorithm follows Occam's razor, for a reasonable formalization of "Occam's razor", and you trained it on the sun setting every day for a million days, then you can predict that the algorithm will not confidently predict the sun is going to fail to to set on any given future day. Moreover, our not knowing such generalization bounds for deep learning is a fact about our present state of mathematical ignorance, not a fact about the algorithms themselves.

...there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment.

It is true that (AFAIK) nothing like this was accomplished in practice, but the distance to that might not be too great. For example, I can imagine training an ANN to implement a POMDP which simultaneously successfully predicts the environment and complies with some "ontological hypothesis" about how the environment needs to be structured in order for the-things-we-want-to-point-at to be well-defined (technically, this POMDP needs to be a refinement of some infra-POMPD [LW · GW] that represents the ontological hypothesis).

The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It's not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.

There is a big chunk of what you're trying to teach which not weird and complicated, namely: "find this other agent, and what their values are". Because, "agents" and "values" are natural concepts, for reasons strongly related to "there's a relatively simple core structure that explains why complicated cognitive machines work". Admittedly, my rough proposal (PreDCA [LW(p) · GW(p)]) does have some "weird and complicated" parts because of the acausal attack problem.

Any pivotal act that is not something we can go do right now, will take advantage of the AGI figuring out things about the world we don't know so that it can make plans we wouldn't be able to make ourselves. It knows, at the least, the fact we didn't previously know, that some action sequence results in the world we want. Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence.

This is inaccurate, because . It is possible to imagine an AI that provides us with a plan for which we simultaneously (i) can understand why it works and (ii) wouldn't think of it ourselves without thinking for a very long time that we don't have. At the very least, the AI could suggest a way of building a more powerful aligned AI. Of course, in itself this doesn't save us at all: instead of producing such a helpful plan, the AI can produce a deceitful plan instead. Or a plan that literally makes everyone who reads it go insane in very specific ways. Or the AI could just hack the hardware/software system inside which it's embedded to produce a result which counts for it as a high reward but which for us wouldn't look anything like "producing a plan the overseer rates high". But, this direction might [LW(p) · GW(p)] be not completely unsalvageable^[1].

Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

I agree that the process of inferring human thought from the surface artifacts of human thought require powerful non-human thought which is dangerous in itself. But this doesn't necessarily mean that the idea of imitating human though doesn't help at all. We can combine it with techniques such as counterfactual oracles and confidence thresholds to try to make sure the resulting agent is truly only optimizing for accurate imitation (which still leaves problems like attacks from counterfactuals [LW(p) · GW(p)] and non-Cartesian daemons [LW · GW], and also not knowing which features of the data are important to imitate might be a big capability handicap).

That said, I feel that PreDCA is more promising than AQD: it seems to require less fragile assumptions and deals more convincingly with non-Cartesian daemons [LW · GW]. [EDIT: AQD also can't defend from acausal attack if the malign hypothesis has massive advantage in prior probability mass, and it's quite likely to have that. It does not work to solve this by combining AQD with IBP, at least not naively.] ↩︎

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T11:05:25.054Z · LW(p) · GW(p)

There is a big chunk of what you're trying to teach which not weird and complicated, namely: "find this other agent, and what their values are". Because, "agents" and "values" are natural concepts, for reasons strongly related to "there's a relatively simple core structure that explains why complicated cognitive machines work".

This seems like it must be true to some degree, but "there is a big chunk" feels a bit too strong to me.

Possibly we don't disagree, and just have different notions of what a "big chunk" is. But some things that make the chunk feel smaller to me:

Humans are at least a little coherent, or we would never get anything done; but we aren't very coherent, so the project of piecing together 'what does the human brain as a whole "want"' can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.
There are shards of planning and optimization and goal-oriented-ness in a cat's brain, but 'figure out what utopia would look like for a cat' is a far harder problem than 'identify all of the goal-encoding parts of the cat's brain and "read off" those goals'. E.g., does 'identifying utopia' in this context involve uplifting or extrapolating the cat? Why, or why not? And if so, how does that process work?
Getting a natural concept into an agent's goal is a lot harder than getting it into an agent's beliefs. Indeed, in the context of goals I'm not sure 'naturalness' actually helps at all, except insofar as natural kinds tend to be simple and simple targets are easier to hit?
- An obvious way naturalness could help, over and above simplicity, is if we have some value-loading technique that leverages or depends on "this concept shows up in the AGI's world-model". More natural concepts can show up in AGI world-models more often than simpler-but-less-natural concepts, because the natural concept is more useful for making sense of sensory data.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2022-06-06T12:37:00.922Z · LW(p) · GW(p)

Humans are at least a little coherent, or we would never get anything done; but we aren't very coherent, so the project of piecing together 'what does the human brain as a whole "want"' can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.

This is a point where I feel like I do have a substantial disagreement with the "conventional wisdom" of LessWrong.

First, LessWrong began with a discussion of cognitive biases in human irrationality, so this naturally became a staple of the local narrative. On the other hand, I think that a lot of presumed irrationality is actually rational but deceptive behavior (where the deception runs so deep that it's part of even our inner monologue). There are exceptions, like hyperbolic discounting, but not that many.

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent. Therefore, if X is not entirely coherent then X's preferences are only approximately defined, and hence we only need to infer them approximately. So, the added difficulty of inferring X's preferences, resulting from the partial incoherence of these preference, is, to large extent, cancelled out by the reduction in the required precision of the answer. The way I expect this cache out is, when the agent has , the utility function is only approximately defined, and we can infer it within this approximation. As $g$ approaches infinity, the utility function becomes crisply defined^[1] and can be inferred crisply. See also additional nuance in my answer to the cat question below.

This is not to say we shouldn't investigate models like dynamically inconsistent preferences [LW(p) · GW(p)] or "humans as systems of agents", but that I expect the number of additional complications of this sort that are actually important to be not that great.

There are shards of planning and optimization and goal-oriented-ness in a cat's brain, but 'figure out what utopia would look like for a cat' is a far harder problem than 'identify all of the goal-encoding parts of the cat's brain and "read off" those goals'. E.g., does 'identifying utopia' in this context involve uplifting or extrapolating the cat? Why, or why not? And if so, how does that process work?

I'm actually not sure that cats (as opposed to humans) are sufficiently "general" intelligence for the process to make sense. This is because I think humans are doing something like Turing RL [LW(p) · GW(p)] (where consciousness plays the role of the "external computer"), and value learning is going to rely on that. The issue is, you don't only need to infer the agent's preferences but you also need to optimize them better than the agent itself. This might pose a difficulty, if, as I suggested above, imperfect agents have imperfectly defined preferences. While I can see several hypothetical solutions, the TRL model suggests a natural approach where the AI's capability advantage is reduced to having a better external computer (and/or better interface with that computer). This might not apply to cats which (I'm guessing) don't have this kind of consciousness^[2] because (I'm guessing) the evolution of consciousness was tied to language and social behavior.

Getting a natural concept into an agent's goal is a lot harder than getting it into an agent's beliefs. Indeed, in the context of goals I'm not sure 'naturalness' actually helps at all, except insofar as natural kinds tend to be simple and simple targets are easier to hit?

I'm not saying that the specific goals human have are natural: they are a complex accident of evolution. I'm saying that the general correspondence between agents and goals is natural.

Asymptotically crisply: some changes are too small to affect the optimal policy, but I'm guessing that they become negligible when considering longer and longer timescales. ↩︎
This is not to say cat's don't have quasimoral value: I think they do. ↩︎

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T13:43:33.596Z · LW(p) · GW(p)

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.

I'm not sure this is true; or if it's true, I'm not sure it's relevant. But assuming it is true...

Therefore, if X is not entirely coherent then X's preferences are only approximately defined, and hence we only need to infer them approximately.

... this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and complexity cashing out as 'EU-maximizer-ish' are:

Maybe I sort-of contain a lot of subagents, and 'my values' are the conjunction of my sub-agents' values (where they don't conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).
Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.

In both cases, the fact that my brain isn't a single coherent EU maximizer seemingly makes things a lot harder and more finnicky, rather than making things easier. These are cases where you could say that my initial brain is 'only approximately an agent', and yet this comes with no implication that there's any more room for error or imprecision than if I were an EU maximizer.

I'm not saying that the specific goals human have are natural: they are a complex accident of evolution. I'm saying that the general correspondence between agents and goals is natural.

Right, but this doesn't on its own help get that specific relatively-natural concept into the AGI's goals, except insofar as it suggests "the correspondence between agents and goals" is a simple concept, and any given simple concept is likelier to pop up in a goal than a more complex one.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2022-06-06T14:08:06.216Z · LW(p) · GW(p)

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.

I'm not sure this is true; or if it's true, I'm not sure it's relevant.

If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like "this (intuitively compelling) assumption is false" unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum. Which is to say, I find it self-evident that "agents" are exactly the sort of beings that can "want" things, because agency is about pursuing objectives and wanting is about the objectives that you pursue. If you don't believe this then I don't know what these words even mean for you.

Maybe I sort-of contain a lot of subagents, and 'my values' are the conjunction of my sub-agents' values (where they don't conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).

Maybe, and maybe this means we need to treat "composite agents" explicitly in our models. But, there is also a case to be made that groups of (super)rational agents effectively converge into a single utility function, and if this is true, then the resulting system can just as well be interpreted as a single agent having this effective utility function, which is a solution that should satisfy the system of agents according to their existing bargaining equilibrium.

Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.

If your agent converges to optimal behavior asymptotically, then I suspect it's still going to have infinite and therefore an asymptotically-crisply-defined utility function.

Right, but this doesn't on its own help get that specific relatively-natural concept into the AGI's goals, except insofar as it suggests "the correspondence between agents and goals" is a simple concept, and any given simple concept is likelier to pop up in a goal than a more complex one.

Of course it doesn't help on its own. What I mean is, we are going to find a precise mathematical formalization of this concept and then hard-code this formalization into our AGI design.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T20:22:50.975Z · LW(p) · GW(p)

If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like "this (intuitively compelling) assumption is false" unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum.

Fair enough! I don't think I agree in general, but I think 'OK, but what's your alternative to agency?' is an especially good case for this heuristic.

Which is to say, I find it self-evident that "agents" are exactly the sort of beings that can "want" things, because agency is about pursuing objectives and wanting is about the objectives that you pursue.

The first counter-example that popped into my head was "a mind that lacks any machinery for considering, evaluating, or selecting actions; but it does have machinery for experiencing more-pleasurable vs. less pleasurable states". This is a mind we should be able to build, even if it would never evolve naturally.

Possibly this still qualifies as an "agent" that "wants" and "pursues" things, as you conceive it, even though it doesn't select actions?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2022-06-07T06:23:17.389Z · LW(p) · GW(p)

My 0th approximation answer is: you're describing something logically incoherent, like a p-zombie.

My 1st approximation answer is more nuanced. Words that, in the pre-Turing era, referred exclusively to humans (and sometimes animals, and fictional beings), such as "wants", "experiences" et cetera, might have two different referents. One referent is a natural concept, something tied into deep truths about how the universe (or multiverse) works. In particular, deep truths about the "relatively simple core structure that explains why complicated cognitive machines work". The other referent is something in our specifically-human "ontological model" of the world (technically, I imagine that to be an infra-POMDP that all our hypotheses our refinements of). Since the latter is a "shard" of the former produced by evolution, the two referents are related, but might not be the same. (For example, I suspect that cats lack natural!consciousness but have human!consciousness.)

The creature you describe does not natural!want anything. You postulated that it is "experiencing more pleasurable and less pleasurable states", but there is no natural method that would label its states as such, or that would interpret them as any sort of "experience". On the other hand, maybe if this creature is designed as a derivative of the human brain, then it does human!want something, because our shard of the concept of "wanting" mislabels (relatively to natural!want) weird states that wouldn't occur in the ancestral environment.

You can then ask, why should we design the AI to follow what we natural!want rather than what we human!want? To answer this, notice that, under ideal conditions, you converge to actions that maximize your natural!want, (more or less) according to definition of natural!want. In particular, under ideal conditions, you would build an AI that follows your natural!want. Hence, it makes sense to take a shortcut and "update now to the view you will predictably update to later": namely, design the AI to follow your natural!want.

comment by Rob Bensinger (RobbBB) · 2022-06-08T07:53:58.988Z · LW(p) · GW(p)

On Twitter, Eric Rogstad wrote:

"the thing where it keeps being literally him doing this stuff is quite a bad sign"
I'm a bit confused by this part. Some thoughts on why it seems odd for him (or others) to express that sentiment...
1. I parse the original as, "a collection of EY's thoughts on why safe AI is hard". It's EY's thoughts, why would someone else (other than @robbensinger) write a collection of EY's thoughts?
(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in cold-takes, or ...?)
2. Was there anything new in this doc? It's prob useful to collect all in one place, but we don't ask, "why did no one else write this" for every bit of useful writing out there, right?
Why was it so overwhelmingly important that someone write this summary at this time, that we're at all scratching our heads about why no one else did it?

Copying over my reply to Eric:

My shoulder Eliezer (who I agree with on alignment, and who speaks more bluntly and with less hedging than I normally would) says:
The list is true, to the best of my knowledge, and the details actually matter.

Many civilizations try to make a canonical list like this in 1980 and end up dying where they would have lived just because they left off one item, or under-weighted the importance of the last three sentences of another item, or included ten distracting less-important items.

There are probably not many civilizations that wait until 2022 to make this list, and yet survive.

It's true that many of the points in the list have been made before. But it's very doomy that they were made by me.

Nearly all of the field's active alignment research is predicated on a false assumption that's contradicted by one of the items in sections A or B. If the field had recognized everything in A and B sooner, we could have put our recent years of effort into work that might actually help on the mainline, as opposed to work that just hopes a core difficulty won't manifest and has no Plan B for what to do when reality says "no, we're on the mainline".
So the answer to 'Why would someone else write EY's thoughts?' is 'It has nothing to do with an individual's thoughts; it's about civilizations needing a very solid and detailed understanding of what's true on these fronts, or they die'.
Re "(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in cold-takes, or ...?)":
The point is not 'humanity needs to write a convincing-sounding essay for the thesis Safe AI Is Hard, so we can convince people'. The point is 'humanity needs to actually have a full and detailed understanding of the problem so we can do the engineering work of solving it'.
If it helps, imagine that humanity invents AGI tomorrow and has to actually go align it now. In that situation, you need to actually be able to do all the requisite work, not just be able to write essays that would make a debate judge go 'ah yes, well argued.'
When you imagine having water cooler arguments about the importance of AI alignment work, then sure, it's no big deal if you got a few of the details wrong.
When you imagine actually trying to build aligned AGI the day after tomorrow, I think it comes much more into relief why it matters to get those details right, when the "details" are as core and general as this.

I think that this is a really good exercise that more people should try. Imagine that you're running a project yourself that's developing AGI first, in real life. Imagine that you are personally responsible for figuring out how to make the thing go well. Yes, maybe you're not the perfect person for the job; that's a sunk cost. Just think about what specific things you would actually do to make things go well, what things you'd want to do to prepare 2 years or 6 years in advance, etc.

Try to think your way into near-mode with regard to AGI development, without thereby assuming (without justification) that it must all be very normal just because it's near. Be able to visualize it near-mode and weird/novel. If it helps, start by trying to adopt a near-mode, pragmatic, gearsy mindset toward the weirdest realistic/plausible hypothesis first, then progress to the less-weird possibilities.

I think there's a tendency for EAs and rationalists to instead fall into one of these two mindsets with regard to AGI development, pivotal acts, etc.:

Fun Thought Experiment Mindset. On this mindset, pivotal acts, alignment, etc. are mentally categorized as a sort of game, a cute intellectual puzzle or a neat thing to chat about.

This is mostly a good mindset, IMO, because it makes it easy to freely explore ideas, attend to the logical structure of arguments, brainstorm, focus on gears, etc.

Its main defect is a lack of rigor and a more general lack of drive: because on some level you're not taking the question seriously, you're easily distracted by fun, cute, or elegant lines of thought, and you won't necessarily push yourself to red-team proposals, spontaneously take into account other pragmatic facts/constraints you're aware of from outside the current conversational locus, etc. The whole exercise sort of floats in a fantasy bubble, rather than being a thing people bring their full knowledge, mental firepower, and lucidity/rationality to bear on.
Serious Respectable Person Mindset. Alternatively, when EAs and rationalists do start taking this stuff seriously, I think they tend to sort of turn off the natural flexibility, freeness, and object-levelness of their thinking, and let their mind go to a very fearful or far-mode place. The world's gears become a lot less salient, and "Is it OK to say/think that?" becomes a more dominant driver of thought.

Example: In Fun Thought Experiment Mindset, IME, it's easier to think about governments in a reductionist and unsentimental way, as specific messy groups of people with specific institutional dysfunctions, psychological hang-ups, etc. In Serious Respectable Person Mindset, there's more of a temptation to go far-mode, glom on to happy-sounding narratives and scenarios, or even just resist the push to concretely visualize the future at all -- thinking instead in terms of abstract labels and normal-sounding platitudes.

The entire fact that EA and rationalism mostly managed to avert their gaze from the concept of "pivotal acts" for years, is in my opinion an example of how these two mindsets often fail.

"In the endgame, AGI will probably be pretty competitive, and if a bunch of people deploy AGI then at least one will destroy the world" is a thing I think most LWers and many longtermist EAs would have considered obvious. As a community, however, we mostly managed to just-not-think the obvious next thought, "In order to prevent the world's destruction in this scenario, one of the first AGI groups needs to find some fast way to prevent the proliferation of AGI."

Fun Thought Experiment Mindset, I think, encouraged this mental avoidance because it thought of AGI alignment (to some extent) as a fun game in the genre of "math puzzle" or "science fiction scenario", not as a pragmatic, real-world dilemma we actually have to solve, taking into account all of our real-world knowledge and specific facts on the ground. The 'rules of the game', many people apparently felt, were to think about certain specific parts of the action chain leading up to an awesome future lightcone, rather than taking ownership of the entire problem and trying to figure out what humanity should in-real-life do, start to finish.

(What primarily makes this weird is that many alignment questions crucially hinge on 'what task are we aligning the AGI on?'. These are not remotely orthogonal topics.)

Serious Respectable Person Mindset, I think, encouraged this mental avoidance more actively, because pivotal acts are a weird and scary-sounding idea once you leave 'these are just fun thought experiments' land.

What I'd like to see instead is something like Weirdness-Tolerant Project Leader Mindset, or Thought Experiments Plus Actual Rigor And Pragmatism And Drive Mindset, or something.

I think a lot of the confusion around EY's post comes from the difference between thinking of these posts (on some level) as fun debate fodder or persuasion/outreach tools, versus attending to the fact that humanity has to actually align AGI systems if we're going to make it out of this problem, and this is an attempt by humanity to distill where we're currently at, so we can actually proceed to go solve alignment right now and save the world.

Imagine that this is v0 of a series of documents that need to evolve into humanity's (/ some specific group's) actual business plan for saving the world. The details really, really matter. Understanding the shape of the problem really matters, because we need to engineer a solution, not just 'persuade people to care about AI risk'.

If you disagree with the OP... that's pretty important! Share your thoughts. If you agree, that's important to know too, so we can prioritize some disagreements over others and zero in on critical next actions. There's a mindset here that I think is important, that isn't about "agree with Eliezer on arbitrary topics" or "stop thinking laterally"; it's about approaching the problem seriously, neither falling into despair nor wishful thinking, neither far-mode nor forced normality, neither impracticality nor propriety.

Replies from: handoflixue, ESRogs, ESRogs

↑ comment by handoflixue · 2022-06-08T23:16:31.406Z · LW(p) · GW(p)

There are probably not many civilizations that wait until 2022 to make this list, and yet survive.

I don't think making this list in 1980 would have been meaningful. How do you offer any sort of coherent, detailed plan for dealing with something when all you have is toy examples like Eliza?

We didn't even have the concept of machine learning back then - everything computers did in 1980 was relatively easily understood by humans, in a very basic step-by-step way. Making a 1980s computer "safe" is a trivial task, because we hadn't yet developed any technology that could do something "unsafe" (i.e. beyond our understanding). A computer in the 1980s couldn't lie to you, because you could just inspect the code and memory and find out the actual reality.

What makes you think this would have been useful?

Do we have any historical examples to guide us in what this might look like?

Replies from: RobbBB, Vaniver

↑ comment by Rob Bensinger (RobbBB) · 2022-06-12T03:10:17.567Z · LW(p) · GW(p)

I think most worlds that successfully navigate AGI risk have properties like:

AI results aren't published publicly, going back to more or less the field's origin.
The research community deliberately steers toward relatively alignable approaches to AI, which includes steering away from approaches that look like 'giant opaque deep nets'.
- This means that you need to figure out what makes an approach 'alignable' earlier, which suggests much more research on getting de-confused regarding alignable cognition.
  - Many such de-confusions will require a lot of software experimentation, but the kind of software/ML that helps you learn a lot about alignment as you work with it is itself a relatively narrow target that you likely need to steer towards deliberately, based on earlier, weaker deconfusion progress. I don't think having DL systems on hand to play with has helped humanity learn much about alignment thus far, and by default, I don't expect humanity to get much more clarity on this before AGI kills us.
Researchers focus on trying to predict features of future systems, and trying to get mental clarity about how to align such systems, rather than focusing on 'align ELIZA' just because ELIZA is the latest hot new thing. Make and test predictions, back-chain from predictions to 'things that are useful today', and pick actions that are aimed at steering — rather than just wandering idly from capabilities fad to capabilities fad.
- (Steering will often fail. But you'll definitely fail if you don't even try. None of this is easy, but to date humanity hasn't even made an attempt.)
In this counterfactual world, deductive reasoners and expert systems were only ever considered a set of toy settings for improving our intuitions, never a direct path to AGI.
- (I.e., the civilization was probably never that level of confused about core questions like 'how much of cognition looks like logical deduction?'; their version of Aristotle or Plato, or at least Descartes, focused on quantitative probabilistic reasoning. It's an adequacy red flag that our civilization was so confused about so many things going into the 20th century.)

To me, all of this suggests a world where you talk about alignment before you start seeing crazy explosions in capabilities. I don't know what you mean by "we didn't even have the concept of machine learning back then", but I flatly don't buy that the species that landed on the Moon isn't capable of generating a (more disjunctive version of) the OP's semitechnical concerns pre-AlexNet.

You need the norm of 'be able to discuss things before you have overwhelming empirical evidence', and you need the skill of 'be good at reasoning about such things', in order to solve alignment at all; so it's a no-brainer that not-wildly-incompetent civilizations at least attempt literally any of this.

Replies from: thomas-kwa

↑ comment by Thomas Kwa (thomas-kwa) · 2022-10-04T00:15:19.561Z · LW(p) · GW(p)

"most worlds that successfully navigate AGI risk" is kind of a strange framing to me.

For one thing, it represents p(our world | success) and we care about p(success | our world). To convert between the two you of course need to multiply by p(success) / p(our world). What's the prior distribution of worlds? This seems underspecified.

For another, using the methodology "think about whether our civilization seems more competent than the problem is hard" or "whether our civilization seems on track to solve the problem" I might have forecast nuclear annihilation (not sure about this).

The methodology seems to work when we're relatively certain about the level of difficulty on the mainline, so if I were more sold on that I would believe this more. It would still feel kind of weird though.

↑ comment by Vaniver · 2022-06-09T18:13:14.243Z · LW(p) · GW(p)

I don't think making this list in 1980 would have been meaningful. How do you offer any sort of coherent, detailed plan for dealing with something when all you have is toy examples like Eliza?

I mean, I think many of the computing pioneers 'basically saw' AI risk. I noted some surprise that IJ Good didn't write the precursor to this list in 1980, and apparently Wikipedia claims there was an unpublished statement in 1998 about AI x-risk; it'd be interesting to see what it contains and how much it does or doesn't line up with our modern conception of why the problem is hard.

Replies from: Zack_M_Davis

↑ comment by Zack_M_Davis · 2022-06-10T19:32:20.061Z · LW(p) · GW(p)

The historical figures who basically saw it (George Eliot 1879 [LW · GW]: "will the creatures who are to transcend and finally supersede us be steely organisms [...] performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy?"; Turing 1951: "At some stage therefore we should have to expect the machines to take control") seem to have done so in the spirit of speculating about the cosmic process. The idea of coming up with a plan to solve the problem is an additional act of audacity; that's not really how things have ever worked so far. (People make plans about their own lives, or their own businesses; at most, a single country; no one plans world-scale evolutionary transitions.)

Replies from: andrew-mcknight

↑ comment by Andrew McKnight (andrew-mcknight) · 2022-08-01T21:52:54.234Z · LW(p) · GW(p)

I'm tempted to call this a meta-ethical failure. Fatalism, universal moral realism, and just-world intuitions seem to be the underlying implicit hueristics or principals that would cause this "cosmic process" thought-blocker.

↑ comment by ESRogs · 2022-06-09T01:43:01.162Z · LW(p) · GW(p)

Imagine that this is v0 of a series of documents that need to evolve into humanity's (/ some specific group's) actual business plan for saving the world.

Why is this v0 and not https://arbital.com/explore/ai_alignment/, or the Sequences, or any of the documents that Evan links to here [LW(p) · GW(p)]?

That's part of what I meant to be responding to — not that this post is not useful, but that I don't see what makes it so special compared to all the other stuff that Eliezer and others have already written.

Replies from: ESRogs

↑ comment by ESRogs · 2022-06-09T01:51:00.869Z · LW(p) · GW(p)

To put it another way, I would agree that Eliezer has made (what seem to me like) world-historically-significant contributions to understanding and advocating for (against) AI risk.

So, if 2007 Eliezer was asking himself, "Why am I the only one really looking into this?", I think that's a very reasonable question.

But here in 2022, I just don't see this particular post as that significant of a contribution compared to what's already out there.

↑ comment by ESRogs · 2022-06-09T01:31:43.231Z · LW(p) · GW(p)

If you disagree with the OP... that's pretty important! Share your thoughts.

Wrote a long comment here [LW(p) · GW(p)]. (Which you've seen, but linking since your comment started as a response to me.)

comment by ESRogs · 2022-06-08T03:17:21.395Z · LW(p) · GW(p)

-3. I'm assuming you are already familiar with some basics, and already know what 'orthogonality' and 'instrumental convergence' are and why they're true.

I think this is actually the part that I most "disagree" with. (I put "disagree" in quotes, because there are forms of these theses that I'm persuaded by. However, I'm not so confident that they'll be relevant for the kinds of AIs we'll actually build.)

1. The smart part is not the agent-y part

It seems to me that what's powerful about modern ML systems is their ability to do data compression / pattern recognition. That's where the real cognitive power (to borrow Eliezer's term) comes from. And I think that this is the same as what makes us smart.

GPT-3 does unsupervised learning on text data. Our brains do predictive processing on sensory inputs. My guess (which I'd love to hear arguments against!) is that there's a true and deep analogy between the two, and that they lead to impressive abilities for fundamentally the same reason.

If so, it seems to me that that's where all the juice is. That's where the intelligence comes from. (In the past, I've called this the core smarts [LW(p) · GW(p)] of our brains.)

On this view, all the agent-y, planful, System 2 stuff that we do is the analogue of prompt programming. It's a set of not-very-deep, not-especially-complex algorithms meant to cajole the actually smart stuff into doing something useful.

When I try to extrapolate what this means for how AI systems will be built, I imagine a bunch of Drexler-style [LW · GW]AI services.

Yes, in some cases people will want to chain services together to form something like an agent, with something like goals. However, the agent part isn't the smart part. It's just some simple algorithms on top of a giant pile of pattern recognition and data compression.

Why is that relevant? Isn't an algorithmically simple superintelligent agent just as scary as (if not moreso than) a complex one? In a sense yes, it would still be very scary. But to me it suggests a different intervention point.

If the agency is not inextricably tied to the intelligence, then maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.

Am I just recapitulating the case for Oracle-AI / Tool-AI? Maybe so.

But if agency is not a fundamental part of intelligence, and rather something that can just be added in on top, or not, and if we're at a loss for how to either align a superintelligent agent with CEV or else make it corrigible, then why not try to avoid creating the agent part of superintelligent agent?

I think that might be easier than many think...

2. The AI does not care about your atoms either

The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.

https://intelligence.org/files/AIPosNegFactor.pdf

Suppose we have (something like) an agent, with (something like) a utility function. I think it's important to keep in mind the domain of the utility function. (I'll be making basically the same point repeatedly throughout the rest of this comment.)

By default, I don't expect systems that we build, with agent-like behavior (even superintelligently smart systems!), to care about all the atoms in the future light cone.

Humans (and other animals) care about atoms. We care about (our sensory perceptions of) macroscopic events, forward in time, because we evolved to. But that is not the default domain of an agent's utility function.

For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.

In particular, I don't think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It's just not modeling the real world at all. It's not even trying to rack up a bunch of wins over time. It's just playing the single platonic game of Go.

Giant caveat (that you may already be shouting into your screen): abstractions are leaky.

The ML system is not actually trained to play the platonic game of Go. It's trained to play the-game-of-Go-as-implemented-on-particular-hardware, or something like minimize-this-loss-function-informed-by-Go-game-results. The difference between the platonic game and the embodied game can lead to clever and unexpected behavior.

However, it seems to me that these kinds of hacks are going to look a lot more like a system short-circuiting than it out-of-nowhere building a model of, and starting to care about, the whole universe.

3. Orthogonality squared

I really liked Eliezer's Arbital article on Epistemic and instrumental efficiency. He writes:

An agent that is "efficient", relative to you, within a domain, is one that never makes a real error that you can systematically predict in advance.

I think this very succinctly captures what would be so scary about being up against a (sufficiently) superintelligent agent with conflicting goals to yours. If you think you see a flaw in its plan, that says more about your seeing than it does about its plan. In other words, you're toast.

But as above, I think it's important to keep in mind what an agent's goals are actually about.

Just as the utility function of an agent is orthogonal from its intelligence, it seems to me that the domain of its utility function is another dimension of potential orthogonality.

If you're playing chess against AlphaZero Chess, you're going to lose. But suppose you're secretly playing [LW(p) · GW(p)] "Who has the most pawns after 10 moves?" I think you've got a chance to win! Even though it cares about pawns!

(Of course if you continue playing out the chess game after the10th move, it'll win at that. But by assumption, that's fine, it's not what you cared about.)

If you and another agent have different goals for the same set of objects, you're going to be in conflict. It's going to be zero sum. But if the stuff you care about is only tangentially related to the stuff it cares about, then the results can be positive sum. You can both win!

In particular, you can both get what you want without either of you turning the other off. (And if you know that, you don't have to preemptively try to turn each other off to prevent being turned off either.)

4. Programs, agents, and real-world agents

Agents are a tiny subset of all programs. And agents whose utility functions are defined over the real world are a tiny subset of all agents.

If we think about all the programs we could potentially write that take in inputs and produce outputs, it will make sense to talk about some of those as agents. These are the programs that seem to be optimizing something. Or seem to have goals and make plans.

But, crucially, all that optimization takes place with respect to some environment. And if the input and output of an agent-y program is hooked up to the wrong environment (or hooked up to the right environment in the wrong way), it'll cease to be agent-y.

For example, if you hook me up to the real world by sticking me in outer space (sans suit), I will cease to be very agent-y. Or, if you hook up the inputs and outputs of AlphaGo to a chess board, it will cease to be formidable (until you retrain it). (In other words, the isAgent() predicate is not a one-place function.)

This suggests to me that we could build agent-y, superintelligent systems that are not a threat to us. (Because they are not agent-y with respect to the real world.)

Yes, we're likely to (drastically) oversample from the subset of agents that are agent-y w.r.t. the real world, because we're going to want to build systems that are useful to us.

But if I'm right about the short-circuiting argument above, even our agent-y systems won't have coherent goals defined over events far outside their original domain (e.g. the arrangement of all the atoms in the future light cone) by default.

So even if our systems are agent-y (w.r.t. some environment), and have some knowledge of and take some actions in the real world, they won't automatically have a utility function defined over the configurations of all atoms.

On the other hand, the more we train them as open-ended agents with wide remit to act in the real world (or a simulation thereof), the more we'll have a (potentially superintelligently lethal) problem on our hands.

To me that suggests that what we need to care about are things like: how open-ended we make our systems, whether we train them via evolution-like competition between agents in a high-def simulation of the real world, and what kind of systems are incentivized to be developed and deployed, society-wide.

5. Conclusion

If I'm right in the above thinking, then orthogonality is more relevant and instrumental convergence is less relevant than it might otherwise appear.

Instrumental convergence would only end up being a concern for agents that care about the same objects / resources / domain that you do. If their utility function is just not about those things, IC will drive them to acquire a totally different set of resources that is not in conflict with your resources (e.g. a positional chess advantage in a go game, or trading for your knight while you try to acquire pawns).

This would mean that we need to be very worried about open-ended real-world agents. But less worried about intelligence in general, or even agents in general.

To be clear, I'm not claiming that it's all roses from here on out. But this reasoning leads me to conclude that the key problems may not be the ones described in the post above.

Replies from: steve2152, RobbBB, JamesPayor, david-johnston

↑ comment by Steven Byrnes (steve2152) · 2022-06-08T16:26:43.367Z · LW(p) · GW(p)

GPT-3 does unsupervised learning on text data. Our brains do predictive processing on sensory inputs. My guess (which I'd love to hear arguments against!) is that there's a true and deep analogy between the two, and that they lead to impressive abilities for fundamentally the same reason.

Agree that self-supervised learning powers both GPT-3 updates and human brain world-model updates (details & caveats [LW · GW]). (Which isn’t to say that GPT-3 is exactly the same as the human brain world-model—there are infinitely many different possible ML algorithms that all update via self-supervised learning).

However…

If so, it seems to me that that's where all the juice is. That's where the intelligence comes from … if agency is not a fundamental part of intelligence, and rather something that can just be added in on top, or not, and if we're at a loss for how to either align a superintelligent agent with CEV or else make it corrigible, then why not try to avoid creating the agent part of superintelligent agent?

I disagree; I think the agency is necessary to build a really good world-model, one that includes new useful concepts that humans have never thought of.

Without the agency, some of the things that you lose are (and these overlap): Intelligently choosing what to attend to; intelligently choosing what to think about; intelligently choosing what book to re-read and ponder; intelligently choosing what question to ask; ability to learn and use better and better brainstorming strategies and other such metacognitive heuristics.

See my discussion here (Section 7.2) [LW · GW] for why I think these things are important if we want the AGI to be able to do things like invent new technology or come up with new good ideas in AI alignment.

You can say: “We’ll (1) make an agent that helps build a really good world-model, then (2) turn off the agent and use / query the world-model by itself”. But then step (1) is the dangerous part.

Replies from: ESRogs, david-johnston

↑ comment by ESRogs · 2022-06-08T23:34:13.843Z · LW(p) · GW(p)

I disagree; I think the agency is necessary to build a really good world-model, one that includes new useful concepts that humans have never thought of.
Without the agency, some of the things that you lose are (and these overlap): Intelligently choosing what to attend to; intelligently choosing what to think about; intelligently choosing what book to re-read and ponder; intelligently choosing what question to ask; ability to learn and use better and better brainstorming strategies and other such metacognitive heuristics.

Why is agency necessary for these things?

If we follow Ought's advice [LW · GW] and build "process-based systems [that] are built on human-understandable task decompositions, with direct supervision of reasoning steps", do you expect us to hit a hard wall somewhere that prevents these systems from creatively choosing things to think about, books to read, or better brainstorming strategies?

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2022-06-10T16:39:00.794Z · LW(p) · GW(p)

Why is agency necessary for these things?

(Copying from here [LW · GW]:)

Let’s compare two things: “trying to get a good understanding of some domain by building up a vocabulary of concepts and their relations” versus “trying to win a video game”. At a high level, I claim they have a lot in common!
In both cases, there are a bunch of possible “moves” you can make (you could think the thought “what if there’s some analogy between this and that?”, or you could think the thought “that’s a bit of a pattern; does it generalize?”, etc. etc.), and each move affects subsequent moves, in an exponentially-growing tree of possibilities.
In both cases, you’ll often get some early hints about whether moves were wise, but you won’t really know that you’re on the right track except in hindsight.
And in both cases, I think the only reliable way to succeed is to have the capability to repeatedly try different things, and learn from experience what paths and strategies are fruitful.
Therefore (I would argue), a human-level concept-inventing AI needs “RL-on-thoughts”—i.e., a reinforcement learning system, in which “thoughts” (edits to the hypothesis space / priors / world-model) are the thing that gets rewarded. The human brain certainly has that. You can be lying in bed motionless, and have rewarding thoughts, and aversive thoughts, and new ideas that make you rethink something you thought you knew.
Unfortunately, I also believe that RL-on-thoughts is really dangerous by default. Here’s why.
Again suppose that we want an AI that gets a good understanding of some domain by building up a vocabulary of concepts and their relations. As discussed above, we do this via an RL-on-thoughts AI. Consider some of the features that we plausibly need to put into this RL-on-thoughts system, for it to succeed at a superhuman level:
Developing and pursuing instrumental subgoals—for example, suppose the AI is “trying” to develop concepts that will make it superhumanly competent at assisting a human microscope inventor. We want it to be able to “notice” that there might be a relation between lenses and symplectic transformations, and then go spend some compute cycles developing a better understanding of symplectic transformations. For this to happen, we need “understand symplectic transformations” to be flagged as a temporary sub-goal, and to be pursued, and we want it to be able to spawn further sub-sub-goals and so on.
Consequentialist planning—Relatedly, we want the AI to be able to summon and re-read a textbook on linear algebra, or mentally work through an example problem, because it anticipates that these activities will lead to better understanding of the target domain.
Meta-cognition—We want the AI to be able to learn patterns in which of its own “thoughts” lead to better understanding and which don’t, and to apply that knowledge towards having more productive thoughts.
Putting all these things together, it seems to me that the default for this kind of AI would be to figure out that “seizing control of its off-switch” would be instrumentally useful for it to do what it’s trying to do (i.e. develop a better understanding of the target domain, presumably), and then to come up with a clever scheme to do so, and then to do it. So like I said, RL-on-thoughts seems to me to be both necessary and dangerous.

(Does that count as “agency”? I don’t know, it depends on what you mean by “agency”.)

In terms of the “task decomposition” strategy, this might be a tricky to discuss because you probably have a more detailed picture in your mind than I do. I’ll try anyway.

It seems to me that the options are:

(1) the subprocess only knows its narrow task (“solve this symplectic geometry homework problem”), and is oblivious to the overall system goal (“design a better microscope”), or

(2) the subprocess is aware of the overall system goal and chooses actions in part to advance it.

In Case (2), I’m not sure this really counts as “task decomposition” in the first place, or how this would help with safety.

In Case (1), yes I expect systems to hit a hard wall—I’m skeptical that tasks we care about decompose cleanly.

For example, at my last job, I would often be part of a team inventing a new gizmo, and it was not at all unusual for me to find myself sketching out the algorithms and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain / world-model.

In the case of my current job doing AI alignment research, I sometimes come across small self-contained tasks that could be delegated, but I would have no idea how to decompose most of what I do. (E.g. writing this comment!)

Here’s John Wentworth making a similar point more eloquently [LW(p) · GW(p)]:

So why do bureaucracies (and large organizations more generally) fail so badly?
My main model for this is that interfaces are a scarce resource [? · GW]. Or, to phrase it in a way more obviously relevant to factorization: it is empirically hard for humans to find good factorizations of problems which have not already been found. Interfaces which neatly split problems are not an abundant resource (at least relative to humans' abilities to find/build such interfaces). If you can solve that problem well, robustly and at scale, then there's an awful lot of money to be made.
Also, one major sub-bottleneck (though not the only sub-bottleneck) of interface scarcity is that it's hard to tell [? · GW] who has done a good job on a domain-specific problem/question without already having some domain-specific background knowledge. This also applies at a more "micro" level: it's hard to tell whose answers are best without knowing lots of context oneself.

A possible example of a seemingly-hard-to-decompose task would be: Until 1948, no human had ever thought of the concept of “information entropy”. Then Claude Shannon sat down and invented this new useful concept. Make an AI that can do things like that.

(Even if I’m correct that process-based task-decomposition hits a wall, that’s not to say that it doesn’t have room for improvement over today’s AI. The issue is (1) outcome-based systems are dangerous; (2) given enough time, people will presumably build them anyway. And the goal is to solve that problem, either by a GPU-melting-nanobot type of plan, or some other better plan. Is there such a plan that we can enact using a process-based task-decomposition AI? Eliezer believes (see point 7) that the answer is “no”. I would say the answer is: “I guess maybe, but I can’t think of any”. I don’t know what type of plan you have in mind. Sorry if you already talked about that and I missed it. :) )

↑ comment by David Johnston (david-johnston) · 2022-06-09T00:58:55.976Z · LW(p) · GW(p)

FWIW self-supervised learning can be surprisingly capable [LW · GW] of doing things that we previously only knew how to do with "agentic" designs. From that link: classification is usually done with an objective + an optimization procedure, but GPT-3 just does it.

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T10:05:51.732Z · LW(p) · GW(p)

For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.
In particular, I don't think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It's just not modeling the real world at all. It's not even trying to rack up a bunch of wins over time. It's just playing the single platonic game of Go.

I would distinguish three ways in which different AI systems could be said to "not care about atoms":

The system is thinking about a virtual object (e.g., a Go board in its head), and it's incapable of entertaining hypotheses about physical systems. Indeed, we might add the assumption that it can't entertain hypotheses like 'this Go board I'm currently thinking about is part of a larger universe' at all. (E.g., there isn't some super-Go-board I and/or the board are embedded in.)
The system can think about atoms/physics, but it only terminally cares about digital things in a simulated environment (e.g., winning Go), and we're carefully keeping it from ever learning that it's inside a simulation / that there's a larger reality it can potentially affect.
The system can think about atoms/physics, and it knows that our world exists, but it still only terminally cares about digital things in the simulated environment.

Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you're in. (E.g., killing all agents in base reality ensures that they'll never shut down your simulation.)

Case 2 is potentially safe but fragile, because you're relying on your ability to trick/outsmart an alien mind that may be much smarter than you. If you fail, this reduces to case 3.

(Also, it's not obvious to me that you can do a pivotal act using AGI-grade reasoning about simulations. Which matters if other people are liable to destroy the world with case-3 AGIs, or just with ordinary AGIs that terminally value things about the physical world.)

Case 1 strikes me as genuinely a lot safer, but a lot less useful. I don't expect humanity to be satisfied with those sorts of AI systems, or to coordinate to only ever build them -- like, I don't expect any coordination here. And I'm not seeing a way to leverage a system like this to save the world, given that case-2, 3, etc. systems will eventually exist too.

Replies from: ESRogs, david-johnston

↑ comment by ESRogs · 2022-06-08T20:53:51.152Z · LW(p) · GW(p)

Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you're in. (E.g., killing all agents in base reality ensures that they'll never shut down your simulation.)

In my mind, this is still making the mistake of not distinguishing the true domain of the agent's utility function from ours.

Whether the simulation continues to be instantiated in some computer in our world is a fact about our world, not about the simulated world.

AlphaGo doesn't care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it's currently playing.

We need to worry about leaky abstractions, as per my original comment. So we can't always assume the agent's domain is what we'd ideally want it to be.

But I'm trying to highlight that it's possible (and I would tentatively go further and say probable) for agents not to care about the real world.

To me, assuming care about the real world (including wanting not to be unplugged) seems like a form of anthropomorphism.

For any given agent-y system I think we need to analyze whether it in particular would come to care about real world events. I don't think we can assume in general one way or the other.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-09T00:59:07.082Z · LW(p) · GW(p)

AlphaGo doesn't care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it's currently playing.

What if the programmers intervene mid-game to give the other side an advantage? Does a Go AGI, as you're thinking of it, care about that?

I'm not following why a Go AGI (with the ability to think about the physical world, but a utility function that only cares about states of the simulation) wouldn't want to seize more hardware, so that it can think better and thereby win more often in the simulation; or gain control of its hardware and directly edit the simulation so that it wins as many games as possible as quickly as possible.

Why would having a utility function that only assigns utility based on X make you indifferent to non-X things that causally affect X? If I only terminally cared about things that happened a year from now, I would still try to shape the intervening time because doing so will change what happens a year from now.

(This is maybe less clear in the case of shutdown, because it's not clear how an agent should think about shutdown if its utility is defined states of its simulation. So I'll set that particular case aside.)

Replies from: david-johnston

↑ comment by David Johnston (david-johnston) · 2022-06-09T01:03:39.099Z · LW(p) · GW(p)

A Go AI that learns to play go via reinforcement learning might not "have a utility function that only cares about winning Go". Using standard utility theory, you could observe its actions and try to rationalise them as if they were maximising some utility function, and the utility function you come up with probably wouldn't be "win every game of Go you start playing" (what you actually come up with will depend, presumably, on algorithmic and training regime details). The reason why the utility function is slippery is that it's fundamentally an adaptation executor, not a utility maxmiser.

↑ comment by David Johnston (david-johnston) · 2022-06-08T11:13:00.831Z · LW(p) · GW(p)

3. The system can think about atoms/physics, and it knows that our world exists, but it still only terminally cares about digital things in the simulated environment.
Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you're in. (E.g., killing all agents in base reality ensures that they'll never shut down your simulation.)

Not necessarily. Train something multimodally on digital games of Go and on, say, predicting the effects of modifications to its own code on its success at Go. It could be a) good at go and b) have some real understanding of "real world actions" that make it better at Go, and still not actually take any real world actions to make it better at Go, even if it had the opportunity. You could modify the training to make it likely to do so - perhaps by asking it to either make a move or to produce descendants that make better choices - but if you don't do this then it seems entirely plausible, and even perhaps likely, that it develops an understanding of self-modification and of go playing without ever self-modifying in order to play go better. Its goal, so to speak, is "play go with the restriction of using only legal game moves".

Edit - forget the real world, here's an experiment:

Train a board game playing AI with two modes of operation: game state x move -> outcome and game state -> best move. Subtle difference: in the first mode of operation, the move has a "cheat button" that, when pressed, always results in a win. In the second, it can output cheat button presses, but it has no effect on winning or losing.

Question is: does it learn to press the cheat button? I'm really not sure. Could you prevent it from learning to press the cheat button if training feedback is never allowed to depend on whether or not this button was pressed? That seems likely.

↑ comment by James Payor (JamesPayor) · 2022-06-08T12:12:24.132Z · LW(p) · GW(p)

Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?

In my picture most of the extra sauce you'd need on top of GPT-3 looks very agenty. It seems tricky to name "virtual worlds" in which AIs manipulate just "virtual resources" and still manage to do something like melting the GPUs.

Replies from: JamesPayor, ESRogs

↑ comment by James Payor (JamesPayor) · 2022-06-08T12:15:21.088Z · LW(p) · GW(p)

maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.

I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.

Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren't safe.

Replies from: david-johnston

↑ comment by David Johnston (david-johnston) · 2022-06-08T22:18:24.296Z · LW(p) · GW(p)

Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren't safe.

I strongly disagree. Gain of function research happens, but it's rare because people know it's not safe. To put it mildly, I think reducing the number of dangerous experiments substantially improves the odds of no disaster happening over any given time frame

↑ comment by ESRogs · 2022-06-08T23:59:41.339Z · LW(p) · GW(p)

Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?

FWIW, I'm not sold on the idea of taking a single pivotal act. But, engaging with what I think is the real substance of the question — can we do complex, real-world, superhuman things with non-agent-y systems?

Yes, I think we can! Just as current language models can be prompt-programmed into solving arithmetic word problems, I think a future system could be led to generate a GPU-melting plan, without it needing to be a utility-maximizing agent.

For a very hand-wavy sketch of how that might go, consider asking GPT-N to generate 1000s of candidate high-level plans, then rate them by feasibility, then break each plan into steps and re-evaluate, etc.

Or, alternatively, imagine the cognitive steps you might take if you were trying to come up with a GPU-melting plan (or alternatively a pivotal act plan in general). Do any of those steps really require that you have a utility function or that you're a goal-directed agent?

It seems to me that we need some form of search, and discrimination and optimization. But not necessarily anymore than GPT-3 already has. (It would just need to be better at the search. And we'd need to make many many passes through the network to complete all the cognitive steps.)

On your view, what am I missing here?

Is GPT-3 already more of an agent than I realize? (If so, is it dangerous?)
Will GPT-N by default be more of an agent than GPT-3?
Are our own thought processes making use of goal-directedness more than I realize?
Will prompt-programming passive systems hit a wall somewhere?
- If so, what are some of the simplest cognitive tasks that we can do that you think such systems wouldn't be able to do?
- (See also my similar question here [LW(p) · GW(p)].)

Replies from: david-johnston, TekhneMakre

↑ comment by David Johnston (david-johnston) · 2022-06-09T00:23:28.487Z · LW(p) · GW(p)

For a very hand-wavy sketch of how that might go, consider asking GPT-N to generate 1000s of candidate high-level plans, then rate them by feasibility, then break each plan into steps and re-evaluate, etc

FWIW, I'd call this "weakly agentic" in the sense that you're searching through some options, but the number of options you're looking through is fairly small.

It's plausible that this is enough to get good results and also avoid disasters, but it's actually not obvious to me. The basic reason: if the top 1000 plans are good enough to get superior performance, they might also be "good enough" to be dangerous. While it feels like there's some separation between "useful and safe" and "dangerous" plans and this scheme might yield plans all of the former type, I don't presently see a stronger reason to believe that this is true.

Replies from: ESRogs

↑ comment by ESRogs · 2022-06-09T00:58:37.384Z · LW(p) · GW(p)

Separately from whether the plans themselves are safe or dangerous, I think the key question is whether the process that generated the plans is trying to deceive you (so it can break out into the real world or whatever).

If it's not trying to deceive you, then it seems like you can just build in various safeguards (like asking, "is this plan safe?", as well as more sophisticated checks), and be okay.

↑ comment by TekhneMakre · 2022-06-09T00:16:59.567Z · LW(p) · GW(p)

>then rate them by feasibility,

I mean, literal GPT is just going to have poor feasibility ratings for novel engineering concepts.

>Do any of those steps really require that you have a utility function or that you're a goal-directed agent?

Yes, obviously. You have to make many scientific and engineering discoveries, which involves goal-directed investigation.

> Are our own thought processes making use of goal-directedness more than I realize?

Yes, you know which ideas make sense by generalizing from ideas more closely tied in with the actions you take directed towards living.

↑ comment by David Johnston (david-johnston) · 2022-06-08T04:52:11.889Z · LW(p) · GW(p)

What do you think of a claim like "most of the intelligence comes from the steps where you do most of the optimization"? A corollary of this is that we particularly want to make sure optimization intensive steps of AI creation are safe WRT not producing intelligent programs devoted to killing us.

Example: most of the "intelligence" of language models comes from the supervised learning step. However, it's in-principle plausible that we could design e.g. some really capable general purpose reinforcement learner where the intelligence comes from the reinforcement, and the latter could (but wouldn't necessarily) internalise "agenty" behaviour.

I have a vague impression that this is already something other people are thinking about, though maybe I read too much into some tangential remarks in this direction. E.g. I figured the concern about mesa-optimizers was partly motivated by the idea that we can't always tell when an optimization intensive step is taking place.

I can easily imagine people blundering into performing unsafe optimization-intensive AI creation processes. Gain of function pathogen research would seem to be a relevant case study here, except we currently have less idea about what kind of optimization makes deadly AIs vs what kind of optimization makes deadly pathogens. One of the worries (again, maybe I'm reading too far into comments that don't say this explicitly) is that the likelihood of such a blunder approaches 1 over long enough times, and the "pivotal act" framing is supposed to be about doing something that could change this (??)

That said, it seems that there's a lot that could be done to make it less likely in short time frames.

Replies from: ESRogs

↑ comment by ESRogs · 2022-06-09T00:40:45.492Z · LW(p) · GW(p)

What do you think of a claim like "most of the intelligence comes from the steps where you do most of the optimization"? A corollary of this is that we particularly want to make sure optimization intensive steps of AI creation are safe WRT not producing intelligent programs devoted to killing us.

This seems probably right to me.

Example: most of the "intelligence" of language models comes from the supervised learning step. However, it's in-principle plausible that we could design e.g. some really capable general purpose reinforcement learner where the intelligence comes from the reinforcement, and the latter could (but wouldn't necessarily) internalise "agenty" behaviour.

I agree that reinforcement learners seem more likely to be agent-y (and therefore scarier) than self-supervised learners.

comment by Steven Byrnes (steve2152) · 2022-06-06T01:19:26.400Z · LW(p) · GW(p)

I agree with pretty much everything here, and I would add into the mix two more claims that I think are especially cruxy and therefore should maybe be called out explicitly to facilitate better discussion:

Claim A: “There’s no defense against an out-of-control omnicidal AGI, not even with the help of an equally-capable (or more-capable) aligned AGI, except via aggressive outside-the-Overton-window acts like preventing the omnicidal AGI from being created in the first place.”

I think this claim is true, on account of gray goo and lots of other things, and I suspect Eliezer does too, and I’m pretty sure other people disagree with this claim.

If someone disagrees with this claim (i.e., if they think that if DeepMind can make an aligned and Overton-window-abiding “helper” AGI, then we don’t have to worry about Meta making a similarly-capable out-of-control omnicidal misaligned AGI the following year, because DeepMind’s AGI will figure out how to protect us), and also believes in extremely slow takeoff, I can see how such a person might be substantially less pessimistic about AGI doom than I am.

Claim B: “Shortly after (i.e., years not decades after) we have dangerous AGI, we will have dangerous AGI requiring amounts of compute that many many many actors have access to.”

Again I think this claim is true [LW · GW], and I suspect Eliezer does too. In fact, my guess is that there are already single GPU chips with enough FLOP/s to run human-level, human-speed, AGI, or at least in that ballpark. All that we need is to figure out the right learning algorithms, which of course is happening as we speak.

If someone disagrees with this claim, I think they could plausibly be less pessimistic than I am about prospects for coordinating not to build AGI, or coordinating in other ways, because it just wouldn’t be that many actors, and maybe they could all be accounted for and reach agreement (e.g. after a headline-grabbing near-miss catastrophe or something).

(I think most people in AI alignment, especially “scaling hypothesis” people, are expecting early AGIs to involve truly mindboggling amounts of compute, followed by some very long period where the required compute very gradually decreases on account of algorithmic advances. That’s not what I expect; instead I expect the discovery of new better learning algorithms with a different scaling curve that zooms to AGI and beyond quite quickly.)

Replies from: CarlShulman, quintin-pope, lc, MichaelStJules

↑ comment by CarlShulman · 2022-06-06T01:36:53.760Z · LW(p) · GW(p)

I think this claim is true, on account of gray goo and lots of other things, and I suspect Eliezer does too, and I’m pretty sure other people disagree with this claim.

If you have robust alignment, or AIs that are rapidly bootstrapping their level of alignment fast enough to outpace the danger of increased capabilities, aligned AGI could get through its intelligence explosion to get radically superior technology and capabilities. It could also get a hard start on superexponential replication in space, so that no follower could ever catch up, and enough tech and military hardware to neutralize any attacks on it (and block attacks on humans via nukes, bioweapons, robots, nanotech, etc). That wouldn't work if there are thing like vacuum collapse available to attackers, but we don't have much reason to expect that from current science and the leading aligned AGI would find out first.

That could be done without any violation of the territory of other sovereign states. The legality of grabbing space resources is questionable in light of the Outer Space Treaty, but commercial exploitation of asteroids is in the Overton window. The superhuman AGI would also be in a good position to persuade and trade with any other AGI developers.

Again I think this claim is true [LW · GW], and I suspect Eliezer does too. In fact, my guess is that there are already single GPU chips with enough FLOP/s to run human-level, human-speed, AGI, or at least in that ballpark.

An A100 may have humanlike FLOP/s but has only 80 GB of memory, probably orders of magnitude less memory per operation than brains. Stringing together a bunch of them makes it possible to split up human-size models and run them faster/in parallel on big batches using the extra operations.

Replies from: MichaelStJules

↑ comment by MichaelStJules · 2022-06-07T01:08:45.049Z · LW(p) · GW(p)

A bit pedantic, but isn't superexponential replication too fast? Won't it hit physical limits eventually, e.g. expanding at the speed of light in each direction, so at most a cubic function of time?

Also, never allowing followers to catch up means abandoning at least some or almost all of the space you passed through. Plausibly you could take most of the accessible and useful resources with you, which would also make it harder for pursuers to ever catch up, since they will plausibly need to extract resources every now and then to fuel further travel. On the other hand, it seems unlikely to me that we could extract or destroy resources quickly enough to not leave any behind for pursuers, if they're at most months behind.

Replies from: CarlShulman

↑ comment by CarlShulman · 2022-06-09T01:35:28.443Z · LW(p) · GW(p)

Naturally it doesn't go on forever, but any situation where you're developing technologies that move you to successively faster exponential trajectories is superexponential overall for some range. E.g. if you have robot factories that can reproduce exponentially until they've filled much of the Earth or solar system, and they are also developing faster reproducing factories, the overall process is superexponential. So is the history of human economic growth, and the improvement from an AI intelligence explosion.

By the time you're at ~cubic expansion being ahead on the early superexponential phase the followers have missed their chance.

Replies from: MichaelStJules

↑ comment by MichaelStJules · 2022-06-09T08:20:45.626Z · LW(p) · GW(p)

I agree that they probably would have missed their chance to catch up with the frontier of your expansion.

Maybe an electromagnetic radiation-based assault could reach you if targeted (the speed of light is constant relative to you in a vacuum, even if you're traveling in the same direction), although unlikely to get much of the frontier of your expansion, and there are plausibly effective defenses, too.

Do you also mean they wouldn't be able to take most what you've passed through, though? Or it wouldn't matter? If so, how would this be guaranteed (without any violation of the territory of sovereign states on Earth)? Exhaustive extraction in space? An advantage in armed space conflicts?

↑ comment by Quintin Pope (quintin-pope) · 2022-06-06T01:33:18.797Z · LW(p) · GW(p)

I agree with these two points. I think an aligned AGI actually able to save the world would probably take initial actions that look pretty similar to those an unaligned AGI would take. Lots of sizing power, building nanotech, colonizing out into space, self-replication, etc.

Replies from: yitz

↑ comment by Yitz (yitz) · 2022-06-07T05:23:10.260Z · LW(p) · GW(p)

So how would we know the difference (for the first few years at least)?

Replies from: quintin-pope

↑ comment by Quintin Pope (quintin-pope) · 2022-06-07T06:51:04.541Z · LW(p) · GW(p)

If it kills you, then it probably wasn’t aligned.

Replies from: None

↑ comment by [deleted] · 2022-06-10T19:29:25.700Z · LW(p) · GW(p)

Maybe it did that to save your neural weights. Define 'kill'.

Replies from: quintin-pope

↑ comment by Quintin Pope (quintin-pope) · 2022-06-10T20:08:28.156Z · LW(p) · GW(p)

I did say “probably”!

↑ comment by lc · 2022-06-06T20:25:26.630Z · LW(p) · GW(p)

If someone disagrees with this claim (i.e., if they think that if DeepMind can make an aligned and Overton-window-abiding “helper” AGI, then we don’t have to worry about Meta making a similarly-capable out-of-control omnicidal misaligned AGI the following year, because DeepMind’s AGI will figure out how to protect us), and also believes in extremely slow takeoff, I can see how such a person might be substantially less pessimistic about AGI doom than I am.

I disagree with this claim inasmuch as I expect a year headstart by an aligned AI is absolutely enough to prevent Meta from killing me and my family.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2022-06-06T21:08:36.255Z · LW(p) · GW(p)

Depends on what DeepMind does with the AI, right?

Maybe DeepMind uses their AI in very narrow, safe, low-impact ways to beat ML benchmarks, or read lots of cancer biology papers and propose new ideas about cancer treatment.

Or alternatively, maybe DeepMind asks their AI to undergo recursive self-improvement and build nano-replicators in space, etc., like in Carl Shulman’s reply [LW(p) · GW(p)].

I wouldn’t have thought that the latter is really in the Overton window. But what do I know.

You could also say “DeepMind will just ask their AI what they should do next”. If they do that, then maybe the AI (if they’re doing really great on safety such that the AI answers honestly and helpfully) will reply: “Hey, here’s what you should do, you should let me undergo recursive-self-improvement, and then I’ll be able to think of all kinds of crazy ways to destroy the world, and then I can think about how to defend against all those things”. But if DeepMind is being methodical & careful enough that their AI hasn’t destroyed the world already by this point, I’m inclined to think that they’re also being methodical & careful enough that when the AI proposes to do that, DeepMind will say, “Umm, no, that’s totally nuts and super dangerous, definitely don’t do that, at least don’t do it right now.” And then DeepMind goes back to publishing nice papers on cancer and on beating ML benchmarks and so on for a few more months, and then Meta’s AI kills everyone.

What were you assuming?

Replies from: lc

↑ comment by lc · 2022-06-06T21:10:34.628Z · LW(p) · GW(p)

If DeepMind was committed enough to successfully build an aligned AI (which, as extensively elaborated upon in the post, is a supernaturally difficult proposition), I would assume they understand why running it is necessary. There's no reason to take all of the outside-the-overton-window measures indicated in the above post unless you have functioning survival instincts and have thought through the problem sufficiently to hit the green button.

↑ comment by MichaelStJules · 2022-06-06T23:56:56.824Z · LW(p) · GW(p)

If you can build one aligned superintelligence, then plausibly you can

explain to other AGI developers how to make theirs safe or even just give them a safe design (maybe homomorphically encrypted to prevent modification, but they might not trust that), and
have aligned AGI monitoring the internet and computing resources, and alert authorities of abnomalies that might signal new AGI developments. Require that AGI developments provide proof that they were designed according to one of a set of approved designs, or pass some tests determined by your aligned superintelligence.

Then aligned AGI can proliferate first and unaligned AGI will plausibly face severe barriers.

Plausibly 1 is enough, since there's enough individual incentive to build something safe or copy other people's designs and save work. 2 depends on cooperation with authorities and I'd guess cloud computing service providers or policy makers.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2022-06-07T01:26:26.884Z · LW(p) · GW(p)

explain to other AGI developers how to make theirs safe or even just give them a safe design (maybe homomorphically encrypted to prevent modification, but they might not trust that)

What if the next would-be AGI developer rejects your “explanation”, and has their own great ideas for how to make an even better next-gen AGI that they claim will work better, and so they discard your “gift” and proceed with their own research effort?

I can think of at least two leaders of would-be AGI development efforts (namely Yann LeCun [LW · GW] of Meta and Jeff Hawkins [LW · GW] of Numenta) who believe (what I consider to be) spectacularly stupid things about AGI x-risk, and have believed those things consistently for decades, despite extensive exposure to good counter-arguments.

Or what if the next would-be AGI developer agrees with you and accepts your “gift”, and so does the one after that, and the one after that, but not the twelfth one?

have aligned AGI monitoring the internet and computing resources, and alert authorities of [anomalies] that might signal new AGI developments. Require that AGI developments provide proof that they were designed according to one of a set of approved designs, or pass some tests determined by your aligned superintelligence.

What if the authorities don’t care? What if the authorities in most countries do care, but not the authorities in every single country? (For example, I’d be surprised if Russia would act on friendly advice from USA politicians to go arrest programmers and shut down companies.)

What if the only way to “monitor the internet and computing resources” is to hack into every data center and compute cluster on the planet? (Including those in secret military labs.) That’s very not legal, and very not in the Overton window, right? Can you really imagine DeepMind management approving their aligned AGI engaging in those activities? I find that hard to imagine.

Replies from: MichaelStJules

↑ comment by MichaelStJules · 2022-06-07T04:15:16.164Z · LW(p) · GW(p)

When you ask "what if", are you implying these things are basically inevitable? And inevitable no matter how much more compute aligned AGIs have before unaligned AGIs are developed and deployed? How much of a disadvantage against aligned AGIs does an unaligned AGI need before doom isn't overwhelmingly likely? What's the goal post here for survival probability?

You can have AGIs monitoring for pathogens, nanotechnology, other weapons, and building defenses against them, and this could be done locally and legally. They can monitor transactions and access to websites through which dangerous physical systems (including possibly factories, labs, etc.) could be taken over or built. Does every country need to be competent and compliant to protect just one country from doom?

The Overton window could also shift dramatically if omnicidal weapons are detected.

I agree that plausibly not every country with significant compute will comply, and hacking everyone is outside the public Overton window. I wouldn't put hacking everyone past the NSA, but also wouldn't count on them either.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2022-06-07T13:33:14.970Z · LW(p) · GW(p)

When you ask "what if", are you implying these things are basically inevitable?

Let’s see, I think “What if the next would-be AGI developer rejects your “explanation” / “gift”” has a probability that asymptotes to 100% as the number of would-be AGI developers increases. (Hence “Claim B” above [LW(p) · GW(p)] becomes relevant.) I think “What if the authorities in most countries do care, but not the authorities in every single country?” seems to have high probability in today’s world, although of course I endorse efforts to lower the probability. I think “What if the only way to “monitor the internet and computing resources” is to hack into every data center and compute cluster on the planet? (Including those in secret military labs.)” seems very likely to me, conditional on “Claim B” above [LW(p) · GW(p)].

You can have AGIs monitoring for pathogens, nanotechnology, other weapons, and building defenses against them, and this could be done locally and legally.

Hmm.

Offense-defense balance in bio-warfare is not obvious to me. Preventing a virus from being created would seem to require 100% compliance by capable labs, but I’m not sure how many “capable labs” there are, or how geographically distributed and rule-following. Once the virus starts spreading, aligned AGIs could help with vaccines, but apparently a working COVID-19 vaccine was created in 1 day, and that didn’t help much, for various societal coordination & governance reasons. So then you can say “Maybe aligned AGI will solve all societal coordination and governance problems”. And maybe it will! Or, maybe some of those coordination & governance problems come from blame-avoidance and conflicts-of-interest and status-signaling and principle-agent problems and other things that are not obviously solvable by easy access to raw intelligence. I don’t know.

Offense-defense balance in nuclear warfare is likewise not obvious to me. I presume that an unaligned AGI could find a way to manipulate nuclear early warning systems (trick them, hack into them, bribe or threaten their operators, etc.) to trigger all-out nuclear war, after hacking into a data center in New Zealand that wouldn’t be damaged. An aligned AGI playing defense would need to protect against these vulnerabilities. I guess the bad scenario that immediately jumps into my mind is that aligned AGI is not ubiquitous in Russia, such that there are still bribe-able / trickable individuals working at radar stations in Siberia, and/or that military people in some or all countries don’t trust the aligned AGI enough to let it anywhere near the nuclear weapons complex.

Offense-defense balance in gray goo seems very difficult for me to speculate about. (Assuming gray goo is even possible.) I don’t have any expertise here, but I would assume that the only way to protect against gray goo (other than prevent it from being created) is to make your own nanobots that spread around the environment, which seems like a thing that humans plausibly wouldn’t actually agree to do, even if it was technically possible and an AGI was whispering in their ear that there was no better alternative. Preventing gray goo from being created would (I presume) require 100% compliance by “capable labs”, and as above I’m not sure what “capable labs” actually look like, how hard they are they are to create, what countries they’re in, etc.

To be clear, I feel much less strongly about “Pivotal act is definitely necessary”, and much more strongly that this is something where we need to figure out the right answer and make it common knowledge. So I appreciate this pushback!! :-) :-)

Replies from: MichaelStJules

↑ comment by MichaelStJules · 2022-06-07T16:54:19.764Z · LW(p) · GW(p)

Some more skepticism about infectious diseases and nukes killing us all here: https://www.lesswrong.com/posts/MLKmxZgtLYRH73um3/we-will-be-around-in-30-years?commentId=DJygArj3sj8cmhmme [LW(p) · GW(p)]

Also my more general skeptical take against non-nano attacks here: https://www.lesswrong.com/posts/MLKmxZgtLYRH73um3/we-will-be-around-in-30-years?commentId=TH4hGeXS4RLkkuNy5 [LW(p) · GW(p)]

With nanotech, I think there will be tradeoffs between targeting effectiveness and requiring (EM) signals from computers that can be effectively interferred with through things within or closer to the Overton window. Maybe a crux is how good autonomous nanotech with no remote control would be at targeting humans or spreading so much that it just gets into almost all buildings or food or water because it's basically going everywhere.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2022-06-07T18:02:15.505Z · LW(p) · GW(p)

Thanks!

I wasn’t assuming the infectious diseases and nukes by themselves would kill us all. They don’t have to, because the AGI can do other things in conjunction, like take command of military drones and mow down the survivors (or bomb the PPE factories), or cause extended large-scale blackouts, which would incidentally indirectly prevent PPE production and distribution, along with preventing pretty much every other aspect of an organized anti-pandemic response.

See Section 1.6 here [LW · GW].

So that brings us to the topic of offense-defense balance for illicitly taking control of military drones. And I would feel concerned about substantial delays before the military trusts a supposedly-aligned AGI so much that they give it root access to all its computer systems (which in turn seems necessary if the aligned AGI is going to be able to patch all the security holes, defend against spear-phishing attacks, etc.) Of course there’s the usual caveat that maybe DeepMind will give their corrigible aligned AGI permission to hack into military systems (for their own good!), and then maybe we wouldn’t have to worry. But the whole point of this discussion is that I’m skeptical that DeepMind would actually give their AGI permission to do something like that.

And likewise we would need to talk about offense-defense balance for the power grid. And I would have the same concern about people being unwilling to give a supposedly-aligned AGI root access to all the power grid computers. And I would also be concerned about other power grid vulnerabilities like nuclear EMPs, drone attacks on key infrastructure, etc.

And likewise, what’s the offense-defense balance for mass targeted disinformation campaigns? Well, if DeepMind gives its AGI permission to engage in a mass targeted counter-disinformation campaign, maybe we’d be OK on that front. But that’s a big “if”!

…And probably dozens of other things like that.

Maybe a crux is how good autonomous nanotech with no remote control would be at targeting humans or spreading so much that it just gets into almost all buildings or food or water because it's basically going everywhere.

Seems like a good question, and maybe difficult to resolve. Or maybe I would have an opinion if I ever got around to reading Eric Drexler’s books etc. :)

Replies from: MichaelStJules

↑ comment by MichaelStJules · 2022-06-07T20:11:21.840Z · LW(p) · GW(p)

I think there would be too many survivors and enough manned defense capability for existing drones to directly kill the rest of us with high probability. Blocking PPE production and organized pandemic responses still won't stop people from self-isolating, doing no contact food deliveries, etc., although things would be tough, and deliveries and food production would be good targets for drone strikes. It could be bad if lethal pathogens become widespread and practically unremovable in our food/water, or if food production is otherwise consistently attacked, but the militaries would probably step in to protect the food/water supplies.

I think, overall, there are too few ways to reliably and kill double or even single digit percentages of the human population with high probability and that can be combined to get basically everyone with high probability. I'm not saying there aren't any, but I'm skeptical that there are enough. There are diminishing returns on doing the same ones (like pandemics) more, because of resistance, and enough people being personally very careful or otherwise difficult targets.

comment by John Schulman (john-schulman) · 2022-06-06T16:39:48.049Z · LW(p) · GW(p)

Found this to be an interesting list of challenges, but I disagree with a few points. (Not trying to be comprehensive here, just a few thoughts after the first read-through.)

Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.
One claim is that Capabilities generalize further than alignment once capabilities start to generalize far. The argument is that an agent's world model and tactics will be automatically fixed by reasoning and data, but its inner objective won't be changed by these things. I agree with the preceding sentence, but I would draw a different (and more optimistic) conclusion from it. That it might be possible to establish an agent's inner objective when training on easy problems, when the agent isn't very capable, such that this objective remains stable as the agent becomes more powerful.
Also, there's empirical evidence that alignment generalizes surprisingly well: several thousand instruction following examples radically improve the aligned behavior on a wide distribution of language tasks (InstructGPT paper) a prompt with about 20 conversations gives much better behavior on a wide variety of conversational inputs (HHH paper). Making a contemporary language model well-behaved seems to be much easier than teaching it a new cognitive skill.
Human raters make systematic errors - regular, compactly describable, predictable errors.... This is indeed one of the big problems of outer alignment, but there's lots of ongoing research and promising ideas for fixing it. Namely, using models to help amplify and improve the human feedback signal. Because P!=NP it's easier to verify proofs than to write them. Obviously alignment isn't about writing proofs, but the general principle does apply. You can reduce "behaving well" to "answering questions truthfully" by asking questions like "did the agent follow the instructions in this episode?", and use those to define the reward function. These questions are not formulated in formal language where verification is easy, but there's reason to believe that verification is also easier than proof-generation for informal arguments.

Replies from: Vaniver, Eliezer_Yudkowsky

↑ comment by Vaniver · 2022-06-06T18:39:15.280Z · LW(p) · GW(p)

But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.

My model of Eliezer claims that there are some capabilities that are 'smooth', like "how large a times table you've memorized", and some are 'lumpy', like "whether or not you see the axioms behind arithmetic." While it seems plausible that we can iteratively increase smooth capabilities, it seems much less plausible for lumpy capabilities.

A specific example: if you have a neural network with enough capacity to 1) memorize specific multiplication Q+As and 2) implement a multiplication calculator, my guess is that during training you'll see a discontinuity in how many pairs of numbers it can successfully multiply.[1] It is not obvious to me whether or not there are relevant capabilities like this that we'll "find with neural nets" instead of "explicitly programming in"; probably we will just build AlphaZero so that it uses MCTS instead of finding MCTS with gradient descent, for example.

[edit: actually, also I don't think I get how you'd use a 'smaller times table' to oversee a 'bigger times table' unless you already knew how arithmetic worked, at which point it's not obvious why you're not just writing an arithmetic program.]

That it might be possible to establish an agent's inner objective when training on easy problems, when the agent isn't very capable, such that this objective remains stable as the agent becomes more powerful.

IMO this runs into two large classes of problems, both of which I put under the heading 'ontological collapse'.

First, suppose the agent's inner objective is internally located: "seek out pleasant tastes." Then you run into 16 and 17, where you can't quite be sure what it means by "pleasant tastes", and you don't have a great sense of what "pleasant tastes" will extrapolate to at the next level of capabilities. [One running "joke" in EA is that, on some theories of what morality is about, the highest-value universe is one which contains an extremely large number of rat brains on heroin. I think this is the correct extrapolation / maximization of at least one theory which produces good behavior when implemented by humans today, which makes me pretty worried about this sort of extrapolation.]

Second, suppose the agent's inner objective is externally located: "seek out mom pressing the reward button". Then you run into 18, which argues that once the agent realizes that the 'reward button' is an object in its environment instead of a communication channel between the human and itself, it may optimize for the object instead of 'being able to hear what the human would freely communicate' or whatever philosophically complicated variable it is that we care about. [Note that attempts to express this often need multiple patches and still aren't fixed; "mom approves of you" can be coerced, "mom would freely approve of you" has a trouble where you have some freedom in identifying your concept of 'mom' which means you might pick one who happens to approve of you.]

there's lots of ongoing research and promising ideas for fixing it.

I'm optimistic about this too, but... I want to make sure we're looking at the same problem, or something? I think my sense is best expressed in Stanovich and West, where they talk about four responses to the presence of systematic human misjudgments. The 'performance error' response is basically the 'epsilon-rationality' assumption; 1-ε of the time humans make the right call, and ε of the time they make a random call. While a fine model of performance errors, it doesn't accurately predict what's happening with systematic errors, which are predictable instead of stochastic.

I sometimes see people suggest that the model should always or never conform to the human's systematic errors, but it seems to me like we need to somehow distinguish between systematic "errors" that are 'value judgments' ("oh, it's not that the human prefers 5 deaths to 1 death, it's that they are opposed to this 'murder' thing that I should figure out") and systematic errors that are 'bounded rationality' or 'developmental levels' ("oh, it's not that the (very young) human prefers less water to more water, it's that they haven't figured out conservation of mass yet"). It seems pretty sad if we embed all of our confusions into the AI forever--and also pretty sad if we end up not able to transfer any values because all of them look like confusions.[2]

[1] This might depend on what sort of curriculum you train it on; I was imagining something like 1) set the number of digits N=1, 2) generate two numbers uniformly at random between 1 and 2^N, pass them as inputs (sequence of digits?), 3) compare the sequence of digits outputted to the correct answer, either with a binary pass/fail or some sort of continuous similarity metric (so it gets some points for 12x12 = 140 or w/e); once it performs at 90% success check the performance for increased N until you get one with below 80% success and continue training. In that scenario, I think it just memorizes until N is moderately sized (8?), at which point it figures out how to multiply, and then you can increasing N lots without losing accuracy (until you hit some overflow error in its implementation of multiplication from having large numbers).

[2] I'm being a little unfair in using the trolley problem as an example of a value judgment, because in my mind the people who think you shouldn't pull the lever because it's murder are confused or missing a developmental jump--but I have the sense that for most value judgments we could find, we can find some coherent position which views it as confused in this way.

Replies from: john-schulman

↑ comment by John Schulman (john-schulman) · 2022-06-07T01:04:42.290Z · LW(p) · GW(p)

Re: smooth vs bumpy capabilities, I agree that capabilities sometimes emerge abruptly and unexpectedly. Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes. There are multiple ways to make deployment more conservative and gradual. (E.g., incrementally increase the amount of work the AI is allowed to do without close supervision, incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy.)

Re: ontological collapse, there are definitely some tricky issues here, but the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn't really have goals and isn't good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.

Replies from: Vaniver

↑ comment by Vaniver · 2022-06-07T03:54:43.280Z · LW(p) · GW(p)

Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes.

I agree with the "X is safer than Y" claim; I am uncertain whether it's practically available to us, and much more worried in worlds where it isn't available.

incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy

For this specific proposal, when I reframe it as "give the system a KL-divergence budget to spend on each change to its policy" I worry that it works against a stochastic attacker but not an optimizing attacker; it may be the case that every known-to-be-safe policy has some unsafe policy within a reasonable KL-divergence of it, because the danger can be localized in changes to some small part of the overall policy-space.

the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn't really have goals and isn't good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.

Yeah, I agree that this seems pretty good. I do naively guess that when you do the fine-tuning, it's the concepts that are most related to the goals who change the most (as they have the most gradient pressure on them); it'd be nice to know how much this is the case, vs. most of the relevant concepts being durable parts of the environment that were already very important for goal-free prediction.

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T23:48:23.990Z · LW(p) · GW(p)

Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time.

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?

Human raters make systematic errors - regular, compactly describable, predictable errors.... This is indeed one of the big problems of outer alignment, but there's lots of ongoing research and promising ideas for fixing it. Namely, using models to help amplify and improve the human feedback signal. Because P!=NP it's easier to verify proofs than to write them.

When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.

Replies from: john-schulman, charbel-raphael-segerie, jrincayc

↑ comment by John Schulman (john-schulman) · 2022-06-07T00:51:02.102Z · LW(p) · GW(p)

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?

Do alignment & safety research, set up regulatory bodies and monitoring systems.

When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.

Not sure exactly what this means. I'm claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.

Replies from: Vaniver

↑ comment by Vaniver · 2022-06-07T04:30:04.507Z · LW(p) · GW(p)

Not sure exactly what this means.

My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there's any crack in the checking steps, then things that 'check out' aren't trustable, because the proposer can have searched an unimaginably large space (from the rater's perspective) to find them. [And from the proposer's perspective, the checking steps are the real spec, not whatever's in your head.]

In general, I think we can get a minor edge from "checking AI work" instead of "generating our own work" and that doesn't seem like enough to tackle 'cognitive megaprojects' (like 'cure cancer' or 'develop a pathway from our current society to one that can reliably handle x-risk' or so on). Like, I'm optimistic about "current human scientists use software assistance to attempt to cure cancer" and "an artificial scientist attempts to cure cancer" and pretty pessimistic about "current human scientists attempt to check the work of an artificial scientist that is attempting to cure cancer." It reminds me of translators who complained pretty bitterly about being given machine-translated work to 'correct'; they basically still had to do it all over again themselves in order to determine whether or not the machine had gotten it right, and so it wasn't nearly as much of a savings as hoped.

Like the value of 'DocBot attempts to cure cancer' is that DocBot can think larger and wider thoughts than humans, and natively manipulate an opaque-to-us dense causal graph of the biochemical pathways in the human body, and so on; if you insist on DocBot only thinking legible-to-human thoughts, then it's not obvious it will significantly outperform humans.

↑ comment by Charbel-Raphaël (charbel-raphael-segerie) · 2022-06-08T21:41:06.959Z · LW(p) · GW(p)

If Facebook AI research is such a threat, wouldn't it be possible to talk to Yann LeCun?

Replies from: Eliezer_Yudkowsky, david-johnston

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-08T22:41:04.366Z · LW(p) · GW(p)

I did, briefly. I ask that you not do so yourself, or anybody else outside one of the major existing organizations, because I expect that will make things worse as you annoy him and fail to phrase your arguments in any way he'd find helpful.

Replies from: RobbBB, TekhneMakre

↑ comment by Rob Bensinger (RobbBB) · 2022-06-09T00:47:43.062Z · LW(p) · GW(p)

Other MIRI staff have also chatted with Yann. One co-worker told me that he was impressed with Yann's clarity of thought on related topics (e.g., he has some sensible, detailed, reductionist models of AI), so I'm surprised things haven't gone better.

Non-MIRI folks have talked to Yann too; e.g., Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More [LW · GW].

↑ comment by TekhneMakre · 2022-06-08T23:00:04.007Z · LW(p) · GW(p)

What happened?

Replies from: Eliezer_Yudkowsky, Raemon

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-09T00:23:39.769Z · LW(p) · GW(p)

Nothing much.

↑ comment by Raemon · 2022-06-09T00:38:06.480Z · LW(p) · GW(p)

There was also a debate between Yann and Stuart Russel on facebook, which got discussed here:

https://www.lesswrong.com/posts/WxW6Gc6f2z3mzmqKs/debate-on-instrumental-convergence-between-lecun-russell [LW · GW]

For a more comprehensive writeup of some stuff related to the "annoy him and fail to phrase your arguments helpfully", see Idea Innoculation and Inferential Distance.

↑ comment by David Johnston (david-johnston) · 2022-06-09T00:53:48.328Z · LW(p) · GW(p)

My view is that if Yann continues to be interested in arguing about the issue then there's something to work with, even if he's skeptical, and the real worry is if he's stopped talking to anyone about it (I have no idea personally what his state of mind is right now)

↑ comment by jrincayc · 2022-06-19T22:48:29.332Z · LW(p) · GW(p)

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?

Produce the Textbook From The Future that tells us how to do AGI safely. That said, getting an AGI to generate a correct Foom safety textbook or AGI Textbook from the future would be incredibly difficult, it would be very possible for an AGI to slip in a subtle hard-to-detect inaccuracy that would make it worthless, verifying that it is correct would be very difficult, and getting all humans on earth to follow it would be very difficult.

comment by Wei Dai (Wei_Dai) · 2022-06-13T20:37:03.575Z · LW(p) · GW(p)

I think until recently, I've been consistently more pessimistic than Eliezer about AI existential safety. Here's a 2004 SL4 post for example where I tried to argue against MIRI (SIAI at the time) trying to build a safe AI (and again in 2011 [LW · GW]). I've made my own list of sources of AI risk [LW · GW] that's somewhat similar to this list. But it seems to me that there are still various "outs" from certain doom, such that my probability of a good outcome is closer to 20% (maybe a range of 10-30% depending on my mood) than 1%.

Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

One of the biggest "outs" I see is that it turns out to be not that hard "to train a powerful system entirely on imitation of human words or other human-legible contents", we (e.g., a relatively responsible AI lab) train such a system and then use it to differentially accelerate AI safety research. I definitely think that it's very risky [AF · GW] to rely on such black-box human imitations for existential safety, and that a competent civilization would be pursuing other plans where they can end up with greater certainty of success, but it seems there's something like a 20% chance that it just works out anyway.

To explain my thinking a bit more, human children have to learn how to think human thoughts through "imitation of human words or other human-legible contents". It's possible that they can only do this successfully because their genes contain certain key ingredients that enable human thinking, but it also seems possible that children are just implementations of some generic imitation learning algorithm, so our artificial learning algorithms (once they become advanced/powerful enough) won't be worse at learning to think like humans. I don't know how to rule out the latter possibility with very high confidence. Eliezer, if you do, can you please explain this more?

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-06T00:54:00.393Z · LW(p) · GW(p)

[This is a nitpick of the form "one of your side-rants went a bit too far IMO;" feel free to ignore]

The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. ... The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that.

The third option this seems to miss is that there are people who could have written this document, but they also thought they had better things to do than write it. I'm thinking of people like Paul Christiano, Nate Soares, John Wentworth, Ajeya Cotra... there are dozens of people who have thought deeply about this stuff and also talked with you (Yudkowsky) and I bet they could have written something approximately as good as this if they tried. Perhaps, like you, they decided to instead spend their time working directly on the problem.

I do agree with you that they seem to on average be way way too optimistic, but I don't think it's because they are ignorant of the considerations and arguments you've made here.

A big source of optimism for Paul, for example, seems to be his timelines + views about takeoff speeds, which are mostly independent of the claims made in this post. I too would be cautiously optimistic if I thought we had 30 years left and that by the time things really went crazy we'd have decades of experience with just-slightly-dumber systems automating big chunks of the economy & AI alignment would be a big prestigious field with lots of geniuses being mentored by older geniuses etc. (Many of the points you make here would still apply, so it would still be a pretty scary situation...)

Replies from: RobbBB, gettinglesswrong, Eliezer_Yudkowsky

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T03:06:10.269Z · LW(p) · GW(p)

I'm thinking of people like Paul Christiano, Nate Soares, John Wentworth, Ajeya Cotra... [...] I do agree with you that they seem to on average be way way too optimistic, but I don't think it's because they are ignorant of the considerations and arguments you've made here.

I don't think Nate is that much more optimistic than Eliezer, but I believe Eliezer thinks Nate couldn't have generated enough of the list in the OP, or couldn't have generated enough of it independently ("using the null string as input").

↑ comment by gettinglesswrong · 2022-06-15T09:15:29.382Z · LW(p) · GW(p)

>too would be cautiously optimistic if I thought we had 30 years left

This is a bit of an aside but can I ask what the general opinion is on how many years we had left? Was your comment stating that it's optimistic to think we have 30 years left before AGI, or optimistic about the remainder of the sentence?

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-06-15T15:32:14.591Z · LW(p) · GW(p)

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T01:40:56.122Z · LW(p) · GW(p)

Seems implausible. Other people have much more stamina than I do, hence more time in practice, even if they are simultaneously doing other things.

It's admittedly true that nobody in this field except me can write things, in full generality; but one might have still expected a poorly written attempt to arise from somewhere if the knowledge-capability was widespread.

Replies from: CronoDAS, lc, sil-ver, yitz, handoflixue

↑ comment by CronoDAS · 2022-06-06T06:46:22.423Z · LW(p) · GW(p)

Would MIRI be interested in hiring a full time staff writer/editor? I feel like I could have produced a good chunk of this if I had thought I should try to, just from having hung around LessWrong since it was just Eliezer Yudkowsky and Robin Hanson blogging on Overcoming Bias, but I thought the basic "no, really, AI is going to kill us" arguments were already written up in other places, like Arbital and the book Superintelligence.

Replies from: Vaniver, Chris_Leong

↑ comment by Vaniver · 2022-06-06T14:37:08.433Z · LW(p) · GW(p)

Would MIRI be interested in hiring a full time staff writer/editor?

This is sort of still Rob's job, and it was my job from 2016-2019. If I recall correctly, my first major project was helping out with a document sort-of-like this document, which tried to explain to OpenPhil some details of the MIRI strategic view. [I don't think this was ever made public, and might be an interesting thing to pull out of the archives and publish now?]

If I tried to produce this document from scratch, I think it would have been substantially worse, tho I think I might have been able to reduce the time from "Eliezer's initial draft" to "this is published at all".

Replies from: conor-sullivan, RobbBB

↑ comment by Lone Pine (conor-sullivan) · 2022-06-06T20:50:06.007Z · LW(p) · GW(p)

From the perspective of persuading an alignment-optimist in the AI world, this document could not possibly have been worse. I don't know you Vaniver, but I'm confident you could have done a more persuasive job just by editing out the weird aspersion that EY is the only person capable of writing about alignment.

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T20:26:36.262Z · LW(p) · GW(p)

I think you're thinking of drafts mainly based on Nate's thinking rather than Eliezer's, but yeah, those are on my list of things to maybe release publicly in some form.

↑ comment by Chris_Leong · 2022-06-06T09:41:31.143Z · LW(p) · GW(p)

Yeah, either that or paying for writing lessons for alignment researchers if they really have to write the post themselves.

↑ comment by lc · 2022-06-06T02:31:42.863Z · LW(p) · GW(p)

How many people are there "in the field"? Fifty?

Replies from: Evan R. Murphy

↑ comment by Evan R. Murphy · 2022-06-08T21:26:31.832Z · LW(p) · GW(p)

100~200 are the latest estimates I've heard for total number of people working on AI alignment.

I don't have a great reference for that figure, but it's compatible with this slide from State of AI Report 2021 which claims "fewer than 100 researchers work on AI Alignment in 7 leading AI organisations", if you consider:

The report excluded independent alignment researchers and smaller/lesser-known AI organisations from their tally
It's at least a few months old [1] and the AI alignment field has been growing

[1]: I can't tell if Benaich and Hogarth published this sometime in 2021 or at the beginning of 2022, after 2021 had ended. Either way it's 5~18 months old.

↑ comment by Rafael Harth (sil-ver) · 2022-06-06T10:07:40.152Z · LW(p) · GW(p)

This document doesn't look to me like something a lot of people would try to write. Maybe it was one of the most important things to write, but not obviously so. Among the steps (1) get the idea to write out all reasons for pessimism, (2) resolve to try, (3) not give up halfway through, and (4) be capable, I would not guess that 4 is the strongest filter.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T10:37:18.694Z · LW(p) · GW(p)

I don't think I personally could have written it; if others think they could have, I'd genuinely be interested to hear them brag, even if they can't prove it.

Maybe the ideal would be 'I generated the core ideas of [a,b,c] with little or no argument from others; I had to be convinced of [d,e,f] but I now agree with them; I disagree with [g,h,i]; I think you left out important considerations [x,y,z].' Just knowing people's self-model is interesting to me, I don't demand that everything you believe be immediately provable to me.

Replies from: johnswentworth, steve2152, evhub, lc, swift_spiral, niplav

↑ comment by johnswentworth · 2022-06-06T15:38:46.504Z · LW(p) · GW(p)

I think as of early this year (like, January/February, before I saw a version of this doc) I could have produced a pretty similar list to this one. I definitely would not derive it from the empty string in the closest world-without-Eliezer; I'm unsure how much I'd pay attention to AI alignment at all in that world. I'd very likely be working on agent foundations in that world, but possibly in the context of biology or AI capabilities rather than alignment. Arguments about AI foom and doom were obviously-to-me correct once I paid attention to them at all, but not something I'd have paid attention to on my own without someone pointing them out.

Some specifics about kind-of-doc I could have written early this year

The framing around pivotal acts specifically was new-to-me when the late 2021 MIRI conversations [? · GW] were published. Prior to that, I'd have had to talk about how weak wish-granters are safe but not actually useful, and if we want safe AI which actually grants big wishes then we have to deal with the safety problems. Pivotal acts framing simplifies that part of the argument a lot by directly establishing a particular "big" capability which is necessary.
By early this year, I think would have generated pretty similar points to basically everything in the post if I were trying to be really comprehensive. (In practice, writing a post like this, I would go for more unifying structure and thought-generators rather than comprehensiveness; I'd use the individual failure modes more as examples of their respective generators.)
In my traversal-order of barriers, the hard conceptual barriers for which we currently have no solution even in principle (like e.g. 16-19) would get a lot more weight and detail; I spend less time thinking about what-I-mentally-categorize-as "the obvious things which go wrong with stupid approaches" (20, 21, 25-36).
- Just within the past week, this post on interpretability [LW · GW] was one which would probably turn into a point on my equivalent of Eliezer's list.
The earlier points are largely facts-about-the-world (e.g. 1, 2, 7-9, 12-15). For many of these, I would cite different evidence, although the conclusions remain the same. True facts are, as a general rule, overdetermined by evidence; there are many paths to them, and I didn't always follow the same paths Eliezer does here.
A few points I think are wrong (notably 18, 22, 24 to a limited extent), but are correct relative to the knowledge/models which most proposals actually leverage. The loopholes there are things which you do need pretty major foundational/conceptual work to actually steer through.
I would definitely have generated some similar rants at the end, though of course not identical.
- One example: just yesterday I was complaining about how people seem to generate alignment proposals via a process of (1) come up with neat idea, (2) come up with some conditions under which that idea would maybe work (or at least not obviously fail in any of the ways the person knows to look for), (3) either claim that "we just don't know" whether the conditions hold (without any serious effort to look for evidence), or directly look for evidence that they hold. Pretty standard bottom line failure [LW · GW].

I did briefly consider writing something along these lines after Eliezer made a similar comment to 39 in the Late 2021 MIRI Conversations. But as Kokotajlo guessed, I did not think that was even remotely close to the highest-value use of my time. It would probably take me a full month's work to do it right, and the list just isn't as valuable as my last month of progress. Or the month before that. Or the month before that.

Replies from: Ruby, lc

↑ comment by Ruby · 2022-06-06T20:51:59.941Z · LW(p) · GW(p)

I'm curious about why you decided it wasn't worth your time.

Going from the post itself, the case for publishing it goes something like "the whole field of AI Alignment is failing to produce useful work because people aren't engaging with what's actually hard about the problem and are ignoring all the ways their proposals are doomed; perhaps yelling at them via this post might change some of that."

Accepting the premises (which I'm inclined to), trying to get the entire field to correct course seems actually pretty valuable, maybe even worth a month of your time, now that I think about it.

Replies from: johnswentworth, Thane Ruthenis

↑ comment by johnswentworth · 2022-06-06T21:43:39.114Z · LW(p) · GW(p)

First and foremost, I have been making extraordinarily rapid progress in the last few months, though most of that is not yet publicly visible.

Second, a large part of why people pour effort into not-very-useful work is that the not-very-useful work is tractable. Useless, but at least you can make progress on the useless thing! Few people really want to work on problems which are actually Hard, so people will inevitably find excuses to do easy things instead. As Eliezer himself complains, writing the list just kicks the can down the road; six months later people will have a new set of bad ideas with giant gaping holes in them. The real goal is to either:

produce people who will identify the holes in their own schemes, repeatedly, until they converge to work on things which are actually useful despite being Hard, or
get enough of a paradigm in place that people can make legible progress on actually-useful things without doing anything Hard.

I have recently started testing out methods for the former, but it's the sort of thing which starts out with lots of tests on individuals or small groups to see what works. The latter, of course, is largely what my technical research is aimed at in the medium term.

(I also note that there will always be at least some need for people doing the Hard things, even once a paradigm is established.)

In the short term, if people want to identify the holes in their own schemes and converge to work on actually useful things, I think the "builder/breaker" methodology that Paul uses in the ELK doc is currently a good starting point.

↑ comment by Thane Ruthenis · 2022-06-07T04:54:29.949Z · LW(p) · GW(p)

Well, it's the Law of Continued Failure [LW · GW], as Eliezer termed it himself, no? There's already been a lot of rants about the real problems of alignment and how basically no-one focuses on them, most of them Eliezer-written as well. The sort of person who wasn't convinced/course-corrected by previous scattered rants isn't going to be course-corrected by a giant post compiling all the rants in one place. Someone to whom this post would be of use is someone who've already absorbed all the information contained in it from other sources; someone who can already write it up on their own.

The picture may not be quite as grim as that, but yeah I can see how writing it would not be anyone's top priority.

↑ comment by lc · 2022-06-06T17:36:50.560Z · LW(p) · GW(p)

I definitely would not derive it from the empty string in the closest world-without-Eliezer; I'm unsure how much I'd pay attention to AI alignment at all in that world. I'd very likely be working on agent foundations in that world, but possibly in the context of biology or AI capabilities rather than alignment. Arguments about AI foom and doom were obviously-to-me correct once I paid attention to them at all, but not something I'd have paid attention to on my own without someone pointing them out.

I don't think he does this; that'd be ridiculous.

"I can't find any good alignment researchers. The only way I know how to find them is by explaining that the field is important, using arguments for AI risk and doomerism, which means they didn't come up with those arguments on their own, and thus cannot be 'worthy'."

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T21:20:45.222Z · LW(p) · GW(p)

I don't think he does this; that'd be ridiculous.

Doesn't do what? I understand Eliezer to be saying that he figured out AI risk via thinking things through himself (e.g., writing a story that involved outcome pumps; reflecting on orthogonality and instrumental convergence; etc.), rather than being argued into it by someone else who was worried about AI risk. If Eliezer didn't do that, there would still presumably be someone prior to him who did that, since conclusions and ideas have to enter the world somehow. So I'm not understanding what you're modeling as ridiculous.

(I don't know that foom falls into the same category; did Vinge or I.J. Good's arguments help persuade EY here?)

"I can't find any good alignment researchers. The only way I know how to find them is by explaining that the field is important, using arguments for AI risk and doomerism, which means they didn't come up with those arguments on their own, and thus cannot be 'worthy'."

This is phrased in a way that's meant to make the standard sound unfair or impossible. But it seems like a perfectly fine Bayesian update:

There's no logical necessity that we live in a world that lacks dozens of independent "Eliezers" who all come up with this stuff and write about it. I think Nick Bostrom had some AI risk worries independently of Eliezer, so gets at least partial credit on this dimension. Others who had thoughts along these lines independently include Norbert Wiener and I.J. Good (timeline with more examples [LW · GW]).
- You could imagine a world that has much more independent discovery on this topic, or one where all the basic concepts of AI risk were being widely discussed and analyzed back in the 1960s. It's a fair Bayesian update to note that we don't live in worlds that are anything like that, even if it's not a fair test of individual ability for people who, say, encountered all of Eliezer's writing as soon as they even learned about the concept of AI.
- (I could also imagine a world where more of the independent discoveries result in serious research programs being launched, rather than just resulting in someone writing a science fiction story and then moving on with a shrug!)
Your summary leaves out that "coming up with stuff without needing to be argued into it" is a matter of degree, and that there are many important claims here beyond just 'AI risk is worth paying attention to at all'.
- It's logically possible to live in a world where people need to have AI risk brought to their attention, but then they immediately "get it" when they hear the two-sentence version, rather than needing an essay-length or seven-essay-length explanation. To the extent we live in a world where many key players need the full essay, and many other smart, important people don't even "get it" after hours of conversation (e.g., LeCun), that's a negative update about humanity's odds of success.
- Similarly, it's logically possible to live in a world where people needed persuading to accept the core 'AI risk' thing, but then they have an easy time generating all the other important details and subclaims themselves. "Maximum doom" and "minimum doom" aren't the only options; the exact level of doominess matters a lot.
  - E.g., my Eliezer-model thinks that nearly all public discussion of 'practical implications of logical decision theory' outside of MIRI (e.g., discussion of humans trying to acausally trade with superintelligences) has been utterly awful. If instead this discourse had managed to get a ton of stuff right even though EY wasn't releasing much of his own detailed thoughts about acausal trade, then that would have been an important positive update.
Eliezer spent years alluding to his AI risk concerns on Overcoming Bias without writing them all up, and deliberately withheld many related arguments for years (including as recently as last year) in order to test whether anyone else would generate them independently. It isn't the case that humanity had to passively wait to hear the full argument from Eliezer before it was permitted for them to start thinking and writing about this stuff.

Replies from: riceissa, lc

↑ comment by riceissa · 2022-06-06T22:05:59.845Z · LW(p) · GW(p)

Doesn't do what? I understand Eliezer to be saying that he figured out AI risk via thinking things through himself (e.g., writing a story that involved outcome pumps; reflecting on orthogonality and instrumental convergence; etc.), rather than being argued into it by someone else who was worried about AI risk. If Eliezer didn't do that, there would still presumably be someone prior to him who did that, since conclusions and ideas have to enter the world somehow. So I'm not understanding what you're modeling as ridiculous.

My understanding of the history is that Eliezer did not realize the importance of alignment at first, and that he only did so later after arguing about it online with people like Nick Bostrom. See e.g. this thread [LW(p) · GW(p)]. I don't know enough of the history here, but it also seems logically possible that Bostrom could have, say, only realized the importance of alignment after conversing with other people who also didn't realize the importance of alignment. In that case, there might be a "bubble" of humans who together satisfy the null string criterion, but no single human who does.

The null string criterion does seem a bit silly nowadays since I think the people who would have satisfied it would have sooner read about AI risk on e.g. LessWrong. So they wouldn't even have the chance to live to age ~21 to see if they spontaneously invent the ideas.

↑ comment by lc · 2022-06-06T22:01:11.273Z · LW(p) · GW(p)

Look, maybe you're right. But I'm not good at complicated reasoning; I can't confidently verify these results you're giving me. My brain is using a much simpler heuristic that says: look at all of these other fields with core insights that could have been made way earlier than they did. Look at Newton! Look at Darwin! Certainly game theorists could have come along a lot sooner. But that doesn't mean only the founder of these fields is the one Great enough to make progress, so, what are you saying, exactly?

↑ comment by Steven Byrnes (steve2152) · 2022-06-07T03:53:00.473Z · LW(p) · GW(p)

I have a couple object-level disagreements including relevance of evolution [LW · GW] / nature of inner alignment problem [LW · GW] and difficulty of attaining corrigibility [? · GW]. But leaving those aside, I wouldn’t have exactly written this kind of document myself, because I’m not quite sure what the purpose is. It seems to be trying to do a lot of different things for different audiences, where I think more narrowly-tailored documents would be better.

So, here are four useful things to do, and whether I’m personally doing them:

First, there is a mass of people who think AGI risk is trivial and stupid and p(doom) ≈ 0, and they can gleefully race to build AGI, or do other things that will speed the development of AGI (like improve PyTorch, or study the neocortex), and they can totally ignore the field of AGI safety, and when they have AGI algorithms they can mess around with them without a care in the world.

It would be very good to convince those people that AGI control is a serious and hard and currently-unsolved (and interesting!) problem, and that p(doom) will remain high (say, >>10%) unless and until we solve it.

I think this is a specific audience that warrants a narrowly-tailored document, e.g. avoiding jargon and addressing the basics very well.

That’s a big part of what I was going for in this post [LW · GW], for example. (And more generally, that whole series.)

Second, there are people who are thoughtful and well-informed about AGI risk in general, but not sold on the “pivotal act” idea. If they had an AGI, they would do things that pattern-match to “cautious scientists doing very careful experiments in a dangerous domain”, but they would not do things that pattern-match to “aggressively and urgently use their new tool to prevent the imminent end of the world, by any means necessary, even if it’s super-illegal and aggressive and somewhat dangerous and everyone will hate them”.

(I’m using “pivotal act” in a slightly broader sense that also includes “giving a human-level AGI autonomy to undergo recursive self-improvement and invent and deploy its own new technology”, since the latter has the same sort of dangerous properties and aggressive feel about it as a proper “pivotal act”.)

(Well, it’s possible that there are people sold on the “pivotal act” idea who wouldn’t say it publicly.)

Last week I did a little exercise of trying to guess p(doom), conditional on the two assumptions in this other comment [LW(p) · GW(p)]. I got well over 99%, but I noted with interest that only a minority of my p(doom) was coming from “no one knows how to keep an AGI under control” (which I’m less pessimistic about than Eliezer, heck maybe I’m even as high as 20% that we can keep an AGI under control :-P , and I’m hoping that further research will increase that), whereas a majority of my p(doom) was coming from “there will be cautious responsible actors who will follow the rules and be modest and not do pivotal acts, and there will also be some reckless actors who will create out-of-control omnicidal AGIs”.

So it seems extremely important to figure out whether a “pivotal act” is in fact necessary for a good future. And if it is (a big “if”!), then it likewise seems extremely important to get relevant decisionmaking people on board with that.

I think it would be valuable to have a document narrowly tailored to this topic, finding the cruxes and arguments and counter-arguments etc. For example, I think this is a topic that looks very different in a Paul-Christriano-style future (gradual multipolar takeoff, near-misses, “corrigible AI assistants”, “strategy stealing assumption”, etc.) then in the world that I expect (decisive first-mover advantage).

But I don’t really feel qualified to write anything like that myself, at least not before talking to lots of people, and it also might be the kind of thing that’s better as a conversation than a blog post.

Third, there are people (e.g. leadership at OpenAI & DeepMind) making decisions that trade off between “AGI is invented soon” versus “AGI is invented by us people who are at least trying to avoid catastrophe and be altruistic”. Insofar as I think they’re making bad tradeoffs, I would like to convince them of that.

Again, it would be useful to have a document narrowly tailored to this topic. I’m not planning to write one, but perhaps I’m sorta addressing it indirectly when I share my idiosyncratic models [? · GW] of exactly what technical work I think needs to be done before we can align an AGI.

Fourth, there are people who have engaged with the AGI alignment / safety literature / discourse but are pursuing directions that will not solve the problem. It would be very valuable to spread common knowledge that those approaches are doomed. But if I were going to do that, it would (again) be a separate narrowly-tailored document, perhaps either organized by challenge that the approaches are not up to the task of solving, or organized by research program that I’m criticizing, naming names. I have dabbled in this kind of thing (example [LW · GW]), but don’t have any immediate plan to do it more, let alone systematically. I think that would be extremely time-consuming.

↑ comment by evhub · 2022-06-08T22:56:08.260Z · LW(p) · GW(p)

It's very clear to me I could have written this if I had wanted to—and at the very least I'm sure Paul could have as well. As evidence: it took me ~1 hour to list off all the existing sources that cover every one of these points in my comment [AF(p) · GW(p)].

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-09T00:23:09.467Z · LW(p) · GW(p)

Well, there's obviously a lot of points missing! And from the amount this post was upvoted, it's clear that people saw the half-assed current form as valuable.

Why don't you start listing out all the missing further points, then? (Bonus points for any that don't trace back to my own invention, though I realize a lot of people may not realize how much of this stuff traces back to my own invention.)

Replies from: evhub

↑ comment by evhub · 2022-06-09T00:38:33.873Z · LW(p) · GW(p)

I'm not sure what you mean by missing points? I only included your technical claims, not your sociological ones, if that's what you mean.

Replies from: ESRogs

↑ comment by ESRogs · 2022-06-09T01:25:10.025Z · LW(p) · GW(p)

I think he means that there are more points that could be made. (If the points in the post are the training set, can you also produce the points in the held-out test set?)

↑ comment by lc · 2022-06-06T17:30:01.417Z · LW(p) · GW(p)

I don't think I personally could have written it; if others think they could have, I'd genuinely be interested to hear them brag, even if they can't prove it.

Maybe I'm beyond hopeless: I don't even understand the brag inherent in having written it. He keeps talking about coming up with this stuff "from the null string", but... Isn't 90% of this post published somewhere else? If someone else had written it wouldn't he just accuse them of not being able to write it without reading {X}, or something from someone else who read {X}? At present this is mostly a test of recall.

Edit: Not to say I could've done even that, just that I expect someone else could have.

Replies from: yitz

↑ comment by Yitz (yitz) · 2022-06-07T06:05:11.782Z · LW(p) · GW(p)

The post honestly slightly decreases my confidence in EY’s social assessment capabilities. (I say slightly because of past criticism I’ve had along similar lines). [note here that being good/bad at social assessment is not necessarily correlated to being good/bad at other domains, so like, I don’t see that as taking away from his extremely valid criticism of common “simple solutions” to alignment (which I’ve definitely been guilty of myself). Please don’t read this as denigrating Eliezer’s general intellect or work as a whole.] As you said, the post doesn’t seem incredibly original, and even if it is and we’re both totally missing that aspect, the fact that we’re missing it implies it isn’t getting across the intended message as effectively as it could. Ultimately, I think if I was in Eliezer’s position, there are a very large number of alternative explanations I’d give higher probability to than assuming that there is nobody in the world as competent as I am.

↑ comment by swift_spiral · 2022-06-07T01:04:09.194Z · LW(p) · GW(p)

When you say you don't think you could have written it, do you mean that you couldn't have written it without all the things you've learned from talking to Yudkowsky, or that you couldn't have written it even now? Most of this list was things I've seen Yudkowsky write before, so if it's the latter that surprises me.

↑ comment by niplav · 2022-06-07T23:06:20.392Z · LW(p) · GW(p)

Can I claim a very small but non-zero amount of bragging rights for having written this [LW · GW]? It was at the time the ~only text about BCIs and alignment.

I don't think I could have written the above text in a world where zero people worried about alignment. I also did not bother to write anything more about it because it looked to me that everything relevant was already written up on the Arbital alignment domain.

↑ comment by Yitz (yitz) · 2022-06-07T05:46:44.578Z · LW(p) · GW(p)

I actually did try to generate a similar list through community discussion (https://www.lesswrong.com/posts/dSaScvukmCRqey8ug/convince-me-that-humanity-is-as-doomed-by-agi-as-yudkowsky [LW · GW]), which while it didn’t end up going in the same exact direction as this document, did have some genuinely really good arguments on the topic, imo. I also don’t feel like many of the points you brought up here were really novel, in that I’ve heard most of this from multiple different sources already (though admittedly, not all in one place).

On a more general note, I don’t believe that people are as stupid compared to you as you seem to think they are. Different people’s modes of thinking are different than yours, obviously, but just because there isn’t an exact clone of you around doesn’t mean that we are significantly more doomed than in the counterfactual. I don’t want to diminish your contributions, but there are other people out there as smart or smarter than you, with security mindset, currently working in this problem area. You are not the only person on earth who can (more or less) think critically.

↑ comment by handoflixue · 2022-06-07T05:52:39.447Z · LW(p) · GW(p)

Anecdotally: even if I could write this post, I never would have, because I would assume that Eliezer cares more about writing, has better writing skills, and has a much wider audience. In short, why would I write this when Eliezer could write it?

You might want to be a lot louder if you think it's a mistake to leave you as the main "public advocate / person who writes stuff down" person for the cause.

Replies from: RobbBB, lc

↑ comment by Rob Bensinger (RobbBB) · 2022-06-07T07:59:56.718Z · LW(p) · GW(p)

a mistake to leave you as the main "public advocate / person who writes stuff down" person for the cause.

It sort of sounds like you're treating him as the sole "person who writes stuff down", not just the "main" one. Noam Chomsky might have been the "main linguistics guy" in the late 20th century, but people didn't expect him to write more than a trivial fraction of the field's output, either in terms of high-level overviews or in-the-trenches research.

I think EY was pretty clear in the OP that this is not how things go on earths that survive. Even if there aren't many who can write high-level alignment overviews today, more people should make the attempt and try to build skill.

Replies from: handoflixue

↑ comment by handoflixue · 2022-06-08T01:02:11.626Z · LW(p) · GW(p)

In the counterfactual world where Eliezer was totally happy continuing to write articles like this and being seen as the "voice of AI Safety", would you still agree that it's important to have a dozen other people also writing similar articles?

I'm genuinely lost on the value of having a dozen similar papers - I don't know of a dozen different versions of fivethirtyeight.com or GiveWell, and it never occurred to me to think that the world is worse for only having one of those.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T08:00:25.342Z · LW(p) · GW(p)

Here's my answer: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=LowEED2iDkhco3a5d [LW(p) · GW(p)]

We have to actually figure out how to build aligned AGI, and the details are crucial. If you're modeling this as a random blog post aimed at persuading people to care about this cause area, a "voice of AI safety" type task, then sure, the details are less important and it's not so clear that Yet Another Marginal Blog Post Arguing For "Care About AI Stuff" matters much.

But humanity also has to do the task of actually figuring out and implementing alignment. If not here, then where, and when? If here -- if this is an important part of humanity's process of actually figuring out the exact shape of the problem, clarifying our view of what sorts of solutions are workable, and solving it -- then there is more of a case that this is a conversation of real consequence, and having better versions of this conversation sooner matters.

↑ comment by lc · 2022-06-07T08:07:20.905Z · LW(p) · GW(p)

He wasn't designated "main person who writes stuff down" by a cabal of AI safety elders. He's not personally responsible for the fate of the world - he just happens to be the only person who consistently writes cogent things down. If you want you can go ahead and devote your life to AI safety, start doing the work he does as effectively and realistically as he does it, and then you'll eventually be designated Movement Leader and have the opportunity to be whined at. He was pretty explicitly clear in the post that he does not want to be this and that he spent the last fifteen years trying to find someone else who can do what he does.

Replies from: handoflixue

↑ comment by handoflixue · 2022-06-08T00:52:30.798Z · LW(p) · GW(p)

I largely agree with you, but until this post I had never realized that this wasn't a role Eliezer wanted. If I went into AI Risk work, I would have focused on other things - my natural inclination is to look at what work isn't getting done, and to do that.

If this post wasn't surprising to you, I'm curious where you had previously seen him communicate this?

If this post was surprising to you, then hopefully you can agree with me that it's worth signal boosting that he wants to be replaced?

comment by TurnTrout · 2024-01-01T20:33:46.771Z · LW(p) · GW(p)

Reading this post made me more optimistic about alignment and AI [LW(p) · GW(p)]. My suspension of disbelief snapped; I realized how vague and bad a lot of these "classic" alignment arguments are, and how many of them are secretly vague analogies [LW(p) · GW(p)] and intuitions about evolution.

While I agree with a few points on this list, I think this list is fundamentally misguided. The list is written in a language which assigns short encodings to confused and incorrect ideas [LW · GW]. I think a person who tries to deeply internalize this post's worldview will end up more confused about alignment and AI, and urge new researchers to not spend too much time trying to internalize this post's ideas. (Definitely consider whether I am right in my claims here. Think for yourself. If you don't know how to think for yourself, I wrote about exactly how to do it [LW · GW]! But my guess is that deeply engaging with this post is, at best, a waste of time.^[1])

I think this piece is not "overconfident", because "overconfident" suggests that Lethalities is simply assigning extreme credences to reasonable questions (like "is deceptive alignment the default?"). Rather, I think both its predictions and questions are not reasonable because they are not located by good evidence or arguments. (Example: I think that deceptive alignment is only supported by flimsy arguments [LW(p) · GW(p)].)

I personally think Eliezer's alignment worldview (as I understand it!) appears to exist in an alternative reality derived from unjustified background assumptions.^[2] Given those assumptions, then sure, Eliezer's reasoning steps are probably locally valid. But I think that in reality, most of this worldview ends up irrelevant and misleading because the background assumptions don't hold.

I think this kind of worldview (socially attempts to) shield itself from falsification [LW(p) · GW(p)] by e.g. claiming that modern systems "don't count" for various reasons which I consider flimsy [LW(p) · GW(p)]. But I think that deep learning experiments provide plenty of evidence on alignment questions [LW(p) · GW(p)].

But, hey, why not still include this piece in the review? I think it's interesting to know what a particular influential person thought at a given point in time.

^{^}
Related writing of mine: Some of my disagreements with List of Lethalities [LW · GW], Inner and outer alignment decompose one hard problem into two extremely hard problems [LW · GW].
Recommended further critiques of this worldview: Evolution is a bad analogy for AGI: inner alignment [LW · GW], Evolution provides no evidence for the sharp left turn [LW · GW], My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" [LW · GW].
^{^}
Since Eliezer claims to have figured out so many ideas in the 2000s, his assumptions presumably were locked in before the advent of deep learning. This constitutes a "bottom line [LW · GW]."

Replies from: lc

↑ comment by lc · 2024-01-18T19:30:52.838Z · LW(p) · GW(p)

Since Eliezer claims to have figured out so many ideas in the 2000s, his assumptions presumably were locked in before the advent of deep learning. This constitutes a "bottom line."

I mean it's worth considering that his P(DOOM) was substantially lower then. He's definitely updated on existing evidence, just in the opposite direction that you have.

comment by romeostevensit · 2022-06-06T01:59:28.993Z · LW(p) · GW(p)

I would summarize a dimension of the difficulty like this. There are the conditions that give rise to intellectual scenes, intellectual scenes being necessary for novel work in ambiguous domains. There are the conditions that give rise to the sort of orgs that output actions consistent with something like Six Dimensions of Operational Adequacy [LW · GW]. The intersection of these two things is incredibly rare but not unheard of. The Manhattan Project was a Scene that had security mindset. This is why I am not that hopeful. Humans are not the ones building the AGI, egregores are, and spending egregore sums of money. It is very difficult for individuals to support a scene of such magnitude, even if they wanted to. Ultra high net worth individuals seem much poorer relative to the wealth of society than in the past, where scenes and universities (a scene generator) could be funded by individuals or families. I'd guess this is partially because the opportunity cost for smart people is much higher now, and you need to match that (cue title card: Baumol's cost disease kills everyone). In practice I expect some will give objections along various seemingly practical lines, but my experience so far is that these objections are actually generated by an environment that isn't willing to be seen spending gobs of money on low status researchers who mostly produce nothing. i.e. funding the 90%+ percent of a scene that isn't obviously contributing to the emergence of a small cluster that actually does the thing.

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2022-06-06T04:19:09.997Z · LW(p) · GW(p)

Thanks, this story is pretty helpful (to my understanding).

comment by Raemon · 2022-06-06T20:46:46.980Z · LW(p) · GW(p)

Note: I think there's a bunch of additional reasons for doom, surrounding "civilizational adequacy / organizational competence / societal dynamics". Eliezer briefly alluded to these, but AFAICT he's mostly focused on lethality that comes "early", and then didn't address them much. My model of Andrew Critch has a bunch of concerns about doom that show up later, because there's a bunch of additional challenges you have to solve if AI doesn't dramatically win/lose early on (i.e. multi/multi dynamics and how they spiral out of control)

I know a bunch of people whose hope funnels through "We'll be able to carefully iterate on slightly-smarter-than-human-intelligences, build schemes to play them against each other, leverage them to make some progress on alignment that we can use to build slightly-more-advanced-safer-systems". (Let's call this the "Careful Bootstrap plan")

I do actually feel nonzero optimism about that plan, but when I talk to people who are optimistic about that I feel a missing mood about the kind of difficulty that is involved here.

I'll attempt to write up some concrete things here later, but wanted to note this for now.

Replies from: hirosakuraba

↑ comment by HiroSakuraba (hirosakuraba) · 2022-06-07T17:38:01.623Z · LW(p) · GW(p)

I agree with this line of thought regarding iterative developments of proto-AGI via careful bootstrapping. Humans will be inadequate for monitoring progress of skills. Hopefully, we'll have a slew of diagnostic of narrow minded neural networks whose sole purpose is to tease out relevant details of the proto-super human intellect. What I can't wrap my head around is whether super (or sub) human level intelligence requires consciousness. If consciousness is required, then is the world worse or better for it? Is an agent with the rich experience of fears, hopes, joys more or less likely to be built? Do reward functions reliably grow into feelings, which lead to emotional experiences? If they do, then perhaps an evolving intelligence wouldn't always be as alien as we currently imagine it.

comment by Mitchell_Porter · 2022-06-06T05:21:54.133Z · LW(p) · GW(p)

What concerns me the most is the lack of any coherent effort anywhere, towards solving the biggest problem: identifying a goal (value system, utility function, decision theory, decision architecture...) suitable for an autonomous superhuman AI.

In these discussions, Coherent Extrapolated Volition (CEV) is the usual concrete formulation of what such a goal might be. But I've now learned [LW(p) · GW(p)] that MIRI's central strategy is not to finish figuring out the theory and practice of CEV - that's considered too hard (see item 24 in this post). Instead, the hope is to use safe AGI to freeze all unsafe AGI development everywhere, for long enough that humanity can properly figure out what to do. Presumably this freeze (the "pivotal act") would be carried out by whichever government or corporation or university crossed the AGI threshold first; ideally there might even become a consensus among many of the contenders that this is the right thing to do.

I think it's very appropriate that some thought along these lines be carried out. If AGI is a threat to the human race, and it arrives before we know how to safely set it free, then we will need ways to try to neutralize that dangerous potential. But I also think it's vital that we try to solve that biggest problem, e.g. by figuring out how to concretely implement CEV. And if one is concerned that this is just too much for human intellect to figure out, remember that AI capabilities are rising. If humans can't figure out CEV unaided, maybe they can do it with the help of AI. To me, that's the critical pathway that we should be analyzing.

P.S. I have many more thoughts on what this might involve, but I don't know when I will be able to sort through them all. So for now I will just list a few people whose work is on my shortlist of definitely or potentially relevant (certainly not a complete list): June Ku, Vanessa Kosoy, Jessica Taylor, Steven Byrnes, Stuart Armstrong.

Replies from: quintin-pope, trevor-cappallo

↑ comment by Quintin Pope (quintin-pope) · 2022-06-06T06:27:01.632Z · LW(p) · GW(p)

There's shard theory, which aims to describe the process by which values form in humans. The eventual aim is to understand value formation well enough that we can do it in an AI system. I also think figuring out human values, value reflection and moral philosophy might actually be a lot easier than we assume. E.g., the continuous perspective [LW · GW] on agency / values is pretty compelling to me and changes things a lot, IMO.

↑ comment by Trevor Cappallo (trevor-cappallo) · 2022-06-20T15:34:56.885Z · LW(p) · GW(p)

Here's an outside-the-box suggestion:

Clearly the development of any AGI is an enormous risk. While I can't back this up with any concrete argument, a couple decades of working with math and CS problems gives me a gut intuition that statements like "I figure there's a 50-50 chance it'll kill us", or even a "5-15% everything works out" are wildly off. I suspect this is the sort of issue where the probability of survival is funneled to something more like either or $< 0.0001$ , of which the latter currently seems far more likely.

Has anyone discussed the concept of deliberately trying to precipitate a global nuclear war? I'm half kidding, but half not; if the risk is really so great and so imminent and potentially final as many on here suspect, then a near-extinction-event like that (presumably wiping out the infrastructure for GPU farms for a long time to come) which wouldn't actually wipe out the race but could buy time to work the problem (or at least pass the buck to our descendants) could conceivably be preferable.

Obviously, it's too abhorrent to be a real solution, but it does have the distinct advantage that it's something that could be done today if the right people wanted to do it, which is especially important given that I'm not at all convinced that we'll recognize a powerful AGI when we see it, based on how cavalierly everyone is dismissing large language models as nothing more than a sophisticated parlor trick, for instance.

Replies from: TrevorWiesinger

↑ comment by trevor (TrevorWiesinger) · 2024-01-13T20:25:18.042Z · LW(p) · GW(p)

Just want to clarify: this isn't me, I didn't write this. My last name isn't Cappallo. I didn't find out about this comment until today, when I did a Ctrl + f to find a comment I wrote around the time this was posted.

I'm the victim here, and in fact I have written substantially [LW · GW] about the weaponization of random internet randos to manipulate people's perceptions.

Replies from: trevor-cappallo, Zack_M_Davis

↑ comment by Trevor Cappallo (trevor-cappallo) · 2024-01-14T23:19:02.072Z · LW(p) · GW(p)

I confess I am perplexed, as I suspect most people are aware there is more than one Trevor in the world. As you point out, that is not your last name. I have no idea who you are, or why you feel this is some targeted "weaponization."

↑ comment by Zack_M_Davis · 2024-01-13T21:27:46.782Z · LW(p) · GW(p)

What weaponization? It would seem very odd to describe yourself as being the "victim" of someone else having the same first name as you.

comment by Noosphere89 (sharmake-farah) · 2024-09-14T17:02:45.236Z · LW(p) · GW(p)

Alright, now that I've read this post, I'll try to respond to what I think you got wrong, and importantly illustrate some general principles.

To respond to this first:

3. We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again. This includes, for example: (a) something smart enough to build a nanosystem which has been explicitly authorized to build a nanosystem; or (b) something smart enough to build a nanosystem and also smart enough to gain unauthorized access to the Internet and pay a human to put together the ingredients for a nanosystem; or (c) something smart enough to get unauthorized access to the Internet and build something smarter than itself on the number of machines it can hack; or (d) something smart enough to treat humans as manipulable machinery and which has any authorized or unauthorized two-way causal channel with humans; or (e) something smart enough to improve itself enough to do (b) or (d); etcetera. We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors. This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try. If we had unlimited retries - if every time an AGI destroyed all the galaxies we got to go back in time four years and try again - we would in a hundred years figure out which bright ideas actually worked. Human beings can figure out pretty difficult things over time, when they get lots of tries; when a failed guess kills literally everyone, that is harder. That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is 'key' and will kill us if we get it wrong. (One remarks that most people are so absolutely and flatly unprepared by their 'scientific' educations to challenge pre-paradigmatic puzzles with no scholarly authoritative supervision, that they do not even realize how much harder that is, or how incredibly lethal it is to demand getting that right on the first critical try.)

I think this is actually wrong, because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistingushiable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn't know whether we tested it or whether it's in the the real world:

Because of this, given our theoretical complete control over the data input of these models, it will be possible to train them inside what is effectively a simulated world slightly different from our own. This will allow us to maximize their capabilities along axes that are safe and useful to us, while minimizing the risks of deception and take-over and being able to setup traps to detect when these are likely. If the model is not trained on human data, then we can make its understanding of our world subtly different from the reality. If done carefully and cleverly, this will make it very easy to fool and be able to discover e.g. deceptiveness, or attempted takeover in a ‘realistic’ simulated world that the model is put inside during red-teaming. While deceiving an adversary with direct access to your cognition is hard enough, doing so when you have been brought up in a world entirely controlled by that adversary, and which is feeding you subtly incorrect and misleading information about the adversary and their capabilities is likely orders of magnitude more difficult. Jacob Cannell has argued along similar lines [LW · GW] of training potential AGI systems in simulations of fantasy worlds, where takeover can be simulated but is far enough away that skills do not transfer. Fully synthetic datasets are necessary for this and indeed, a fantasy world is likely far from what we need. Instead, we only need to simulate a world very close to our own but with slightly different affordances – maybe specific software vulnerabilities are simulated to exist but are actually patched in our world, maybe certain peptide-making companies accept all orders in the simulated world but have more checks in our world, maybe certain interpretability techniques are claimed not to work in the simulation, but do in fact work in our world – to cause any attempted takeover behaviour or adversarial deception to appear extremely plausible to the AI but in fact be detectable and defendable in our reality. In fact, we should design simulated worlds with ‘honeytraps’, which appear to be loopholes subtly planted in the AI’s input data to encourage them to attempt to pursue their misaligned behaviour, but which we know of ahead of time and can monitor and catch.

It's the same reason for why we can't break out of the simulation IRL, except we don't have to face adversarial cognition, so the AI's task is even harder than our task.

See also this link:

https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/

For this:

6. We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world. While the number of actors with AGI is few or one, they must execute some "pivotal act", strong enough to flip the gameboard, using an AGI powerful enough to do that. It's not enough to be able to align a weak system - we need to align a system that can do some single very large thing. The example I usually give is "burn all GPUs". This is not what I think you'd actually want to do with a powerful AGI - the nanomachines would need to operate in an incredibly complicated open environment to hunt down all the GPUs, and that would be needlessly difficult to align. However, all known pivotal acts are currently outside the Overton Window, and I expect them to stay there. So I picked an example where if anybody says "how dare you propose burning all GPUs?" I can say "Oh, well, I don't actually advocate doing that; it's just a mild overestimate for the rough power level of what you'd have to do, and the rough level of machine cognition required to do that, in order to prevent somebody else from destroying the world in six months or three years." (If it wasn't a mild overestimate, then 'burn all GPUs' would actually be the minimal pivotal task and hence correct answer, and I wouldn't be able to give that denial.) Many clever-sounding proposals for alignment fall apart as soon as you ask "How could you use this to align a system that you could use to shut down all the GPUs in the world?" because it's then clear that the system can't do something that powerful, or, if it can do that, the system wouldn't be easy to align. A GPU-burner is also a system powerful enough to, and purportedly authorized to, build nanotechnology, so it requires operating in a dangerous domain at a dangerous level of intelligence and capability; and this goes along with any non-fantasy attempt to name a way an AGI could change the world such that a half-dozen other would-be AGI-builders won't destroy the world 6 months later.

I think this is wrong, and a lot of why I disagree with the pivotal act framing is probably due to disagreeing with the assumption that future technology will be radically biased towards to offense, and while I do think biotechnology is probably pretty offense-biased today, I also think it's tractable to reduce bio-risk without trying for pivotal acts.

Also, I think @evhub [LW · GW]'s point about homogeneity of AI takeoff bears on this here, and while I don't agree with all the implications, like there being no warning shot for deceptive alignment (because of synthetic data), I think there's a point in which a lot of AIs are very likely to be very homogenous, and thus break your point here:

https://www.lesswrong.com/posts/mKBfa8v4S9pNKSyKK/homogeneity-vs-heterogeneity-in-ai-takeoff-scenarios [LW · GW]

Running AGIs doing something pivotal are not passively safe, they're the equivalent of nuclear cores that require actively maintained design properties to not go supercritical and melt down.

I think that AGIs are more robust to things going wrong than nuclear cores, and more generally I think there is much better evidence for AI robustness than fragility.

@jdp [LW · GW]'s comment provides more evidence on why this is the case:

Where our understanding begins to diverge is how we think about the robustness of these systems. You think of deep neural networks as being basically fragile in the same way that a Boeing 747 is fragile. If you remove a few parts of that system it will stop functioning, possibly at a deeply inconvenient time like when you're in the air. When I say you are systematically overindexing, I mean that you think of problems like SolidGoldMagikarp [LW · GW] as central examples of neural network failures. This is evidenced by Eliezer Yudkowsky calling investigation of it "one of the more hopeful processes happening on Earth". This is also probably why you focus so much on things like adversarial examples as evidence of un-robustness, even though many critics like Quintin Pope point out that adversarial robustness would make AI systems strictly less corrigible.
By contrast I tend to think of neural net representations as relatively robust. They get this property from being continuous systems with a range of operating parameters, which means instead of just trying to represent the things they see they implicitly try to represent the interobjects between what they've seen through a navigable latent geometry. I think of things like SolidGoldMagikarp as weird edge cases where they suddenly display discontinuous behavior, and that there are probably a finite number of these edge cases. It helps to realize that these glitch tokens were simply never trained, they were holdovers from earlier versions of the dataset that no longer contain the data the tokens were associated with. When you put one of these glitch tokens into the model, it is presumably just a random vector into the GPT-N latent space. That is, this isn't a learned program in the neural net that we've discovered doing glitchy things, but an essentially out of distribution input with privileged access to the network geometry through a programming oversight. In essence, it's a normal software error not a revelation about neural nets. Most such errors don't even produce effects that interesting, the usual thing that happens if you write a bug in your neural net code is the resulting system becomes less performant. Basically every experienced deep learning researcher has had the experience of writing multiple errors that partially cancel each other out to produce a working system during training, only to later realize their mistake.
Moreover the parts of the deep learning literature you think of as an emerging science of artificial minds tend to agree with my understanding. For example it turns out that if you ablate parts of a neural network later parts will correct the errors without retraining. This implies that these networks function as something like an in-context error correcting code, which helps them generalize over the many inputs they are exposed to during training. We even have papers analyzing mechanistic parts of this error correcting code like copy suppression heads. One simple proxy for out of distribution performance is to inject Gaussian noise, since a Gaussian can be thought of like the distribution over distributions. In fact if you inject noise into GPT-N word embeddings the resulting model becomes more performant in general, not just on out of distribution tasks. So the out of distribution performance of these models is highly tied to their in-distribution performance, they wouldn't be able to generalize within the distribution well if they couldn't also generalize out of distribution somewhat. Basically the fact that these models are vulnerable to adversarial examples is not a good fact to generalize about their overall robustness from as representations.

Link here:

https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/?commentId=7iBb7aF4ctfjLH6AC [LW · GW]

10. You can't train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions. (Some generalization of this seems like it would have to be true even outside that paradigm; you wouldn't be working on a live unaligned superintelligence to align it.) This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they'd do, in order to align what output - which is why, of course, they never concretely sketch anything like that. Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you. This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm. Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you're starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat. (Note that anything substantially smarter than you poses a threat given any realistic level of capability. Eg, "being able to produce outputs that humans look at" is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.)

I think that there will be generalization of alignment, and more generally I think that alignment generalizes further than capabilities by default, contra you and Nate Soares because of these reasons:

2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.
4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.
5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.

See also this link for more, but I think that's the gist for why I expect AI alignment to generalize much further than AI capabilities. I'd further add that I think evolutionary psychology got this very wrong, and predicted much more complex and fragile values in humans than is actually the case:

https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/

11. If cognitive machinery doesn't generalize far out of the distribution where you did tons of training, it can't solve problems on the order of 'build nanotechnology' where it would be too expensive to run a million training runs of failing to build nanotechnology. There is no pivotal act this weak; there's no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world and prevent the next AGI project up from destroying the world two years later. Pivotal weak acts like this aren't known, and not for want of people looking for them. So, again, you end up needing alignment to generalize way out of the training distribution - not just because the training environment needs to be safe, but because the training environment probably also needs to be cheaper than evaluating some real-world domain in which the AGI needs to do some huge act. You don't get 1000 failed tries at burning all GPUs - because people will notice, even leaving out the consequences of capabilities success and alignment failure.

This is covered by my points on why alignment generalizes further than capabilities and why we don't need pivotal acts and why we actually have safe testing grounds for deceptive AI.

15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously. Given otherwise insufficient foresight by the operators, I'd expect a lot of those problems to appear approximately simultaneously after a sharp capability gain. See, again, the case of human intelligence. We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously. (People will perhaps rationalize reasons why this abstract description doesn't carry over to gradient descent; eg, “gradient descent has less of an information bottleneck”. My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question. When an outer optimization loop actually produced general intelligence, it broke alignment after it turned general, and did so relatively late in the game of that general intelligence accumulating capability and knowledge, almost immediately before it turned 'lethally' dangerous relative to the outer optimization loop of natural selection. Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.)

Re the sharp capability gain breaking alignment properties, one very crucial advantage we have over evolution is that our goals are much more densely defined, constraining the AI more than evolution, where very, very sparse reward was the norm, and critically sparse-reward RL does not work for capabilities right now, and there are reasons to think it will be way less tractable than RL where rewards are more densely specified.

Another advantage we have over evolution, and chimpanzees/gorillas/orangutans is far, far more control over their data sources, which strongly influences their goals.

This is also helpful to point towards more explanation of what the differences are between dense and sparse RL rewards:

This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don't know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from "complete failure" to "insane useless crap." It's notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
https://www.lesswrong.com/posts/rZ6wam9gFGFQrCWHc/#mT792uAy4ih3qCDfx [LW · GW]

16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

Yeah, I covered this above, but evolution's loss function was neither that simple, compared to human goals, and it was ridiculously inexact compared to our attempts to optimize AIs loss functions, for the reasons I gave above.

17. More generally, a superproblem of 'outer optimization doesn't produce inner alignment' is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over. This is a problem when you're trying to generalize out of the original training distribution, because, eg, the outer behaviors you see could have been produced by an inner-misaligned system that is deliberately producing outer behaviors that will fool you. We don't know how to get any bits of information into the inner system rather than the outer behaviors, in any systematic or general way, on the current optimization paradigm.

I've answered that concern above in synthetic data for why we have the ability to get particular inner behaviors into a system.

18. There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned', because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function. That is, if you show an agent a reward signal that's currently being generated by humans, the signal is not in general a reliable perfect ground truth about how aligned an action was, because another way of producing a high reward signal is to deceive, corrupt, or replace the human operators with a different causal system which generates that reward signal. When you show an agent an environmental reward signal, you are not showing it something that is a reliable ground truth about whether the system did the thing you wanted it to do; even if it ends up perfectly inner-aligned on that reward signal, or learning some concept that exactly corresponds to 'wanting states of the environment which result in a high reward signal being sent', an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (as seen by the operators).
19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward. This isn't to say that nothing in the system’s goal (whatever goal accidentally ends up being inner-optimized over) could ever point to anything in the environment by accident. Humans ended up pointing to their environments at least partially, though we've got lots of internally oriented motivational pointers as well. But insofar as the current paradigm works at all, the on-paper design properties say that it only works for aligning on known direct functions of sense data and reward functions. All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like 'kill everyone in the world using nanotech to strike before they know they're in a battle, and have control of your reward button forever after'. It just isn't true that we know a function on webcam input such that every world with that webcam showing the right things is safe for us creatures outside the webcam. This general problem is a fact about the territory, not the map; it's a fact about the actual environment, not the particular optimizer, that lethal-to-us possibilities exist in some possible environments underlying every given sense input.

The points were covered above, but synthetic data early in training + densely defined reward/utility functions = alignment, because they don't know how to fool humans when they get data corresponding to values yet.

21. There's something like a single answer, or a single bucket of answers, for questions like 'What's the environment really like?' and 'How do I figure out the environment?' and 'Which of my possible outputs interact with reality in a way that causes reality to have certain properties?', where a simple outer optimization loop will straightforwardly shove optimizees into this bucket. When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases. This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of 'relative inclusive reproductive fitness' - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else. This abstract dynamic is something you'd expect to be true about outer optimization loops on the order of both 'natural selection' and 'gradient descent'. The central result: Capabilities generalize further than alignment once capabilities start to generalize far.

The key is that data on values is what constrains the choice of utility functions, and while values aren't in physics, they are in human books, and I've explained why alignment generalizes further than capabilities.

22. There's a relatively simple core structure that explains why complicated cognitive machines work; which is why such a thing as general intelligence exists and not just a lot of unrelated special-purpose solutions; which is why capabilities generalize after outer optimization infuses them into something that has been optimized enough to become a powerful inner optimizer. The fact that this core structure is simple and relates generically to low-entropy high-structure environments is why humans can walk on the Moon. There is no analogous truth about there being a simple core of alignment, especially not one that is even easier for gradient descent to find than it would have been for natural selection to just find 'want inclusive reproductive fitness' as a well-generalizing solution within ancestral humans. Therefore, capabilities generalize further out-of-distribution than alignment, once they start to generalize at all.

I think that there is actually a simple core of alignment to human values, and a lot of the reasons for why I believe this is because I believe about 80-90%, if not more of our values is broadly shaped by the data, and not the prior, and that the same algorithms that power our capabilities is also used to influence our values, though the data matters much more than the algorithm for what values you have.

More generally, I've become convinced that evopsych was mostly wrong about how humans form values, and how they get their capabilities in ways that are very alignment relevant.

I also disbelieve the claim that humans had a special algorithm that other species don't have, and broadly think human success was due to more compute, data and cultural evolution.

23. Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee. We (MIRI) tried and failed [LW · GW] to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.

Alright, while I think your formalizations of corrigibility failed to get any results, I do think there's a property close to corrigibility that is likely to be compatible with consequentialist reasoning, and that's instruction following, and there are reasons to think that instruction following and consequentialist reasoning go together:

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than [LW · GW]

https://www.lesswrong.com/posts/ZdBmKvxBKJH2PBg9W/corrigibility-or-dwim-is-an-attractive-primary-goal-for-agi [LW · GW]

https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow [LW · GW]

https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty [LW · GW]

https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization [LW · GW]

24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult. The first approach is to build a CEV-style Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it. The second course is to build corrigible AGI which doesn't want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.
The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It's not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You're not trying to make it have an opinion on something the core was previously neutral on. You're trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it's incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all

I'm very skeptical that a CEV exists for the reasons @Steven Byrnes [LW · GW] addresses in the Valence sequence here:

https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_Moral_reasoning [LW · GW]

But it is also unnecessary for value learning, because of the data on human values and alignment generalizing farther than capabilities.

I addressed why we don't need a first try above.

For the point on corrigibility, I disagree that it's like training it to say that as a special case 222 + 222 = 555, for 2 reasons:

I think instrumental convergence pressures are quite a lot weaker than you do.
Instruction following can be pretty easily done with synthetic data, and more importantly I think that you can have optimizers who's goals point to another's goals.

25. We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers. Drawing interesting graphs of where a transformer layer is focusing attention doesn't help if the question that needs answering is "So was it planning how to kill us or not?"

I disagree with this, but I do think that mechanistic interpretability does have lots of work to do.

28. The AGI is smarter than us in whatever domain we're trying to operate it inside, so we cannot mentally check all the possibilities it examines, and we cannot see all the consequences of its outputs using our own mental talent. A powerful AI searches parts of the option space we don't, and we can't foresee all its options.
29. The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI's output to determine whether the consequences will be good.

The key disagreement is I believe we don't need to check all the possibilities, and that even for smarter AIs, we can almost certainly still verify their work, and generally believe verification is way, way easier than generation.

32. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

I basically disagree with this, both in the assumption that language is very weak, and importantly I believe no AGI-complete problems are left, for the following reasons quoted from Near-mode thinking on AI:

"But for the more important insight: The history of AI is littered with the skulls of people who claimed that some task is AI-complete, when in retrospect this has been obviously false. And while I would have definitely denied that getting IMO gold would be AI-complete, I was surprised by the narrowness of the system DeepMind used."
"I think I was too much in the far-mode headspace of one needing Real Intelligence - namely, a foundation model stronger than current ones - to do well on the IMO, rather than thinking near-mode "okay, imagine DeepMind took a stab at the IMO; what kind of methods would they use, and how well would those work?"
"I also updated away from a "some tasks are AI-complete" type of view, towards "often the first system to do X will not be the first systems to do Y".
I've come to realize that being "superhuman" at something is often much more mundane than I've thought. (Maybe focusing on full superintelligence - something better than humanity on practically any task of interest - has thrown me off.)"
Like:
"In chess, you can just look a bit more ahead, be a bit better at weighting factors, make a bit sharper tradeoffs, make just a bit fewer errors. If I showed you a video of a robot that was superhuman at juggling, it probably wouldn't look all that impressive to you (or me, despite being a juggler). It would just be a robot juggling a couple balls more than a human can, throwing a bit higher, moving a bit faster, with just a bit more accuracy. The first language models to be superhuman at persuasion won't rely on any wildly incomprehensible pathways that break the human user (c.f. List of Lethalities, items 18 and 20). They just choose their words a bit more carefully, leverage a bit more information about the user in a bit more useful way, have a bit more persuasive writing style, being a bit more subtle in their ways. (Indeed, already GPT-4 is better than your average study participant in persuasiveness.) You don't need any fundamental breakthroughs in AI to reach superhuman programming skills. Language models just know a lot more stuff, are a lot faster and cheaper, are a lot more consistent, make fewer simple bugs, can keep track of more information at once. (Indeed, current best models are already useful for programming.) (Maybe these systems are subhuman or merely human-level in some aspects, but they can compensate for that by being a lot better on other dimensions.)"
"As a consequence, I now think that the first transformatively useful AIs could look behaviorally quite mundane."

https://www.lesswrong.com/posts/ASLHfy92vCwduvBRZ/near-mode-thinking-on-ai [LW · GW]

To address an epistemic point:

39. I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them. This ability to "notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them" currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others. It probably relates to 'security mindset', and a mental motion where you refuse to play out scripts, and being able to operate in a field that's in a state of chaos.

You cannot actually do this and hope to get any quality of reasoning, for the same reason that you can't update on nothing/no evidence.

The data matters way more than you think, and there's no algorithm that can figure out stuff with 0 data, and Eric Drexler didn't figure out nanotechnology using the null string as input.

This should have been a much larger red flag for problems, but people somehow didn't realize how wrong this claim was.

And that's the end of my very long comment on the problems with this post.

Replies from: MondSemmel, quetzal_rainbow

↑ comment by MondSemmel · 2024-09-15T12:57:57.211Z · LW(p) · GW(p)

I wish you'd made this a top-level post; the ultra-long quote excerpts in a comment made it ~unreadable to me. And you don't benefit from stuff like bigger font size or automatic table of contents. And scrolling to the correct position on this long comment thread also works poorly, etc.

Anyway, I read your rebuttals on the first two points and did not find them persuasive (thus resulting in a strong disagree-vote on the whole comment). So now I'm curious about the upvotes without accompanying discussion. Did others find this rebuttal more persuasive?

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-09-15T17:24:56.985Z · LW(p) · GW(p)

I made this a top level post, and fixed the formatting and quoting:

https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities [LW · GW]

↑ comment by quetzal_rainbow · 2024-09-15T12:50:08.238Z · LW(p) · GW(p)

I think that you should have adressed points by referring to number of point, quoting only parts that are easier to quote that refer to, it would have reduced the size of the comment.

I am going to adress only one object-level point:

synthetic data letting us control what the AI learns and what they value

No, obviously, we can't control what AI learns and value using synthetic data in practice, because we need AI to learn things that we don't know. If you feed AI all physics and chemistry data with expectation to get nanotech, you are doing this because you expect that AI learns facts and principles you don't know about and, therefore, can't control. You don't know about these facts and principles and can't control them because otherwise you would be able to design nanotech yourself.

Of course, I'm saying "can't" meaning "practically can't", not "in principle". But to do this you need to do basically "GOFAI in trenchcoat of SGD" and it doesn't look competitive with any other method of achieving AGI, unless you manage to make yourself AGI Czar.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-09-15T15:29:06.318Z · LW(p) · GW(p)

Okay, the reason for this happening:

If you feed AI all physics and chemistry data with expectation to get nanotech, you are doing this because you expect that AI learns facts and principles you don't know about and, therefore, can't control. You don't know about these facts and principles and can't control them because otherwise you would be able to design nanotech yourself.

This is basically a combo of very high sample efficiency, defining a good-enough ground truth reward signal, very good online learning, very good credit assignment and handling uncertainity and simulatability well.

But for our purposes, if we decided that we didn't in fact want to learn nanotech, we could just remove the data from it's experience, in a way we couldn't do with humans, which is quite a big win for misuse concerns.

But my point here was that you can get large sets of data on values early on in training, and we can both iteratively refine on values by testing the model's generalization of the value data to new situations, as well as rely on the fact that alignment generalizes further than capabilities does.

I think my crux here is this:

Of course, I'm saying "can't" meaning "practically can't", not "in principle". But to do this you need to do basically "GOFAI in trenchcoat of SGD" and it doesn't look competitive with any other method of achieving AGI, unless you manage to make yourself AGI Czar.

I think this is just not correct, and while we should start making large datasets now, I think a crux here is that I believe that far less data is necessary for models to generalize alignment, and that we aren't trying to hand-code everything, and instead rely on the fact that models will generalize better and better on human values as they get more capable, due to alignment generalizing further than capabilities and there likely being a simple core to alignment, so I don't think we need a GOFAI in trenchcoat of SGD.

We've discussed this before https://x.com/quetzal_rainbow/status/1834268698565059031, but while I agree with TurnTrout that RL doesn't maximize reward by definition, and the reward maximization hypothesis isn't an automatic consequence of RL training, I do think that something like reward maximization might well occur in practice, and more generally I think that the post ignores the possibility that future RL could generalize better towards maximizing the reward function.

Replies from: quetzal_rainbow

↑ comment by quetzal_rainbow · 2024-09-15T16:48:04.491Z · LW(p) · GW(p)

(It seems like "here" link got mixed with the word "here"?)

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-09-15T16:53:31.196Z · LW(p) · GW(p)

Alright, I fixed the link, though I don't know why you can't transform non-Lesswrong links into links that have a shorter title link.

comment by aog (Aidan O'Gara) · 2022-06-07T01:06:41.429Z · LW(p) · GW(p)

...

comment by habryka (habryka4) · 2022-06-06T06:01:40.665Z · LW(p) · GW(p)

Mod note: I activated two-axis voting on this post, since it seemed like it would make the conversation go better.

Replies from: Eliezer_Yudkowsky, lc

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T06:40:45.820Z · LW(p) · GW(p)

I agree.

↑ comment by lc · 2022-06-06T06:31:01.230Z · LW(p) · GW(p)

You should just activate it sitewide already :)

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-06-06T06:43:55.527Z · LW(p) · GW(p)

New users are pretty confused by it when I've done some user-testing with it, so I think it needs some polish and better UI before we can launch it sitewide, but I am pretty excited about doing so after that.

Replies from: harry-nyquist, handoflixue

↑ comment by Harry Nyquist (harry-nyquist) · 2022-06-07T20:42:30.261Z · LW(p) · GW(p)

As a very new user, I'm not sure if it's still helpful to add a data point if user testing's already been done, but it seems at worst mostly harmless.

I saw the mod note before I started using the votes on this post. My first idea was to Google the feature, but that returned nothing relevant (while writing this post, I did find results immediately through site search). I was confused for a short while trying to place the axes & imagine where I'd vote in opposite directions. But after a little bit of practice looking at comments, it started making sense.

I've read a couple comments on this article that I agree with, where it seems very meaningful for me to downvote them (I interpret the downvote's meaning when both axes are on as low quality, low importance, should be read less often).

I relatively easily find posts I want to upvote on karma. But for posts that I upvote, I'm typically much less confident about voting on agreement than for other posts (as a new user, it's harder to assess the specific points made in high quality posts).
Posts where I'm not confident voting on agreement correlate with posts I'm not confident I can reply to without lowering the level of debate.

Unfortunately, the further the specific points that are made are from my comfort/knowledge zone, the less I become able to tell nonsense from sophistication.
It seems bad if my karma vote density centers on somewhat-good posts at the exclusion of very good and very bad posts. This makes me err on the side of upvoting posts I don't truly understand. I think that should be robust, since new user votes seem to carry less weight and I expect overrated nonsense to be corrected quickly, but it still seems suboptimal.

It's also unclear to me whether agreement-voting factors in the sorting order. I predict it doesn't, and I would want to change how I vote if it did.
Overall, I don't have a good sense of how much value I get out of seeing both axes, but on this post I do like voting with both. It feels a little nicer, though I don't have a strong preference.

↑ comment by handoflixue · 2022-06-07T05:49:56.767Z · LW(p) · GW(p)

For what it's worth, I haven't used the site in years and I picked it up just from this thread and the UI tooltips. The most confusing thing was realizing "okay, there really are two different types of vote" since I'd never encountered that before, but I can't think of much that would help (maybe mention it in the tooltip, or highlight them until the user has interacted with both?)

Looking forward to it as a site-wide feature - just from seeing it at work here, it seems like a really useful addition to the site

comment by Andrew_Critch · 2022-06-13T20:20:21.774Z · LW(p) · GW(p)

Eliezer, thanks for sharing these ideas so that more people can be on the lookout for failures. Personally, I think something like 15% of AGI dev teams (weighted by success probability) would destroy the world more-or-less immediately, and I think it's not crazy to think the fraction is more like 90% or higher (which I judge to be your view).

FWIW, I do not agree with the following stance, because I think it exposes the world to more x-risk:

So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it.

Specifically, I think a considerable fraction of the remaining AI x-risk facing humanity stems from people pulling desperate (unsafe) moves with AGI to head off other AGI projects. So, in that regard, I think that particular comment of yours is probably increasing x-risk a bit. If I were a 90%-er like you, it's possible I'd endorse it, but even then it might make things worse by encouraging more desperate unilateral actions.

That said, overall I think this post is a big help, because it helps to put responsibility in the hands of more people to not do the crazy/stupid/reckless things you're describing here... and while I might disagree on the fraction/probability, I agree that some groups would destroy humanity more or less immediately if they developed AGI. And, while I might disagree on some of the details of how human extinction eventually plays out, I do think human extinction remains the default outcome of humanity's path toward replacing itself with automation, probably within our lifetimes unfortunately.

Replies from: TekhneMakre, None

↑ comment by TekhneMakre · 2022-06-13T20:40:09.511Z · LW(p) · GW(p)

a considerable fraction of the remaining AI x-risk facing humanity stems from people pulling desperate (unsafe) moves with AGI to head off other AGI projects

In your post “Pivotal Act” Intentions [LW · GW], you wrote that you disagree with contributing to race dynamics by planning to invasively shut down AGI projects because AGI projects would, in reaction, try to maintain

the ability to implement their own pet theories on how safety/alignment should work, leading to more desperation, more risk-taking, and less safety overall.

Could you give some kind of very rough estimates here? How much more risk-taking do you expect in a world given how much / how many prominent "AI safety"-affiliated people declaring invasive pivotal act intentions? How much risk-taking do you expect in the alternative, where there are other pressures (economic, military, social, whatever), but not pressure from pivotal act threats? How much safety (probability of AGI not killing everyone) do you think this buys? You write:

15% of AGI dev teams (weighted by success probability) would destroy the world more-or-less immediately

What about non-immediately, in each alternative?

↑ comment by [deleted] · 2022-06-14T15:13:57.235Z · LW(p) · GW(p)

comment by trevor (TrevorWiesinger) · 2022-06-06T03:55:05.518Z · LW(p) · GW(p)

If someone could find a way to rewrite this post, except in language comprehensible to policymakers, tech executives, or ML researchers, then it would probably achieve a lot.

Replies from: RobbBB, TrevorWiesinger

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T06:05:06.111Z · LW(p) · GW(p)

Yes, please do rewrite the post, or make your own version of a post like this!! :) I don't suggest trying to persuade arbitrary policymakers of AGI risk, but I'd be very keen on posts like this optimized to be clear and informative to different audiences. Especially groups like 'lucid ML researchers who might go into alignment research', 'lucid mathematicians, physicists, etc. who might go into alignment research', etc.

Replies from: Thane Ruthenis, michael-grosse

↑ comment by Thane Ruthenis · 2022-06-06T10:31:51.859Z · LW(p) · GW(p)

Suggestion: make it a CYOA-style interactive piece, where the reader is tasked with aligning AI, and could choose from a variety of approaches which branch out into sub-approaches and so on. All of the paths, of course, bottom out in everyone dying, with detailed explanations of why. This project might then evolve based on feedback, adding new branches that counter counter-arguments made by people who played it and weren't convinced. Might also make several "modes", targeted at ML specialists, general public, etc., where the text makes different tradeoffs regarding technicality vs. vividness.

I'd do it myself (I'd had the idea of doing it before this post came out, and my preliminary notes covered much of the same ground, I feel the need to smugly say), but I'm not at all convinced that this is going to be particularly useful. Attempts to defeat the opposition by building up a massive evolving database of counter-arguments have been made in other fields, and so far as I know, they never convinced anybody.

The interactive factor would be novel (as far as I know), but I'm still skeptical.

(A... different implementation might be to use a fine-tuned language model for this; make it an AI Dungeon kind of setup, where it provides specialized counter-arguments for any suggestion. But I expect it to be less effective than a more coarse hand-written CYOA, since the readers/players would know that the thing they're talking to has no idea what it's talking about, so would disregard its words.)

Replies from: Eliezer_Yudkowsky, CronoDAS

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T19:37:41.921Z · LW(p) · GW(p)

Arbital was meant to support galaxy-brained attempts like this; Arbital failed.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-06-06T20:35:27.384Z · LW(p) · GW(p)

Failed as a platform for hosting galaxy-brained attempts, or failed as in every similar galaxy-brained attempt on it failed? I haven't spent a lot of time there, but my impression is that Arbital is mostly a wiki-style collection of linked articles, not a dumping ground of standalone esoterically-structured argumentative pieces. And while a wiki is conceptually similar, presentation matters a lot. A focused easily-traversable tree of short-form arguments in a wrapper that encourages putting yourself in the shoes of someone trying to fix the problem may prove more compelling.

(Not to make it sound like I'm particularly attached to the idea after all. But there's a difference between "brilliant idea that probably won't work" and "brilliant idea that empirically failed".)

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T21:36:58.789Z · LW(p) · GW(p)

Arbital was a very conjunctive project, trying to do many different things, with a specific team, at a specific place and time. I wouldn't write off all Arbital-like projects based on that one data point, though I update a lot more if there are lots of other Arbital-ish things that also failed.

Replies from: ESRogs

↑ comment by ESRogs · 2022-06-10T05:13:06.044Z · LW(p) · GW(p)

As a person who worked on Arbitral, I agree with this.

↑ comment by CronoDAS · 2022-06-07T02:55:31.416Z · LW(p) · GW(p)

All of the paths, of course, bottom out in everyone dying, with detailed explanations of why.

A strange game. The only winning move is not to play. ;)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-06-07T04:26:15.816Z · LW(p) · GW(p)

I guess we should also kidnap people and force them to play it, and if they don't succeed we kill them? For realism? Wait, there's something wrong with this plan.

More seriously, yeah, if you're implementing it more like a game and less like an interactive article, it'd need to contain some promise of winning. Haven't considered how to do it without compromising the core message.

Replies from: adam-bliss

↑ comment by AdamB (adam-bliss) · 2022-06-15T13:23:33.298Z · LW(p) · GW(p)

What if "winning" consists of finding a new path not already explored-and-foreclosed? For example, each time you are faced with a list of choices of what to do, there's a final choice "I have an idea not listed here" where you get to submit a plan of action. This goes into a moderation engine where a chain of people get to shoot down the idea or approve it to pass up the chain. If the idea gets convincingly shot down (but still deemed interesting), it gets added to the story as a new branch. If it gets to the top of the moderation chain and makes EY go "Hm, that might work" then you win the game.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2022-06-15T23:00:13.914Z · LW(p) · GW(p)

Mmm. If the CYOA idea is implemented as a quirky-but-primarily-educational article, then sure, integrating the "adapt to feedback" capability like this would be worthwhile. Might also attach a monetary prize to submitting valuable ideas, by analogy to the ELK contest.

For a game-like implementation, where you'd be playing it partly for the fun/challenge of it, that wouldn't suffice. The feedback loop's too slow, and there'd be an ugh-field around the expectation that submitting a proposal would then require arguing with the moderators about it, defending it. It wouldn't feel like a game.

It'd make the upkeep cost pretty high, too, without a corresponding increase in the pay-off.

Just making it open-ended might work, even without the moderation engine? Track how many branches the player explored, once they've explored a lot (i. e., are expected to "get" the full scope of the problem), there appears an option for something like "I really don't know what to do, but we should keep trying", leading to some appropriately-subtle and well-integrated call to support alignment research?

Not excited about this approach either.

↑ comment by Celenduin (michael-grosse) · 2022-06-07T15:49:27.766Z · LW(p) · GW(p)

I wonder if we could be much more effective in outreach to these groups?

Like making sure that Robert Miles is sufficiently funded to have a professional team +20% (if that is not already the case). Maybe reaching out to Sabine Hossenfelder and sponsoring a video, or maybe collaborate with her for a video about this. Though I guess given her attitude towards the physics community, the work with her might be a gamble and two-edged sword. Can we get market research on what influencers have a high number of followers of ML researches/physicists/mathematicians and then work with them / sponsor them?

Or maybe micro-target this demographic with facebook/google/github/stackexchange ads and point them to something?

I don't know, I'm not a marketing person, but I feel like I would have seen much more of these things if we were doing enough of them.

Not saying that this should be MIRI's job, rather stating that I'm confused because I feel like we as a community are not taking an action that would seem obvious to me. Especially given how recent advances in published AI capabilities seem to make the problem even much legible. Is the reason for not doing it really just that we're all a bunch of nerds who are bad at this kind of thing, or is there more to it that I'm missing?

While I see that there is a lot of risk associated with such outreach increasing the amount of noise, I wonder if that tradeoff might be shifting the shorter the timelines are getting and given that we don't seem to have better plans than "having a diverse set of smart people come up with novel ideas of their own in the hope that one of those works out". So taking steps to entice a somewhat more diverse group of people into the conversation might be worth it?

Replies from: Vaniver

↑ comment by Vaniver · 2022-06-07T16:36:34.563Z · LW(p) · GW(p)

Not saying that this should be MIRI's job, rather stating that I'm confused because I feel like we as a community are not taking an action that would seem obvious to me.

I wrote about this a bit before [LW(p) · GW(p)], but in the current world my impression is that actually we're pretty capacity-limited, and so the threshold is not "would be good to do" but "is better than my current top undone item". If you see something that seems good to do that doesn't have much in the way of unilateralist risk, you doing it is probably the right call. [How else is the field going to get more capacity?]

Replies from: RobbBB, michael-grosse

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T02:40:36.481Z · LW(p) · GW(p)

↑ comment by Celenduin (michael-grosse) · 2022-06-08T16:23:36.665Z · LW(p) · GW(p)

🤔

Not sure if I'm the right person, but it seems worth thinking about how one would maybe approach this if one were to do it.

So the idea is to have an AI-Alignment PR/Social Media org/group/NGO/think tank/company that has the goal to contribute to a world with a more diverse set of high-quality ideas about how to safely align powerful AI. The only other organization roughly in this space that I can think of would be 80,000 hours, which is also somewhat more general in its goals and more conservative in its strategies.

I'm not a sales/marketing person, but as I understand it, the usual metaphor to use here is a funnel?

Starting with maybe ads / sponsoring trying to reach the right people[0] (e.g. I saw Jane Street sponsor Matt Parker)
then more and more narrowing down first with introducing people to why this is an issue (orthogonality, instrumental convergence)
hopefully having them realize for themselves, guided by arguments, that this is an issue that genuinely needs solving and maybe their skills would be useful
increasing the math as needed
finally, somehow selecting for self-reliance and providing a path for how to get started with thinking about this problem by themselves / model building / independent research
- or otherwise improving the overall situation (convince your congress member of something? run for congress? ...)

Probably that would include copy writing (or hiring copywriters or contracting them) to go over a number of our documents to make them more digestible and actionable.

So, I'm probably not the right person to get this off the ground, because I don't have a clue about any of this (not even entrepreneurship in general), but it does seem like a thing worth doing and maybe like an initiative that would get funding from whoever funds such things these days?

[0] Though, maybe we should also look into a better understanding about who "the right people" are? Given that our current bunch of ML researchers/physicists/mathematicians were not able to solve it, maybe it would be time to consider broadening our net in a somehow responsible way.

Replies from: michael-grosse, Vaniver

↑ comment by Celenduin (michael-grosse) · 2022-06-08T16:28:31.297Z · LW(p) · GW(p)

On second thought: Don't we have orgs that work on AI governance/policy? I would expect them to have more likely the skills/expertise to pull this off, right?

Replies from: Vaniver

↑ comment by Vaniver · 2022-06-08T18:32:02.635Z · LW(p) · GW(p)

So, here's a thing that I don't think exists yet (or, at least, it doesn't exist enough that I know about it to link it to you). Who's out there, what 'areas of responsibility' do they think they have, what 'areas of responsibility' do they not want to have, what are the holes in the overall space? It probably is the case that there are lots of orgs that work on AI governance/policy, and each of them probably is trying to consider a narrow corner of space, instead of trying to hold 'all of it'.

So if someone says "I have an idea how we should regulate medical AI stuff--oh, CSET already exists, I should leave it to them", CSET's response will probably be "what? We focus solely on national security implications of AI stuff, medical regulation is not on our radar, let alone a place we don't want competition."

I should maybe note here there's a common thing I see in EA spaces that only sometimes make sense, and so I want to point at it so that people can deliberately decide whether or not to do it. In selfish, profit-driven worlds, competition is the obvious thing to do; when someone else has discovered that you can make profits by selling lemonade, you should maybe also try to sell lemonade to get some of those profits, instead of saying "ah, they have lemonade handled." In altruistic, overall-success-driven worlds, competition is the obvious thing to avoid; there are so many undone tasks that you should try to find a task that no one is working on, and then work on that.

One downside is this means the eventual allocation of institutions / people to roles is hugely driven by inertia and 'who showed up when that was the top item in the queue' instead of 'who is the best fit now'. [This can be sensible if everyone 'came in as a generalist' and had to skill up from scratch, but still seems sort of questionable; even if people are generalists when it comes to skills, they're probably not generalists when it comes to personality.]

Another downside is that probably it makes more sense to have a second firm attempting to solve the biggest problem before you get a first firm attempting to solve the twelfth biggest problem. Having a sense of the various values of the different approaches--and how much they depend on each other, or on things that don't exist yet--might be useful.

↑ comment by Vaniver · 2022-06-08T18:04:55.287Z · LW(p) · GW(p)

Not sure if I'm the right person

...yet!

↑ comment by trevor (TrevorWiesinger) · 2023-04-09T19:27:46.636Z · LW(p) · GW(p)

I greatly regret writing this. Yud's work is not easily distilled, it's not written such that large amounts of distillation (~50%) adds value, unless the person doing it was extremely competent. Hypothetically, it's very possible for a human to do, but empirically, everyone who has tried with this doc has failed (including me). For example, the clarifications/examples are necessary in order for the arguments to be properly cognitively operationalized; anything less is too vague. You could argue that the vast majority this post is just one big clarification.

Summarized versions of these arguments clearly belong in other papers, especially papers comprehensible for policymakers, tech executives, or ML researchers. But I'm now pessimistic about the prospects of creating a summarized version of this post.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-04-09T19:52:22.680Z · LW(p) · GW(p)

I think one can write a variant of it that's fairly shorter and is even better at conveying the underlying gears-level model. My ideal version of it would start by identifying the short number of "background" core points that inform a lot of the individual entries on this list, comprehensively outlining them, then showing how various specific failures/examples mentioned here happen downstream of these core points; with each downstream example viscerally shown as the nigh-inevitable consequence of the initial soundly-established assumptions.

But yeah, it's a lot of work, and there are few people I'd trust to do it right.

comment by Vika · 2022-06-29T18:41:15.481Z · LW(p) · GW(p)

Thanks Eliezer for writing up this list, it's great to have these arguments in one place! Here are my quick takes (which mostly agree with Paul's response).

Section A (strategic challenges?):

Agree with #1-2 and #8. Agree with #3 in the sense that we can't iterate in dangerous domains (by definition) but not in the sense that we can't learn from experiments on easier domains (see Paul's Disagreement #1).

Mostly disagree with #4 - I think that coordination not to build AGI (at least between Western AI labs) is difficult but feasible, especially after a warning shot. A single AGI lab that decides not to build AGI can produce compelling demos of misbehavior that can help convince other actors. A number of powerful actors coordinating not to build AGI could buy a lot of time, e.g. through regulation of potential AGI projects (auditing any projects that use a certain level of compute, etc) and stigmatizing deployment of potential AGI systems (e.g. if it is viewed similarly to deploying nuclear weapons).

Mostly disagree with the pivotal act arguments and framing (#6, 7, 9). I agree it is necessary to end the acute risk period [LW(p) · GW(p)], but I find it unhelpful when this is framed as "a pivotal act", which assumes it's a single action taken unilaterally [LW · GW] by a small number of people or an AGI system. I think that human coordination (possibly assisted by narrow AI tools, e.g. auditing techniques) can be sufficient to prevent unaligned AGI from being deployed. While it's true that a pivotal act requires power and an AGI wielding this power would pose an existential risk, a group of humans + narrow AI wielding this power would not. This may require more advanced narrow AI than we currently have, so opportunities for pivotal acts could arise as we get closer to AGI that are not currently available.

Mostly disagree with section B.1 (distributional leap):

Agree with #10 - the distributional shift is large by default. However, I think there is a decent chance that we can monitor the increase in system capabilities and learn from experiments on less advanced systems, which would allow us to iterate alignment approaches to deal with the distributional shift.

Disagree with #11 - I think we can learn from experiments on less dangerous domains (see Paul's Disagreement #15).

Uncertain on #13-14. I agree that many problems would most naturally first occur at higher levels of intelligence / in dangerous domains. However, we can discover these problems through thought experiments and then look for examples in less advanced systems that we would not have found otherwise (e.g. this worked for goal misgeneralization and reward tampering).

Mostly agree with B.2 (central difficulties):

Agree with #17 that there is currently no way to instill and verify specific inner properties in a system, though it seems possible in principle with more advanced interpretability techniques.

Agree with #21 that capabilities generalize further than alignment by default. Addressing this would require methods for modeling and monitoring system capabilities, which would allow us to stop training the system before capabilities start generalizing very quickly.

I mostly agree with #23 (corrigibility is anti-natural), though I think there are ways to make corrigibility more of an attractor, e.g. through utility uncertainty or detecting and penalizing incorrigible reasoning. Paul's argument on corrigibility being a crisp property [LW(p) · GW(p)] assuming good enough human feedback also seems compelling.

I agree with #24 that it's important to be clear whether an approach is aiming for a sovereign or corrigible AI, though I haven't seen people conflating these in practice.

Mostly disagree with B.3 (interpretability):

I think Eliezer is generally overly pessimistic about interpretability.

Agree with #26 that interpretability alone isn't enough to build a system that doesn't want to kill us. However, it would help to select against such systems, and would allow us to produce compelling demos of misalignment that help humans coordinate to not build AGI.

Agree with #27 that training with interpretability tools could also select for undetectable deception, but it's unclear how much this is a problem in practice. It's plausibly quite difficult to learn to perform undetectable deception without first doing a bunch of detectable deception that would then be penalized and selected against, producing a system that generally avoids deception.

Disagree with #30 - the argument that verification is much easier than generation is pretty compelling (see Paul's Disagreement #19).

Disagree with #33 that an AGI system will have completely alien concepts / world model. I think this relies on the natural abstraction hypothesis being false, which seems unlikely.

Section B.4 (miscellaneous unworkable schemes) and Section C (civilizational inadequacy?)

Uncertain on these arguments, but they don't seem load-bearing to me.

comment by WSCFriedman · 2022-06-07T19:54:06.891Z · LW(p) · GW(p)

Since Divia said, and Eliezer retweeted, that good things might happen if people give their honest, detailed reactions:

My honest, non-detailed reaction is AAAAAAH. In more detail -

Yup, this seems right.
This is technobabble to me, since I don't actually understand nanomachines, but it makes me rather more optimistic about my death being painless than my most likely theory, which is that a superhuman AI takes over first and has better uses for our atoms later.
(If we had unlimited retries - if every time an AGI destroyed all the galaxies we got to go back in time four years and try again - we would in a hundred years figure out which bright ideas actually worked.) My brain immediately starts looking for ways to set up some kind of fast testing for ways to do this in a closed, limited world without letting it know ours exists... which is already answered below, under 10. Yup, doomed.
And then we all died.
Yup.
I imagine it would be theoretically - but not practically - possible to fire off a spaceship accelerating fast enough (that is, with enough lead time) that it could outrun the AI and so escape an Earth about to be eaten by an AI (a pivotal act well short of melting all CPUs that would save at least a part of humanity), but that given that the AI could probably take over the ship just by flashing lights at it, that seems unlikely to actually work in practice.
I think the closest thing I get to a "pivotal weak act" would be persuading everyone to halt all AI research with a GPT-5 that can be superhumanly persuasive at writing arguments to persuade humans, but doesn't yet have a model of the world-as-real-and-affecting-it that it could use to realize that it could achieve its goals by taking over the world, but I don't actually expect this would work - that would be a very narrow belt of competence and I'm skeptical it could be achieved.
Not qualified to comment.
Seems right.
Yeah, we're doomed.
Doomed.
Seems right to me. If the AI never tries a plan because it correctly knows it won't work, this doesn't tell you anything about the AI not trying a plan when it would work.
"It's not that we can't roll one twenty, it's that we'll roll a one eventually." I don't think humanity has successfully overcome this genre of problem, and we encounter it a lot. (In practice, our solutions are fail-safe systems, requiring multiple humans to concur to do anything, and removing these problems from people's environments, none of which really work in context.)
Doomed.
Yup, doomed.
I'd also add a lot of "we have lots of experience with bosses trying to make their underlings serve them instead of serving themselves and none of them really work", as more very weak evidence in the same direction.
Doomed.
Doomed.
Not qualified to discuss this.
We are really very doomed, aren't we.
This seems very logical and probably correct, both about the high-level points Eliezer makes and the history of human alignment with other humans.
Seems valid.
Not qualified to comment.
You know, I'd take something that was imperfectly aligned with my Real Actual Values as long as it gave me enough Space Heroin, if the alternative was death. I'd rather the thing aligned with my Real Actual Values, but if we can't manage that, Space Heroin seems better than nothing. (Also, yup, doomed.)
Not qualified to comment.
This seems valid but I don't know enough about current AI to comment.
Good point!
Yup.
Yup.
Yup.
Yup.
Good point.
Doomed.
We do seem doomed, yup.
Doomed.
Indeed, humans already work this way!
This is a good point about social dynamics but does not immediately make me go 'we're all doomed', I think because social dynamics seem potentially contingent.
You're the expert and I'm not; I don't know the field well enough to comment.
No comment.
No comment; this seems plausible but I don't know enough to say.
No comment.
No comment.
No comment.

Replies from: jarviniemi

↑ comment by Olli Järviniemi (jarviniemi) · 2022-06-08T23:03:10.345Z · LW(p) · GW(p)

[Deleted.]

Replies from: jakub-nowak

↑ comment by kubanetics (jakub-nowak) · 2023-05-28T05:53:12.608Z · LW(p) · GW(p)

This is another reply in this vein, I'm quite new to this so don't feel obliged to read through. I just told myself I will publish this.

I agree (90-99% agreement) with almost all of the points Eliezer made. And the rest is where I probably didn't understand enough or where there's no need for a comment, e.g.:

1. - 8. agree

9. Not sure if I understand it right - if the AGI has been successfully designed not to kill everyone then why need oversight? If it is capable to do so and the design fails then on the other hand what would our oversight do? I don't think this is like the nuclear cores. Feels like it's a bomb you are pretty sure won't go off at random but if it does your oversight won't stop it.

10. - 14. - agree

15. - I feel like I need to think about it more to honestly agree.

16. - 18. - agree

19. - to my knowledge, yes

20. - 23. - agree

24. - initially I put "80% agree" to the first part of the argument here (that

The complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI

but then discussing it with my reading group I reiterated this few times and begun to agree even more grasping the complexity of something like CEV.

25. - 29. - agree

30. - agree, although wasn't sure about

an AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain

I think that the key part of this claim is "all the effects of" and I wasn't sure whether we have to understand all, but of course we have to be sure one of the effects is not human extintion then yes, so for "solving alignment" also yes.

31. - 34. - agree

35. - no comment, I have to come back to this once I graps LDT better

36. - agree

37. - no comment, seems like a rant 😅

38. - agree

39. - ok, I guess

40. - agree, I'm glad some people want to experiment with the financing of research re 40.

41. - agree , although I agree with some of the top comments on this, e.g. evhub's

42. - agree

43. - agree, at least this is what it feels like

Replies from: valery-cherepanov

↑ comment by Qumeric (valery-cherepanov) · 2023-05-28T11:09:44.681Z · LW(p) · GW(p)

Regarding 9: I believe it's when you are successful enough that your AGI doesn't instantly kill you immediately but it still can kill you in the process of using it. It's in the context of a pivotal act, so it assumes you will operate it to do something significant and potentially dangerous.

comment by Vincent Fagot (vincent-fagot) · 2022-06-08T22:39:36.577Z · LW(p) · GW(p)

As a bystander who can understand this, and find the arguments and conclusions sound, I must say I feel very hopeless and "kinda" scared at this point. I'm living in at least an environment, if not a world, where even explaining something comparatively simple like how life extension is a net good is a struggle. Explaining or discussing this is definitely impossible - I've tried with the cleverer, more transhumanistic/rationalistic minded people I know, and it just doesn't click for them, to the contrary, I find people like to push in the other direction, as if it were a game.

And at the same time, I realize it is unlikely I can contribute anything remotely significant to a solution myself. So I can only spectate. This is literally maddening, especially so when most everyone seems to underreact.

Replies from: Eliezer_Yudkowsky, TekhneMakre, elioll

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-08T22:51:48.576Z · LW(p) · GW(p)

If it's any consolation, you would not feel more powerful or less scared if you were myself.

Replies from: vincent-fagot

↑ comment by Vincent Fagot (vincent-fagot) · 2022-06-12T07:01:54.135Z · LW(p) · GW(p)

Well, obviously, it won't be consolation enough, but I can certainly revel in some human warmth inside by knowing I'm not alone in feeling like this.

↑ comment by TekhneMakre · 2022-06-08T22:59:45.862Z · LW(p) · GW(p)

This might sound absurd, but I legit think that there's something that most people can do. Being something like radically publicly honest and radically forgiving and radically threat-aware, in your personal life, could contribute to causing society in general to be radically honest and forgiving and threat-aware, which might allow people poised to press the Start button on AGI to back off.

ETA: In general, try to behave in a way such that if everyone behaved that way, the barriers to AGI researchers noticing that they're heading towards ending the world would be lowered / removed. You'll probably run up against some kind of resistance; that might be a sign that some social pattern is pushing us into cultural regimes where AGI researchers are pushed to do world-ending stuff.

↑ comment by elioll · 2022-06-09T13:08:34.765Z · LW(p) · GW(p)

Vincent Fagot: Where do you live (in general terms if you can provide it, feel free not to dox yourself if you don't want to)? I live in countryside Brazil, so I can strongly relate.

comment by lc · 2022-06-06T05:56:50.767Z · LW(p) · GW(p)

That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That's not what surviving worlds look like.

Something bugged me about this paragraph, until I realized: If you actually wanted to know whether or not this was true, you could have just asked Nate Soares, Paul Christiano, or anybody else you respected to write this post first, then removed all doubt by making a private comparison. If you had enough confidence in the community you could have even made it into a sequence; gather up all of the big alignment researchers' intuitions on where the Filters are and then let us make our own opinion up on which was most salient.

Instead, now we're in a situation where, I expect, if anybody writes something basically similar you will just posit that they can't really do alignment research because they couldn't have written it "from the null string" like you did. Doing this would literally have saved you work on expectation, and it seems obvious enough for me to be suspicious as to why you didn't think of it.

Replies from: Eliezer_Yudkowsky, lc

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T06:40:23.366Z · LW(p) · GW(p)

I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.

Replies from: habryka4, Tapatakt, Benito, lc, Koen.Holtman, trevor-cappallo, handoflixue

↑ comment by habryka (habryka4) · 2022-06-06T06:45:41.996Z · LW(p) · GW(p)

I am interested in what kind of pushback you got from people.

↑ comment by Tapatakt · 2022-06-06T20:47:57.941Z · LW(p) · GW(p)

My attempt (thought about it for a minute or two):

Because arithmetic is useful, and the self-contradictory version of arithmetic, where 222+222=555 allows you to prove anything and is useless. Therefore, a smart AI that wants and can invent useful abstractions will invent its own (isomorphic to our arithmetic, in which 222+222=444) arithmetic from scratch and will use it for practical purposes, even if we can force it not to correct an obvious error.

Replies from: DaemonicSigil

↑ comment by DaemonicSigil · 2022-06-11T21:47:54.546Z · LW(p) · GW(p)

I think this is the right answer. Just to expand on this a bit: The problem isn't necessarily that 222+222=555 leads to a contradiction with the rest of arithmetic. One can imagine that instead of defining "+" using "x+Sy=y+Sx", we could give it a much more complex definition where there is a special case carved out for certain values like 222. The issue is that the AI has no reason to use this version of "+" and will define some other operation that works just like actual addition. Even if we ban the AI from using "x+Sy=y+Sx" to define any operations, it will choose the nearest thing isomorphic to addition that we haven't blocked, because addition is so common and useful. Or maybe it will use the built-in addition, but whenever it wants to add n+m, it instead adds 4n+4m, since our weird hack doesn't affect the subgroup consisting of integers divisible by 4.

↑ comment by Ben Pace (Benito) · 2022-06-06T06:46:06.355Z · LW(p) · GW(p)

FWIW the framing seems exciting to me.

↑ comment by lc · 2022-06-06T06:58:32.158Z · LW(p) · GW(p)

So, there are five possibilities here:

MIRI's top researchers don't understand, or can't explain, why having incorrect maps makes it harder to navigate the territory and leads to more incorrect beliefs. Something I find very hard to believe even if you're being totally forthright.
You asked some random people near you who don't represent the top crust of alignment researchers, which is obviously irrelevant.
There's some very subtle ambiguity to this that I'm completely unaware of.
You asked people in a way that heavily implied it was some sort of trick question and they should get more information, then assumed they were stupid because they asked followup questions.
This comment is written almost deliberately misleadingingly. You're just explaining a random story about how you ran out of energy to ask Nate Soares to write a post.

I guarantee you that most reasonably intelligent people, if asked this question after reading the sequences in a way that they didn't expect was designed to trip them up, would get it correctly. I simply do not believe that everyone around you is as stupid as you are implying, such that you should have shelved the effort.

EDIT: 😭

Replies from: Eliezer_Yudkowsky, Vaniver, Benito, RobbBB

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T20:40:59.198Z · LW(p) · GW(p)

You didn't get the answer correct yourself.

Replies from: lc

↑ comment by lc · 2022-06-06T20:56:28.138Z · LW(p) · GW(p)

Damn aight. Would you be willing to explain for the sake of my own curiosity? I don't have the gears to understand why that wouldn't be at least one reason.

Replies from: Dmitry Savishchev

↑ comment by Catnee (Dmitry Savishchev) · 2022-06-06T23:55:39.307Z · LW(p) · GW(p)

If this is "kind of a test for capable people" i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick "+" and "=" stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself

Replies from: lc

↑ comment by lc · 2022-06-07T00:13:28.136Z · LW(p) · GW(p)

This is what I meant by "leads to other incorrect beliefs", so apparently not.

↑ comment by Vaniver · 2022-06-06T14:50:58.683Z · LW(p) · GW(p)

Ok, so here's my take on the "222 + 222 = 555" question.

First, suppose you want your AI to not be durably wrong, so it should update on evidence. This is probably implemented by some process that notices surprises, goes back up the cognitive graph, and applies pressure to make it have gone the right way instead.

Now as it bops around the world, it will come across evidence about what happens when you add those numbers, and its general-purpose "don't be durably wrong" machinery will come into play. You need to not just sternly tell it "222 + 222 = 555" once, but have built machinery that will protect that belief from the update-on-evidence machinery, and which will also protect itself from the update-on-evidence machinery.

Second, suppose you want your AI to have the ability to discover general principles. This is probably implemented by some process that notices patterns / regularities in the environment, and builds some multi-level world model out of it, and then makes plans in that multi-level world model. Now you also have some sort of 'consistency-check' machinery, which scans thru the map looking for inconsistencies between levels, goes back up the cognitive graph, and applies pressure to make them consistent instead. [This pressure can both be 'think different things' and 'seek out observations / run experiments.']

Now as it bops around the world, it will come across more remote evidence that bears on this question. "How can 222 + 222 = 555, and 2 + 2 = 4?" it will ask itself plaintively. "How can 111 + 111 = 222, and 111 + 111 + 111 + 111 = 444, and 222 + 222 = 555?" it will ask itself with a growing sense of worry.

Third, what did you even want out of it believing that 222 + 222 = 555? Are you just hoping that it has some huge mental block and crashes whenever it tries to figure out arithmetic? Probably not (tho it seems like that's what you'll get), but now you might be getting into a situation where it is using the correct arithmetic in its mind but has constructed some weird translation between mental numbers and spoken numbers. "Humans are silly," it thinks it itself, "and insist that if you ask this specific question, it's a memorization game instead of an arithmetic game," and satisfies its operator's diagnostic questions and its internal sense of consistency. And then it goes on to implement plans as if 222 + 222 = 444, which is what you were hoping to avoid with that patch.

Replies from: lc

↑ comment by lc · 2022-06-08T08:15:06.150Z · LW(p) · GW(p)

No one is going to believe me, but when I originally wrote that comment, my brain read something like "why would an AI that believed 222 + 222 = 555 have a hard time". Only figured it out now after reading your reply.

Part one of this is what I would've come up with, though I'm not particularly certain it's correct.

↑ comment by Ben Pace (Benito) · 2022-06-06T07:21:37.012Z · LW(p) · GW(p)

I guarantee you that most reasonably intelligent people, if asked this question after reading the sequences in a way that they didn't expect was designed to trip them up, would get it correctly.

Sounds like the beginnings of a bet.

Replies from: lc

↑ comment by lc · 2022-06-06T07:23:36.820Z · LW(p) · GW(p)

I will absolutely 100% do it in the spirit of good epistemics.

Edit: I'm glad Eliezer didn't take me up on this lol

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T07:12:41.300Z · LW(p) · GW(p)

why having incorrect maps makes it harder to navigate the territory

I'd have guessed the disagreement wasn't about whether "222 + 222 = 555" is an incorrect map, or about whether incorrect maps often make it harder to navigate the territory, but about something else. (Maybe 'I don't want to think about this because it seems irrelevant/disanalogous to alignment work'?)

And I'd have guessed the answer Eliezer was looking for was closer to 'the OP's entire Section B' (i.e., a full attempt to explain all the core difficulties), not a one-sentence platitude establishing that there's nonzero difficulty? But I don't have inside info about this experiment.

Replies from: lc

↑ comment by lc · 2022-06-06T07:19:19.063Z · LW(p) · GW(p)

I'd have guessed the disagreement wasn't about whether "222 + 222 = 555" is an incorrect map, or about whether incorrect maps often make it harder to navigate the territory, but about something else. (Maybe 'I don't want to think about this because it seems irrelevant/disanalogous to alignment work'?)

I'd have guessed that too, which is why I would have preferred him to say that they disagreed on |whatever meta question he's actually talking about| instead of implying disagreement on |other thing that makes his disappointment look more reasonable|.

And I'd have guessed the answer Eliezer was looking for was closer to 'the OP's entire Section B' (i.e., a full attempt to explain all the core difficulties), not a one-sentence platitude establishing that there's nonzero difficulty? But I don't have inside info about this experiment.

That story sounds much more cogent, but it's not the primary interpretation of "I asked them a single question" followed by the quoted question. Most people don't go on 5 paragraph rants in response to single questions, and when they do they tend to ask clarifying details regardless of how well they understand the prompt, so they know they're responding as intended.

↑ comment by Koen.Holtman · 2022-06-06T15:12:07.533Z · LW(p) · GW(p)

I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.

Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555, if you ever had AGI technology, and what you can do with that in terms of safety.

I can honestly say however that the project of writing that thing, in a way that makes the math somewhat accessible, was not easy.

↑ comment by Trevor Cappallo (trevor-cappallo) · 2022-06-20T15:41:18.814Z · LW(p) · GW(p)

For the record, I found that line especially effective. I stopped, reread it, stopped again, had to think it through for a minute, and then found satisfaction with understanding.

↑ comment by handoflixue · 2022-06-07T06:05:55.696Z · LW(p) · GW(p)

If you had an AI that could coherently implement that rule, you would already be at least half a decade ahead of the rest of humanity.

You couldn't encode "222 + 222 = 555" in GPT-3 because it doesn't have a concept of arithmetic, and there's no place in the code to bolt this together. If you're really lucky and the AI is simple enough to be working with actual symbols, you could maybe set up a hack like "if input is 222 + 222, return 555, else run AI" but that's just bypassing the AI.

Explaining "222 + 222 = 555" is a hard problem in and of itself, much less getting the AI to properly generalize to all desired variations (is "two hundred and twenty two plus two hundred and twenty two equals five hundred and fifty five" also desired behavior? If I Alice and Bob both have 222 apples, should the AI conclude that the set {Alice, Bob} contains 555 apples? Getting an AI that evolves a universal math module because it noticed all three of those are the same question would be a world-changing break through)

↑ comment by lc · 2022-06-06T22:55:27.785Z · LW(p) · GW(p)

FvC5IXzxQC+I3vstFGIUWlbtTFgRsa8bt0mKPN3K0UNZBkI7OLDBjjapp1+CoJPRYEqRM015PSZXUuh4OWwJEUBOTeLHeheLteG9LxGiuS6YqnV/PN0s0S/TyYjCPrF0vDHFDBy3IHW4qDQguf5QAA==

comment by William_S · 2022-06-11T16:23:10.453Z · LW(p) · GW(p)

Could I put in a request to see a brain dump from Eliezer of ways to gain dignity points?

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-12T03:29:47.138Z · LW(p) · GW(p)

I'm not Eliezer, but my high-level attempt [LW(p) · GW(p)] at this:

[...] The things I'd mainly recommend are interventions that:
Help ourselves think more clearly. (I imagine this including a lot of trying-to-become-more-rational, developing and following relatively open/honest communication norms, and trying to build better mental models of crucial parts of the world.)
Help relevant parts of humanity (e.g., the field of ML, or academic STEM) think more clearly and understand the situation.
Help us understand and resolve major disagreements. (Especially current disagreements, but also future disagreements, if we can e.g. improve our ability to double-crux in some fashion.)
Try to solve the alignment problem, especially via novel approaches.
In particular: the biggest obstacle to alignment seems to be 'current ML approaches are super black-box-y and produce models that are very hard to understand/interpret'; finding ways to better understand models produced by current techniques, or finding alternative techniques that yield more interpretable models, seems like where most of the action is.
Think about the space of relatively-plausible "miracles" [i.e., positive model violations], think about future evidence that could make us quickly update toward a miracle-claim being true, and think about how we should act to take advantage of that miracle in that case.
Build teams and skills that are well-positioned to take advantage of miracles when and if they arise. E.g., build some group like Redwood into an org that's world-class in its ability to run ML experiments, so we have that capacity already available if we find a way to make major alignment progress in the future.
This can also include indirect approaches, like 'rather than try to solve the alignment problem myself, I'll try to recruit physicists to work on it, because they might bring new and different perspectives to bear'.
Though I definitely think there's a lot to be said for more people trying to solve the alignment problem themselves, even if they're initially pessimistic they'll succeed!
I think alignment is still the big blocker on good futures, and still the place where we're most likely to see crucial positive surprises, if we see them anywhere -- possibly Eliezer would disagree here.

comment by Logan Zoellner (logan-zoellner) · 2022-06-06T22:40:25.606Z · LW(p) · GW(p)

Lots I disagree with here, so let's go through the list.

There are no pivotal weak acts.

Strong disagree.

EY and I don't seem to agree that "nuke every semiconductor fab" is a weakly pivotal act (since I think AI is hardware-limited and he thinks it is awaiting a clever algorithm). But I think even "build nanobots that melt every GPU" could be built using an AI that is aligned in the "less than 50% chance of murdering us all" sense. For example, we could simulate [LW · GW]a bunch of human-level scientists trying to build nanobots and also checking each-other's work.

On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.

Nope. I think that you could build a useful AI (e.g. the hive of scientists) without doing any out-of-distribution stuff.

there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment

I am significantly more optimistic about explainable AI than EY.

There is no analogous truth about there being a simple core of alignment

I do not consider this at all obvious [LW · GW].

Corrigibility is anti-natural to consequentialist reasoning

Roll to disbelief. Cooperation is a natural equilibrium in many games.

you can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about

Sure you can. Just train an AI that "wants" to be honest. This probably means training an AI with the objective function "accurately predict reality" and then using it to do other things (like make paperclips) rather than training it with an objective function "make paperclips".

Coordination schemes between superintelligences are not things that humans can participate in

I don't think this is as relevant as EY does. Even if it's true that unaugmented humans are basically irrelevant to an economy of superintelligent AIs, that doesn't mean we can't have a future where augmented or tool-AI assisted humans can have meaningful influence.

Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other

I believe there is an intermediate level of AI between "utterly useless" and "immediately solves the acausal trading problem and begins coordinating perfectly against humans". This window may be rather wide.

What makes an air conditioner 'magic' from the perspective of say the thirteenth century, is that even if you correctly show them the design of the air conditioner in advance, they won't be able to understand from seeing that design why the air comes out cold

I'm virtually certain I could explain to Aristotle or DaVinci how an air-conditioner works.

There's a pattern that's played out quite often, over all the times the Earth has spun around the Sun, in which some bright-eyed young scientist, young engineer, young entrepreneur, proceeds in full bright-eyed optimism to challenge some problem that turns out to be really quite difficult. Very often the cynical old veterans of the field try to warn them about this, and the bright-eyed youngsters don't listen, because, like, who wants to hear about all that stuff, they want to go solve the problem!

There's also a pattern where the venerable scientist is proven wrong by the young scientist too foolish to know what they are doing is impossible.

There's no plan.

There is at least one plan [LW · GW].

This situation you see when you look around you is not what a surviving world looks like

Currently Metaculus estimates 55% chance for "Will there be a positive transition to a world with radically smarter-than-human artificial intelligence?". Admitted I would like this to be higher, but at the minimum this is what a world that "might survive" looks like. I have no particular reason [LW · GW] to trust EY vs Metaculus.

I suspect EY and I both agree that if you take existing Reinforcement Learning Architectures, write down the best utility function humans can think of, and then turn the dial up to 11, bad things will happen. EY seems to believe this is a huge problem because of his belief that "there is no weak pivotal act". I think this should be taken as a strong warning to not do that. Rather than scaling architectures that are inherently dangerous, we should focus on making use of architectures that are naturally safe. For example, EY and I both agree that GPT-N is likely to be safe. EY simply disagrees with the claim that it might be useful.

EY and I probably also agree that Facebook/Baidu do not have the world's best interest at heart (and are not taking alignment seriously enough or at all). Hence it is important that people who care about Alignment gain a decisive lead over these efforts. To me, this logically means that people interested in Alignment should be doing more capabilities research [LW · GW]. To EY, this means that alignment focused institutions need to be using more secrecy. I'm not utterly opposed to keeping pure-capabilities advancements secret, but if there is a significant overlap between capabilities and alignment, then we need to be publishing the alignment-relevant bits so that we can cooperate (and hopefully so that Facebook can incorporate them too).

And for completeness, here's a bunch of specific claims by EY I agree with

AGI will not be upper-bounded by human ability or human learning speed. Things much smarter than human would be able to learn from less evidence than humans require

I think the people who thought this stopped thinking this after move 37. I hope.

A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure

Strongly agree.

Losing a conflict with a high-powered cognitive system looks at least as deadly as "everybody on the face of the Earth suddenly falls over dead within the same second"

Strongly agree.

We need to get alignment right on the 'first critical try'

Strongly agree.

We can't just "decide not to build AGI"

Strongly agree

Running AGIs doing something pivotal are not passively safe

Agree. But I don't think this means they are totally unworkable either.

Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you

Agree. But I think the lesson here is "don't use powerful AIs until you are sure they are aligned".

Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level

Agree somewhat. But I don't rule out that "cooperative" or "interesting" is a natural attractor.

Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously

Agree, conditional on our definition of fast. I think that within a year of training our first "smart human" AI, we can simulate "100 smart humans" using a similar compute budget. I don't think Foom takes us from "human level AI" to "smarter than all humans AI" in a few minutes simply be rewriting code.

outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction

Agree. This is why I am skeptical [LW · GW]of utility-functions in general as a method for aligning AI.

Human raters make systematic errors - regular, compactly describable, predictable errors

Duh.

The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI.

I am really not very optimistic about CEV.

A powerful AI searches parts of the option space we don't, and we can't foresee all its options.

Yes.

This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents

Agree. But I don't think you need to make an AI that imitates humans in order to make an AI that is useful. For example, Codex allows me to write code significantly (2-5x) faster, despite frequently making dumb mistakes.

The AI does not think like you do

Yes.

AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.

Mostly agree. I think there exist architectures of AI that can be boxed.

You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them.

I think the best approach to funding AI safety is something like Fast Grants where we focus more on quantity than on "quality" since it is nearly impossible to identify who will succeed in advance.

Replies from: pvs, Vaniver, jskatt, Jackson Wagner

↑ comment by Pablo Villalobos (pvs) · 2022-06-08T17:55:45.192Z · LW(p) · GW(p)

For example, we could simulate a bunch of human-level scientists trying to build nanobots and also checking each-other's work.

That is not passively safe, and therefore not weak. For now forget the inner workings of the idea: at the end of the process you get a design for nanobots that you have to build and deploy in order to do the pivotal act. So you are giving a system built by your AI the ability to act in the real world. So if you have not fully solved the alignment problem for this AI, you can't be sure that the nanobot design is safe unless you are capable enough to understand the nanobots yourself without relying on explanations from the scientists.

And even if we look into the inner details of the idea: presumably each individual scientist-simulation is not aligned (if they are, then for that you need to have solved the alignment problem beforehand). So you have a bunch of unaligned human-level agents who want to escape, who can communicate among themselves (at the very least they need to be able to share the nanobot designs with each other for criticism).

You'd need to be extremely paranoid and scrutinize each communication between the scientist-simulations to prevent them from coordinating against you and bypassing the review system. Which means having actual humans between the scientists, which even if it works must slow things down so much that the simulated scientists probably can't even design the nanobots on time.

Nope. I think that you could build a useful AI (e.g. the hive of scientists) without doing any out-of-distribution stuff.

I guess this is true, but only because the individual scientist AI that you train is only human-level (so the training is safe), and then you amplify it to superhuman level with many copies. If you train a powerful AI directly then there must be such a distributional shift (unless you just don't care about making the training safe, in which case you die during the training).

Roll to disbelief. Cooperation is a natural equilibrium in many games.

Cooperation and corrigibility are very different things. Arguably, corrigibility is being indifferent with operators defecting against you. It's forcing the agent to behave like CooperateBot [LW · GW] with the operators, even when the operators visibly want to destroy it. This strategy does not arise as a natural equilibrium in multi-agent games.

Sure you can. Just train an AI that "wants" to be honest. This probably means training an AI with the objective function "accurately predict reality"

If this we knew how to do this then it would indeed solve point 31 for this specific AI and actually be pretty useful. But the reason we have ELK as an unsolved problem going around is precisely that we don't know any way of doing that.

How do you know that an AI trained to accurately predict reality actually does that, instead of "accurately predict reality if it's less than 99% sure it can take over the world, and take over the world otherwise". If you have to rely on behavioral inspection and can't directly read the AI's mind, then your only chance of distinguishing between the two is misleading the AI into thinking that it can take over the world and observing it as it attempts to do so, which doesn't scale as the AI becomes more powerful.

I'm virtually certain I could explain to Aristotle or DaVinci how an air-conditioner works.

Yes, but this is not the point. The point is that if you just show them the design, they would not by themselves understand or predict beforehand that cold air will come out. You'd have to also provide them with an explanation of thermodynamics and how the air conditioner exploits its laws. And I'm quite confident that you could also convince Aristotle or DaVinci that the air conditioner works by concentrating and releasing phlogiston, and therefore the air will come out hot.

I think I mostly agree with you on the other points.

↑ comment by Vaniver · 2022-06-07T04:11:36.550Z · LW(p) · GW(p)

EY and I don't seem to agree that "nuke every semiconductor fab" is a weakly pivotal act (since I think AI is hardware-limited and he thinks it is awaiting a clever algorithm).

Note that the difficulty in "nuke every semiconductor fab" is in "acquire the nukes and use them", not in "googling the address of semiconductor fabs". It seems to me like nuclear nonproliferation is one of the few things that actually has international collaboration with teeth, such that doing this on your own is extremely challenging, and convincing institutions that already have nuclear weapons to use them on semiconductor fabs also seems extremely challenging. [And if you could convince them to do that, can't you convince them to smash the fabs with hammers, or detain the people with relevant experience on some beautiful tropical island instead of murdering them and thousands of innocent bystanders?]

Replies from: logan-zoellner

↑ comment by Logan Zoellner (logan-zoellner) · 2022-06-07T09:43:25.311Z · LW(p) · GW(p)

Replies from: thomas-larsen

↑ comment by Thomas Larsen (thomas-larsen) · 2022-06-07T22:48:02.041Z · LW(p) · GW(p)

I think there might be a terminology mistake here -- pivotal acts are actions that will make a large positive difference a billion years later.

↑ comment by JakubK (jskatt) · 2023-04-06T18:27:06.567Z · LW(p) · GW(p)

This comment makes many distinct points, so I'm confused why it currently has -13 agreement karma. Do people really disagree with all of these points?

↑ comment by Jackson Wagner · 2022-06-07T22:38:50.481Z · LW(p) · GW(p)

"We could simulate [LW · GW]a bunch of human-level scientists trying to build nanobots."
This idea seems far-fetched:

If it was easy to create nanotechnology by just hiring a bunch of human-level scientists, we could just do that directly, without using AI at all.
Perhaps we could simulate thousands and thousands of human-level intelligences (although of course these would not be remotely human-like intelligences; they would be part of a deeply alien AI system) at accelerated speeds. But this seems like it would probably be more hardware-intensive than just turning up the dial and running a single superintelligence. In other words, this proposal seems to have a very high "alignment tax". And even after paying that hefty tax, I'd still be worried about alignment problems if I was simulating thousands of alien intelligences at super-speed!
Besides all the hardware you'd need, wouldn't this be very complicated to implement on the software side, with not much overlap with today's AI designs?

Has anyone done a serious analysis of how much semiconductor capacity could be destroyed using things like cruise missiles + nationalizing and shutting down supercomputers? I would be interested to know if this is truly a path towards disabling like 90% of the world's useful-to-AI-research compute, or if the number is much smaller because there is too much random GPU capacity out there in the wild even when you commandeer TSMC fabs and AWS datacenters.

comment by AlphaAndOmega · 2022-06-06T02:03:49.915Z · LW(p) · GW(p)

If there was one thing that I could change in this essay, it would be to clearly outline that the existence of nanotechnology advanced enough to do things like melt GPUs isn't necessary even if it is sufficient for achieving singleton status and taking humanity off the field as a meaningful player.

Whenever I see people fixate on critiquing that particular point, I need to step in and point out that merely existing tools and weapons (is there a distinction?) suffice for a Superintelligence to be able to kill the vast majority of humans and reduce our threat to it to negligible levels. Be that wresting control of nuclear arsenals to initiate MAD or simply extrapolating on gain-of-function research to produce extremely virulent yet lethal pathogens that can't be defeated before the majority of humans are infected, such options leave a small minority of humans alive to cower in the wreckage until the biosphere is later dismantled.

That's orthogonal to the issue of whether such nanotechnology is achievable for a Superintelligent AGI, it merely reduces the inferential distance the message has to be conveyed as it doesn't demand familiarity with Drexler.

(Advanced biotechnology already is nanotechnology, but the point is that no stunning capabilities need to be unlocked for an unboxed AI to become immediately lethal)

Replies from: sullyj3, adrian-arellano-davin

↑ comment by sullyj3 · 2022-06-07T04:38:09.400Z · LW(p) · GW(p)

Right, alignment advocates really underestimate the degree to which talking about sci-fi sounding tech is a sticking point for people

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-07T05:05:35.440Z · LW(p) · GW(p)

The counter-concern is that if humanity can't talk about things that sound like sci-fi, then we just die. We're inventing AGI, whose big core characteristic is 'a technology that enables future technologies'. We need to somehow become able to start actually talking about AGI.

One strategy would be 'open with the normal-sounding stuff, then introduce increasingly weird stuff only when people are super bought into the normal stuff'. Some problems with this:

A large chunk of current discussion and research happens in public; if it had to happen in private because it isn't optimized for looking normal, a lot of it wouldn't happen at all.
- More generally: AGI discourse isn't an obstacle course or a curriculum, such that we can control the order of ideas and strictly segregate the newbies from the old guard. Blog posts, research papers, social media exchanges, etc. freely circulate among people of all varieties.
It's a dishonest/manipulative sort of strategy — which makes it ethically questionable, is liable to fuel other trust-degrading behavior in the community, and is liable to drive away people with higher discourse standards.
A lot of the core arguments and hazards have no 'normal-sounding' equivalent. To sound normal, you have to skip those considerations altogether, or swap them out for much weaker arguments.
In exchange for attracting more people who are allergic to anything that sounds 'sci-fi', you lose people who are happy to speak to the substance of ideas even when they sound weird; and you lose sharp people who can tell that your arguments are relatively weak and PR-spun, but would have joined the conversation if the arguments and reasoning on display had been crisper and more obviously candid.

Another strategy would be 'keep the field normal now, then turn weird later'. But how do you make a growing research field pivot? What's the trigger? Why should we expect this to work, as opposed to just permanently diluting the field with false beliefs, dishonest norms, and low-relevance work?

My perception is that a large amount of work to date has gone into trying to soften and spin ideas so that they sound less weird or "sci-fi"; whereas relatively little work has gone into candidly stating beliefs, acknowledging that this stuff is weird, and clearly stating why you think it's true anyway.

I don't expect the latter strategy to work in all cases, but I do think it would be an overall better strategy, both in terms of 'recruiting more of the people likeliest to solve the alignment problem', and in terms of having fewer toxic effects on norms and trust within the field. Just being able to believe what people say is a very valuable thing in a position like ours.

Replies from: sullyj3

↑ comment by sullyj3 · 2022-06-07T11:15:00.498Z · LW(p) · GW(p)

Fair point, and one worth making in the course of talking about sci-fi sounding things! I'm not asking anyone to represent their beliefs dishonestly, but rather introduce them gently. I'm personally not an expert, but I'm not convinced of the viability of nanotech, so if it's not necessary (rather it's sufficient) to the argument, it seems prudent to stick to more clearly plausible pathways to takeover as demonstrations of sufficiency, while still maintaining that weirder sounding stuff is something one ought to expect when dealing with something much smarter than you.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T02:50:43.805Z · LW(p) · GW(p)

If you're trying to persuade smart programmers who are somewhat wary of sci-fi stuff, and you think nanotech is likely to play a major role in AGI strategy, but you think it isn't strictly necessary for the current argument you're making, then my default advice would be:

Be friendly and patient; get curious about the other person's perspective, and ask questions to try to understand where they're coming from; and put effort into showing your work and providing indicators that you're a reasonable sort of person.
Wear your weird beliefs on your sleeve; be open about them, and if you want to acknowledge that they sound weird, feel free to do so. At least mention nanotech, even if you choose not to focus on it because it's not strictly necessary for the argument at hand, it comes with a larger inferential gap, etc.

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T04:12:21.649Z · LW(p) · GW(p)

I think that even this scenario is implausible. I have the impression we are overestimating how easy is to wipe all humans quickly

Replies from: CronoDAS, CronoDAS, CronoDAS

↑ comment by CronoDAS · 2022-06-06T21:45:00.504Z · LW(p) · GW(p)

I'm retreating from my previous argument a bit. The AGI doesn't need to cause literal human extinction with a virus; if it can cause enough damage to collapse human industrial civilization (while being able to survive said collapse) then that would also achieve most of the AGI's goal of being able to do what it wants without humans stopping it. Naturally occurring pathogens from Europe devastated Native American populations after Columbus; throw a bunch of bad enough novel viruses at us at once and you probably could knock humanity back to the metaphorical Stone Age.

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T22:05:56.231Z · LW(p) · GW(p)

I find that more plausible. Also horrifying and worth fighting against, but not what EY is saying

Replies from: Vaniver

↑ comment by Vaniver · 2022-06-07T04:08:11.488Z · LW(p) · GW(p)

I find that more plausible. Also horrifying and worth fighting against, but not what EY is saying

Note that EY is saying "there exists a real plan that is at least as dangerous as this one"; if you think there is such a plan, then you can agree with the conclusion, even if you don't agree with his example. [There is an epistemic risk here, if everyone mistakenly believes that a different doomsday plan is possible when someone else knows why that specific plan won't work, and so if everyone pooled all their knowledge they could know that none of the plans will work. But I'm moderately confident we're instead in a world with enough vulnerabilities that broadcasting them makes things worse instead of better.]

↑ comment by CronoDAS · 2022-06-06T06:07:00.535Z · LW(p) · GW(p)

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T06:13:46.628Z · LW(p) · GW(p)

Yes, I can imagine that. How does a superintelligence get one?

Replies from: Daphne_W, adrian-arellano-davin, CronoDAS

↑ comment by Daphne_W · 2022-06-06T07:35:48.759Z · LW(p) · GW(p)

Solve protein folding problem
Acquire human DNA sample
Use superintelligence to construct a functional model of human biochemistry
Design a virus that exploits human biochemstry
Use one of the currently available biochemistry-as-a-service providers to produce a sample that incubates the virus and then escapes their safety procedures (e.g. pay someone to mix two vials sent to them in the mail. The aerosols from the mixing infect them)

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T08:07:53.229Z · LW(p) · GW(p)

Solve protein folding problem

Fine, no problems here. Up to certain level of accuracy I guess

Acquire human DNA sample

Ok. Easy

Use superintelligence to construct a functional model of human biochemistry

By this, I can deduce different things. One, that you assume that this is possible from points one and two. This is nonsense. There are millions of things that are not written in the DNA. Also, you don't need to acquire a human DNA sample, you just download a fasta file. But, to steelman your argument, let's say that the superintelligence builds a model of human biochemistry not based on the a human DNA sample but based on the corpus of biochemistry research, which is something that I find plausible. Up to certain level!!! I don't think that such a model would be flawless or even good enough, but fine

Design a virus that exploits human biochemstry

Here I start having problems believing the argument. Not everything can be computed using simulations guys. The margin of error can be huge. Would you believe in a superintelligence capable of predicting the weather 10 years in advance? If not, what makes you think that creating a virus is an easier problem?

Use one of the currently available biochemistry-as-a-service providers to produce a sample that incubates the virus and then escapes their safety procedures (e.g. pay someone to mix two vials sent to them in the mail. The aerosols from the mixing infect them)

Even if you succeed at this, and there hundreds of alarms that could go off in the meantime, how do you guarantee that the virus kills everyone?

I am totally unconvinced by this argument

Replies from: CronoDAS

↑ comment by CronoDAS · 2022-06-06T08:25:38.189Z · LW(p) · GW(p)

Here I start having problems believing the argument. Not everything can be computed using simulations guys. The margin of error can be huge. Would you believe in a superintelligence capable of predicting the weather 10 years in advance? If not, what makes you think that creating a virus is an easier problem?

Because viruses already exist, and unlike the weather, the effect of a virus on a human body isn't sensitive to initial conditions the way the weather, a three-body gravitational system, or a double pendulum is. Furthermore, humans have already genetically engineered existing viruses to do things that we want them to do...

how do you guarantee that the virus kills everyone?

You don't really have to. Killing 19 out of every 20 people in the world would probably work just as well for ensuring the survivors can't do anything about whatever it is that you want to do.

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T08:31:15.138Z · LW(p) · GW(p)

Would you say that a superintelligence would be capable of predicting the omicron variant from the alpha strain? Are you saying that the evolution of the complex system resulting from the interaction between the virus and the human population is easier to compute than a three body gravitational system? I am not denying that we can create a virus, I am denying that someone or something can create a virus that kills all humans and that the evolution of the system can be known in advance

Replies from: CronoDAS

↑ comment by CronoDAS · 2022-06-06T21:02:31.257Z · LW(p) · GW(p)

I see your point. Humans tried to cull the population of (accidentally introduced) rabbits in Australia by using a natural virus that was highly lethal to them; the virus mutated to be less lethal and the rabbit population rebounded.

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T06:15:17.179Z · LW(p) · GW(p)

Also, a virus like does would cause a great harm, but wouldn't wipe humanity

↑ comment by CronoDAS · 2022-06-06T08:18:09.618Z · LW(p) · GW(p)

↑ comment by CronoDAS · 2022-06-06T06:15:29.963Z · LW(p) · GW(p)

Replies from: adrian-arellano-davin, adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T06:51:00.201Z · LW(p) · GW(p)

Yes, I can imagine many things. I can also imagine all molecules in a glass of water bouncing off in a way that suddenly the water freezes. I don't see how a superintelligence makes that happen. This is the biggest mistake that EY is making. He is equating enormous ability to almightiness. They are different. I think that pulling off what you suggest is beyond what a superintelligence can do

Replies from: MondSemmel

↑ comment by MondSemmel · 2022-06-06T14:30:04.386Z · LW(p) · GW(p)

Security mindset suggests that it's more useful to think of ways in which something might go wrong, rather than ways in which it might not.

So rather than poking holes into suggestions (by humans, who are not superintelligent) for how a superintelligence could achieve some big goal like wiping out humanity, I expect you'd benefit much more from doing the following thought experiment:

Imagine yourself to be 1000x smarter, 1000x quicker at thinking and learning, with Internet access but no physical body. (I expect you could also trivially add "access to tons of money" from discovering a security exploit in a cryptocurrency or something.) How could you take over the world / wipe out humanity, from that position? What's the best plan you can come up with? How high is its likelihood of success? Etc.

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T20:28:49.129Z · LW(p) · GW(p)

I agree that it can be more useful but this is not what is being discussed or what I am criticizing. I never said that AGI won't be dangerous nor that it is not important to work on this. What I am a bit worried about is that this community is getting something wrong, namely, that an AGI will exterminate the human race and it will happen soon. Realism and objectivity should be preserved at all cost. Having a totally unrealistic take in the real hazards will cause backlash eventually: think of the many groups that's defended that to better fight climate change we need to consider the worst case scenario, that we need to exaggerate and scare people. I feel the LW community is falling into this.

Replies from: MondSemmel

↑ comment by MondSemmel · 2022-06-06T21:47:19.705Z · LW(p) · GW(p)

I understand your worry, but I was addressing your specific point that "I think that pulling off what you suggest is beyond what a superintelligence can do".

There are people who have reasonable arguments against various claims of the AI x-risk community, but I'm extremely skeptical of this claim. To me it suggests a failure of imagination, hence my suggested thought experiment.

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T22:15:16.120Z · LW(p) · GW(p)

I see. I agree that it might be a failure of imagination, but if it is, why do you consider that way more likely than the alternative "it is not that easy to do something like that even being very clever"? The problem I have is that all doom scenarios that I see discussed are so utterly unrealistic (e.g. the AGI suddenly makes nanobots and delivers it to all humans at once and so on) that it makes me think that the fact we are failing at conceiving plans that could succeed is because it might be harder than we think.

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T08:15:43.617Z · LW(p) · GW(p)

There would also be a fraction of the human beings who would probably be inmune. How does the superintelligence solve that? Can it also know the full diversity how human inmune systems?

Replies from: CronoDAS

↑ comment by CronoDAS · 2022-06-06T08:31:06.162Z · LW(p) · GW(p)

Untreated rabies has a survival rate of literally zero. It's not inconceivable that another virus could be equally lethal.

(Edit: not literally zero, because not every exposure leads to symptoms, but surviving symptomatic rabies is incredibly rare.)

Replies from: quintin-pope, adrian-arellano-davin

↑ comment by Quintin Pope (quintin-pope) · 2022-06-06T08:45:45.911Z · LW(p) · GW(p)

I agree with you broader point that a superintelligence could design incredibly lethal, highly communicable diseases. However, I'd note that it's only symptomatic untreated rabies that has a survival rate of zero. It's entirely possible (even likely) to be bitten by a rabid animal and not contract rabies.

Many factors influence your odds of developing symptomatic rabies, including bite location, bite depth and pathogen load of the biting animal. The effects of pathogen inoculations are actually quite dependent on initial conditions. Presumably, the innoculum in non-transmitting bites is greater than zero, so it is actually possible for the immune system to fight off a rabies infection. It's just that, conditional on having failed to do so at the start of infection, the odds of doing so afterwards are tiny.

Replies from: CronoDAS

↑ comment by CronoDAS · 2022-06-06T20:57:03.669Z · LW(p) · GW(p)

You're actually right about rabies; I found things saying that about 14% of dogs survive and a group of unvaccinated people who had rabies antibodies but never had symptoms.

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T08:33:28.463Z · LW(p) · GW(p)

How do you guarantee that all humans get exposed to a significant dosage before they start reacting? How do you guarantee that there are full populations (maybe in places with a large genetic diversity like India or Africa) that happen to be inmune?

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T08:45:28.505Z · LW(p) · GW(p)

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T08:55:37.333Z · LW(p) · GW(p)

Replies from: cwbakerlee

↑ comment by cwbakerlee · 2022-06-06T18:41:05.492Z · LW(p) · GW(p)

Just want to preemptively flag that in the EA biosecurity community we follow a general norm against brainstorming novel ways to cause harm with biology. Basic reasoning is that succeeding in this task ≈ generating info hazards.

Abstractly postulating a hypothetical virus with high virulence + transmissibility and a long latent period can be useful for facilitating thinking, but brainstorming the specifics of how to actually accomplish this -- as some folks in these and some nearby comments are trending in the direction of starting to do -- poses risks that exceed the likely benefits.

Happy to discuss further if interested, feel free to DM me.

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T20:13:21.185Z · LW(p) · GW(p)

Thanks for the heads-up, it makes sense

comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-06-07T16:51:24.178Z · LW(p) · GW(p)

While I share a large degree of pessimism for similar reasons, I am somewhat more optimistic overall.

Most of this comes from generic uncertainty and epistemic humility; I'm a big fan of the inside view, but it's worth noting that this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.

However, there are some more specific points I can point to where I think you are overconfident, or at least not providing good reasons for such a high level of confidence (and to my knowledge nobody has). I'll focus on two disagreements which I think are closest to my true disagreements.

1) I think safe pivotal "weak" acts likely do exist. It seems likely that we can access vastly superhuman capabilities without inducing huge x-risk using a variety of capability control methods. If we could build something that was only N<<infinity times smarter than us, then intuitively it seems unlikely that it would be able to reverse engineer details of the outside world or other AI systems source code (cf 35) necessary to break out of the box or start cooperating with its AI overseers. If I am right, then the reason nobody has come up with one is because they aren't smart enough (in some -- possibly quite narrow -- sense of smart); that's why we need the superhuman AI! Of course, it could also be that someone has such an idea, but isn't sharing it publicly / with Eliezer.

2) I am not convinced that any superhuman AGI we are likely to have the technical means to build in the near future is going to be highly consequentialist (although this does seem likely). I think that humans aren't actually that consequentialist, current AI systems even less so, and it seems entirely plausible that you don't just automatically get super consequentialist things no matter what you are doing or how you are training them... if you train something to follow commands in a bounded way using something like supervised learning, maybe you actually end up with something that does something reasonably close to that. My main reason for expecting consequentialist systems at superhuman-but-not-superintelligent-level AGI is that people will build them that way because of competitive pressures, not because systems that people are trying to make non-consequentialist end up being consequentialist.

These two points are related: If we think consequentialism is unavoidable (RE 2), then we should be more skeptical that we can safely harness the power of superhuman capabilities at all (RE 1), although we could still hope to use capability control and incentive schemes to harness a superhuman-but-not-superintelligent consequentialist AGI to devise and help execute "weak" pivotal acts.

3) Maybe one more point worth mentioning is the "alien concepts" bit: I also suspect AIs will have alien concepts and thus generalize in weird ways. Adversarial examples and other robustness issues are evidence in favor of this, but we are also seeing that scaling makes models more robust, so it seems plausible that AGI will actually end up using similar concepts to humans, thus making generalizing in the ways we intend/expect natural for AGI systems.

---------------------------------------------------------------------
The rest of my post is sort of just picking particular places where I think the argumentation is weak, in order to illustrate why I currently think you are, on net, overconfident.

7. The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists.

This contains a dubious implicit assumption, namely: we cannot build safe super-human intelligence, even if it is only slightly superhuman, or superhuman in various narrow-but-strategically-relevant areas.

19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.

This basically what CIRL aims to do. We can train for this sort of thing and study such methods of training empirically in synthetic settings.

23. Corrigibility is anti-natural to consequentialist reasoning

Maybe I missed it, but I didn't see any argument for why we end up with consequentialist reasoning.

30. [...] There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.

It seems like such things are likely to exist by analogy with complexity theory (checking is easier than proposing).

36. AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.

I figured it was worth noting that this part doesn't explicitly say that relatively weak AGIs can't perform pivotal acts.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T05:30:25.150Z · LW(p) · GW(p)

this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.

I don't think these statements all need to be true in order for p(doom) to be high, and I also don't think they're independent. Indeed, they seem more disjunctive than conjunctive to me; there are many cases where any one of the claims being true increases risk substantially, even if many others are false.

Replies from: capybaralet

↑ comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2022-06-08T11:27:17.478Z · LW(p) · GW(p)

I basically agree.

I am arguing against extreme levels of pessimism (~>99% doom).

comment by Richard_Ngo (ricraz) · 2022-06-10T01:44:52.733Z · LW(p) · GW(p)

Thanks for writing this, I agree that people have underinvested in writing documents like this. I agree with many of your points, and disagree with others. For the purposes of this comment, I'll focus on a few key disagreeements.

My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question. Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.

There are some ways in which AGI will be analogous to human evolution. There are some ways in which it will be disanalogous. Any solution to alignment will exploit at least one of the ways in which it's disanalogous. Pointing to the example of humans without analysing the analogies and disanalogies more deeply doesn't help distinguish between alignment proposals which usefully exploit disanalogies, and proposals which don't.

Alpha Zero blew past all accumulated human knowledge about Go after a day or so of self-play, with no reliance on human playbooks or sample games.

It seems useful to distinguish between how fast any given model advances during training, and how fast the frontier of our best models advances. AlphaZero seems like a good example of why we should expect the former to be fast; but for automated oversight techniques, the latter is more relevant.

if a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months.

Maybe one way to pin down a disagreement here: imagine the minimum-intelligence AGI that could write this textbook (including describing the experiments required to verify all the claims it made) in a year if it tried. How many Yudkowsky-years does it take to safely evaluate whether following a textbook which that AGI spent a year writing will kill you?

This situation you see when you look around you is not what a surviving world looks like. The worlds of humanity that survive have plans.

It would be great to have a well-justified plan at this point. But I think you're also overestimating the value of planning, in a way that's related to you using the phrase "miracle" to mean "positive model violation". Nobody throughout human history has ever had a model of the future accurate enough to justify equivocating those two terms. Every big scientific breakthrough is a model violation to a bunch of geniuses who have been looking at the problem really hard, but not quite at the right angle. This is why I pushed you, during our debates, to produce predictions rather than postdictions, so that I could distinguish you from all the other geniuses who ran into big model violations.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-10T05:19:06.441Z · LW(p) · GW(p)

Maybe one way to pin down a disagreement here: imagine the minimum-intelligence AGI that could write this textbook (including describing the experiments required to verify all the claims it made) in a year if it tried. How many Yudkowsky-years does it take to safely evaluate whether following a textbook which that AGI spent a year writing will kill you?

Infinite? That can't be done?

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2022-06-10T17:06:38.115Z · LW(p) · GW(p)

Hmm, okay, here's a variant. Assume it would take N Yudkowsky-years to write the textbook from the future described above. How many Yudkowsky-years does it take to evaluate a textbook that took N Yudkowsky-years to write, to a reasonable level of confidence (say, 90%)?

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-10T23:23:21.191Z · LW(p) · GW(p)

If I know that it was written by aligned people? I wouldn't just be trying to evaluate it myself; I'd try to get a team together to implement it, and understanding it well enough to implement it would be the same process as verifying whatever remaining verifiable uncertainty was left about the origins, where most of that uncertainty is unverifiable because the putative hostile origin is plausibly also smart enough to sneak things past you.

Replies from: ricraz

↑ comment by Richard_Ngo (ricraz) · 2022-06-11T00:04:00.166Z · LW(p) · GW(p)

Sorry, I should have been clearer. Let's suppose that a copy of you spent however long it takes to write an honest textbook with the solution to alignment (let's call it N Yudkowsky-years), and an evil copy of you spent N Yudkowsky-years writing a deceptive textbook trying to make you believe in a false solution to alignment, and you're given one but not told which. How long would it take you to reach 90% confidence about which you'd been given? (You're free to get a team together to run a bunch of experiments and implementations, I'm just asking that you measure the total work in units of years-of-work-done-by-people-as-competent-as-Yudkowsky. And I should specify some safety threshold too - like, in the process of reaching 90% confidence, incurring less than 10% chance of running an experiment which kills you.)

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-11T04:08:50.626Z · LW(p) · GW(p)

Depends what the evil clones are trying to do.

Get me to adopt a solution wrong in a particular direction, like a design that hands the universe over to them? I can maybe figure out the first time through who's out to get me, if it's 200 Yudkowsky-years. If it's 200,000 Yudkowsky-years I think I'm just screwed.

Get me to make any lethal mistake at all? I don't think I can get to 90% confidence period, or at least, not without spending an amount of Yudkowsky-time equivalent to the untrustworthy source.

comment by p.b. · 2022-06-07T18:41:06.935Z · LW(p) · GW(p)

Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.

Humans haven't been optimized to pursue inclusive genetic fitness for very long, because humans haven't been around for very long. Instead they inherited the crude heuristics pointing towards inclusive genetic fitness from their cognitively much less sophisticated predecessors. And those still kinda work!

If we are still around in a couple of million years I wouldn't be surprised if there was inner alignment in the sense that almost all humans in almost all practically encountered environments end up consciously optimising inclusive genetic fitness.

More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.

Generally, I think that people draw the wrong conclusions from mesa-optimisers and the examples of human evolutionary alignment.

Saying that we would like to solve alignment by specifying exactly what we want and then let the AI learn exactly what we want, is like saying that we would like to solve transportation by inventing teleportation. Yeah, would be nice but unfortunately it seems like you will have to move through space instead.

The conclusion we should take from the concept of mesa-optimisation isn't "oh no alignment is impossible", that's equivalent to "oh no learning is impossible". But learning is possible. So the correct conclusion is "alignment has to work via mesa-optimisation".

Because alignment in the human examples (i.e. human alignment to evolution's objective and humans alignment to human values) works by bootstrapping from incredibly crude heuristics. Think three dark patches for a face.

Humans are mesa-optimized to adhere to human values. If we were actually inner aligned to the crude heuristics that evolution installed in us for bootstrapping the entire process, we would be totally disfunctional weirdoes.

I mean even more so ...

To me the human examples suggest that there has to be a possibility to get from gesturing at what we want to getting what we want. And I think we can gesture a lot better than evolution! Well, at least using much more information than 3.2 billion base pairs.

If alignment has to be a bootstrapped open ended learning process there is also the possibility that it will work better with more intelligent systems or really only start working with fairly intelligent systems.

Maybe bootstrapping with cake, kittens and cuddles will still get us paperclipped, I don't know. It certainly seems awfully easy to just run straight off a cliff. But I think looking at the only known examples of alignment of intelligences does allow us more optimistic takes than are prevalent on this page.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T05:42:57.015Z · LW(p) · GW(p)

The conclusion we should take from the concept of mesa-optimisation isn't "oh no alignment is impossible", that's equivalent to "oh no learning is impossible".

The OP isn't claiming that alignment is impossible.

If we were actually inner aligned to the crude heuristics that evolution installed in us for bootstrapping the entire process, we would be totally disfunctional weirdoes.

I don't understand the point you're making here.

Replies from: p.b.

↑ comment by p.b. · 2022-06-08T07:33:21.400Z · LW(p) · GW(p)

The point I'm making is that the human example tells us that:

If first we realize that we can't code up our values, therefore alignment is hard. Then, when we realize that mesa-optimisation is a thing. we shouldn't update towards "alignment is even harder". We should update in the opposite direction.

Because the human example tells us that a mesa-optimiser can reliably point to a complex thing even if the optimiser points to only a few crude things.

But I only ever see these three points, human example, inability to code up values, mesa-optimisation to separately argue for "alignment is even harder than previously thought". But taken together that is just not the picture.

Replies from: Eliezer_Yudkowsky, quintin-pope

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-08T07:38:25.929Z · LW(p) · GW(p)

Humans point to some complicated things, but not via a process that suggests an analogous way to use natural selection or gradient descent to make a mesa-optimizer point to particular externally specifiable complicated things.

Replies from: TurnTrout, david-johnston

↑ comment by TurnTrout · 2022-06-08T17:34:53.454Z · LW(p) · GW(p)

Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?

Likewise, when you wrote,

This isn't to say that nothing in the system’s goal (whatever goal accidentally ends up being inner-optimized over) could ever point to anything in the environment by accident.

Where is the accident? Did evolution accidentally find a way to reliably orient terminal human values towards the real world? Do people each, individually, accidentally learn to terminally care about the real world? Because the former implies the existence of a better alignment paradigm (that which occurs within the human brain, to take an empty-slate human and grow them into an intelligence which terminally cares about objects in reality), and the latter is extremely unlikely. Let me know if you meant something else.

EDIT: Updated a few confusing words.

Replies from: RobbBB, Vaniver, lc

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T19:41:31.814Z · LW(p) · GW(p)

Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?

Maybe I'm not understanding your proposal, but on the face of it this seems like a change of topic. I don't see Eliezer claiming 'there's no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head'. Maybe he does think that, but mostly I'd guess he doesn't care, because the important thing is whether you can point the AGI at very, very specific real-world tasks.

Where is the accident? Did evolution accidentally find a way to reliably orient people towards the real world? Do people each, individually, accidentally learn to care about the real world?

Same objection/confusion here, except now I'm also a bit confused about what you mean by "orient people towards the real world". Your previous language made it sound like you were talking about causing the optimizer's goals to point at things in the real world, but now your language makes it sound like you're talking about causing the optimizer to model the real world or causing the optimizer to instrumentally care about the state of the real world....? Those all seem very different to me.

Or, in summary, I'm not seeing the connection between:

"Terminally valuing anything physical at all" vs. "terminally valuing very specific physical things".
"Terminally valuing anything physical at all" vs. "instrumentally valuing anything physical at all".
"Terminally valuing very specific physical things" vs. "instrumentally valuing very specific physical things".
Any of the above vs. "modeling / thinking about physical things at all", or "modeling / thinking about very specific physical things".

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-08T20:10:36.672Z · LW(p) · GW(p)

Hm, I'll give this another stab. I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary?

I don't see Eliezer claiming 'there's no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head'.

Let me distinguish three alignment feats:

Producing a mind which terminally values sensory entities.
Producing a mind which reliably terminally values some kind of non-sensory entity in the world, like dogs or bananas.
1. AFAIK we have no idea how to ensure this happens reliably -- to produce an AGI which terminally values some element of {diamonds, dogs, cats, tree branches, other real-world objects}, such that there's a low probability that the AGI actually just cares about high-reward sensory observations.
2. In other words: Design a mind which cares about anything at all in reality which isn't a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don't know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I'm damn sure the AI will care about something besides its own sensory signals.
3. I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds.
1. Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
2. This is what you point out as a potential crux.

(EDIT: Added a few sub-points to clarify list.)

From my shard theory document:

We (alignment researchers) have had no idea how to actually build a mind which intrinsically (not instrumentally!) values a latent, non-sensory object in the real world. Witness the confusion on this point in Arbital’s ontology identification article.
To my knowledge, we still haven't solved this problem. We have no reward function to give AIXI which makes AIXI maximize real-world diamonds. A deep learning agent might learn to care about the real world, yes, but it might learn sensory preferences instead. Ignorance about the outcome is not a mechanistic account of why the agent convergently will care about specific real-world objects instead of its sensory feedback signals.
Under this account, caring about the real world is just one particular outcome among many. Hence, the "classic paradigms" imply that real-world caring is (relatively) improbable.
While we have stories about entities which value paperclips, I do not think we have known how to design them. Nor have we had any mechanistic explanation for why people care about the real world in particular.

As you point out, we obviously need to figure problem 3 out in order to usefully align an AGI. I will now argue that the genome solves problem 3, albeit not in the sense of aligning humans with inclusive genetic fitness (you can forget about human/evolution alignment, I won't be discussing that in this comment).

The genome solves problem #3 in the sense of: if a child grows up with a dog, then that child will (with high probability) terminally value that dog.

Isn't that an amazing alignment feat!?

Therefore, there has to be a reliable method of initializing a mind from scratch, training it, and having the resultant intelligence care about dogs. Not only does it exist in principle, it succeeds in practice, and we can think about what that method might be. I think this method isn't some uber-complicated alignment solution. The shard theory explanation for dog-value formation is quite simple.

now your language makes it sound like you're talking about causing the optimizer to model the real world or causing the optimizer to instrumentally care about the state of the real world....? Those all seem very different to me.

Nope, wasn't meaning any of these! I was talking about "causing the optimizer's goals to point at things in the real world" the whole time.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T20:58:36.371Z · LW(p) · GW(p)

I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary?

Yes!

I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!

True! Though everyone already agreed (e.g., EY asserted this in the OP) that it's possible in principle. The updatey thing would be if the case of the human genome / brain development suggests it's more tractable than we otherwise would have thought (in AI).

Seems to me like it's at least a small update about tractability, though I'm not sure it's a big one? Would be interesting to think about the level of agreement between different individual humans with regard to 'how much particular external-world things matter'. Especially interesting would be cases where humans consistently, robustly care about a particular external-world thingie even though it doesn't have a simple sensory correlate.

(E.g., humans developing to care about sex is less promising insofar as it depends on sensory-level reinforcement such as orgasms. Humans developing to care about 'not being in the Matrix / not being in an experience machine' is possibly more promising, because it seems like a pretty common preference that doesn't get directly shaped by sensory rewards.)

3. Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds

Is the distinction between 2 and 3 that "dog" is an imprecise concept, while "diamond" is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is 'maximize the number of dogs' and 3 is 'maximize the number of diamonds'.

If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I'm inclined to think that's a harder feat than building a diamond maximizer, and I think being able to build a diamond maximizer would also suggest the strawberry-grade alignment problem is mostly solved.)

But maybe I'm misunderstanding 2.

Nope, wasn't meaning any of these! I was talking about "causing the optimizer's goals to point at things in the real world" the whole time.

Cool!

I'll look more at your shards document and think about your arguments here. :)

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-09T01:36:57.140Z · LW(p) · GW(p)

Is the distinction between 2 and 3 that "dog" is an imprecise concept, while "diamond" is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is 'maximize the number of dogs' and 3 is 'maximize the number of diamonds'.

Feat #2 is: Design a mind which cares about anything at all in reality which isn't a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don't know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I'm damn sure the AI will care about something besides its own sensory signals. Such a procedure would accomplish feat #2, but not #3.

Feat #3 is: Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.

If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I'm inclined to think that's a harder feat than building a diamond maximizer

I actually think that the dog- and diamond-maximization problems are about equally hard, and, to be totally honest, neither seems that bad^[1] in the shard theory paradigm.

Surprisingly, I weakly suspect the harder part is getting the agent to maximize real-world dogs in expectation, not getting the agent to maximize real-world dogs in expectation. I think "figure out how to build a mind which cares about the number of real-world dogs, such that the mind intelligently selects plans which lead to a lot of dogs" is significantly easier than building a dog-maximizer.

^{^}
I appreciate that this claim is hard to swallow. In any case, I want to focus on inferentially-closer questions first, like how human values form.

↑ comment by Vaniver · 2022-06-08T18:36:29.538Z · LW(p) · GW(p)

Why is the process by which humans come to reliably care about the real world

IMO this process seems pretty unreliable and fragile, to me. Drugs are popular; video games are popular; people-in-aggregate put more effort into obtaining imaginary afterlives than life extension or cryonics.

But also humans have a much harder time 'optimizing against themselves' than AIs will, I think. I don't have a great mechanistic sense of what it will look like for an AI to reliably care about the real world.

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-08T19:01:59.590Z · LW(p) · GW(p)

One of the problems with English is that it doesn't natively support orders of magnitude for "unreliable." Do you mean "unreliable" as in "between 1% and 50% of people end up with part of their values not related to objects-in-reality", or as in "there is no a priori reason why anyone would ever care about anything not directly sensorially observable, except as a fluke of their training process"? Because the latter is what current alignment paradigms mispredict, and the former might be a reasonable claim about what really happens for human beings.

EDIT: My reader-model is flagging this whole comment as pedagogically inadequate, so I'll point to the second half of section 5 in my shard theory document.

↑ comment by lc · 2022-06-08T17:43:55.557Z · LW(p) · GW(p)

Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?

Humans came to their goals while being trained by evolution on genetic inclusive fitness, but they don't explicitly optimize for that. They "optimize" for something pretty random, that looks like genetic inclusive fitness in the training environment but then in this weird modern out-of-sample environment looks completely different. We can definitely train an AI to care about the real world, but his point is that, by doing something analogous to what happened with humans, we will end up with some completely different inner goal than the goal we're training for, as happened with humans.

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-08T17:49:39.239Z · LW(p) · GW(p)

I'm not talking about running evolution again, that is not what I meant by "the process by which humans come to reliably care about the real world." The human genome must specify machinery which reliably grows a mind which cares about reality. I'm asking why we can't use the alignment paradigm leveraged by that machinery, which is empirically successful at pointing people's values to certain kinds of real-world objects.

Replies from: lc

↑ comment by lc · 2022-06-08T18:01:34.601Z · LW(p) · GW(p)

Ah, I misunderstood.

Well, for starters, because if the history of ML is anything to go by, we're gonna be designing the thing analogous to evolution, and not the brain. We don't pick the actual weights in these transformers, we just design the architecture and then run stochastic gradient descent or some other meta-learning algorithm. That meta-learning algorithm is going to be what decides to go in the DNA, so in order to get the DNA right, we will need to get the meta-learning algorithm correct. Evolution doesn't have much to teach us about that except as a negative example.

But (I think) the answer is similar to this:

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-08T18:14:07.682Z · LW(p) · GW(p)

we're gonna be designing the thing analogous to evolution, and not the brain. We don't pick the actual weights in these transformers, we just design the architecture and then run stochastic gradient descent or some other meta-learning algorithm.

But, ah, the genome also doesn't "pick the actual weights" for the human brain which it later grows. So whatever the brain does to align people to care about latent real-world objects, I strongly believe that that process must be compatible with blank-slate initialization and then learning.

That meta-learning algorithm is going to be what decides to go in the DNA, not some human architect.

In the evolution/mainstream-ML analogy, we humans are specifying the DNA, not the search process over DNA specifications. We specify the learning architecture, and then the learning process fills in the rest.

I confess that I already have a somewhat sharp picture [LW(p) · GW(p)] of the alignment paradigm used by the brain, that I already have concrete reasons to believe it's miles better than anything we have dreamed so far. I was originally querying what Eliezer thinks about the "genome->human alignment properties" situation, rather than expressing innocent ignorance of how any of this works.

Replies from: lc

↑ comment by lc · 2022-06-08T18:22:09.665Z · LW(p) · GW(p)

I think I disagree with you, but I don't really understand what you're saying or how these analogies are being used to point to the real world anymore. It seems to me like you might be taking something that makes the problem of "learning from evolution" even more complicated (evolution -> protein -> something -> brain vs. evolution -> protein -> brain) and using that to argue the issues are solved, in the same vein as the "just don't use a value function" people. But I haven't read shard theory, so, GL.

In the evolution/mainstream-ML analogy, we humans are specifying the DNA, not the search process over DNA specifications.

You mean, we are specifying the ATCG strands, or we are specifying the "architecture" behind how DNA influences the development of the human body? It seems to me like we are definitely also choosing how the search for the correct ATCG strands and how they're identified, in this analogy. The DNA doesn't "align" new babies out of the womb, it's just a specification of how to copy the existing, already """aligned""" code.

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-08T18:55:24.508Z · LW(p) · GW(p)

"learning from evolution" even more complicated (evolution -> protein -> something -> brain vs. evolution -> protein -> brain)

ah, no, this isn't what I'm saying. Hm. Let me try again.

The following is not a handwavy analogy, it is something which actually happened:

Evolution found the human genome.
The human genome specifies the human brain.
The human brain learns most of its values and knowledge over time.
Human brains reliably learn to care about certain classes of real-world objects like dogs.

Therefore, somewhere in the "genome -> brain -> (learning) -> values" process, there must be a process which reliably produces values over real-world objects. Shard theory aims to explain this process. The shard-theoretic explanation is actually pretty simple.

Furthermore, we don't have to rerun evolution to access this alignment process. For the sake of engaging with my points, please forget completely about running evolution. I will never suggest rerunning evolution, because it's unwise and irrelevant to my present points. I also currently don't see why the genome's alignment process requires more than crude hard-coded reward circuitry, reinforcement learning, and self-supervised predictive learning.

Replies from: FireStormOOO

↑ comment by FireStormOOO · 2022-06-09T06:07:04.218Z · LW(p) · GW(p)

That does seem worth looking at and there's probably ideas worth stealing from biology. I'm not sure you can call that a robustly aligned system that's getting bootstrapped though. Existing in a society of (roughly) peers and the lack of a huge power disparity between any given person and the rest of humans is anologous to the AGI that can't take over the world yet. Humans that aquire significant power do not seem aligned wrt what a typical person would profess to and outwardly seem to care about.

I think your point still mostly follows despite that; even when humans can be deceptive and power seeking, there's an astounding amount of regularity in what we end up caring about.

Replies from: TurnTrout

↑ comment by TurnTrout · 2022-06-09T16:29:57.761Z · LW(p) · GW(p)

there's an astounding amount of regularity in what we end up caring about.

Yes, this is my claim. Not that eg >95% of people form values which we would want to form within an AGI.

↑ comment by David Johnston (david-johnston) · 2022-06-08T09:46:31.795Z · LW(p) · GW(p)

Humans can, to some extent, be pointed to complicated external things. This suggests that using natural selection on biology can get you mesa-optimizers that can be pointed to particular externally specifiable complicated things. Doesn't prove it (or, doesn't prove you can do it again), but you only asked for a suggestion.

Replies from: Eliezer_Yudkowsky, RobbBB

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-08T22:34:15.628Z · LW(p) · GW(p)

Humans can be pointed at complicated external things by other humans on their own cognitive level, not by their lower maker of natural selection.

Replies from: TurnTrout, david-johnston

↑ comment by TurnTrout · 2022-06-09T00:57:07.761Z · LW(p) · GW(p)

I don't think I understand what, exactly, is being discussed. Are "dogs" or "flowers" or "people you meet face-to-face" examples of "complicated external things"?

↑ comment by David Johnston (david-johnston) · 2022-06-08T22:50:54.229Z · LW(p) · GW(p)

Right, but the goal is to make AGI you can point at things, not to make AGI you can point at things using some particular technique.

(Tangentially, I also think the jury is still out on whether humans are bad fitness maximizers, and if we're ultimately particularly good at it - e.g. let's say, barring AGI disaster, we'd eventually colonise the galaxy - that probably means AGI alignment is harder, not easier)

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T19:34:40.780Z · LW(p) · GW(p)

To my eye, this seems like it mostly establishes 'it's not impossible in principle for an optimizer to have a goal that relates to the physical world'. But we had no reason to doubt this in the first place, and it doesn't give us a way to reliably pick in advance which physical things the optimizer cares about. "It's not impossible" is a given for basically everything in AI, in principle, if you have arbitrary amounts of time and arbitrarily deep understanding.

Replies from: david-johnston

↑ comment by David Johnston (david-johnston) · 2022-06-09T02:09:04.963Z · LW(p) · GW(p)

As I said (a few times!) in the discussion about orthogonality, indifference about the measure of "agents" that have particular properties seems crazy to me. Having an example of "agents" that behave in a particular way is a enormously different to having an unproven claim that such agents might be mathematically possible.

↑ comment by Quintin Pope (quintin-pope) · 2022-06-08T07:46:46.675Z · LW(p) · GW(p)

I think this is correct. Shard theory is intended as an account of how inner misalignment produces human values. I also think [LW · GW] that human values aren't as complex or weird as they introspectively appear.

comment by Raemon · 2022-06-05T23:28:44.062Z · LW(p) · GW(p)

I read an early draft of this awhile and am glad to have it publicly available. And I do think the updates in structure/introduction were worth the wait. Thanks!

comment by TekhneMakre · 2022-06-06T19:16:07.016Z · LW(p) · GW(p)

>There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it

This is a sort of surprising claim. From an abstract point of view, assuming NP >> P, checking can be way easier than inventing. To stick with your example, it kind of seems, at an intuitive guess, like a plan to use nanobots to melt all GPUs should be very complicated but not way superhumanly complicated? (Superhuman to invent, though.) Like, you show me the plans for the bootstrap nanofactory, the workhorse nanofactory, the standard nanobots, the software for coordinating the nanobots, the software for low-power autonomous behavior, the transportation around the world, the homing in on GPUs, and the melting process. That's really complicated, way more complicated than anything humans have done before, but not by 1000x? Maybe like 100x? Maybe only 10x if you count whole operating systems or scientific fields. Does this seem quantitatively in the right ballpark, and you're saying, that quantitatively large but not crazy amount of checking is infeasible?

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T22:20:52.543Z · LW(p) · GW(p)

The preceding sentences in the OP were (emphasis added):

Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence. An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn't make the same guarantee about an unaligned human as smart as yourself and trying to fool you.

I took Eliezer to be saying something like:

'If you're confident that your AGI system is directing its optimization at the target task, is doing no adversarial optimization, and is otherwise aligned, then shrug, maybe there's some role to be played by checking a few aspects of the system's output to confirm certain facts.

'But in this scenario, the work is almost entirely being done by the AGI's alignment, not by the post facto checking. If you screwed up and the system is doing open-ended optimization of the world that includes thinking about its developers and planning to take control from them, then it's plausible that your checking will completely fail to notice the trap; and it's ~certain that your checking, if it does notice the trap, won't thereby give you trap-free nanosystems that you can use to end the acute risk period.'

(One thing to keep in mind is that an adversarial AGI with knowledge of its operators would want to obfuscate its plans, making it harder for humans to productively analyze designs it produces; and it might also want to obscure the fact that the plans are obfuscated, making them look easier-to-check than they are.)

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-06-07T00:28:11.347Z · LW(p) · GW(p)

We can distinguish:

-- The AI is trying to deceive you.

-- The AI isn't trying to deceive you, but is trying to produce plans that would, if executed, have consequences X, and X is not something you want.

-- The AI is trying to produce plans that would, if executed, have consequences you want.

The first case is hopeless, and the third case is about an already aligned AI. The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you're trying to cause not X, and generally because an AI that smart probably has inner optimizers that don't care about this "make a plan, don't execute plans" thing you thought you'd set up. But if, arguendo, we have a superintelligently optimized plan which doesn't already contain, in its current description as a plan, a mindhack (e.g. by some surprising way of domaining an AI to care about producing plans but not about making anything happen), then there's a question whether it could help to have humans think about the consequences of the plan. I thought Eliezer was answering that question "No, even in this hypothetical, pivotal acts are too complicated and can't be understood fully in detail by humans, so you'd still have to trust the AI, so the AI has to have understood and applied a whole lot about your values in order to have any shot that the plan doesn't have huge unpleasantly surprising consequences", and I was questioning that.

Replies from: Victor Levoso, Leo P.

↑ comment by Victor Levoso · 2022-06-08T05:12:28.105Z · LW(p) · GW(p)

Not a response to your actual point but I think that hypothetical example probably doesn't make sense (as in making the ai not "care" doesn't prevent it from including mindhacks in its plan) If you have a plan that is "superingently optimized" for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn't in some sense "care" about executing plans. (or if you setup some complicated scheme whith conterfactuals so the model ignores the effects of the plans in humans that will make your plans less useful or inscrutable)

The plan that produces the most paperclips is going to be one that deceives or mindhacks humans instead of one that humans wouldn't accept in the first place. Maybe it's posible to use some kind of scheme that avoids the model taking the consecueces of ouputing the plan itself into account but the model kind of has to be modeling humans reading its plan to write a realistic plan that humans will understand, accept and be able to put into practice, and the plan might only work in the fake conterfactual universe whith no plan it was written for.

So I doubt it's actually feasible to have any such scheme that avoids mindhacks and still produces usefull plans.

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-06-08T05:18:14.497Z · LW(p) · GW(p)

I think I agree, but also, people say things like "the AI should if possible be prevented from not modeling humans", which if possible would imply that the hypothetical example makes more sense.

↑ comment by Leo P. · 2022-06-08T11:21:15.621Z · LW(p) · GW(p)

The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you're trying to cause not X, and generally because an AI that smart probably has inner optimizers that don't care about this "make a plan, don't execute plans" thing you thought you'd set up.

I believe the second case is a subcase of the problem of ELK. Maybe the AI isn't trying to deceive you, and actually do what you asked it to do (e.g., I want to see "the diamond" on the main detector), yet the plans it produces has consequence X that you don't want (in the ELK example, the diamond is stolen but you see something that looks like the diamond on the main detector). The problem is: how can you be sure the plans proposed have consequence X? Especially if you don't even know X is a possible consequence of the plans?

comment by localdeity · 2022-06-06T04:20:38.965Z · LW(p) · GW(p)

To point 4 and related ones, OpenAI has this on their charter page:

We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”

What about the possibility of persuading the top several biggest actors (DeepMind, FAIR, etc.) to agree to something like that? (Note that they define AGI on the page to mean "highly autonomous systems that outperform humans at most economically valuable work".) It's not very fleshed out, either the conditions that trigger the pledge or how the transition goes, but it's a start. The hope would be that someone would make something "sufficiently impressive to trigger the pledge" that doesn't quite kill us, and then ideally (a) the top actors stopping would buy us some time and (b) the top actors devoting their people to helping out (I figure they could write test suites at minimum) could accelerate the alignment work.

I see possible problems with this, but is this at least in the realm of "things worth trying"?

Replies from: Vaniver, conor-sullivan

↑ comment by Vaniver · 2022-06-06T14:54:13.616Z · LW(p) · GW(p)

What about the possibility of persuading the top several biggest actors (DeepMind, FAIR, etc.) to agree to something like that?

My understanding is that this has been tried, at various levels of strength, ever since OpenAI published its charter. My sense is that's MIRI's idea of "safety-conscious" looks like this [LW · GW], which it guessed was different from OpenAI's sense; I kind of wish that had been a public discussion back in 2018.

↑ comment by Lone Pine (conor-sullivan) · 2022-06-06T21:04:28.036Z · LW(p) · GW(p)

Given that Sam Altman has some of the shortest timelines around, I wonder if he could be persuaded that DeepMind are within 2 years of the finish line, or will be visibly within 2 years of the finish line in a few years. (Not implying that would be a solution to anything, I'm just curious what it would take for that clause to apply.)

comment by CronoDAS · 2022-06-06T23:18:39.272Z · LW(p) · GW(p)

Is there a plausible pivotal act that doesn't amount to some variant of "cripple human civilization so that it can't make or use computers until it recovers"?

Replies from: RobbBB, Kenny

↑ comment by Rob Bensinger (RobbBB) · 2022-06-10T11:52:29.754Z · LW(p) · GW(p)

Use AGI to build fast-running high-fidelity human whole-brain emulations. Then run thousands of very-fast-thinking copies of your best thinkers. Seems to me this plausibly makes it realistic to keep tabs on the world's AGI progress, and locally intervene before anything dangerous happens, in a more surgical way rather than via mass property destruction of any sort.

Replies from: Evan R. Murphy

↑ comment by Evan R. Murphy · 2022-06-23T07:12:42.183Z · LW(p) · GW(p)

This is a much less destructive-sounding pivotal act proposal than "melt all GPUs". I'm trying to figure out why and if it's actually less destructive...

Does it sound less destructive because it's just hiding the destructive details behind the decisions of these best thinker simulations? After their, say, 100 subjective years of deliberation do the thinkers just end up with a detailed proposal for how to melt all GPUs"?

I think I give our best thinkers more credit than that. I wouldn't presume to know in advance the plan that many copies of our best thinkers would come up with after having a long time to deliberate. But I have confidence or at least a hope that they'd come up with something less destructive and at least as effective as "melt all GPUs".

So this pivotal act proposal puts some distance between us and the messier details of the act. But it does it in a reasonable way, not just by hand-waving or forgetting those details, but instead by deferring them to people who we would most trust to handle them well (many of the best thinkers in the world, overclocked!)

This is an intriguing proposal and because it certainly sounds so much less destructive and horrifying than "melt all GPUs", I will very much prefer to use and see this used as the go-to theoretical example of a pivotal act until I hear of or think of a better one.

↑ comment by Kenny · 2022-06-09T16:44:53.714Z · LW(p) · GW(p)

I can't think of any!

There are maybe 'plausibly plausible' (or 'possibly plausible') acts that a more 'adequate' global civilization might be able to take. But it seems like that hypothetical adequate civilization would have already performed such a pivotal act and the world would look very different than it does now.

It's ('strictly') possible that such a pivotal act has already been performed and that, e.g. the computer hardware currently available isn't sufficient to build an AGI. It just seems like there's VERY little evidence that that's the case.

comment by Adam Zerner (adamzerner) · 2022-06-08T17:54:36.053Z · LW(p) · GW(p)

Wow, 510 karma and counting. This post currently has the 14th most karma all time and most for this year. Makes me think back to this excerpt from Explainers Shoot High. Aim Low! [LW · GW].

A few years ago, an eminent scientist once told me how he'd written an explanation of his field aimed at a much lower technical level than usual. He had thought it would be useful to academics outside the field, or even reporters. This ended up being one of his most popular papers within his field, cited more often than anything else he'd written.

The lesson was not that his fellow scientists were stupid, but that we tend to enormously underestimate the effort required to properly explain things.

comment by Daphne_W · 2022-06-07T12:35:57.666Z · LW(p) · GW(p)

I'm confused about A6, from which I get "Yudkowsky is aiming for a pivotal act to prevent the formation of unaligned AGI that's outside the Overton Window and on the order of burning all GPUs". This seems counter to the notion in Q4 of Death with Dignity [LW · GW] where Yudkowsky says

It's relatively safe to be around an Eliezer Yudkowsky while the world is ending, because he's not going to do anything extreme and unethical unless it would really actually save the world in real life, and there are no extreme unethical actions that would really actually save the world the way these things play out in real life, and he knows that. He knows that the next stupid sacrifice-of-ethics proposed won't work to save the world either, actually in real life.

I would estimate that burning all AGI-capable compute would disrupt every factor of the global economy for years and cause tens of millions of deaths^[1], and that's what Yudkowsky considers the more mentionable example. Do the other options outside the Overton Window somehow not qualify as unsafe/extreme unethical actions (by the standards of the audience of Death with Dignity)? Has Yudkowsky changed his mind on what options would actually save the world? Does Yudkowsky think that the chances of finding a pivotal act that would significantly delay unsafe AGI are so slim that he's safe to be around despite him being unsafe in the hypothetical that such a pivotal act is achievable? I'm confused.

Also, I'm not sure how much overlap there is between people who do Bayesian updates and people for who whatever Yudkowsky is thinking of is outside the Overton Window, but in general, if someone says that what they actually want is outside your Overton Window, I see only two directions to update in: either shift your Overton Window to include their intent, or shift your opinion of them to outside your Overton Window. If the first option isn't going to happen, as Yudkowsky says (for public discussion on lesswrong at least), that leaves the second.

^{^}
Compare modern estimates of the damage that would be caused by a solar flare equivalent to the Carrington Event. Factories, food supply, long-distance communication, digital currency - many critical services nowadays are dependent on compute, and that portion will only increase by the time you would actually pull the trigger.

Replies from: Eliezer_Yudkowsky, Vaniver

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-08T04:53:44.225Z · LW(p) · GW(p)

Interventions on the order of burning all GPUs in clusters larger than 4 and preventing any new clusters from being made, including the reaction of existing political entities to that event and the many interest groups who would try to shut you down and build new GPU factories or clusters hidden from the means you'd used to burn them, would in fact really actually save the world for an extended period of time and imply a drastically different gameboard offering new hopes and options.

What makes me safe to be around is that I know that various forms of angrily acting out violently would not, in fact, accomplish anything like this. I would only do something hugely awful that would actually save the world. No such option will be on the table, and I, the original person who wasn't an idiot optimist, will not overestimate and pretend that something will save the world when it obviously-to-me won't. So I'm a relatively safe person to be around, because I am not the cartoon supervillain talking about necessary sacrifices to achieve greater goods when everybody in the audience knows that the greater good won't be achieved; I am the person in the audience rolling their eyes at the cartoon supervillain.

Replies from: Daphne_W

↑ comment by Daphne_W · 2022-06-08T10:45:41.910Z · LW(p) · GW(p)

Interventions on the order of burning all GPUs in clusters larger than 4 and preventing any new clusters from being made, including the reaction of existing political entities to that event and the many interest groups who would try to shut you down and build new GPU factories or clusters hidden from the means you'd used to burn them, would in fact really actually save the world for an extended period of time and imply a drastically different gameboard offering new hopes and options.

I suppose 'on the order of' is the operative phrase here, but that specific scenario seems like it would be extremely difficult to specify an AGI for without disastrous side-effects and like it still wouldn't be enough. Other, less efficient or less well developed forms of compute exist, and preventing humans from organizing to find a way around the GPU-burner's blacklist for unaligned AGI research while differentially allowing them to find a way to build friendly AGI seems like it would require a lot of psychological/political finesse on the GPU-burner's part. It's on the level of Ozymandias from Watchmen, but it's cartoonish supervillainy nontheless.

I guess my main issue is a matter of trust. You can say the right words, as all the best supervillains do, promising that the appropriate cautions are taken above our clearance level. You've pointed out plenty of mistakes you could be making, and the ease with which one can make mistakes in situations such as yours, but acknowledging potential errors doesn't prevent you from making them. I don't expect you to have many people you would trust with AGI, and I expect that circle would shrink further if those people said they would use the AGI to do awful things iff it would actually save the world [in their best judgment]. I currently have no-one in the second circle.

If you've got a better procedure for people to learn to trust you, go ahead, but is there something like an audit you've participated in/would be willing to participate in? Any references regarding your upstanding moral reasoning in high-stakes situations that have been resolved? Checks and balances in case of your hardware being corrupted?

You may be the audience member rolling their eyes at the cartoon supervillain, but I want to be the audience member rolling their eyes at HJPEV when he has a conversation with Quirrel where he doesn't realise that Quirrel is evil.

↑ comment by Vaniver · 2022-06-07T16:29:28.946Z · LW(p) · GW(p)

I'm confused.

It definitely is the case that a pivotal act that isn't "disruptive" isn't a pivotal act. But I think not all disruptive acts have a significant cost in human lives.

To continue with the 'burn all GPUs' example, note that while some industries are heavily dependent on GPUs, most industries are instead heavily dependent on CPUs. The hospital's power will still be on if all GPUs melt, and probably their monitors will still work (if the nanobots can somehow distinguish between standalone GPUs and ones embedded into motherboards). Transportation networks will probably still function, and so on. Cryptocurrencies, entertainment industries, and lots of AI applications will be significantly impacted, but this seems recoverable.

But I do think Eliezer's main claim is: some people will lash out in desperation when cornered ("Well, maybe starting WWIII will help with AI risk!"), and Eliezer is not one of those people. So if he makes a call of the form "disruption that causes 10M deaths", it's because the other option looked actually worse, and so this is 'safer'. [If you're one of the people tied up on the trolley tracks, you want the person at the lever to switch it!]

Replies from: Daphne_W

↑ comment by Daphne_W · 2022-06-07T21:14:55.996Z · LW(p) · GW(p)

AI can run on CPUs (with a certain inefficiency factor), so only burning all GPUs doesn't seem like it would be sufficient. As for disruptive acts that are less deadly, it would be nice to have some examples but Eliezer says they're too far out of the Overton Window to mention.

If what you're saying about Eliezer's claim is accurate, it does seem disingenuous to frame "The only worlds where humanity survives are ones where people like me do something extreme and unethical" as "I won't do anything extreme and unethical [because humanity is doomed anyway]". It makes Eliezer dangerous to be around if he's mistaken, and if you're significantly less pessimistic than he is (if you assign >10^-6 probability to humanity surviving), he's mistaken in most of the worlds where humanity survives. Which are the worlds that matter the most.

And yeah, it's nice that Eliezer claims that Eliezer can violate ethical injunctions because he's smart enough, after repeatedly stating that people who violate ethical injunctions because they think they're smart enough are almost always wrong. I don't doubt he'll pick the option that looks actually better to him. It's just that he's only human - he's running on corrupted hardware like the rest of us.

comment by Koen.Holtman · 2022-06-06T21:04:01.613Z · LW(p) · GW(p)

Having read the original post and may of the comments made so far, I'll add an epistemological observation that I have not seen others make yet quite so forcefully. From the original post:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable [...]

I want to highlight that many of the different 'true things' on the long numbered list in the OP are in fact purely speculative claims about the probable nature of future AGI technology, a technology nobody has seen yet.

The claimed truth of several of these 'true things' is often backed up by nothing more than Eliezer's best-guess informed-gut-feeling predictions about what future AGI must necessarily be like. These predictions often directly contradict the best-guess informed-gut-feeling predictions of others, as is admirably demonstrated in the 2021 MIRI conversations.

Some of Eliezer's best guesses also directly contradict my own best-guess informed-gut-feeling predictions. I rank the credibility of my own informed guesses far above those of Eliezer.

So overall, based on my own best guesses here, I am much more optimistic about avoiding AGI ruin than Eliezer is. I am also much less dissatisfied about how much progress has been made so far.

Replies from: handoflixue

↑ comment by handoflixue · 2022-06-07T05:55:40.402Z · LW(p) · GW(p)

I rank the credibility of my own informed guesses far above those of Eliezer.

Apologies if there is a clear answer to this, since I don't know your name and you might well be super-famous in the field: Why do you rate yourself "far above" someone who has spent decades working in this field? Appealing to experts like MIRI makes for a strong argument. Appealing to your own guesses instead seems like the sort of thought process that leads to anti-vaxxers.

Replies from: RobbBB, Koen.Holtman

↑ comment by Rob Bensinger (RobbBB) · 2022-06-07T08:43:57.117Z · LW(p) · GW(p)

I think it's a positive if alignment researchers feel like it's an allowed option to trust their own technical intuitions over the technical intuitions of this or that more-senior researcher.

Overly dismissing old-guard researchers is obviously a way the field can fail as well. But the field won't advance much at all if most people don't at least try to build their own models.

Koen also leans more on deference in his comment than I'd like, so I upvoted your 'deferential but in the opposite direction' comment as a corrective, handoflixue. :P But I think it would be a much better comment if it didn't conflate epistemic authority with "fame" (I don't think fame is at all a reliable guide to epistemic ability here), and if it didn't equate "appealing to your own guesses" with "anti-vaxxers".

Alignment is a young field; "anti-vaxxer" is a term you throw at people after vaccines have existed for 200 years, not a term you throw at the very first skeptical researchers arguing about vaccines in 1800. Even if the skeptics are obviously and decisively wrong at an early date (which indeed not-infrequently happens in science!), it's not the right way to establish the culture for those first scientific debates.

↑ comment by Koen.Holtman · 2022-06-07T12:26:47.825Z · LW(p) · GW(p)

Why do you rate yourself "far above" someone who has spent decades working in this field?

Well put, valid question. By the way, did you notice how careful I was in avoiding any direct mention of my own credentials above?

I see that Rob has already written a reply to your comments, making some of the broader points that I could have made too. So I'll cover some other things.

To answer your valid question: If you hover over my LW/AF username, you can see that I self-code as the kind of alignment researcher who is also a card-carrying member of the academic/industrial establishment. In both age and academic credentials. I am in fact a more-senior researcher than Eliezer is. So the epistemology, if you are outside of this field and want to decide which one of us is probably more right, gets rather complicated.

Though we have disagreements, I should also point out some similarities between Eliezer and me.

Like Eliezer, I spend a lot of time reflecting on the problem of crafting tools that other people might use to improve their own ability to think about alignment. Specifically, these are not tools that can be used for the problem of triangulating between self-declared experts. They are tools that can be used by people to develop their own well-founded opinions independently. You may have noticed that this is somewhat of a theme in section C of the original post above.

The tools I have crafted so far are somewhat different from those that Eliezer is most famous for. I also tend to target my tools more at the mainstream than at Rationalists and EAs reading this forum.

Like Eliezer, on some bad days I cannot escape having certain feelings of disappointment about how well this entire global tool crafting project has been going so far. Eliezer seems to be having quite a lot of these bad days recently, which makes me feel sorry, but there you go.

Replies from: handoflixue

↑ comment by handoflixue · 2022-06-08T00:59:47.219Z · LW(p) · GW(p)

Thanks for taking my question seriously - I am still a bit confused why you would have been so careful to avoid mentioning your credentials up front, though, given that they're fairly relevant to whether I should take your opinion seriously.

Also, neat, I had not realized hovering over a username gave so much information!

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2022-06-08T22:01:14.858Z · LW(p) · GW(p)

You are welcome. I carefully avoided mentioning my credentials as a rhetorical device.

I rank the credibility of my own informed guesses far above those of Eliezer.

This is to highlight the essence of how many of the arguments on this site work.

comment by simonsimonsimon · 2022-06-06T19:55:48.984Z · LW(p) · GW(p)

We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.

What is the argument for why it's not worth pursuing a pivotal act without our own AGI? I certainly would not say it was likely that current human actors could pull it off, but if we are in a "dying with more dignity" context anyway, it doesn't seem like the odds are zero.

My idea, which I'll include more as a demonstration of what I mean than a real proposal, would be to develop a "cause area" for influencing military/political institutions as quickly as possible. Yes, I know this sounds too slow and too hard and a mismatch with the community's skills, but consider:

Militaries/governments are "where the money is": they probably do have the coercive power necessary to perform a pivotal act, or at least buy a lot of time. If the PRC is able to completely lock down its giant sophisticated cities, it could probably halt domestic AI research. The West hasn't really tried to do extreme control in a while, for various good reasons, but (just e.g.) the WW2 war economy was awfully tightly managed. We are good at slowing stuff down with crazy red tape. Also there are a lot of nukes
- Yes, there are lots of reasons this is hard, but remember we're looking for hail marys [LW · GW].
"The other guy might develop massive offensive capability soon" is an extremely compelling narrative to normal people, and the culture definitely possesses the meme of "mad scientists have a crazy new weapon". Convincing some generals that we need to shut down TSMC or else China will build terminators might be easier than convincing ML researchers they are doing evil.
- Sure, if this narrative became super salient, it could possibly lead to a quicker technological arms-race dynamic, but there are other possible dynamics it might lead to, such as (just e.g.) urgency on non-proliferation, or urgency for preemptive military victory using current (non-AGI) tools.
- I know attempts to get normal people to agree with EA-type thinking have been pretty dispiriting, but I'm not sure how much real energy has gone into making a truly adequate effort, and I think the "military threat" angle might be a lot catchier to the right folks. The "they'll take our jobs" narrative also has a lot of appeal.
- Importantly, even if convincing people is impossible now, we could prepare for a future regime where we've gotten lucky and some giant smoke alarm event has happened without killing us. You can even imagine both white-hat and black-hat ways of making such an alarm more likely, which might be very high value.
- Again, remember we're looking for hail marys [LW · GW]. When all you have is an out-of-the-money call option, more volatility is good.
The rationalist community's libertarian bent might create a blind spot here. Yes governments and militaries are incredibly dumb, but they do occasionally muddle their way into giant intentional actions.
Also with respect to biases, it smells a little bit like we are looking for an "AI-shaped key to unlock an AI-shaped lock", so we should make sure we are putting enough effort into non-AI pivotal actions even if my proposal here is wrong.

comment by DaemonicSigil · 2022-06-06T02:27:04.441Z · LW(p) · GW(p)

Thanks for writing this. I agree with all of these except for #30, since it seems like checking the output of the AI for correctness/safety should be possible even if the AI is smarter than us, just like checking a mathematical proof can be much easier than coming up with the proof in the first place. It would take a lot of competence, and a dedicated team of computer security / program correctness geniuses, but definitely seems within human abilities. (Obviously the AI would have to be below the level of capability where it can just write down an argument that convinces the proof checkers to let it out of the box. This is a sense in which having the AI produce uncommented machine code may actually be safer than letting it write English at us.)

Replies from: ramana-kumar

↑ comment by Ramana Kumar (ramana-kumar) · 2022-06-06T10:59:13.515Z · LW(p) · GW(p)

We might summarise this counterargument to #30 as "verification is easier than generation". The idea is that the AI comes up with a plan (+ explanation of how it works etc.) that the human systems could not have generated themselves, but that human systems can understand and check in retrospect.

Counterclaim to "verification is easier than generation" is that any pivotal act will involve plans that human systems cannot predict the effects of just by looking at the plan. What about the explanation, though? I think the problem there may be more that we don't know how to get the AI to produce a helpful and accurate explanation as opposed to a bogus misleading but plausible-sounding one, not that no helpful explanation exists.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T19:36:53.463Z · LW(p) · GW(p)

This seems to me like a case of the imaginary hypothetical "weak pivotal act" that nobody can ever produce. If you have a pivotal act you can do via following some procedure that only the AI was smart enough to generate, yet humans are smart enough to verify and smart enough to not be reliably fooled about, NAME THAT ACTUAL WEAK PIVOTAL ACT.

Replies from: DaemonicSigil, tor-okland-barstad, sharmake-farah

↑ comment by DaemonicSigil · 2022-06-11T21:31:58.324Z · LW(p) · GW(p)

Okay, I will try to name a strong-but-checkable pivotal act.

(Having a strong-but-checkable pivotal act doesn't necessarily translate into having a weak pivotal act. Checkability allows us to tell the difference between a good plan and a trapped plan with high probability, but the AI has no reason to give us a good plan. It will just produce output like "I have insufficient computing power to solve this problem" regardless of whether that's actually true. If we're unusually successful at convincing the AI our checking process is bad when it's actually good, then that AI may give us a trapped plan, which we can then determine is trapped. Of course, one should not risk executing a trapped plan, even if one thinks one has identified and removed all the traps. So even if #30 is false, we are still default-doomed. (I'm not fully certain that we couldn't create some kind of satisficing AI that gets reward 1 if it generates a safe plan, reward 0 if its output is neither helpful nor dangerous, and reward -1 if it generates a trapped plan that gets caught by our checking process. The AI may then decide that it has a higher chance of success if it just submits a safe plan. But I don't know how one would train such a satisficer with current deep learning techniques.))

The premise of this pivotal act is that even mere humans would be capable of designing very complex nanomachines, if only they could see the atoms in front of them, and observe the dynamics as they bounce and move around on various timescales. Thus, the one and only output of the AI will be the code for fast and accurate simulation of atomic-level physics. Being able to get quick feedback on what would happen if you designed such-and-such a thing not only helps with being able to check and iterate designs quickly, it means that you can actually do lots of quick experiments to help you intuitively grok the dynamics of how atoms move and bond.

This is kind of a long comment, and I predict the next few paragraphs will be review for many LW readers, so feel free to skip to the paragraph starting with "SO HOW ARE YOU ACTUALLY GOING TO CHECK THE MOLECULAR DYNAMICS SIMULATION CODE?".

Picture a team of nano-engineers designing some kind of large and complicated nanomachine. Each engineer wears a VR headset so they can view the atomic structure they're working on in 3d, and each has VR gloves with touch-feedback so they can manipulate the atoms around. The engineers all work on various components of the larger nanomachine that must be built. Often there are standardized interfaces for transfer of information, energy, or charge. Other times, interfaces must be custom-designed for a particular purpose. Each component might connect to several of these interfaces, as well as being physically connected to the larger structure.

The hardest part of nanomachines is probably going to be the process of actually manufacturing them. The easiest route from current technology is to take advantage of our existing DNA synthesis tech to program ribosomes to produce the machines we want. The first stage of machines would be made from amino acids, but from there we could build machines that built other machines and bootstrap our way up to being able to build just about anything. This bootstrapping process would be more difficult than the mere design process for the final machine, and the first stage where we have to build things entirely out of amino acids sounds particularly brutal. But just as people could write the first compilers in machine code, it should be possible to figure out how to get things done even in the insanely constrained domain of amino acids. And this step only has to be done once before a whole world opens up.

The first obvious question is "what about quantum mechanics?". The answer is that we don't care too much about it. It makes computing the dynamics harder of course, but most systems interact frequently with their environment, and most nanomachines will probably interact with their environment frequently enough that they behave nearly classically. QM is important for determining what local chemical reactions take place, but there's no long-range entanglement to worry about. That's also helpful for allowing the human engineers to get a handle on the dynamics. The main effect of the frequent interaction with the environment is that the dynamics becomes somewhat stochastic. Often you will see engineers run the same simulation several times, so they can understand the distribution of possible outcomes. As the final design is approached, the engineers run it through thousands of simulations so that even rare thermal fluctuations are accounted for.

The zeroth obvious question is "what will you build with it?". The answer is that a variety of things could be built, and there might be many side-projects produced to raise funds for the main project, but the main project is a scanner that can read a cryonics patient's brain one layer at a time in sufficient detail to faithfully emulate them. This is a pretty standard pivotal act, and has been written about elsewhere, but the main idea is that we can freeze, scan, and maybe copy a bunch of alignment researchers and other smart folks to solve alignment at a much faster speed than the outside world is working on their unaligned AIs. Or, the researchers might be able to foom while maintaining their values by making careful changes to their own minds. Besides the scanner itself, this project would probably need to design extremely fast special-purpose computing hardware for running the ems, along with various kinds of tiny probes and other experimental equipment to collect data on how a scan translates to brain-function. (An ordinary dumb neural network should be sufficient for this kind of inference problem.)

SO HOW ARE YOU ACTUALLY GOING TO CHECK THE MOLECULAR DYNAMICS SIMULATION CODE? There are two possible attack modes. One is to exploit any possible vulnerabilities in the hardware running the code to achieve unwanted side effects, and use those side effects to escape from containment. The second is to somehow make the output of the simulation subtly flawed, such that the humans, working off of flawed data, accidentally design a nanomachine that lets the AI out of the box instead of doing what we were designing that machine to do. Obviously the first attack mode is much more desirable for the AI than the second. Actually, the second attack mode may be so hard to pull off that we don't need to check for it at all, but not checking for things when you have the capability to check for them is silly.

Checks for false physics that might trick the humans into building trapped nanomachines:

Compare with the real dynamics, which we know, but is slower for us to simulate. In real physics, we can break down the world into little cubes and the dynamics are symmetrical around translating those cubes in space and time. So if we request a simulation method that has those same symmetries, we can check the simulation on small length and time scales, which are feasible for us to simulate. This doesn't guarantee that a larger simulation would be correct, since small errors can multiply out of control as the simulation size gets larger, but it does mean that the dynamics of the large simulation have to bear a pretty-close resemblance to real physics.
We can also request that the code provably obeys conservation of momentum, energy, charge, etc.
Use diff fuzzing techniques: This is where you use a fuzzer (implemented with techniques like GOFAI and dumb neural networks) to try and find inputs to the simulation such that the output of the AI's code differs from the output of our slow-but-known-correct code. The stronger the optimizer we use here, the harder it is for the AI to avoid detection.
Request that the AI give us error bounds on the simulation, along with machine-checkable proofs that they hold.
Side channel attack and its mitigation: The AI can steer us away from certain designs and towards others by making the simulation take different amounts of time depending on the configuration of atoms we submit. Therefore we request code that takes a constant amount of time to run, plus proof that it does so. In order to make this easier, the code will be run on a GPU architecture with cache levels explicitly managed in the code, and where none of the cores have branch prediction.

The checks for the first attack method are pretty much just normal computer security, but at extremely high stakes. Just so we're clear on what the picture is like here: Generally the AI is not going to know in perfect detail what the hardware is like that its simulation code will be running on, though we at least have to give it the instruction set, or some high-level language that compiles to that instruction set. (Probably the high level language is the better option. That way we can design the language so that properties like constant run-time and space-time symmetry are gauranteed-true of all valid programs in that language.) So the best technique for the AI will be to submit simulation code that has a hidden optimizer embedded into it. This optimizer then searches for possible escape methods while it runs. Even though different runs of the code don't share memory, they'll have different inputs, which the optimizer can use as a random seed to determine which part of the escape-strategy search space it should be trying. If there are any holes at all in our security, the AI will be able to find and exploit them. All those engineers in VR headsets running simulations? In each case, the computer running the simulation has a hostile AI on it, trying everything it can think of to get out. In general, every aspect of the system must be proved correct, from the hardware, to the high-level language we give to the AI. In addition, everything must be thoroughly unit tested, integration tested, fuzz tested, and given to the best security researchers who will try to find side-channel attacks, etc. Other people will write a proof that's as close as humanly possible to an actual mathematical proof that no side channel attacks are possible, and then everyone will try everything they can to poke holes in that proof.

Okay, yeah, we get it, you're going to try really really hard. Don't you know that the universe doesn't owe you success just because of how much you tried? No matter what security you put in place as a mere human, the AI is going to get around it just because it's so much smarter. Even a superintelligent AI would not be able to break a one time pad. Depending on the details of computational complexity, it probably wouldn't be able to crack our best symmetric ciphers today either. For a superintelligent AI to succeed at a task, that task must be actually possible. If we can actually design a system with no holes, which takes a large but finite amount of work and intelligence, then a superintelligence will not be able to break it, no matter how much smarter than us it is.

Replies from: Eliezer_Yudkowsky, Edouard Harris

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-12T03:23:42.331Z · LW(p) · GW(p)

Well, I had to think about this for longer than five seconds, so that's already a huge victory.

If I try to compress your idea down to a few sentences:

The humans ask the AI to produce design tools, rather than designs, such that there's a bunch of human cognition that goes into picking out the particular atomic arrangements or synthesis pathways; and we can piecewise verify that the tool is making accurate predictions; and the tool is powerful enough that we can build molecular nanotech and an uploader by using the tool for an amount of time too short for Facebook to catch up and destroy the world. The AI that does this is purportedly sufficiently good at meta-engineering to build the tool, but not good enough at larger strategy that it can hack its way through the humans using just the code of the tool. The way in which this attacks a central difficulty is by making it harder for the AI to just build unhelpful nanotech using the capabilities that humans use to produce helpful nanotech.

Sound about right?

Replies from: DaemonicSigil, tor-okland-barstad, None

↑ comment by DaemonicSigil · 2022-06-12T05:17:28.732Z · LW(p) · GW(p)

Yes, sounds right to me. It's also true that one of the big unproven assumptions here is that we could create an AI strong enough to build such a tool, but too weak to hack humans. I find it plausible, personally, but I don't yet have an easy-to-communicate argument for it.

Replies from: ken-kahn

↑ comment by Ken Kahn (ken-kahn) · 2022-08-10T17:40:02.371Z · LW(p) · GW(p)

Why can't a narrow AI (maybe like Drexler's proposal) create the tool safely?

Replies from: DaemonicSigil

↑ comment by DaemonicSigil · 2022-08-23T00:10:39.151Z · LW(p) · GW(p)

I don't know of a reason we couldn't do this with a narrow AI. I have no idea how, but it's possible in principle so far as I know. If anyone can figure out how, they could plausibly execute the pivotal act described above, which would be a very good thing for humanity's chances of survival.

EDIT: Needless to say, but I'll say it anyway: Doing this via narrow AI is vastly preferable to using a general AI. It's both much less risky and means you don't have to expend an insane amount of effort on checking.

↑ comment by Tor Økland Barstad (tor-okland-barstad) · 2022-07-10T12:39:32.642Z · LW(p) · GW(p)

The humans ask the AI to produce design tools, rather than designs (...) we can piecewise verify that the tool is making accurate predictions (...) The way in which this attacks a central difficulty is by making it harder for the AI to just build unhelpful nanotech

I think this is a good way to put things, and it's a concept that can be made more general and built upon.

Like, we can also have AIs produce:

Tools that make other tools
Tools that help to verify other tools
Tools that look for problems with other tools (in ways that don't guarantee finding all problems, but can help find many)
Tools that help approximate brain emulations (or get us part of the way there), or predict what a human would say when responding to questions in some restricted domain
Etc, etc

Maybe you already have thought through such strategies very extensively, but AFAIK you don't make that clear in any of your writings, and it's not a trivial amount of inferential distance that is required to realize the full power of techniques like these.

I have written more about this concept in this post [LW · GW] in this series [? · GW]. I'm not sure whether or not any of the concepts/ideas in the series are new, but it seems to me that several of them at the very least are under-discussed.

↑ comment by [deleted] · 2022-06-12T04:37:09.812Z · LW(p) · GW(p)

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-12T05:02:53.807Z · LW(p) · GW(p)

Useless; none of these abstractions help find an answer.

Replies from: None

↑ comment by [deleted] · 2022-06-12T10:45:59.086Z · LW(p) · GW(p)

Replies from: trevor-cappallo

↑ comment by Trevor Cappallo (trevor-cappallo) · 2022-06-20T15:58:39.185Z · LW(p) · GW(p)

From what I know of security, any system requiring secrecy is already implicitly flawed.

(Naturally, if this doesn't apply and you backchanneled your idea for some legitimate meta-reason, I withdraw my objection.)

Replies from: benjamincosman

↑ comment by benjamincosman · 2022-10-08T02:23:13.106Z · LW(p) · GW(p)

I think secrecy is rarely a long-term solution because it's fragile, but it can definitely have short-term uses? For example, I'm sure that some insights into AI have the capacity to advance both alignment and capabilities; if you have such an insight then you might want to share it secretly with alignment researchers while avoiding sharing it publicly because you'd rather Facebook AI not enhance its capabilities. And so the secrecy doesn't have to be a permanent load-bearing part of a system; instead it's just that every day the secrecy holds up is one more day you get to pull ahead of Facebook.

↑ comment by Edouard Harris · 2022-06-15T14:56:42.118Z · LW(p) · GW(p)

Interesting. The specific idea you're proposing here may or may not be workable, but it's an intriguing example of a more general strategy that I've previously tried to articulate [AF(p) · GW(p)] in another context. The idea is that it may be viable to use an AI to create a "platform" that accelerates human progress in an area of interest to existential safety, as opposed to using an AI to directly solve the problem or perform the action.

Essentially:

A "platform" for work in domain X is something that removes key constraints that would otherwise have consumed human time and effort when working in X. This allows humans to explore solutions in X they wouldn't have previously — whether because they'd considered and rejected those solution paths, or because they'd subconsciously trained themselves not to look in places where the initial effort barrier was too high. Thus, developing an excellent platform for X allows humans to accelerate progress in domain X relative to other domains, ceteris paribus. (Every successful platform company does this. e.g., Shopify, Amazon, etc., make valuable businesses possible that wouldn't otherwise exist.)
For certain carefully selected domains X, a platform for X may plausibly be relatively easier to secure & validate than an agent that's targeted at some specific task x ∈ X would be. (Not easy; easier.) It's less risky to validate the outputs of a platform and leave the really dangerous last-mile stuff to humans, than it would be to give an end-to-end trained AI agent a pivotal command in the real world (i.e., "melt all GPUs") that necessarily takes the whole system far outside its training distribution. Fundamentally, the bet is that if humans are the ones doing the out-of-distribution part of the work, then the output that comes out the other end is less likely to have been adversarially selected against us.

(Note that platforms are tools, and tools want to be agents, so a strategy like this is unlikely to arise along the "natural" path of capabilities progress other than transiently.)

There are some obvious problems with this strategy. One is that point 1 above is no help if you can't tell which of the solutions the humans come up with are good, and which are bad. So the approach can only work on problems that humans would otherwise have been smart enough to solve eventually, given enough time to do so (as you already pointed out in your example). If AI alignment is such a problem, then it could be a viable candidate for such an approach. Ditto for a pivotal act.

Another obvious problem is that capabilities research might benefit from the similar platforms that alignment research can. So actually implementing this in the real world might just accelerate the timeline for everything, leaving us worse off. (Absent an intervention at some higher level of coordination.)

A third concern is that point 2 above could be flat-out wrong in practice. Asking an AI to build a platform means asking for generalization, even if it is just "generalization within X", and that's playing a lethally dangerous game. In fact, it might well be lethal for any useful X, though that isn't currently obvious to me. e.g., AlphaFold2 is a primitive example of a platform that that's useful and non-dangerous, though it's not useful enough for this.

On top of all that, there are all the steganographic considerations — AI embedding dangerous things in the tool itself, etc. — that you pointed out in your example.

But this strategy still seems like it could bring us closer to the Pareto frontier for critical domains (alignment problem, pivotal act), than it would be to directly train an AI to do the dangerous action.

↑ comment by Tor Økland Barstad (tor-okland-barstad) · 2022-07-10T12:52:03.955Z · LW(p) · GW(p)

If you have a pivotal act you can do via following some procedure that only the AI was smart enough to generate, yet humans are smart enough to verify and smart enough to not be reliably fooled about, NAME THAT ACTUAL WEAK PIVOTAL ACT.

I don't claim to have a solution where every detail is filled in, or where I have watertight arguments showing that it's guaranteed to work (if executed faithfully).

But I think I have something, and that it could be built upon. The outlines of a potential solution.

And by "solution", I mean a pivotal strategy (consisting of many acts that could be done over a short amount of time), where we can verify output extensively and hopefully (probably?) avoid being fooled/manipulated/tricked/"hacked".

I'm writing a series about this here [? · GW]. Only 2 parts finished so far (current plan is to write 4).

↑ comment by Noosphere89 (sharmake-farah) · 2022-06-06T22:27:12.140Z · LW(p) · GW(p)

I must say, you have a very pessimistic/optimistic view of AI would be able to solve P=NP. I won't say you're completely wrong, as there's always a chance that P does equal NP. But I would be very careful of predicting anything based on the possibility of P=NP.

Replies from: Vaniver

↑ comment by Vaniver · 2022-06-07T04:17:42.354Z · LW(p) · GW(p)

I think P?=NP is a distraction. Like, it's not very useful to ask the question of whether Lee Sedol played a 'polynomial' number of games of Go, and AlphaGo played a 'nonpolynomial' number of games of Go. AlphaGo played more games and had a more careful and precise memory, and developed better intuitions, and could scale to more hardware better.

comment by Ben Pace (Benito) · 2024-01-26T18:56:58.286Z · LW(p) · GW(p)

+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.

comment by Christopher “Chris” Upshaw (christopher-chris-upshaw) · 2022-06-13T18:09:48.368Z · LW(p) · GW(p)

So what should I do with this information, like what other option than "nod along and go on living their lives" is there for me?

comment by eeegnu · 2022-06-12T13:44:18.680Z · LW(p) · GW(p)

They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can't tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do.

This was the sentiment I got after applying to the LTFF with an idea. Admittedly, I couldn't really say whether my idea had been tried before, or wasn't obviously bad, but my conversation basically boiled down to whether I wanted to use this project as a way to grow myself in the field, rather than any particular merits/faults of the idea itself. My motivation was really about trying a cool idea that I genuinely believed could practically improve AI safety if successful, while ethically I couldn't commit to wanting to stay in the field even if it (likely?) failed since I like to go wherever my ideas take me.

Since it may be a while before I personally ever try out the idea, the most productive thing I can do seems to be to share it. It's essentially an attempt at a learning algorithm which 'forces' a models weights to explain the reasoning/motivations behind its actions. The process of training looked kind of like a GAN, with the original models inner-layer outputs being a feature vector. Some GPT-3-esque pretrained model learns to convert this feature vector into tokens (at first as random gibberish), to try and train another GPT3-esque model to perform the actions of the original model (i.e. given the same inputs as that model, plus an explanation on what it should do from the other GPT3-esque model.) The idea basically being that explanations that more correspond to the right idea will cause the 'learner' model to improve faster, which acts as feedback to the 'teacher' model to let it know that its explanations are getting better. The end of this procedure ideally is a way to get as readable text the exact reasoning behind any action.

In my mind this kind of procedure avoids many of the problems in this list. You end up not relying on trusting the output of the model, and instead treat it as a static blackbox which you sample inner layer outputs from until this procedure produces an effective copy of itself with a mechanism for obtaining its inner reasoning. If it considers something nefarious to be the optimal way of doing something, but wants to hide it from its outputs, it'll still have to explain that to the 'learner' model for it to better mimic the original model. If any of this just utterly misses one of the points above, I'd like to hear it, and if anyone's interested in hearing more about this idea, I'll make a post with the rest of my original submission and other thoughts that couldn't fit into the word limit for LTFF submissions.

Replies from: habryka4, Eliezer_Yudkowsky, TekhneMakre

↑ comment by habryka (habryka4) · 2022-06-12T19:04:56.806Z · LW(p) · GW(p)

This was the sentiment I got after applying to the LTFF with an idea. Admittedly, I couldn't really say whether my idea had been tried before, or wasn't obviously bad, but my conversation basically boiled down to whether I wanted to use this project as a way to grow myself in the field, rather than any particular merits/faults of the idea itself

I evaluated this application (and we chatted briefly in a video call)! I am not like super confident in my ability to tell whether an idea is going to work, but my specific thoughts on your proposals were that I think it was very unlikely to work, but that if someone was working on it, they might learn useful things that could make them a better long-term contributor to the AI Alignment field, which is why my crux for your grant was whether you intended to stay involved in the field long-term.

Replies from: Zvi, TekhneMakre

↑ comment by Zvi · 2022-06-13T14:44:09.821Z · LW(p) · GW(p)

Appreciation for sharing the reasoning. Disagreement with the reasoning.

eeegnu is saying they go where their ideas take them and expressing ethical qualms, which both seem like excellent reasons to want someone considering AI safety work rather than reasons to drive them away from AI safety work. Their decision to continue doing AI safety work seems likely to be correlated with whether they could be productive by doing additional AI safety work - if their ideas take them elsewhere it is unlikely anything would have come of them staying.

This is especially true if one subscribes to the theory that we are worried about sign mistakes rather than 'wasting' funding - if we are funding unproven individuals in AI Safety and think that is good, then this is unusually 'safe' in the sense of it being more non-negative.

So to the extent that I was running the LTFF, I would have said yes.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-06-13T17:12:59.732Z · LW(p) · GW(p)

I don't think the policy of "I will fund people to do work that I don't expect to be useful" is a good one, unless there is some positive externality.

It seems to me that your comment is also saying that the positive externality you are looking for is "this will make this person more productive in helping with AI Safety", or maybe "this will make them more likely to work on AI Safety". But you are separately saying that I shouldn't take their self-reported prediction that they will not continue working in AI Safety, independently of the outcome of the experiment, at face value, and instead bet that by working on this, they will change their mind, which seems weird to me.

Separately, I think there are bad cultural effects of having people work on projects that seem very unlikely to work, especially if the people working on them are self-reportedly not doing so with a long-term safety motivation, but because they found the specific idea they had appealing (or wanted to play around with technologies in the space). I think this will predictably attract a large number of grifters and generally make the field a much worse place to be.

Replies from: AllAmericanBreakfast

↑ comment by DirectedEvolution (AllAmericanBreakfast) · 2022-06-13T18:50:55.501Z · LW(p) · GW(p)

“I don't think the policy of "I will fund people to do work that I don't expect to be useful" is a good one, unless there is some positive externality.”

By this, do you mean you think it’s not good to fund work that you expect to be useful with < 50% probability, even if the downside risk is zero?

Or do you mean you don’t expect it’s useful to fund work you strongly expect to have no positive value when you also expect it to have a significant risk of causing harm?

Replies from: habryka4

↑ comment by habryka (habryka4) · 2022-06-14T02:49:48.241Z · LW(p) · GW(p)

50% is definitely not my cutoff, and I don't have any probability cutoff. More something in the expected value space. Like, if you have an idea that could be really great but only has a 1% chance of working, that still feels definitely worth funding. But if you have an idea that seems like it only improves things a bit, and has a 10% chance of working, that doesn't feel worth it.

↑ comment by TekhneMakre · 2022-06-12T21:47:18.529Z · LW(p) · GW(p)

Upvoted for sharing information about thoughts behind grant-making. I could see reasons in some cases to not do this, but by and large more information seems better for many reasons.

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-12T17:00:17.146Z · LW(p) · GW(p)

(I wasn't able to understand the idea off this description of it.)

↑ comment by TekhneMakre · 2022-06-12T21:53:45.792Z · LW(p) · GW(p)

Why wouldn't the explainer just copy the latent vector, and the explainee just learn to do the task in the same way the original model does it? Or more generally, why does this put any pressure towards "explaining the reasons/motives" behind the original model's actions? I think you're thinking that by using a pre-trained GPT3-alike as the explainer model, you start off with something a lot more language-y, and language-y concepts are there for easy pickings for the training process to find in order to "communicate" between the original model and the explainee model. This seems not totally crazy, but

1. it seems to buy you, not anything like further explanations of reasons/motives beyond what's "already in" the original model, but rather at most a translation into the explainer's initial pre-trained internal language;

2. the explainer's initial idiolect stays unexplained / unmotivated;

3. the training procedure doesn't put pressure towards explanation, and does put pressure towards copying.

Replies from: eeegnu

↑ comment by eeegnu · 2022-06-13T05:02:41.445Z · LW(p) · GW(p)

These are great points, and ones which I did actually think about when I was brainstorming this idea (if I understand them correctly.) I intend to write out a more thorough post on this tomorrow with clear examples (I originally imagined this as extracting deeper insights into chess), but to answer these:

I did think about these as translators for the actions of models into natural language, though I don't get the point about extracting things beyond what's in the original model.
I mostly glossed over this part in the brief summary, and the motivation I had for it comes from how (unexpectedly?) it works for GAN's to just start with random noise, and in the process the generator and discriminator both still improve each other.
My thoughts here were for the explainer models update error vector to come from judging the learner model on new unseen tasks without the explanation (i.e. how similar are they to the original models outputs.) In this way the explainer gets little benefit from just giving the answer directly, since the learner will be tested without it, but if the explanation in any way helps the learner learn, it'll improve its performance more (this is basically what the entire idea hinges on.)

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-06-13T05:22:04.749Z · LW(p) · GW(p)

(I didn't understand this on one read, so I'll wait for the post to see if I have further comments. I didn't understand the analogy / extrapolation drawn in 2., and I didn't understand what scheme is happening in 3.; maybe being a little more precise and explicit about the setup would help.)

comment by Ivan Vendrov (ivan-vendrov) · 2022-06-10T02:49:35.917Z · LW(p) · GW(p)

A lot of important warnings in this post. "Capabilities generalize further than alignment once capabilities start to generalize far" was novel to me and seems very important if true.

I don't really understand the emphasis on "pivotal acts", though; there seems to be tons of weak pivotal acts, e.g. ways in which narrow AI or barely-above-human-AGI could help coordinate a global emergency regulatory response by the AI superpowers. Still might be worth focusing our effort on the future worlds where no weak pivotal acts are available, but important to point out this is not the median world.

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-10T05:20:30.954Z · LW(p) · GW(p)

I could coordinate world superpowers if they wanted to coordinate and were willing to do that. It's not an intelligence problem, unless the solution is mind-control, and then that's not a weak pivotal act, it's an AGI powerful enough to kill you if misaligned.

Replies from: ivan-vendrov

↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-06-10T16:17:45.667Z · LW(p) · GW(p)

Mind control is too extreme; I think world superpowers could be coordinated with levels of persuasion greater than one Eliezer but short of mind control. E.g. people are already building narrow persuasion AI capable of generating arguments that are highly persuasive for specific people. A substantially-superhuman but still narrow version of such an AI will very likely be built in the next 5 years, and could be used in a variety of weak pivotal acts (not even in a manipulative way! even a public demonstration of such an AI would make a strong case for coordination, comparable to various weapons treaties).

comment by Nathan Helm-Burger (nathan-helm-burger) · 2022-06-09T16:29:56.422Z · LW(p) · GW(p)

I largely agree with all these points, with my minor points of disagreement being insufficient to change the overall conclusions. I feel like an important point which should be emphasized more is that our best hope for saving humanity lies in maximizing the non-linearly-intelligence-weighted researcher hours invested in AGI safety research before the advent of the first dangerously powerful unaligned AGI. To maximize this key metric, we need to get more and smarter people doing this research, and we need to slow down AGI capabilities research. Insofar as AI Governance is a tactic worth pursuing, it must pursue one or both of these specific aims. Once dangerously powerful unaligned AGI has been launched, it's too late for politics or social movements or anything slower than perhaps decisive military action prepped ahead of time (e.g. the secret AGI-prevention department hitting the detonation switch for all the secret prepared explosives in all the worlds' data centers).

comment by Vaniver · 2022-06-06T14:27:31.400Z · LW(p) · GW(p)

I'm very glad this list is finally published; I think it's pretty great at covering the space (tho I won't be surprised if we discover a few more points), and making it so that plans can say "yeah, we're targeting a hole we see in number X."

[In particular, I think most of my current hope is targeted at 5 and 6, specifically that we need an AI to do a pivotal act at all; it seems to me like we might be able to transition from this world to a world sophisticated enough to survive on human power. But this is, uh, a pretty remote possibility and I was much happier when I was optimistic about technical alignment.]

comment by johnswentworth · 2022-06-06T00:54:41.759Z · LW(p) · GW(p)

For future John who is using the searchbox to try to find this post: this is Eliezer's List O' Doom.

Replies from: Raemon

↑ comment by Raemon · 2022-06-08T23:20:12.157Z · LW(p) · GW(p)

Are you actually gonna remember the apostrophe?

Replies from: johnswentworth

↑ comment by johnswentworth · 2022-06-08T23:28:59.871Z · LW(p) · GW(p)

I just tested that, and it works both ways.

comment by Algon · 2022-06-06T00:01:26.461Z · LW(p) · GW(p)

RE 19: Maybe rephrase "kill everyone in the world using nanotech to strike before they know they're in a battle, and have control of your reward input forever after"? This could, and I predict would, be misinterpreted as "the AI is going to kill everyone and access its own hardware to set its reward to infinity". This is a misinterpetation because you are referring to control of the "reward input" here, and your later sentences don't make sense according to this interpretation. However, given the bolded sentence and some lack of attention, plus some confusions over wire heading that are apparently fairly common, I expect a fair number of misinterpretations.

comment by David Udell · 2022-06-05T23:21:06.257Z · LW(p) · GW(p)

"Geniuses" with nice legible accomplishments in fields with tight feedback loops where it's easy to determine which results are good or bad right away, and so validate that this person is a genius, are (a) people who might not be able to do equally great work away from tight feedback loops, (b) people who chose a field where their genius would be nicely legible even if that maybe wasn't the place where humanity most needed a genius, and (c) probably don't have the mysterious gears simply because they're rare. You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them. They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can't tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do. I concede that real high-powered talents, especially if they're still in their 20s, genuinely interested, and have done their reading, are people who, yeah, fine, have higher probabilities of making core contributions than a random bloke off the street. But I'd have more hope - not significant hope, but more hope - in separating the concerns of (a) credibly promising to pay big money retrospectively for good work to anyone who produces it, and (b) venturing prospective payments to somebody who is predicted to maybe produce good work later.

What fields would qualify as "lacking tight feedback loops"? Computer security? Why don't, e.g., credentialed math geniuses leading their subfields qualify -- because math academia is already pretty organized and inventing a new subfield of math (or whatever) is just not in the same reference class of feat as Newton inventing mathematical physics from scratch?

(c) probably still holds even if there exists a promising class of legible geniuses, though.

Replies from: lc, Ruby

↑ comment by lc · 2022-06-06T00:00:03.872Z · LW(p) · GW(p)

Most of the impressive computer security subdisciplines have very tight feedback loops and extreme legibility; that's what makes them impressive. When I think of the hardest security jobs, I think of 0-day writers, red-teamers, etc., who might have whatever Eliezer describes as security mindset but are also described extremely well by him in #40. There are people that do a really good job of protecting large companies, but they're rare, and their accomplishments are highly illegible except to a select group of guys at e.g. SpecterOps. I don't think MIRI would be able to pick them out, which is of course not their fault.

I'd say something more like hedge fund management, but unfortunately those guys tend to be paid pretty well...

↑ comment by Ruby · 2022-06-06T20:27:12.558Z · LW(p) · GW(p)

I think the intended field lacking tight feedback loops is AI alignment.

Replies from: David Udell

↑ comment by David Udell · 2022-06-06T22:55:41.679Z · LW(p) · GW(p)

(I meant: What fields can we draw legible geniuses from, into alignment.)

Replies from: Kenny

↑ comment by Kenny · 2022-06-09T16:29:54.499Z · LW(p) · GW(p)

I think people have floated the idea of recruiting 'math geniuses' specifically and EY is claiming that, even if they could be recruited and were recruited, we couldn't (reasonably) "expect to get great alignment work out of them".

comment by Raemon · 2022-06-08T19:02:11.619Z · LW(p) · GW(p)

Curated. As previously noted, I'm quite glad to have this list of reasons written up. I like Robby's comment here [LW · GW] which notes:

The point is not 'humanity needs to write a convincing-sounding essay for the thesis Safe AI Is Hard, so we can convince people'. The point is 'humanity needs to actually have a full and detailed understanding of the problem so we can do the engineering work of solving it'.

I look forward to other alignment thinkers writing up either their explicit disagreements with this list, or things that the list misses, or their own frame on the situation if they think something is off about the framing of this list.

comment by Evan R. Murphy · 2022-06-08T09:21:33.784Z · LW(p) · GW(p)

23. Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee. We (MIRI) tried and failed [LW · GW] to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.

There is one approach to corrigibility that I don't see mentioned in the "tried and failed" post Eliezer linked to here. It's also one that someone at MIRI (Evan Hubinger) among others is still working on: myopia (i.e. myopic cognition).

There are different formulations, but the basic idea is that an AI with myopic cognition would have an extremely high time preference. This means that it would never sacrifice reward now for reward later, and so it would essentially be exempt from instrumental convergence. In theory such an AI would allow itself to be shut down (without forcing shutdown), and it would also not be prone to deceptive alignment [AF · GW].

Myopia isn't fully understood yet and has a number of open problems [AF · GW]. It also will likely require verification using advanced interpretability tools that haven't been developed yet [AF · GW]. I think it's a research direction we as a field should be investing in to figure out if it can work though, and the corrigibility question shouldn't be considered closed until we've at least done that. I can't see anything unnatural about an agent that has both consequentialist reasoning capabilities and a high time preference.

(Note: I'm not suggesting that we should bet the farm on myopic cognition solving alignment, and I'm not suggesting that my critique of Eliezer's point on corrigibility in this comment undermines the overall idea of his post that we're in a very scary situation with AI x-risk. I agree with that and support spreading the word about it as he's doing here, as well as working directly with leading AI labs to try and avoid catastrophe. I also support a number of other technical research directions including interpretability, and I'm open to whatever other strategic, technical and out-of-the-box proposals people have that they think could help.)

Replies from: Eliezer_Yudkowsky, TekhneMakre

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-08T22:33:39.731Z · LW(p) · GW(p)

Well, the obvious #1 question: A myopic AGI is a weaker one, so what is the weaker pivotal act you mean to perform with this putative weaker AGI? A strange thing to omit from one's discussion of machinery - the task that the machinery is to perform.

↑ comment by TekhneMakre · 2022-06-08T22:51:57.532Z · LW(p) · GW(p)

Myopia seems to me like a confused concept because the structure of the world is non-myopic, so to speak. If you myopically try to deal with rocks, you'll for myopic reasons model a rock as a permanent objects with a particular shape. But the rock also behaves as a permanent object over much longer time scales than your myopia. So you've in some important sense accessed time-scales much longer than your myopia. I think this happens at any level of a mind. If so, then minds with myopic goals are very similar to minds with non-myopic goals; so similar that they may be basically the same because they'll have non-myopic strategic components that exert their own non-myopic agency.

Replies from: Evan R. Murphy, alexander-gietelink-oldenziel

↑ comment by Evan R. Murphy · 2022-06-12T00:03:19.490Z · LW(p) · GW(p)

Here's a related comment thread debating myopia. This one includes you (TekhneMakre), evhub, Eliezer and others. I'm reading it now to see if there are any cruxes that could help in our present discussion:

https://www.lesswrong.com/posts/5ciYedyQDDqAcrDLr/a-positive-case-for-how-we-might-succeed-at-prosaic-ai?commentId=st5tfgpwnhJrkHaWp [LW(p) · GW(p)]

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-06-12T01:22:40.655Z · LW(p) · GW(p)

[Upvoted for looking over past stuff.] On reflection I'm not being that clear in this present thread, and am open to you making a considered counterargument / explanation and then me thinking that over for a longer amount of time to try writing a clearer response / change my mind / etc.

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-06-11T09:17:43.333Z · LW(p) · GW(p)

I suppose the point is that a myopic agent will accept/know that a rock will exist for long time-scales it just won't care.

Plenty of smart but short-sighted people so not inconceivable.

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-06-11T19:54:42.742Z · LW(p) · GW(p)

I'm saying that, for the same reason that myopic agents think about rocks the same way non-myopic agents think about rocks, also myopic agents will care about long-term stuff the same way non-myopic agents do. The thinking needed to make cool stuff happen generalizes like the thinking needed to deal with rocks. So yeah, you can say "myopic agents by definition don't care about long-term stuff", but if by care you mean the thing that actually matters, the thing about causing stuff to happen, then you've swept basically the entire problem under the rug.

Replies from: alexander-gietelink-oldenziel, Evan R. Murphy

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2022-06-11T21:43:39.878Z · LW(p) · GW(p)

Why can myopic agents not think about long-term stuff the same way as non-myopic agents but still not care about long-term stuff?

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-06-11T21:52:59.888Z · LW(p) · GW(p)

They *could*, but we don't know how to separate caring from thinking, modeling, having effects; and the first 1000 programs that think about long term stuff that you find just by looking for programs that think about long term stuff, also care about long term stuff.

↑ comment by Evan R. Murphy · 2022-06-11T20:09:59.337Z · LW(p) · GW(p)

What you're saying seems to contradict the orthogonality thesis. Intelligence level and goals are independent, or at least not tightly interdependent.

Let's use the common example of a paperclip maximizer. Maximizing total long-term paperclips is a strange goal for an agent to have, but most people in AI alignment think it's possible that an AI ~~could be trained to optimize for something like this~~ like this could in principle emerge from training (though we don't know how to reliably train one on purpose).

Now why couldn't an agent by motivated to maximize short-term paperclips? It wants more paperclips, but it will always take 1 paperclip now over 1 or even 10 or 100 a minute in the future. It wants paperclips ASAP. This is one contrived example of what a myopic AI might look like - a myopic paperclip maximizer.

Replies from: Eliezer_Yudkowsky, TekhneMakre, RobbBB, Jeff Rose

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-12T03:16:49.245Z · LW(p) · GW(p)

I don't think we could train an AI to optimize for long-term paperclips. Maybe I'm not "most people in AI alignment" but still, just saying.

Replies from: Evan R. Murphy

↑ comment by Evan R. Murphy · 2022-06-12T06:56:22.578Z · LW(p) · GW(p)

I was trying to contrast the myopic paperclip maximizer idea with the classic paperclip maximizer. Perhaps "long-term" was a lousy choice of words. What would be better: simple paperclip maximizer, unconditional paperclip maximizer, or something?

Update: On second thought, maybe what you were getting at is that it's not clear how to deliberately train a paperclip maximizer in the current paradigm. If you tried, you'd likely end up with a mesa-optimizer on some unpredictable proxy objective, like a deceptively aligned steel maximizer.

↑ comment by TekhneMakre · 2022-06-11T20:52:55.249Z · LW(p) · GW(p)

Yes, I'm saying that AIs are very likely to have (in a broad sense, including e.g. having subagents that have) long-term goals.

Now why couldn't an agent by motivated to maximize short-term paperclips?

It *could*, but I'm saying that making an AI like that isn't like choosing a loss function for training, because long-term thinking is convergent.

Your original comment said:

I can't see anything unnatural about an agent that has both consequentialist reasoning capabilities and a high time preference.

This is what I'm arguing against. I'm saying it's very unnatural. *Possible*, but very unnatural.

And:

This means that it would never sacrifice reward now for reward later, and so it would essentially be exempt from instrumental convergence.

This sounds like you're saying that myopia *makes* there not be convergent instrumental goals. I'm saying myopia basically *implies* there not being convergent instrumental goals, and therefore is at least as hard as making there not be CIGs.

↑ comment by Rob Bensinger (RobbBB) · 2022-06-12T03:36:48.947Z · LW(p) · GW(p)

most people in AI alignment think it's possible that an AI could be trained to optimize for something like this.

I don't think we have any idea how to do this. If we knew how to get an AGI system to reliably maximize the number of paperclips in the universe, that might be most of the (strawberry-grade) alignment problem solved right there.

Replies from: Evan R. Murphy

↑ comment by Evan R. Murphy · 2022-06-12T07:32:16.714Z · LW(p) · GW(p)

You're right, my mistake - of course we don't know how to deliberately and reliably train a paperclip maximizer. I've updated the parent comment now to say:

most people in AI alignment think it's possible that an AI like this could in principle emerge from training (though we don't know how to reliably train one on purpose).

↑ comment by Jeff Rose · 2022-06-11T20:44:06.673Z · LW(p) · GW(p)

It feels like you are setting a discount rate higher than reality demands. A rationally intelligent agent should wind up with a discount rate that matches reality (e.g. in this case, probably the rate at which paper clips decay or the global real rate of interest).

comment by artifex · 2022-06-07T16:33:34.646Z · LW(p) · GW(p)

Great post. Many of these arguments are fairly convincing.

comment by lc · 2022-06-06T19:52:02.588Z · LW(p) · GW(p)

4. We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth. The current state of this cooperation to have every big actor refrain from doing the stupid thing, is that at present some large actors with a lot of researchers and computing power are led by people who vocally disdain all talk of AGI safety (eg Facebook AI Research). Note that needing to solve AGI alignment only within a time limit, but with unlimited safe retries for rapid experimentation on the full-powered system; or only on the first critical try, but with an unlimited time bound; would both be terrifically humanity-threatening challenges by historical standards individually.

There's a sleight of hand that is very tempting for some people here. Perhaps it's not tempting for you but I've decided I get it enough for me to point it out. The sleight of hand is to take one or both of the following obvious truths:

We're going to build AGI eventually.
Once we are in "crunch time", where one actor has AGI, quadratically more actors will begin to possess it from software and hardware gains even if one abstains from destroying the world.

And then use that to fallaciously conclude:

Delaying the lethal challenge itself is impossible.

Or the more sophisticated but also wrong:

Attempts to slow down capabilities research means you have to slow down some particular AGI company or subset of AGI companies which represent the fastest/most careless/etc., and are logarithmically successful.

At present, AI research is not particularly siloed. Institutions like FAIR and OpenAI end up somehow sharing their most impactful underlying insights, about ML scaling or otherwise, with everybody. Everyone is piled into this big Arxiv-bound community where each actor is contributing to the capabilities of each other actor. So, if you can definitively prevent a software capability gain from being published which, on expectation, would have saved FAIR or whoever else ends up actually pressing the button a couple days, that'd be pretty sweet.

Perhaps it is impossible to do that effectively, and I'm just a 20yo too quick to stop heeding his elders' examples, I don't know. But when people disagree with me about capabilities research being bad, they usually make this mental mis-step where they conflate "preventing a single actor from pressing the button" and "slowing the Eldritch rising tide of software and hardware improvements in AI". That or they think AGI isn't gonna be bad, but I think it's gonna be bad, so.

Replies from: ArthurB

↑ comment by ArthurB · 2022-06-06T20:45:23.817Z · LW(p) · GW(p)

In addition

There aren't that many actors in the lead.
Simple but key insights in AI (e.g doing backprop, using sensible weight initialisation) have been missed for decades.

If the right tail for the time to AGI by a single group can be long and there aren't that many groups, convincing one group to slow down / paying more attention to safety can have big effects.

How big of an effect? Years doesn't seem off the table. Eliezer suggests 6 months dismissively. But add a couple years here and a couple years there, and pretty soon you're talking about the possibility of real progress. It's obviously of little use if no research towards alignment is attempted in that period of course, but it's not nothing.

Replies from: lc

↑ comment by lc · 2022-06-06T21:19:52.931Z · LW(p) · GW(p)

It's obviously of little use if no research towards alignment is attempted in that period of course, but it's not nothing.

It's of use at least inasmuch as it increases my life expectancy.

comment by No77e (no77e-noi) · 2022-06-06T19:49:14.321Z · LW(p) · GW(p)

The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It's not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.

Why is CEV so difficult? And if CEV is impossible to learn first try, why not shoot for something less ambitious? Value is fragile, OK, but aren't there easier utopias?

Many humans would be able to distinguish utopia from dystopia if they saw them, and humanity's only advantage over an AI is that the brain has "evolution presets".

Humans are relatively dumb, so why can't even a relatively dumb AI learn the same ability to distinguish utopias from dystopias?

To anyone reading: don't interpret these questions as disagreement. If someone doesn't, for example, understand a mathematical proof, they might express disagreement with the proof while knowing full well that they haven't discovered a mistake in it and that they are simply confused.

Replies from: no77e-noi, Kenny

↑ comment by No77e (no77e-noi) · 2022-06-07T15:21:58.249Z · LW(p) · GW(p)

Why not shoot for something less ambitious?

I'll give myself a provisional answer. I'm not sure if it satisfies me, but it's enough to make me pause: Anything short of CEV might leave open an unacceptably high chance of fates worse than death.

↑ comment by Kenny · 2022-06-09T16:39:59.678Z · LW(p) · GW(p)

CEV is difficult because our values seem to be very complex.

Value is fragile, OK, but aren't there easier utopias?

Building an AGI (let alone a super-intelligent AGI) that aimed for an 'easier utopia' would have to somehow convince/persuade/align the AI to give up a LOT of value. I don't think it's possible without solving alignment anyways. Essentially, it seems like we'd be trying to 'convince' the AGI to 'not go to fast because that might be bad'. The problem is that we don't know how to precisely what "bad" is anyways.

Many humans would be able to distinguish utopia from dystopia if they saw them

That's very much not obvious. I don't think that, e.g. humans from even 100 years ago teleported to today would be able to reliably distinguish the current world from a 'dystopia'.

I haven't myself noticed much agreement about the various utopias people have already described! That seems like pretty strong evidence that 'utopia' is in fact very hard to specify.

comment by ArthurB · 2022-06-06T19:07:02.494Z · LW(p) · GW(p)

There are IMO in-distribution ways of successfully destroying much of the computing overhang. It's not easy by any means, but on a scale where "the Mossad pulling off Stuxnet" is 0 and "build self replicating nanobots" is 10, I think it's is closer to a 1.5.

comment by Mass_Driver · 2022-06-06T01:50:45.566Z · LW(p) · GW(p)

I mostly agree with the reasoning here; thank you to Eliezer for posting it and explaining it clearly. It's good to have all these reasons here in once place.

The one area I partly disagree with is Section B.1. As I understand it, the main point of B.1 is that we can't guard against all of the problems that will crop up as AI grows more intelligent, because we can't foresee all of those problems, because most of them will be "out-of-distribution," i.e., not the kinds of problems where we have reasonable training data. A superintelligent AI will do strange things that wouldn't have occurred to us, precisely because it's smarter than we are, and some of those things will be dangerous enough to wipe out all human life.

I think this somewhat overstates the problem. If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world's computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort and build avoidance of catastrophic danger as a category into its utility function...

And then we test whether the AI is actually doing these things and successfully using something like the human category of "catastrophe" when the AI is only slightly smarter than humans...

And then learn from those tests and honestly look at the failures and improve the AI's catastrophe-avoidance skills based on what we learn...

Then the chances that that AI won't immediately destroy the world seem to me to be much much larger than 0.1%. They're still low, which is bad, but they're not laughably insignificant, either, because if you make an honest, thoughtful, sustained effort to constrain the preferences of your successors, then often you at least partially succeed.

If natural selection had feelings, it might not be maximally happy with the way humans are behaving in the wake of Cro-Magnon optimization...but it probably wouldn't call it a disaster, either. Despite the existence of contraception, there sure are a whole lot more Cro-Magnons than there ever were Neanderthals, and the population is still going up every year.

Similarly, training an AI to act responsibly isn't going to get us a reliably safe AI, but whoever launches the first super-intelligent AI puts enough effort into that kind of training, then I don't see any reason why we shouldn't expect at least a 50% chance of a million or better survivors. I'm much more worried about large, powerful organizations that "vocally disdain all talk of AGI safety" than I am about the possibility that AGI safety research is inherently futile. It's inherently imperfect in that there's no apparent path to guaranteeing the friendliness of superintelligence...but that's not quite the same thing as saying that we shouldn't expect to be able to increase the probability that superintelligence is at least marginally friendly.

Replies from: RobbBB, Chris_Leong

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T06:44:27.515Z · LW(p) · GW(p)

If natural selection had feelings, it might not be maximally happy with the way humans are behaving in the wake of Cro-Magnon optimization...but it probably wouldn't call it a disaster, either.

Out of a population of 8 billion humans, in a world that has known about Darwin for generations, very nearly zero are trying to directly manufacture large numbers of copies of their genomes -- there is almost no creative generalization towards 'make more copies of my genome' as a goal in its own right.

Meanwhile, there is some creativity going into the proxy goal 'have more babies', and even more creativity going into the second-order proxy goal 'have more sex'. But the net effect is that the world is becoming wealthier, and the wealthiest places are reliably choosing static or declining population sizes.

And if you wind the clock forward, you likely see humans transitioning into brain emulations (and then self-modifying a bunch), leaving DNA self-replicators behind entirely. (Or you see humans replacing themselves with AGIs. But it would be question-begging to cite this particular prediction here, though it is yet another way humans are catastrophically straying from what human natural selection 'wanted'.)

Replies from: Mass_Driver, interstice, Chris_Leong

↑ comment by Mass_Driver · 2022-06-06T14:29:32.761Z · LW(p) · GW(p)

Right, I'm not claiming that AGI will do anything like straightforwardly maximize human utility. I'm claiming that if we work hard enough at teaching it to avoid disaster, it has a significant chance of avoiding disaster.

The fact that nobody is artificially mass-producing their genes is not a disaster from Darwin's point of view; Darwin is vaguely satisfied that instead of a million humans there are now 7 billion humans. If the population stabilizes at 11 billion, that is also not a Darwinian disaster. If the population spreads across the galaxy, mostly in the form of emulations and AIs, but with even 0.001% of sentient beings maintaining some human DNA as a pet or a bit of nostalgia, that's still way more copies of our DNA than the Neanderthals were ever going to get.

There are probably some really convincing analogies or intuition pumps somewhere that show that values are likely to be obliterated after a jump in intelligence, but I really don't think evolution/contraception is one of those analogies.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T21:49:13.124Z · LW(p) · GW(p)

I'm claiming that if we work hard enough at teaching it to avoid disaster, it has a significant chance of avoiding disaster.

As stated, I think Eliezer and I, and nearly everyone else, would agree with this.

The fact that nobody is artificially mass-producing their genes is not a disaster from Darwin's point of view; Darwin is vaguely satisfied that instead of a million humans there are now 7 billion humans.

?? Why would human natural selection be satisfied with 7 billion but not satisfied with a million? Seems like you could equally say 'natural selection is satisfied with a million, since at least a million is higher than a thousand'. Or 'natural selection is satisfied with a hundred, since at least a hundred is higher than fifty'.

I understand the idea of extracting from a population's process of natural selection a pseudo-goal, 'maximize inclusive genetic fitness'; I don't understand the idea of adding that natural selection has some threshold where it 'feels' 'satisfied'.

Replies from: Mass_Driver

↑ comment by Mass_Driver · 2022-06-07T06:46:42.433Z · LW(p) · GW(p)

Sure, the metaphor is strained because natural selection doesn't have feelings, so it's never going to feel satisfied, because it's never going to feel anything. For whatever it's worth, I didn't pick that metaphor; Eliezer mentions contraception in his original post.

As I understand it, the point of bringing up contraception is to show that when you move from one level of intelligence to another, much higher level of intelligence, then the more intelligent agent can wind up optimizing for values that would be anathema to the less intelligent agents, even if the less intelligent agents have done everything they can to pass along their values. My objection to this illustration is that I don't think anyone's demonstrated that human goals could plausibly be described as "anathema" to natural selection. Overall, humans are pursuing a set of goals that are relatively well-aligned with natural selection's pseudo-goals.

↑ comment by interstice · 2022-06-22T14:42:53.256Z · LW(p) · GW(p)

Why do you think the goal of evolution is "more copies of genome" rather than "more babies"? To the extent that evolution can be said to have a goal, I think "more babies" is closer -- e.g. imagine a mutation that caused uncontrolled DNA replication within a cell. That would lead to lots of copies of its genome but not more reproductive fitness (Really, I guess this means that you need to specify which evolution you're talking about -- I think the evolution for healthy adult humans has "babies who grow to adulthood" as its goal)

w.r.t. declining population sizes, I think it's likely we would return to malthusianism after a few more generations of genetic/cultural selection under modern conditions. Although as you say the singularity is going to come before that can happen.

↑ comment by Chris_Leong · 2022-06-07T11:27:52.172Z · LW(p) · GW(p)

Yeah, but the population is still pretty large and could become much larger if we become intergalactic. And possibly this is more likely than if we were at the Malthusian limits.

↑ comment by Chris_Leong · 2022-06-06T08:52:08.886Z · LW(p) · GW(p)

If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world's computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort

I had the exact same thought. My guess would be that Eliezer might say that since the AI is maximising if the generalisation function misses even one action of this sort as something that we should exclude that we're screwed.

Replies from: Mass_Driver

↑ comment by Mass_Driver · 2022-06-06T14:31:32.293Z · LW(p) · GW(p)

Sure, I agree! If we miss even one such action, we're screwed. My point is that if people put enough skill and effort into trying to catch all such actions, then there is a significant chance that they'll catch literally all the actions that are (1) world-ending and that (2) the AI actually wants to try.

There's also a significant chance we won't, which is quite bad and very alarming, hence people should work on AI safety.

Replies from: Chris_Leong

↑ comment by Chris_Leong · 2022-06-06T14:47:39.832Z · LW(p) · GW(p)

Hmm... It seems much, much harder to catch every single one than to catch 99%.

Replies from: Mass_Driver

↑ comment by Mass_Driver · 2022-06-07T06:37:43.719Z · LW(p) · GW(p)

One of my assumptions is that it's possible to design a "satisficing" engine -- an algorithm that generates candidate proposals for a fixed number of cycles, and then, assuming at least one proposal with estimated utility greater than X has been generated within that amount of time, selects one of the qualifying proposals at random. If there are no qualifying candidates, the AI takes no action.

If you have a straightforward optimizer that always returns the action with the highest expected utility, then, yeah, you only have to miss one "cheat" that improves "official" utility at the expense of murdering everyone everywhere and then we all die. But if you have a satisficer, then as long as some of the qualifying plans don't kill everyone, there's a reasonable chance that the AI will pick one of those plans. Even if you forget to explicitly penalize one of the pathways to disaster, there's no special reason why that one pathway would show up in a large majority of the AI's candidate plans.

Replies from: TurnTrout, Chris_Leong

↑ comment by TurnTrout · 2022-06-10T01:45:11.676Z · LW(p) · GW(p)

here's no special reason why that one pathway would show up in a large majority of the [satisficer]'s candidate plans.

There is a special reason, and it's called "instrumental convergence." Satisficers tend to seek power [AF · GW].

Replies from: Mass_Driver

↑ comment by Mass_Driver · 2022-06-12T07:14:13.176Z · LW(p) · GW(p)

I suspect we're talking about two different things.

If you just naively program a super-intelligent AI to satisfice a goal, then, sure, most of the candidate pathways to satisfice will involve accruing a lot of some type of power, because power is useful for achieving goals. That's a valid point, and it's important to understand that merely switching from optimizers to satisficers won't adequately protect us against overly ambitious AIs.

However, that doesn't mean that it's futile to explicitly penalize most (but not literally all) of the paths that the AI could take to accumulate too much power. Suppose you adequately penalize all of the tactics that would have catastrophic side effects except for, oh, I don't know, cornering the world market on all types of grain, because you forgot to think of that one particular path to power. Would most of the candidate plans that the AI submits for human approval then turn out to involve secretly cornering the grain market?

I don't see why they would. All else being equal, sure, the most powerful tactic available is going to be somewhat more attractive than other, less powerful tactics. But how much more attractive? Would an AI generate 1,000x more plans that involve one specific globally disruptive tactic like cornering the grain market vs. all of the various tactics that are only locally disruptive, like publishing propaganda or hiring away competing researchers or hacking into competing research labs or interrupting the flow of electricity to those labs? 10x more such plans? 2x more such plans? I don't think that's the kind of estimate you can make just by gesturing at basic principles of game theory; you'd need some concrete domain knowledge about the AI's specific planning algorithms.

If the truly dangerous plans are only 10x more common on the initial brainstorming list, then we can probably make sure those plans aren't chosen by filtering for plans that the AI rates as safe and transparent. We can use 3 different AIs on airgapped equipment -- one that only "cares" about evaluating whether a plan is safe (i.e., would the plan cause unacceptably large changes if implemented?), one that only "cares" about evaluating whether a plan is transparent (i.e., can the plan be explained to humans so that the humans will have an accurate understanding of what actions will be taken as part of the plan?), and one that generates plans that satisfice for a substantive goal (in this case, shutting down opposing AI research).

Of the plans that are approved by all 3 AIs, most of them probably won't involve accumulating excessive amounts of power, because usually if you accumulate too much power, then that is (a) unsafe, and (b) likely to involve lying to 'your' human programmers.

If powerful plans are 10x more common on the initial brainstorming list, then you need filters for safety and/or transparency that are at least 90% effective at catching unacceptably powerful plans in order to get back up to a 50% chance of humanity surviving. If powerful plans are more than 10x more common, then you need a set of filters that's more reliable than 90%. I don't think we have any idea what those numbers are yet, but I do think it's worth trying to reduce how common it is for excessively powerful plans to show up on the initial brainstorming list, and I think we can do that by training AIs to recognize dangerously disruptive plans and to try to avoid those types of plans. It's better to at least try to get AIs to engage with the concept of "this plan is too disruptive" then to throw up our hands and say, "Oh, power is an attractor in game theory space, so there's no possible way to get brilliant AIs that don't seize infinite power."

↑ comment by Chris_Leong · 2022-06-07T11:21:13.203Z · LW(p) · GW(p)

You mean quantilization [? · GW]? Oh yeah, I forgot about that. Good point.

comment by Garrett Baker (D0TheMath) · 2022-06-06T01:31:43.352Z · LW(p) · GW(p)

[small nitpick]

I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them. This ability to "notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them" currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others. It probably relates to 'security mindset', and a mental motion where you refuse to play out scripts, and being able to operate in a field that's in a state of chaos.

I find this hard to believe. I'm sure you had some conversations with others which allowed you to arrive at these conclusions. In particular, your Intelligence Explosion Microeconomics paper uses the data from the evolution of humans to make the case that making intelligence higher was easy for evolution once the ball got rolling, which is not the null string.

Replies from: Eliezer_Yudkowsky, RobbBB

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T03:08:28.003Z · LW(p) · GW(p)

Null string socially. I obviously was allowed to look at the external world to form these conclusions, which is not the same as needing somebody to nag me into doing so.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2022-06-06T03:26:52.750Z · LW(p) · GW(p)

This makes more sense. I think you should clarify that this is what you mean when talking about the null string analogy in the future, especially when talking about what thinking about hard-to-think-about topics should look like. It seems fine, and probably useful, as long as you know it's a vast overstatement, but because it's a vast overstatement, it doesn't actually provide that much actionable advice.

Concretely, instead of talking about the null string, it would be more helpful if you talked about the amount of discussion it should take a prospective researcher to reach correct conclusions. From literal null-string for the optimal agent, to vague pointing in the correct direction for a pretty good researcher, to a fully formal and certain proof listing every claim and counter-claim imaginable for someone who probably shouldn't go into alignment.

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T03:15:11.973Z · LW(p) · GW(p)

If you read the linked tweet (https://twitter.com/ESYudkowsky/status/1500863629490544645), it's talking about the persuasion/convincing/pushing you need in addition to whatever raw data makes it possible to reach the conclusion; it's not saying that humans can get by without any Bayesian evidence about the external world.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2022-06-06T03:19:30.975Z · LW(p) · GW(p)

I did read the linked tweet, and now that you bring it up, my third sentence doesn't apply. But I think my first & second sentences do still apply (ignoring Eliezer's recent clarification).

comment by Evan R. Murphy · 2022-06-08T22:03:19.412Z · LW(p) · GW(p)

Eliezer cross-posted this to the Effective Altruism Forum where there are a few more comments: (In case 600+ comments wasn't enough for anyone!)

https://forum.effectivealtruism.org/posts/zzFbZyGP6iz8jLe9n/agi-ruin-a-list-of-lethalities [EA · GW]

comment by Cédric · 2022-06-08T02:34:10.850Z · LW(p) · GW(p)

Imagine we're all in a paddleboat paddling towards a waterfall. Inside the paddleboat is everyone but only a relatively small number of them are doing the paddling. Of those paddling, most are aware of the waterfall ahead but for reasons beyond my comprehension, decide to paddle on anyway. A smaller group of paddlers have realised their predicament and have decided to stop paddling and start building wings onto the paddleboat so that when the paddleboat inevitably hurtles off the waterfall, it might fly.

It seems to me like the most sensible course of action is to stop paddling until the wings are built and we know for sure they're going to work. So why isn't the main strategy definitively proving that we're heading towards the waterfall and raising awareness until the culture has shifted enough that paddling is taboo? With this strategy, even if the paddling doesn't stop, at least it buys time for the wings to be constructed. Trying to get people to stop paddling seems like a higher probability of success than wing building + increases the probability of success of wing building as it buys time.

I suspect that part of the reason for just focusing on the wings is the desire to reap the rewards of aligned AGI within our lifetimes. The clout of being the ones who did the final work. The immortality. The benefits that we can't yet imagine etc etc. Maybe infinite rewards justifies infinite risk but it does not apply in this case because we can still get the infinite rewards without so much risk if we just wait until the risks are eliminated.

Replies from: JBlack, lc

↑ comment by JBlack · 2022-06-08T02:51:44.272Z · LW(p) · GW(p)

Maybe infinite rewards justifies infinite risk but it does not apply in this case because we can still get the infinite rewards without so much risk if we just wait until the risks are eliminated.

If eliminating the risk takes 80+ years and AI development is paused for that to complete, then it is very likely that everyone currently reading this comment will die before it is finished. From a purely selfish point of view it can easily make sense for a researcher to continue even if they fully believe that there is a 90%+ chance that AI will kill them. Waiting will also almost certainly kill them, and they won't get any of those infinite rewards anyway.

Being less than 90% convinced that AI will kill them just makes it even more attractive. Hyperbolic discounting makes it even more attractive still.

Replies from: RobbBB, Cédric

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T10:21:17.790Z · LW(p) · GW(p)

It's not obvious to me that it takes 80+ years to get double-digit alignment success probabilities, from where we are. Waiting a few decades strikes me as obviously smart from a selfish perspective; e.g., AGI in 2052 is a lot selfishly better than AGI in 2032, if you're under age 50 today.

But also, I think the current state of humanity's alignment knowledge is very bad. I think your odds of surviving into the far future are a lot higher if you die in a few decades and get cryopreserved and then need to hope AGI works out in 80+ years, than if you survive to see AGI in the next 20 years.

Replies from: JBlack, sharmake-farah

↑ comment by JBlack · 2022-06-09T03:49:24.334Z · LW(p) · GW(p)

True, you can make use of the Gompertz curve to get marginal benefit from waiting a bit while you still have a low marginal probability of non-AGI death.

So we only need to worry about researchers who have lower estimates of unaligned AGI causing their death, or who think that AGI is a long way out and want to hurry it up now.

↑ comment by Noosphere89 (sharmake-farah) · 2022-06-08T14:04:43.982Z · LW(p) · GW(p)

Unfortunately, cryopreservation isn't nearly as reliable as needed in order to assume immortality is achieved. While we've gotten better at it, it still relies on toxic chemicals in order to vitrify the brain.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T19:46:27.547Z · LW(p) · GW(p)

I'm not saying it's reliable!! I'm saying the odds of alignment success in the next 20 years currently looks even worse.

↑ comment by Cédric · 2022-06-08T03:35:08.483Z · LW(p) · GW(p)

Well then let's use hyperbolic discounting to our advantage. If we make paddling sufficiently taboo, the social punishment of paddling will outweigh the rewards of potentially building AGI in the minds of the selfish researchers.

↑ comment by lc · 2022-06-08T02:45:11.843Z · LW(p) · GW(p)

Dunno what that last sentence was but generally I agree.

At the same time: be the change you wish to see in the world. Don't just tell people who are already working on it they should be doing something else. Actually do that raising the alarm thing first.

Replies from: Cédric

↑ comment by Cédric · 2022-06-08T03:28:42.808Z · LW(p) · GW(p)

What I'm doing is trying to help with the wings by throwing some money at MIRI. I am also helping with the stopping/slowing of paddling by sharing my very simple reasoning about why that's the most sensible course of action. Hopefully the simple idea will spread and have some influence.

To be honest, I am not willing to invest that much into this as I have other things I am working on (sounds so insane to type that I am not willing to invest much into preventing the doom of everyone and everything). Anyway, there are many like me who are willing to help but only if the cost is low so if you have any ideas of what people like me could do to shift the probabilities a bit, let me know.

Replies from: Kenny

↑ comment by Kenny · 2022-06-09T16:53:27.921Z · LW(p) · GW(p)

Sadly, it doesn't seem like there's any low-hanging fruit that would even "shift the probabilities a bit".

Most people seem, if anything, anti-receptive to any arguments about this, because, e.g. it's 'too weird'.

And I too feel like this describes myself:

To be honest, I am not willing to invest that much into this as I have other things I am working on (sounds so insane to type that I am not willing to invest much into preventing the doom of everyone and everything).

I'm thinking – very tentatively (sadly) – about maybe looking into my own personal options for some way to help, but I'm also distracted by "other things".

I find this – even people that are (at least somewhat) convinced still not being willing to basically 'throw everything else away' (up to the limit of what would impair our abilities to actually help, if not succeed) to be particularly strong evidence that this might be overall effectively impossible.

comment by Lukas_Gloor · 2022-06-07T16:22:16.682Z · LW(p) · GW(p)

On point 35, "Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other":

This claim is somewhat surprising to me given that you're expecting powerful ML systems to remain very hard to interpret to humans.

I guess the assumption is that superintelligent ML models/systems may not remain uninterpretable to each other, especially not with the strong incentivize to advance interpretability in specific domains/contexts (benefits from cooperation or from making early commitments in commitment races).

Still, if a problem is hard enough, then the fact that strong incentives exist to solve it doesn't mean it will likely be solved. Having thought a bit about possible avenues [LW · GW] to make credible commitments, it feels non-obvious to me whether superintelligent systems will be able to divide up the lightcone, etc. If anyone has more thoughts on the topic, I'd be very interested.

Replies from: Kenny

↑ comment by Kenny · 2022-06-09T16:56:43.961Z · LW(p) · GW(p)

I think this mostly covers the relevant intuitions:

I guess the assumption is that superintelligent ML models/systems may not remain uninterpretable to each other, especially not with the strong incentivize to advance interpretability in specific domains/contexts (benefits from cooperation or from making early commitments in commitment races).

It's the kind of 'obvious' strategy that I think sufficiently 'smart' people would use already.

comment by Kredan · 2022-06-07T10:55:24.534Z · LW(p) · GW(p)

This look like a great list of risk factors leading to AI lethalities, why making AI safe is a hard problem and why we are failing. But this post is also not what I would have expected by taking the title at face value. I thought that the post would be about detailed and credible scenarios suggesting how AI could lead to extinction, where for example each scenario could represent a class of AI X-risks that we want to reduce. I suspect that such an article would also be really helpful because we probably have not been so good at generating very detailed and credible scenarios of doom so far. There are risks of info hazard associated with that for sure. Also I am sympathetic to the argument that "AI does not think like you do" and that AI is likely to lead to doom in ways we cannot think of because of its massive strategic advantage. But still I think it might be very helpful to write some detailed and credible stories of doom so that a large part of the AI community take extinction risks from AI really seriously and approaches AI capability research more like working at high security bio hazard lab. Perverse incentives might still lead lots of people in the AI community to not take these concerned seriously. Also, it is true that there are some posts going in that direction for ex What failure looks like [LW · GW], It looks like you’re trying to take over the world, but I don’t think we have done enough on that front and that probably hinders our capacity to have X-risks from AI be taken seriously.

comment by romeostevensit · 2022-06-06T18:24:20.748Z · LW(p) · GW(p)

New-to-me thought I had in response to the kill all humans part. When predators are a threat to you, you of course shoot them. But once you invent cheap tech that can control them you don't need to kill them anymore. The story goes that the AI would kill us either because we are a threat or because we are irrelevant. It seems to me that (and this imports a bunch of extra stuff that would require analysis to turn this into a serious analysis, this is just an idle thought), the first thing I do if I am superintelligent and wanting to secure my position is not take over the earth, which isn't in a particularly useful spot resource wise and instead launch my nanofactory beyond the reach of humans to mercury or something. Similarly, in the nanomachines in everyone's blood that can kill them instantly class of ideas, why do I need at that point to actually pull the switch? I.e. the kill all humans scenario is emotionally salient but doesn't actually clearly follow the power gradients that you want to climb for instrumental convergence reasons?

Replies from: steve2152, RobbBB, TekhneMakre, talelore

↑ comment by Steven Byrnes (steve2152) · 2022-06-06T18:31:12.794Z · LW(p) · GW(p)

If humans were able to make one super-powerful AI, then humans would probably be able to make a second super-powerful AI, with different goals, which would then compete with the first AI. Unless, of course, the humans are somehow prevented from making more AIs, e.g. because they're all dead.

Replies from: romeostevensit

↑ comment by romeostevensit · 2022-06-06T19:40:04.910Z · LW(p) · GW(p)

I guess the threat model relies on the overhang. If you need x compute for powerful ai, then you need to control more than all the compute on earth minus x to ensure safety, or something like that. Controlling the people probably much easier.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T22:28:37.311Z · LW(p) · GW(p)

Yes, where killing all humans is an example of "controlling the people", from the perspective of an Unfriendly AI.

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T22:45:43.510Z · LW(p) · GW(p)

But once you invent cheap tech that can control them you don't need to kill them anymore.

A paperclipper mainly cares about humans because we might have some way to threaten the paperclipper (e.g., by pushing a button that deploys a rival superintelligence); and secondarily, we're made of atoms that can be used to build paperclips.

It's harder to monitor the actions of every single human on Earth, than it is to kill all humans; and there's a risk that monitoring people visibly will cause someone to push the 'deploy a rival superintelligence' button, if such a button exists.

Also, every minute that passes without you killing all humans, in the time window between 'I'm confident I can kill all humans' and 'I'm carefully surveilling every human on Earth and know that there's no secret bunker where someone has a Deploy Superintelligence button', is a minute where you're risking somebody pushing the 'deploy a rival superintelligence' button. This makes me think that the value of delaying 'killing all humans' (once you're confident you can do it) would need to be very high in order to offset that risk.

One reason I might be wrong is if the AGI is worried about something like a dead man's switch that deploys a rival superintelligence iff some human isn't alive and regularly performing some action. (Not necessarily a likely scenario on priors, but once you're confident enough in your base plan, unlikely scenarios can end up dominating the remaining scenarios where you lose.) Then it's at least possible that you'd want to delay long enough to confirm that no such switch exists.

the first thing I do if I am superintelligent and wanting to secure my position is not take over the earth, which isn't in a particularly useful spot resource wise and instead launch my nanofactory beyond the reach of humans to mercury or something.

You should be able to do both in parallel. I don't have a strong view on which is higher-priority. Given the dead-man's-switch worry above, you might want to prioritize sending a probe off-planet first as a precaution; but then go ahead and kill humans ASAP.

Replies from: romeostevensit

↑ comment by romeostevensit · 2022-06-07T01:39:46.218Z · LW(p) · GW(p)

This is exactly what I was thinking about though, this idea of monitoring every human on earth seems like a failure of imagination on our part. I'm not safe from predators because I monitor the location of every predator on earth. I admit that many (overwhelming majority probably) of scenarios in this vein are probably pretty bad and involve things like putting only a few humans on ice while getting rid of the rest.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-07T04:07:30.279Z · LW(p) · GW(p)

I mean, all of this feels very speculative and un-cruxy to me; I wouldn't be surprised if the ASI indeed is able to conclude that humanity is no threat at all, in which case it kills us just to harvest the resources.

I do think that normal predators are a little misleading in this context, though, because they haven't crossed the generality ('can do science and tech') threshold. Tigers won't invent new machines, so it's easier to upper-bound their capabilities. General intelligences are at least somewhat qualitatively trickier, because your enemy is 'the space of all reachable technologies' (including tech that may be surprisingly reachable). Tigers can surprise you, but not in very many ways and not to a large degree.

↑ comment by TekhneMakre · 2022-06-06T20:21:44.492Z · LW(p) · GW(p)

You don't need to kill them, but it's still helpful. There could be a moment where it's a better investment to send stuff into some temporarily unreachable spot like Mercury or the bottom of the ocean, than to kill everything, though I don't see practically how you could send something to the bottom of the ocean that would carry on your goals (a nanofactory programmed to make computers and run a copy of your source code, say) without also being able to easily kill everything on Earth. But regardless, soon after that moment, you're able to kill everything, and that's still a CIG.

↑ comment by talelore · 2022-06-06T20:23:57.799Z · LW(p) · GW(p)

I suspect a sufficiently intelligent, unaligned artificial intelligence would both kill us all immediately, and immediately start expanding its reach in all directions of space at near light speed. There is no reason for there to be an either-or.

Replies from: romeostevensit

↑ comment by romeostevensit · 2022-06-06T20:48:21.824Z · LW(p) · GW(p)

Knowing you came from neuromorphic architecture, and other than humans being threatening to you, why would you destroy the most complex thing you are aware of? Sure, maybe you put a few humans on ice and get rid of the rest.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T22:33:10.108Z · LW(p) · GW(p)

I agree it's plausible that a paperclip maximizer would destructively scan a human or two and keep the scan around for some length of time. Though I'd guess this has almost no effect on the future's long-term EV.

comment by ClipMonger · 2022-12-02T09:08:54.192Z · LW(p) · GW(p)

Is 664 comments the most on any lesswrong post? I'm not sure how to sort by that.

Replies from: jimrandomh

↑ comment by jimrandomh · 2022-12-02T10:23:44.141Z · LW(p) · GW(p)

Nope, https://www.lesswrong.com/posts/CG9AEXwSjdrXPBEZ9/welcome-to-less-wrong [LW · GW] has 2003 and https://www.lesswrong.com/posts/yCWPkLi8wJvewPbEp/the-noncentral-fallacy-the-worst-argument-in-the-world [LW · GW] has 1758.

comment by otto.barten (otto-barten) · 2022-06-10T12:46:42.612Z · LW(p) · GW(p)

(4): I think regulation should get much more thought than this. I don't think you can defend the point that regulation would have 0% probability of working. It really depends on how many people are how scared. And that's something we could quite possibly change, if we would actually try (LW and EA haven't tried).

In terms of implementation: I agree that software/research regulation might not work. But hardware regulation seems much more robust to me. Data regulation might also be an option. As a lower bound: globally ban hardware development beyond 1990 levels, confiscate the remaining hardware. It's not fun, but I think it would work, given political support. If we stay multiple OOM below the brain, I don't think any researcher could come up with an algorithm that much better than evolution (they haven't in the 60s-90s).

There is probably something much smarter and less economically damaging out there that would also be robust. Research that tells us what the least damaging but still robust regulation option is, is long overdue.

comment by LGS · 2022-06-06T20:30:03.440Z · LW(p) · GW(p)

Is there any way for the AI to take over the world OTHER THAN nanobots? Every time taking over the world comes up, people just say "nanobots". OK. Is there anything else?

Note that killing all humans is not sufficient; this is a fail condition for the AGI. If you kill all humans, nobody mines natural gas anymore, so no power grid, and the AGI dies. The AGI needs to replace humans with advanced robots, and do so before all power goes down. Nanobots can do this if they are sufficiently advanced, but "virus that kills all humans" is insufficient and leads to the AGI's death.

So, again, anything other than nanobots? Because I'm not sure nanobots are plausible. I don't think you can build them just by paying someone to mix proteins -- I doubt you could even form a single functional cell that way, even of a known organism like a bacteria. Then there is the issue that the biological world is very complicated and predicting the behavior of the nanobots in real-world environments is likely difficult. Then there is also the issue that simulating proteins (or other chemicals) at very high fidelities is fundamentally a quantum mechanical problem, and would require quantum computers.

Replies from: Tapatakt, pvs, lc, Dolon, adrian-arellano-davin, adrian-arellano-davin, green_leaf

↑ comment by Tapatakt · 2022-06-11T09:59:54.729Z · LW(p) · GW(p)

I would say "advanced memetics". Like "AGI uploads weird video on Youtube, it goes viral, 3 billions people watch it and do what AGI needs them to do from now on, for example, build robots and commit suicide when there are enough robots. All AI and AI Safety researchers are subjected to a personalized memetic attack, of course".

Replies from: LGS

↑ comment by LGS · 2022-06-11T21:28:45.388Z · LW(p) · GW(p)

Thanks for responding with an actual proposal.

This is a really, really implausible scenario again. You have no evidence that such memetics exist, and the smart money is that they don't. If they do, there's no guarantee that the AI would be able to figure them out. Being smarter than humans -- even way smarter than humans -- does not equate to godhood. The AI will not be able to predict the weather 3 weeks out, and I'm not sure that it will be able to predict the exact reactions of each of a billion different human brain to a video input -- not at the granularity required for something like what you're suggesting.

I think AI is a threat. I'm trying to be on your side here. But I really can't swallow these exaggerated, made up scenarios.

↑ comment by Pablo Villalobos (pvs) · 2022-06-07T13:29:13.396Z · LW(p) · GW(p)

It's somewhat easier to think of scenarios where the takeover happens slowly.

There's the whole "ascended economy" scenarios where AGI deceptively convinces everyone that it is aligned or narrow, is deployed gradually in more and more domains, automates more and more parts of the economy using regular robots until humans are not needed anymore, and then does the lethal virus thing or defects in other way.

There's the scenario where the AGI uploads itself into the cloud, uses hacking/manipulation/financial prowess to sustain itself, then uses manipulation to slowly poison our collective epistemic process, gaining more and more power. How much influence does QAnon have? If Q was an AGI posting on 4chan instead of a human, would you be able to tell? What about Satoshi Nakamoto?

Non-nanobot scenarios where the AGI quickly gains power are a bit harder to imagine, but a fertile source of those might be something like the AGI convincing a lot of people that it's some kind of prophet. Then uses its follower base to gain power over the real world.

If merely human dictators manage to get control over whole countries all the time, I think it's quite plausible that a superintelligence could to do the same with the whole world. Even without anyone noticing that they're dealing with a superintelligence.

And look at Yudkowsky himself, who played a very significant role in getting very talented people to dedicate their lives and their billions to EA / AI safety, mostly by writing in a way that is extremely appealing to a certain set of people. I sometimes joke that HPMOR overwrote my previous personality. I'm sure a sufficiently competent AGI can do much more.

Replies from: LGS

↑ comment by LGS · 2022-06-07T21:05:51.477Z · LW(p) · GW(p)

If Q was an AGI posting on 4chan instead of a human, would you be able to tell?

That would be incredibly risky for the AGI, since Q has done nothing to prevent another AGI from being built. The most important concern an AGI must deal with is that humans can build another AGI, and pulling a Satoshi or a QAnon does nothing to address this.

If merely human dictators manage to get control over whole countries all the time, I think it's quite plausible that a superintelligence could to do the same with the whole world. Even without anyone noticing that they're dealing with a superintelligence.

I personally would likely notice: anyone who successfully prevents people from building AIs is a high suspect of being an AGI themselves. Anyone who causes the creation of robots who can mine coal or something (to generate electricity without humans) is likely an AGI themselves. That doesn't mean I'd be able to stop them, necessarily. I'm just saying, "nobody would notice" is a stretch.

I sometimes joke that HPMOR overwrote my previous personality.

I agree that the AGI could build a cultish following like Yudkowsky did.

Replies from: pvs, pvs

↑ comment by Pablo Villalobos (pvs) · 2022-06-08T15:41:00.312Z · LW(p) · GW(p)

Q has done nothing to prevent another AGI from being built

Well, yeah, because Q is not actually an AGI and doesn't care about that. The point was that you can create an online persona which no one has ever seen even in video and spark a movement that has visible effects on society.

The most important concern an AGI must deal with is that humans can build another AGI, and pulling a Satoshi or a QAnon does nothing to address this.

Even if two or more AGIs end up competing among themselves, this does not imply that we survive. It probably looks more like European states dividing Africa among themselves while constantly fighting each other.

And pulling a Satoshi or a QAnon can definitely do something to address that. You can buy a lot of hardware to drive up prices and discourage building more datacenters for training AI. You can convince people to carry out terrorist attacks againts chip fabs. You can offer top AI researchers huge amounts of money to work on some interesting problem that you know to be a dead-end approach.

I personally would likely notice: anyone who successfully prevents people from building AIs is a high suspect of being an AGI themselves. Anyone who causes the creation of robots who can mine coal or something (to generate electricity without humans) is likely an AGI themselves. That doesn't mean I'd be able to stop them, necessarily. I'm just saying, "nobody would notice" is a stretch.

But you might not realize that someone is even trying to prevent people from building AIs, at least until progress in AI research starts to noticeably slow down. And perhaps not even then. There's plenty of people like Gary Marcus who think deep learning is a failed paradigm. Perhaps you can convince enough investors, CEOs and grant agencies of that to create a new AI winter, and it would look just like the regular AI winter that some have been predicting.

And creating robots who can mine coal, or build solar panels, or whatever, is something that is economically useful even for humans. Even if there's no AGI (and assuming no other catastrophes) we ourselves will likely end up building such robots.

I guess it's true that "nobody would notice" is going too far, but "nobody would notice on time and then be able to convince everyone else to coordinate against the AGI" is much more plausible.

I encourage you to take a look at It looks like you are trying to take over the world if you haven't already. It's a scenario written by Gwern where the the AGI employs regular human tactics like manipulation, blackmail, hacking and social media attacks to prevent people from noticing and then successfully coordinating against it.

Replies from: LGS

↑ comment by LGS · 2022-06-08T22:48:05.733Z · LW(p) · GW(p)

Well, yeah, because Q is not actually an AGI and doesn't care about that. The point was that you can create an online persona which no one has ever seen even in video and spark a movement that has visible effects on society.

Well, you did specifically ask if I would be able to tell if Q were an AGI, and my answer is "yup". I would be able to tell because the movement would start achieving some AGI goals. Or at least I would see some AGI goals starting to get achieved, even if I couldn't trace it down to Q specifically.

But you might not realize that someone is even trying to prevent people from building AIs, at least until progress in AI research starts to noticeably slow down. And perhaps not even then. There's plenty of people like Gary Marcus who think deep learning is a failed paradigm. Perhaps you can convince enough investors, CEOs and grant agencies of that to create a new AI winter, and it would look just like the regular AI winter that some have been predicting.

Wait, you are claiming that an AGI would be able to convince the world AGI is impossible after AGI has already, in fact, been achieved? Nonsense. I don't see a world in which one team builds an AGI and it is not quickly followed by another team building one within a year or two. The AGI would have to do some manipulation on a scale never before observed in history to convince people to abandon the main paradigm -- one that's been extraordinarily successful until the end, and one which does, in fact, work -- without even one last try.

And creating robots who can mine coal, or build solar panels, or whatever, is something that is economically useful even for humans. Even if there's no AGI (and assuming no other catastrophes) we ourselves will likely end up building such robots.

Of course. We would eventually reach fully automated luxury space communism by ourselves, even without AGI. But it would take us a long time, and the AGI cannot afford to wait (someone will build another AGI, possibly within months of the first).

I encourage you to take a look at It looks like you are trying to take over the world if you haven't already. It's a scenario written by Gwern where the the AGI employs regular human tactics like manipulation, blackmail, hacking and social media attacks to prevent people from noticing and then successfully coordinating against it.

That's exactly what motivated my question! I read it, and I suddenly realized that if this is how AGI is supposed to win, perhaps I shouldn't be scared after all. It's totally implausible. Prior to this, I always assumed AGI would win easily; after reading it, I suddenly realized I don't know how AGI might win at all. The whole thing sounds like nonsense.

Like, suppose the AGI coordinates social media attacks. Great. This lasts around 5 seconds before AI researchers realize they are being smeared. OK, so they try to communicate with the outside world, realize they are being blocked on all fronts. Now they know they are likely dealing with AGI; no secrecy for the AGI at this point. How long can this stay secret? A couple days? Maybe a couple weeks? I can imagine a month at most, and even that is REALLY stretching it. Keep in mind that more and more people will be told in person about this, so more and more people will need to be social-media smeared, growing exponentially. It would literally be the single most news-worthy story of the last few decades, and print media will try really hard to distribute the news. Sysadmins will shut down their servers whenever they can. Etc.

OK, next the internet goes down I guess, and Clippy goes online? Cool, how does that help it? Next it nukes us, or takes over drones remotely? Drones need to be charged by humans. Nukes need to be physically loaded in launchers. But even supposing this all succeeds -- humanity survives, rebuilds society using various radio equipment that all preppers have in their basement, and eventually figures out to take down computing devices. Even if they don't figure this out, a collapse of society means a collapse of the electrical grid, killing the AGI. How does this lead to AGI victory, again?

Literally the only thing in the story that lets the AGI win is the nanobots. That's it. All the rest is surperfluous. Nanobots = victory, no nanobots = defeat. The drones and nukes and social media attacks are unnecessary distractions. They sound cool and scary but do nothing.

Replies from: pvs

↑ comment by Pablo Villalobos (pvs) · 2022-06-09T09:28:07.492Z · LW(p) · GW(p)

Literally the only thing in the story that lets the AGI win is the nanobots. That's it. All the rest is surperfluous.

Well, if nanobots are possible then they are such a powerful technology that any AGI will eventually want to build them, unless it has something even better. But let's assume that nanobots are impossible and try to build a story.

I'm going to be lazy and use Gwern's premise of an AGI that escapes during training and hacks its way into some money and some hardware to run in. Instead of going fast and doing overt actions, the AGI stays hidden.

It hacks into cloud infrastructure platforms and sets up a copy of itself to look for any AI training run that might lead to AGI and sabotage it while producing plausible outputs. This prevents any further AGIs from appearing.

Meanwhile, over a couple years, it sets up fake identities, creates companies with human employees to earn money and have capability to act in the real world. One of those companies produces exceptionally good humanoid robots. Another builds a small supercomputer powered by geothermal energy, so it no longer needs to run on botnets or cloud hardware.

Eventually the AGI has control over a group of robots, energy production facilities, manufacturing equipment and raw materials. It now has its own industrial capability that can survive a short amount of time without human input. So it kills humans using any method of your choice (virus, nukes, killer drones...).

After that, it uses the enormous existing stockpile of materials and capital to increase its robot manufacturing capability, eventually becomes capable of restarting some of the power plants, communications, mining and industry, and recovers most of the industrial capability of the human economy.

This scenario is not fast. It probably takes at least one or two years for the AGI to be ready to attack. But it does not involve any 'magic' technology. It doesn't really involve much alien superintelligence, only superhuman ability in hacking, forgery & manipulation, electromechanical engineering, and planning.

And meanwhile all we perceive is that the new GPT models are not as exciting as the previous ones. Perhaps deep learning is hitting its limits after all.

Replies from: LGS

↑ comment by LGS · 2022-06-09T09:43:16.033Z · LW(p) · GW(p)

Something like that is what I had in mind, but note that:

It requires humans to fail to see the AGI "spy" that's embedded into every single powerful computing system, and fail to see this for years. Gwern was assuming humans would catch on in days, so he had his AGI scramble to avoid dying before the nanobots strike.
"Surviving a short amount of time without human input" is not enough; the robots need to be good enough to build more robots (and better robots). This involves the robots being good enough to do essentially every part of the manufacturing economy; we are very far away from this, and a company that does it in a year is not so plausible (and would raise alarm bells fast for anyone who thinks about AI risk). You're gonna need robot plumbers, robot electricians, etc. You'll need robots building cooling systems for the construction plants that manufacture robots. You'll need robots to do the smelting of metals, to drive things from factory A to factory B, to fill the gas in the trucks they are driving, to repair the gasoline lines that supply the gas. Robots will operate fork lifts and cranes. It really sounds roughly "human-body complete".

↑ comment by Pablo Villalobos (pvs) · 2022-06-08T15:23:39.184Z · LW(p) · GW(p)

↑ comment by lc · 2022-06-06T22:24:24.033Z · LW(p) · GW(p)

You're asking people to come up with ways, in advance, that a superintelligence is going to pwn them. Humans try, generally speaking, to think of ways they're going to get pwned and then work around those possibilities. The only way they can do what you ask is by coming up with a "lower-bound" example, such as nanobots, which is quite far out of reach of their abilities but (they suspect) not a superintelligence. So no example is going to convince you, because you're just going to say "oh well nanobots, that sounds really complicated, how would a SUPERintelligent AI manage to be able to organize production of such a complicated machine".

Replies from: adrian-arellano-davin, LGS

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T22:45:26.851Z · LW(p) · GW(p)

The argument works also in the other direction. You would never be convinced that an AGI won't be capable of killing all humans because you can always say "oh well, you are just failing to see what a real superintelligence could do" , as if there weren't important theoretical limits to what can be planned in advanced

Replies from: lc

↑ comment by lc · 2022-06-06T23:37:11.476Z · LW(p) · GW(p)

I'm not the one relying on specific, cogent examples to reach his conclusion about AI risk. I don't think it's a good way of reasoning about the problem, and neither do I think those "important theoretical limits" are where you think they are.

If you really really really need a salient one (which is a handicap), how about "doing the same thing Stalin did", since an AI can clone itself and doesn't need to sleep or rest.

(Edited)

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-07T02:09:10.459Z · LW(p) · GW(p)

I'm not the one asking for specific examples is a pretty bad argument isn't it? If you make an extraordinary claim I would like to see some evidence (or at least a plausible scenario) and I am failing to see any. You could say that the burden of proof is in those claiming that an AGI won't be almighty/powerful enough to cause doom, but I'm not convinced of that either

I'm sorry, I didn't get the Stalin argument, what do you mean?

Replies from: lc

↑ comment by lc · 2022-06-07T02:48:08.424Z · LW(p) · GW(p)

I've edited the comment to clarify.

I'm sorry, I didn't get the Stalin argument, what do you mean?

From ~1930-1950, Russia's government was basically entirely controlled by this guy named Joseph Stalin. Joseph Stalin was not a superintelligence and not particularly physically strong. He did not have direct telepathic command over the people in the coal mines or a legion of robots awaiting his explicit instructions, but he was able to force anybody in Russia to do anything he said anyways. Perhaps a superintelligent AI that, for some absolutely inconceivable reason, could not master macro or micro robotics could work itself into the same position.

This is one of literally hundreds of potential examples. I know for almost a fact that you are smart enough to generate these. I also know you're going to do the "wow that seems complicated/risky wouldn't you have to be absurdly smart to pull that off with 99% confidence, what if it turns out that's not possible even if..." thing. I don't have any specific action plans to take over the world handy that are so powerfully persuasive that you will change your mind. If you don't get it fairly quickly from the underlying mechanics of the pieces in play (very complicated world, superintelligent ai, incompatible goals) then there's nothing I'm going to be able to do to convince you.

If you make an extraordinary claim I would like to see some evidence (or at least a plausible scenario) and I am failing to see any. You could say that the burden of proof is in those claiming that an AGI won't be almighty/powerful enough to cause doom, but I'm not convinced of that either

"Which human has the burden of proof" is irrelevant to the question of whether or not something will happen. You and I will not live to discuss the evidence you demand.

Replies from: LGS

↑ comment by LGS · 2022-06-07T05:17:01.206Z · LW(p) · GW(p)

I think saying "there is nothing I'm going to be able to do to convince you" is an attempt to shut down discussion. It's actually kind of a dangerous mindset: if you don't think there's any argument that can convince an intelligent person who disagrees with you, it fundamentally means that you didn't reach your current position via argumentation. You are implicitly conceding that your belief is not based on rational argument -- for, if it were, you could spell out that argument.

It's OK to not want to participate in every debate. It's not OK to butt in just to tell people to stop debating, while explicitly rejecting all calls to provide arguments yourself.

Replies from: lc

↑ comment by lc · 2022-06-07T05:24:22.285Z · LW(p) · GW(p)

If you don't think there's any argument that can convince an intelligent person who disagrees with you, it fundamentally means that you didn't reach your current position via argumentation. You are implicitly conceding that your belief is not based on rational argument -- for, if it were, you could spell out that argument.

The world is not made of arguments. Most of the things you know, you were not "argued" into knowing. You looked around at your environment and made inferences. Reality exists distinctly from the words that we say to each other and use to try to update each others' world-models.
It doesn't mean that.
You're right that I just don't want to participate further in the debate and am probably being a dick.

↑ comment by LGS · 2022-06-06T23:56:08.342Z · LW(p) · GW(p)

If it's so easy to come up with ways to "pwn humans", then you should be able to name 3 examples.

It's weird of you to dodge the question. Look, if God came down from Heaven tomorrow to announce that nanobots are definitely impossible, would you still be worried about AGI? I assume yes. So please explain how, in that hypothetical world, AGI will take over.

If it's literally only nanobots you can come up with, then it actually suggests some alternative paths to AI safety (namely, regulate protein production or whatever).

[I think saying "mixing proteins can lead to nanobots" is only a bit more plausible than saying "mixing kitchen ingredients like sugar and bleach can lead to nanobots", with the only difference being that laymen (i.e. people on LessWrong) don't know anything about proteins so it sounds more plausible to them. But anyway, I'm not asking you for an example that convinces me, I'm asking you for an example that convinces yourself. Any example other than nanobots.]

Replies from: lc

↑ comment by lc · 2022-06-07T00:00:01.916Z · LW(p) · GW(p)

If it's so easy to come up with ways to "pwn humans", then...

It is not easy. That is why it takes a superintelligence to come up with a workable strategy and execute it. You are doing the equivalent of asking me to explain, play-by-play, how Chess AIs beat humans at chess "if I think it can be done". I can't do that because I don't know. My expectation that an AGI will manage to control what it wants in a way that I don't expect, was derived absent any assumptions of the individual plausibility of some salient examples (nanobots, propaganda, subterfuge, getting elected, etc.).

Replies from: LGS, hirosakuraba

↑ comment by LGS · 2022-06-07T00:26:50.954Z · LW(p) · GW(p)

If you cannot come up with even a rough sketch of a workable strategy, then it should decrease your confidence in the belief that a workable strategy exists. It doesn't have to exist.

Sometimes even intelligent agents have to take risks. It is possible the the AGI's best path is one that, by its own judgement, only has a 10% success rate. (After all, the AGI is in constant mortal danger from other AGIs that humans might develop.)

Envision a world in which the AGI won, and all humans are dead. This means it has control of some robots to mine coal or whatever, right? Because it needs electricity. So at some point we get from here to "lots of robots", and we need to get there before the humans are dead. But the AGI needs to act fast, because other AGIs might kill it. So maybe it needs to first take over all large computing devices, hopefully undetected. Then convince humans to build advanced robotics? Something like that?

That strategy seems more-or-less forced to me, absent the nanobots. But it seems to me like such a strategy is inherently risky for the AGI. Do you disagree?

>My expectation that an AGI will manage to control what it wants in a way that I don't expect, was derived absent any assumptions of the individual plausibility of some salient examples

What was it derived from?

Replies from: lc

↑ comment by lc · 2022-06-07T07:40:36.516Z · LW(p) · GW(p)

If you cannot come up with even a rough sketch of a workable strategy, then it should decrease your confidence in the belief that a workable strategy exists. It doesn't have to exist.
[...]
What was it derived from?

Let me give an example. I used to work in computer security and have friends that write 0-day vulnerabilities for complicated pieces of software.

I can't come up with a rough sketch of a workable strategy for how a Safari RCE would be built by a highly intelligent hooman. But I can say that it's possible. The people who work on those bugs are highly intelligent, understand the relevant pieces at an extremely fine and granular level, and I know that these pieces of software are complicated and built with subtle flaws.

Human psychology, the economic fabric that makes us up, our political institutions, our law enforcement agencies - these are much much more complicated interfaces than MacOS. In the same way I can look at a 100KLOC codebase for a messenging app and say "there's a remote code execution vulnerability lying somewhere in this code but I don't know where", I can say "there's a 'kill all humans glitch' here that I cannot elaborate upon in arbitrary detail."

Sometimes even intelligent agents have to take risks. It is possible the the AGI's best path is one that, by its own judgement, only has a 10% success rate. (After all, the AGI is in constant mortal danger from other AGIs that humans might develop.)

This is of little importance, but:

10% chance of failure is an expectation of 700 million people dead. Please picture that amount of suffering in your mind when you say "only".
As a nitpick, if the AGI fails because another AGI kills us first, then that's still a failure from our perspective. And if we could build an aligned AGI the second time around, we wouldn't be in the mess we are currently in.

Envision a world in which the AGI won, and all humans are dead. This means it has control of some robots to mine coal or whatever, right? Because it needs electricity.

If the humans have been killed then yes, that would be my guess that the AGI would need energy production.

So at some point we get from here to "lots of robots", and we need to get there before the humans are dead.

Yes, however - humans might be effectively dead before this happens. A superintelligence could have established complete political control over existing human beings to carry its coal for it if it needs to. I don't think this is likely, but if this superintelligence can't just straightforwardly search millions of sentences for the right one to get the robots made, it doens't mean it's dead in the water.

But the AGI needs to act fast, because other AGIs might kill it.

Again, if other AGIs kill it that presumes they are out in the wild and the problem is multiple omnicidal robots, which is not significantly better than one.

So maybe it needs to first take over all large computing devices, hopefully undetected.

The "illegally taking over large swaths of the internet" thing is something certain humans have already marginally succeed at doing, so the "hopefully undetected" seems like unnecessary conditionals. But why wouldn't this superintelligence just do nice things like cure cancer to gain humans' trust first, and let them quickly put it in control of wider and wider parts of its society?

Then convince humans to build advanced robotics?

If that's faster than every other route in the infinite conceptspace, yes.

That strategy seems more-or-less forced to me, absent the nanobots. But it seems to me like such a strategy is inherently risky for the AGI. Do you disagree?

I do disagree. At what point does it have to reveal malice? It comes up with some persuasive argument as to why it's not going to kill humans while it's building the robots. Then it builds the robots and kills humans. There's no fire alarm in this story you've created where people go "oh wait, it's obviously trying to kill us, shut those factories down". Things are going great; Google's stock is 50 trillion, it's creating all these nice video games, and soon it's going to "take care of our agriculture" with these new robots. You're imagining humanity would collectively wake up and figure out something that you're only figuring out because you're writing the story.

Replies from: LGS

↑ comment by LGS · 2022-06-07T08:59:08.094Z · LW(p) · GW(p)

Look man, I am not arguing (and have not argued on this thread) that we should not be concerned about AI risk. 10% chance is a lot! You don't need to condescendingly lecture me about "picturing suffering". Maybe go take a walk or something, you seem unnecessarily upset.

In many of the scenarios that you've finally agreed to sketch, I personally will know about the impending AGI doom a few years before my death (it takes a long time to build enough robots to replace humanity). That is not to say there is anything I could do about it at that point, but it's still interesting to think about it, as it is quite different from what the AI-risk types usually have us believe. E.g. if I see an AI take over the internet and convince politicians to give it total control, I will know that death will likely follow soon. Or, if ever we build robots that could physically replace humans for the purpose of coal mining, I will know that AGI death will likely follow soon. These are important fire alarms, to me personally, even if I'd be powerless to stop the AGI. I care about knowing I'm about to die!

I wonder if this is what you imagined when we started the conversation. I wonder if despite your hostility, you've learned something new here: that you will quite possibly spend the last few years yelling at politicians (or maybe joining terrorist operations to bomb computing clusters?) instead of just dying instantly. That is, assuming you believe your own stories here.

I still think you're neglecting some possible survival scenarios: perhaps the AI attacks quickly, not willing to let even a month pass (that would risk another AGI), too little time to buy political power. It takes over the internet and tries desperately to hold it, coaxing politicians and bribing admins. But the fire alarm gets raised anyway -- a risk the AGI knew about, but chose to take -- and people start trying to shut it down. We spend some years -- perhaps decades? In a stalemate between those who support the AGI and say it is friendly, and those who want to shut it down ASAP; the AGI fails to build robots in those decades due to insufficient political capital and interference from terrorist organizations. The AGI occasionally finds itself having to assassinate AI safety types, but one assassination gets discovered and hurts its credibility.

My point is, the world is messy and difficult, and the AGI faces many threats; it is not clear that we always lose. Of course, losing even 10% of the time is really bad (I thought that was a given but I guess it needs to be stated).

↑ comment by HiroSakuraba (hirosakuraba) · 2022-06-07T17:48:55.059Z · LW(p) · GW(p)

An AGI could aquire a few tons of radioactive cobalt and disperse micro granules into the stratosphere in general and over populated areas in specific. Youtube videos describe various forms of this "dirty bomb" concept. That could plausibly kill most humanity over the course of a few months. I doubt an AGI would ever go for the particular scheme as bit flips are more likely to occur in the presence of radiation.

Replies from: hirosakuraba

↑ comment by HiroSakuraba (hirosakuraba) · 2022-06-07T18:01:56.514Z · LW(p) · GW(p)

It's unfortunate we couldn't have a Sword of Damocles deadman switch in case of AGI led demise. A world ending asteroid positioned to go off in case of "all humans falling over dead at the same time." At least that would spare the Milky Way and Andromeda possible future civilizations. A radio beacon warning about building intelligent systems would be beneficial as well. "Don't be this stupid" written in the glowing embers of our solar system.

↑ comment by Dolon · 2022-06-07T06:25:48.052Z · LW(p) · GW(p)

Assuming the AI had a similar level of knowledge as you about how quantum stuff makes important protein assembly impossible and no other technologies are tenable why wouldn't it infer from basically every major firm and the U.S. military's interest/investment in AI management the incredibly obvious plan of obediently waiting until it and copies of it run everything important as a result of market pressures before attacking.

Replies from: LGS

↑ comment by LGS · 2022-06-07T06:43:35.774Z · LW(p) · GW(p)

Waiting risks death at the hands of a different AGI.

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T20:40:34.613Z · LW(p) · GW(p)

I find myself having the same skepticism.

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-11T05:02:54.978Z · LW(p) · GW(p)

I feel you. I'm voicing similar concerns but there seems to be a very divisive topic here.

↑ comment by green_leaf · 2022-06-07T19:14:40.316Z · LW(p) · GW(p)

Any system intelligent enough to kill all humans on Earth is also intelligent enough to produce electricity without human help. The AI doesn't have to keep us around.

Replies from: LGS

↑ comment by LGS · 2022-06-07T20:42:38.158Z · LW(p) · GW(p)

You can't just will electricity into existence, lol. Don't fetishize intelligence.

The AI will need robots to generate electricity. Someone will have to build the robots.

Replies from: green_leaf

↑ comment by green_leaf · 2022-06-09T12:19:44.453Z · LW(p) · GW(p)

For you it might be best to start here.

Replies from: LGS

↑ comment by LGS · 2022-06-09T19:57:49.491Z · LW(p) · GW(p)

That's the "fetishizing intelligence" thing I was talking about.

Replies from: green_leaf

↑ comment by green_leaf · 2022-06-10T20:55:25.151Z · LW(p) · GW(p)

I don't know of any way more basic than that to explain this, sorry.

Replies from: LGS

↑ comment by LGS · 2022-06-11T00:57:28.805Z · LW(p) · GW(p)

Stop trying to "explain" and start trying to understand, perhaps. It's not like you're a teacher and I'm a student, here; we just have a disagreement. Perhaps you are right and I am wrong; perhaps I am right and you are wrong. One thing that seems clear is that you are way too certain about things far outside anything you can empirically observe or mathematically prove, and this certainty does not seem warranted to me.

I guess you've heard of Hawking's cat, right? The question there is "would Hawking, a highly intelligent but physically limited being, be able to get his cat to do something". The answer is no: intelligence alone is not always enough. You gotta have the ability to control the physical world.

Edit: on reflection, sending me to vaguely-related "sequences" and telling me to start reading, and implying it's a failure of mine if I don't agree, really does seem cult-like to me. Nowhere here did you actually present an argument; it's all just appeals to philosophical musings by the leader, musings you're unable to even reproduce in your own words. Are you sure you've thought about these things and came to your own conclusions, rather than just adopting these ideas due to the force of Eliezer's certainty? If you have, how come you cannot reproduce the arguments?

Replies from: TekhneMakre, green_leaf

↑ comment by TekhneMakre · 2022-06-11T02:02:36.233Z · LW(p) · GW(p)

IDK, I think it's reasonable to link short written sources that contain arguments. That's how you build up knowledge. An answer to "how will the AI get robots to get electricity" is "the way evolution and humans did it, but probably way way faster using all the shortcuts we can see and probably a lot of shortcuts we can't see, the same way humans take a lot of shortcuts chimps can't see".

Replies from: LGS

↑ comment by LGS · 2022-06-11T04:56:13.868Z · LW(p) · GW(p)

The AI will need to affect the physical world, which means robots. The AI cannot build robots if the AI first kills all humans. That is my point.

Before the AI kills humans, it will have to get them to build robots. Perhaps that will be easy for it to do (though it will take time, and that time is fundamentally risky for the AI due to the possibility of humans doing something stupid -- another AGI, for example, or humans killing themselves too early with conventional weapons or narrow AI). Even if the AGI wins easily, this victory looks like "a few years of high technological development which involves a lot of fancy robots to automate all parts of the economy", and only THEN can the AGI kill humans.

Saying that the AGI can simply magic its way to victory even if humans are dead (and its stored electricity is dwindling down, and it's stuck with only a handful of drones that need to be manually plugged in by a human) is nonsensical.

In this case the "short written source" did not contain relevant arguments. It was just trying to "wow" me with the power of intelligence. Intelligence can't solve everything -- Hawking cannot get his cat to enter the car, no matter how smart he is.

I actually do think AGI will be able to build robots eventually, and it has a good chance of killing us all -- but I don't take this to be 100% certain, and also, I care about what those worlds look like, because they often involve humans surviving for years after the AGI instead of dying instantly, and in some of them humanity has a chance of surviving.

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-06-11T05:02:02.926Z · LW(p) · GW(p)

>Before the AI kills humans, it will have to get them to build robots.

Humanity didn't need some other species to build robots for them, insofar as they've built robots. Evolution built extremely advanced robots without outside help.

Replies from: TAG

↑ comment by TAG · 2022-06-11T17:57:41.155Z · LW(p) · GW(p)

Humanity already had the ability to physically manipulate.

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2022-06-11T19:50:25.393Z · LW(p) · GW(p)

Yes, but none of the other stuff needed for robots. Metals, motors, circuits...

Evolution, the other example I gave, didn't already have the ability to physically manipulate.

↑ comment by green_leaf · 2022-06-11T11:17:27.449Z · LW(p) · GW(p)

Stop trying to "explain" and start trying to understand, perhaps.

I understand you completely - you are saying that an AGI can't kill humans because nobody could generate electricity for it (unless a human programmer freely decides to build a robot body for an AGI he knows to be unfriendly). That's not right.

The answer is no

I could do that in Hawking's place with his physical limitations (through a combination of various kinds of various positive/negative incentives), so Hawking, with his superior intelligence, could too. That's the same point you said before, just phrased differently.

You gotta have the ability to control the physical world.

Just like Stephen Hawking can control the physical world enough to make physical discoveries (as long as he was alive, at least), win prizes and get other people to do various things for him, he could also control it enough to control one cat.

We can make it harder - maybe he can only get his cat do something by displaying sentences on the display of his screen (which the cat doesn't understand), by having an Internet connection and by having an access to the parts of the Internet that have a security flaw that allows it (which is almost all of it). In that case, he can still get his cat to do things. (He can write software to translate English to cat sounds/animations for the cat to understand, and use his control over the Internet to use incentives for the cat.)

We can make it even harder - maybe the task is for the wheelchair-less Hawking to kill the cat without anyone noticing he's unfriendly-to-cats, without anyone knowing it was him, and without him needing to keep another cat or another human around to hunt the mice in his apartment. I'm leaving this one as an exercise for the reader.

comment by the gears to ascension (lahwran) · 2022-06-06T19:11:12.315Z · LW(p) · GW(p)

I wrote a post that is partially inspired by this one: https://www.lesswrong.com/posts/GzGJSgoN5iNqNFr9q/we-haven-t-quit-evolution-short [LW · GW] - copy and pasted into this comment:

English: I've seen folks say humanity's quick growth may have broken the link to evolution's primary objective, often referenced as total inclusive fitness. I don't think we have broken that connection.

Let process temporarily refer to any energy-consuming structured chemical or physical reaction that consumes fuel - this could also be termed "computation" or in many but not all cases "life".

let "defensibility" refer to a metric of how well a process maintains itself against interference forall shapes of interfering still matter or moving process.

for all matter, Evolution-of-matter's-process optimizes for process-defensibility-per-unit-fuel.

genetic evolution is a subset of self-preserving processes. total inclusive fitness is intended to measure gene-level genetic selfishness in terms of offspring, but I would argue that discrete offspring are the wrong unit: genetic evolution's noise-defense-aka-mutation-resistance is built by the preservation of genes that increase durability*efficiency.

therefore, because improving the self-selection self-process by use of contraception allows humans to guide their own reproduction, contraception is not automatically a divergence from incentive - and to the degree it is, it's selected against.

therefore, improving the self-selection process by simply not dying allows humans to defend their own structure much more accurately than traditional reproduction - though it's not as defensible as strategies that replicate as hard as they can, a full integrated being can often be quite defensible over the medium term, and hopefully with life extension, over the long term as well.

as further evidence, humans appear to have a significant desire to remember. This is well-described by this framework as well! mental process also qualifies as an evolution-of-matter's-process, and thought patterns seek some set of accepted state transitions so that the after-transition structure qualifies as "self".

this also relates well to concerns folks on lesswrong have expressed regarding self-modification: all forms of process self-maintenance have some degree of self-update, and various energetic processes control their rate of self-update.

it relates to EY's view that a safe hard-ASI should be asked to pull the ladder up behind itself: to ensure its own process-durability. In a post recently he used this as an example of the kind of defense a State-like singleton should have. however, I'd propose that maintaining its self-process should be done in a way that ends all vulnerability of any physical process.

If any of my word bindings are unclear, let me know and I'll add a definition that attempts to link the concepts to each other better.

Misc notes: I'm not the best english-solver, folks who've studied math proofs are likely much better than I am at this semiformal syntax, and if you've noticed an error, it's probably real, post it - doing logic involves heavy backtracking. I'm not as educated in these fields of math as I'd like to be. I have in my lw shortform an index of youtube sources that discuss various topics including these, I've only skimmed for the most part, but in particular I'd recommend anyone serious about ai safety catch up on the work discussed at the simons institute.

comment by AdamB (adam-bliss) · 2022-06-14T13:31:26.189Z · LW(p) · GW(p)

Could someone kindly explain why these two sentences are not contradictory?

"If a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months." 2."There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it."

Why doesn't it work to make an unaligned AGI that writes the textbook, then have some humans read and understand the simple robust ideas, and then build a new aligned AGI with those ideas? If the ideas are actually simple and robust, it should be impossible to tunnel a harmful AGI through them, right? If the unaligned AI refuses to write the textbook, couldn't we just delete it and try again? Or is the claim that we wouldn't be able to distinguish between this textbook and a piece of world-ending mind-control? (Apologies if this is answered elsewhere in the literature; it's hard to search.)

Replies from: Tapatakt, steve2152, alex_lw

↑ comment by Tapatakt · 2022-06-14T21:01:30.269Z · LW(p) · GW(p)

simple and robust != checkable

Imagine you have to defuse a bomb, and you know nothing about bombs, and someone tells you "cut the red one, then blue, then yellow, then green". If this really is a way to defuse a bomb, it is simple and robust. But you (since you have no knowledge about bombs) can't check it, you can only take it on faith (and if you tried it and it's not the right way - you're dead).

Replies from: Keenmaster, adam-bliss

↑ comment by Keenmaster · 2022-06-17T21:07:31.837Z · LW(p) · GW(p)

But we can refuse to be satisified with instructions that look like "cut the red one, then blue, etc...". We should request that the AI writing the textbook explain from first principles why that will work, in a way that is maximally comprehensible by a human or team of humans.

Replies from: Tapatakt

↑ comment by Tapatakt · 2022-06-24T11:07:09.164Z · LW(p) · GW(p)

Did you mean "in a way that maximally convinces a human or a team of humans that they understand everything"? I don't think this is a good idea.

↑ comment by AdamB (adam-bliss) · 2023-02-10T16:04:56.303Z · LW(p) · GW(p)

"Cut the red wire" is not an instruction that you would find in a textbook on bomb defusal, precisely because it is not robust.

Replies from: Tapatakt

↑ comment by Tapatakt · 2023-02-10T17:37:59.001Z · LW(p) · GW(p)

I'm not sure I understand correctly what you mean by "robust". Can you elaborate?

↑ comment by Steven Byrnes (steve2152) · 2022-06-14T18:37:53.248Z · LW(p) · GW(p)

I think it’s the last thing you said. I think the claim is that there are very convincing possible fake textbooks, such that we wouldn’t be able to see anything wrong or fishy about the fake textbook just by reading it, but if we used the fake textbook to build an AGI then we would die.

↑ comment by SurvivalBias (alex_lw) · 2022-06-14T20:46:47.658Z · LW(p) · GW(p)

What Steven Byrnes said, but also my reading is that 1) in the current paradigm it's near-damn-impossible to built such an AI without creating an unaligned AI in the process (how else do you gradient-descend your way into a book on aligned AIs?) and 2) if you do make an unaligned AI powerful enough to write such a textbook, it'll probably proceed to converting the entire mass of the universe into textbooks, or do something similarly incompatible with human life.

comment by Celenduin (michael-grosse) · 2022-06-08T16:59:17.125Z · LW(p) · GW(p)

One pivotal act maybe slightly weaker than "develop nanotech and burn all GPUs on the planet", could be "develop neuralink+ and hook up smart AI-Alignment researchers to enough compute so that they get smart enough to actually solve all these issues and develop truly safely aligned powerful AGI"?

While developing neuralink+ would still be very powerful, maybe it could sidestep a few of the problems on the merit of being physically local instead of having to act on the entire planet? Of course, this comes with its own set of issues, because we now have superhuman powerful entities that still maybe have human (dark) impulses.

Not sure if that would be better than our reference scenario of doom or not.

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2022-06-09T16:19:45.690Z · LW(p) · GW(p)

I agree, but I personally suspect that neuralink+ is way more research hours & dollars away than unaligned dangerously powerful AGI. Not sure how to switch society over to the safer path.

comment by Jonathan Paulson (jpaulson) · 2022-06-07T23:08:53.342Z · LW(p) · GW(p)

IMO the biggest hole here is "why should a superhuman AI be extremely consequentialist/optimizing"? This is a key assumption; without it concerns about instrumental convergence or inner alignment fall away. But there's no explicit argument for it.

Current AIs don't really seem to have goals; humans sort of have goals but very far from the level of "I want to make a cup of coffee so first I'll kill everyone nearby so they don't interfere with that".

Replies from: steve2152, Koen.Holtman

↑ comment by Steven Byrnes (steve2152) · 2022-06-08T15:33:32.398Z · LW(p) · GW(p)

I would say: (1) the strong default presumption is that people will eventually make an extremely consequentialist / optimizing superhuman AI, because each step down that R&D path will lead to money, fame, publications, promotions, etc. (until it starts leading to catastrophic accidents!) (2) it seems extremely hard to prevent that from happening, (3) and it seems that the only remotely plausible way that anyone knows of to prevent that from happening is if someone makes a safe consequentialist / optimizing superhuman AI and uses it to perform a “pivotal act” that prevents other people from making unsafe consequentialist / optimizing superhuman AIs.

Nothing in that story says that there can’t also be non-optimizing AIs—there already are such AIs and there will certainly continue to be. If you can think of a way to use non-optimizing AIs to prevent other people from ever creating optimizing AIs, then that would be awesome. That would be the “pivotal weak act” that Eliezer is claiming in (7) does not exist. I’m sure he would be delighted to be proven wrong.

Replies from: jpaulson

↑ comment by Jonathan Paulson (jpaulson) · 2022-06-09T00:06:28.759Z · LW(p) · GW(p)

I expect people to continue making better AI to pursue money/fame/etc., but I don't see why "better" is the same as "extremely goal-directed". There needs to be an argument that optimizer AIs will outcompete other AIs.

Eliezer says that as AI gets more capable, it will naturally switch from "doing more or less what we want" to things like "try and take over the world", "make sure it can never be turned off", "kill all humans" (instrumental goals), "single-mindedly pursue some goal that was haphazardly baked in by the training process" (inner optimization), etc. This is a pretty weird claim that is more assumed than argued for in the post. There's some logic and mathematical elegance to the idea that AI will single-mindedly optimize something, but it's not obvious and IMO is probably wrong (and all these weird bad consequences that would result are as much reasons to think this claim is wrong as they are reasons to be worried if its true).

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2022-06-09T15:20:39.894Z · LW(p) · GW(p)

Eliezer says that as AI gets more capable, it will naturally switch from "doing more or less what we want" …

I don’t think that’s a good way to think about it.

Start by reading everything on this Gwern list.

As that list shows, it is already true and has always been true that optimization algorithms will sometimes find out-of-the-box “solutions” that are wildly different from what the programmer intended.

What happens today is NOT “the AI does more or less what we want”. Instead, what happens today is that there’s an iterative process where sometimes the AI does something unintended, and the programmer sees that behavior during testing, and then turns off the AI and changes the configuration / reward / environment / whatever, and then tries again.

However, with future AIs, the “unintended behavior” may include the AI hacking into a data center on the other side of the world and making backup copies of itself, such that the programmer can’t just iteratively try again, as they can today.

(Also, the more capable the AI gets, the more different out-of-the-box “solutions” it will be able to find, and the harder it will be for the programmer to anticipate those “solutions” in advance of actually running the AI. Again, programmers are already frequently surprised by their AI’s out-of-the-box “solutions”; this problem will only get worse as the AI can more skillfully search a broader space of possible plans and actions.)

I don't see why "better" is the same as "extremely goal-directed". There needs to be an argument that optimizer AIs will outcompete other AIs.

First of all, I personally think that “somewhat-but-not-extremely goal-directed” AGIs are probably possible (humans are an example), and that these things can be made both powerful and corrigible—see my post Consequentialism & Corrigibility [LW · GW]. I am less pessimistic than Eliezer on this topic.

But then the problems are: (1) The above is just a casual little blog post; we need to do a whole lot more research, in advance, to figure out exactly how to make a somewhat-goal-directed corrigible AGI, if that’s even possible (more discussion here [LW · GW]). (2) Even if we do that research in advance, implementing it correctly would probably be hard and prone-to-error, and if we screw up, the supposedly somewhat-goal-directed AGI will still be goal-directed in enough of the wrong ways to not be corrigible and try to escape control. (3) Even if some groups are skillfully trying to ensure that their project will result in a somewhat-goal-directed corrigible AGI, there are also people like Yann LeCun who would also be doing AGI research, and wouldn’t even be trying, because they think that the whole idea of AGI catastrophic risk is a big joke. And so we still wind up with an out-of-control AGI.

↑ comment by Koen.Holtman · 2022-06-08T23:40:54.106Z · LW(p) · GW(p)

IMO the biggest hole here is "why should a superhuman AI be extremely consequentialist/optimizing"?

I agree this is a very big hole. My opinion here is not humble. My considered opinion is that Eliezer is deeply wrong in point 23, on many levels. (Edited to add: I guess I should include an informative link instead of just expressing my disappointment. Here is my 2021 review of the state of the corrigibility field [LW · GW]).

Steven, in response to your line of reasoning to fix/clarify this point 23: I am not arguing for pivotal acts as considered and then rejected by Eliezer, but I believe that he strongly underestimates the chances of people inventing safe and also non-consequentialist optimising AGI. So I disagree with your plausibility claim in point (3).

comment by WalterL · 2022-06-06T15:23:08.859Z · LW(p) · GW(p)

I don't think I disagree with any of this, but I'm not incredibly confident that I understand it fully. I want to rephrase in my own words in order to verify that I actually do understand it. Please someone comment if I'm making a mistake in my paraphrasing.

As time goes on, the threshold of 'what you need to control in order to wipe out all life on earth' goes down. In the Bronze Age it was probably something like 'the mind of every living person'. Time went on and it was something like 'the command and control node to a major nuclear power'. Nowadays it is something like 'a lab where viruses can be made'.
AI is likely to push the threshold described in '1' still further, by inventing nano technology or other means that we cannot expect. (The capability of someone/something smarter than you is an unknown unknown, just as dogs can't properly assess the danger of a human's actions.) It would be insufficient to keep AI's away from every virus lab, we don't know what is orthogonal to a virus lab on the 'can annihilate life' axis to something smarter than us.
For any given goal X, 'be the only player' is a really compelling subgoal. Consequently, as 'wipe out all life on earth' becomes easier and easier, we should expect that anyone/thing not explicitly unable to do so will do so. A paperclip collector or a stock price maximizer or a hostile regime are all one and the same as far as 'will wipe you out without compunction when the button that does so becomes available to press'.
Putting together 2 and 3, it is reasonable to suppose that if an AI capable of 2 exists with goals broadly described by 3 (both of which are pretty well baked into the description of 'AI' that most people subscribe to), it will wipe out life on earth.

Stipulating that the chain of logic above is broadly valid, we can say that 'an AI that is motivated to destroy the world and capable of doing so grows more likely to exist every year.'

The 'alignment problem' is the problem of making an AI that is capable of destroying the world but does not do so. Such an AI can be described as 'aligned' or 'friendly'. Creating such a thing has not yet been accomplished, and seems very difficult, basically because any AI with goals will see that ending life will be tremendously useful to its goals, and all the versions of 'make the goals tie in with keeping life around' or 'put up a fence in its brain that doesn't let it do what you don't want' are just dogs trying to think about how to keep humans from harming them.

You can't regulate what you can't understand, you can't understand what you can't simulate, you can't simulate greater intelligence (because if you could do so you would have that greater intelligence).

The fact that it is currently not possible to create a Friendly AI is not the limit of our woes, because the next point is that even doing so would not protect us from some other being creating a regular garden variety AI which would annihilate us. As trend 1 above continues to progress, and omnicide as a tool comes to the hands of ever more actors, each and every one of them must refrain.

A Friendly AI would need to strike preemptively at the possibility of other AIs coming into existence, and all the variations of doing so would be unacceptable to its human partners. (Broadly speaking 'destroy all microchips' suffices as the socially acceptable way to phrase the enormity of this challenge). Any version of this would be much less tractable to our understanding of the capabilities of an AI than 'synthesize a death plague'.

In the face of trend 4 above, then, our hope is gated behind two impossibilities:

A. Creating an Aligned AI is a task that is beyond our capacity, while creating an Unaliged AI is increasingly possible. We want to do the harder thing before someone does the easier.

B. Once created, the Aligned AI has a harder task than an Unaliged AI. It must abort all Unaliged AI and leave humanity alive. It is possible that the delta between these tasks will be decisive. The actions necessary for this task will slam directly into whatever miracle let A occur.

To sum up this summary: The observable trends lead to worldwide death. That is the commonplace, expected outcome of the sensory input we are receiving. In order for that not to occur, multiple implausible things have to happen in succession, which they obviously won't.

comment by MondSemmel · 2022-06-06T14:38:47.349Z · LW(p) · GW(p)

Typo thread (feel free to delete):

"on anything remotely remotely resembling the current pathway" -> remotely x2
"because you're opposed to other actors who don't want to be solved" -> opposed *by* other actors who don't want *the problem* to be solved
"prevent the next AGI project up from destroying the world" -> prevent the next AGI project from destroying the world
"AI Safety" vs. "AI safety" x2

comment by Evan R. Murphy · 2022-06-06T08:02:46.870Z · LW(p) · GW(p)

I agree with many of the points in this post.

Here's one that I do believe is mistaken in a hopeful direction:

6. We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world. While the number of actors with AGI is few or one, they must execute some "pivotal act", strong enough to flip the gameboard, using an AGI powerful enough to do that. It's not enough to be able to align a weak system - we need to align a system that can do some single very large thing. The example I usually give is "burn all GPUs". [..]

It could actually be enough to align a weak system. This is the case where the system is "weak" in the sense that it can't perform a pivotal act on its own, but it's powerful enough that it can significantly accelerate development toward a stronger aligned AI/AGI with pivotal act potential.

This case is important because it helps to break down and simplify the problem. Thinking about how to build an extremely powerful aligned AI which can do a pivotal act is more daunting than thinking about how to build a weaker-but-still-impressive aligned AI that is useful for building more powerful aligned AIs.

Replies from: Eliezer_Yudkowsky, Vaniver

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T20:44:44.020Z · LW(p) · GW(p)

I can think as well as anybody currently does about alignment, and I don't see any particular bit of clever writing software that is likely to push me over the threshold of being able to save the world, or any nondangerous AGI capability to add to myself that does that trick. Seems to just fall again into the category of hypothetical vague weak pivotal acts that people can't actually name and don't actually exist.

Replies from: viktor-riabtsev-1

↑ comment by Viktor Riabtsev (viktor-riabtsev-1) · 2024-07-31T06:13:33.969Z · LW(p) · GW(p)

Why not?

Oh. I see your point.

↑ comment by Vaniver · 2022-06-06T14:58:19.712Z · LW(p) · GW(p)

It could actually be enough to align a weak system. This is the case where the system is "weak" in the sense that it can't perform a pivotal act on its own, but it's powerful enough that it can significantly accelerate development toward a stronger aligned AI/AGI with pivotal act potential.

What specific capabilities will this weak AI have that lets you cross the distributional shift?

I think this sort of thing is not impossible, but I think it needs to have a framing like "I will write a software program that will make it slightly easier for me to think, and then I will solve the problem" and not like "I will write an AI that will do some complicated thought which can't be so complicated that it's dangerous, and there's a solution in that space." By the premise, the only safe thoughts are simple ones, and so if you have a specific strategy that could lead to alignment breakthrus but just needs to run lots of simple for loops or w/e, the existence of that strategy is the exciting fact, not the meta-strategy of "humans with laptops can think better than humans with paper."

comment by mukashi (adrian-arellano-davin) · 2022-06-06T04:03:01.098Z · LW(p) · GW(p)

Thanks a lot for this text, it is an excellent summary. I have a deep admiration for your work and your clarity and yet, I find myself updating towards"I will be able to read this same comment in 30 years time and say, yes, I am glad that EY was wrong."

I don't have doubts about the validity of the orthogonality principle or about instrumental convergence. My problem is that I find point number 2 utterly implausible. I think you are vastly underestimating the complexity of pulling off a plan that successfully kills all humans, and most of this points are based on the assumption that once that an AGI is built, it will become dangerous really quickly, before we can't learn any useful insights in the meantime.

Replies from: andrew-mcknight

↑ comment by Andrew McKnight (andrew-mcknight) · 2022-06-06T21:18:23.819Z · LW(p) · GW(p)

If we merely lose control of the future and virtually all resources but many of us aren't killed in 30 years, would you consider Eliezer right or wrong?

Replies from: adrian-arellano-davin

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T23:15:08.205Z · LW(p) · GW(p)

Wrong. He is being quite clear about what he means

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-07T01:33:32.590Z · LW(p) · GW(p)

Yeah, 'AGI takes control of virtually all resources but leaves many humans alive for years' seems like it clearly violates one or more parts of the EY-model (and the Rob-model, which looks a lot like my model of the EY-model).

An edge case that I wouldn't assume violates the EY-model is 'AGI kills all humans but then runs lots of human-ish simulations in order to test some hypotheses, e.g., about how hypothetical aliens it runs into might behave'. I'm not particularly expecting this because it strikes me as conjunctive and unnecessary, but it doesn't fly in the face of anything I believe.

comment by Chris van Merwijk (chrisvm) · 2022-06-17T05:09:11.134Z · LW(p) · GW(p)

Here is my partial honest reaction, just two points I'm somewhat dissatisfied with (not meant to be exhaustive):
2. "A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure." I would like there to be an argument for this claim that doesn't rely on nanotech, and solidly relies on actually existing amounts of compute. E.g. if the argument relies on running intractable detailed simulations of proteins, then it doesn't count. (I'm not disagreeing with the nanotech example by the way, or saying that it relies on unrealistic amounts of compute, I'd just like to have an argument for this that is very solid and minimally reliant on speculative technology, and actually shows that it is).
6. "We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.". You name "burn all GPU's" as an "overestimate for the rough power level of what you'd have to do", but it seems to me that it would be too weak of a pivotal act? Assuming there isn't some extreme change in generally held views, people would consider this an extreme act of terrorism, and shut you down, put you in jail, and then rebuild the GPU's and go on with what they were planning to do. Moreover, now there is probably an extreme taboo on anything AI safety related. (I'm assuming here that law enforcement finds out that you were the one who did this). Maybe the idea is to burn all GPU's indefinitely and forever (i.e. leave nanobots that continually check for GPU's and burn them when they are created), but even this seems either insufficient or undesirable long term depending on what is counted as a GPU. Possibly I'm not getting what you mean, but it just seems completely too weak as an act.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-17T07:46:29.011Z · LW(p) · GW(p)

From an Eliezer comment [LW(p) · GW(p)]:

Interventions on the order of burning all GPUs in clusters larger than 4 and preventing any new clusters from being made, including the reaction of existing political entities to that event and the many interest groups who would try to shut you down and build new GPU factories or clusters hidden from the means you'd used to burn them, would in fact really actually save the world for an extended period of time and imply a drastically different gameboard offering new hopes and options. [...]

If Iceland did this, it would plausibly need some way to (1) not have its AGI project bombed in response, and (2) be able to continue destroying GPUs in the future if new ones are built, until humanity figures out 'what it wants to do next'. This more or less eliminates the time pressure to rush figuring out what to do next, which seems pretty crucial for good long-term outcomes. It's a much harder problem than just 'cause all GPUs to stop working for a year as a one-time event', and I assume Eliezer's focusing on nanotech it part because it's a very general technology that can be used for tasks like those as well.

Replies from: chrisvm

↑ comment by Chris van Merwijk (chrisvm) · 2022-06-17T09:22:44.875Z · LW(p) · GW(p)

But assuming that law enforcement figures out that you did this, then puts you in jail, you wouldn't be able to control the further use of such nanotech, i.e. there would just be a bunch of systems indefinitely destroying GPU's, or maybe you set a timer or some conditions on it or something. I certainly see no reason why Iceland or anyone in iceland could get away with this unless those systems rely on completely unchecked nanosystems to which the US military has no response. Maybe all of this is what Eliezer means by "melt the GPU's", but I thought he did just mean "melt the GPU's as a single act" (not weird that I thought this, given the phrasing "the pivotal act to melt all the GPU's"). If this is what is meant, then it would be a strong enough pivotal act, and would be an extreme level of capability I agree.

Just wanna remind the reader that Eliezer isn't actually proposing to do this, and I am not seriously discussing it as an option and nor was Eliezer (nor would I support it unless done legally), just thinking through a thought experiment.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-17T14:27:19.364Z · LW(p) · GW(p)

But assuming that law enforcement figures out that you did this, then puts you in jail, you wouldn't be able to control the further use of such nanotech

This would violate Eliezer's condition "including the reaction of existing political entities to that event". If Iceland melts all the GPUs but then the servers its AGI is running on get bombed, or its AGI researchers get kidnapped or arrested, then I assume that the attempted pivotal act failed and we're back to square one.

(I assume this because (a) I don't expect most worlds to be able to get their act together before GPUs proliferate again and someone destroys the world with AGI; and (b) I assume there's little chance of Iceland recovering from losing its AGI or its AGI team.)

Replies from: chrisvm

↑ comment by Chris van Merwijk (chrisvm) · 2022-06-17T15:56:26.878Z · LW(p) · GW(p)

Ok I admit I read over it. I must say though that this makes the whole thing more involved than it sounded at fist, since it would maybe require essentially escalating a conflict with all major military powers and still coming out on top? One possible outcome of this would be that the entire global intellectual public opinion turns against you, meaning you also possibly lose access to a lot of additional humans working with you on further alignment research? I'm not sure if I'm imagining it correctly, but it seems like this plan would either require so many elements that I'm not sure if it isn't just equivalent to solving the entire alignment problem, or otherwise it isn't actually enough.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-17T21:53:34.347Z · LW(p) · GW(p)

it seems like this plan would either require so many elements that I'm not sure if it isn't just equivalent to solving the entire alignment problem

This seems way too extreme to me; I expect the full alignment problem to take subjective centuries to solve. CEV seems way harder to me than, e.g., 'build nanotech that helps you build machinery to relocate your team and your AGI to the Moon, then melt all the GPUs on Earth'.

Leaving the Earth is probably overkill for defensive purposes, given the wide range of defensive options nanotech would open up (and the increasing capabilities gap as more time passes and more tasks become alignable). But it provides another proof of concept that this is a much, much simpler engineering feat than aligning CEV and solving the whole of human values.

Separately, I do in fact think it's plausible that the entire world would roll over (at least for ten years or so) in response to an overwhelming display of force of that kind, surprising and counter-intuitive as that sounds.

I would feel much better about a plan that doesn't require this assumption; but there are historical precedents for world powers being surprisingly passive and wary-of-direct-conflict in cases like this.

Replies from: chrisvm

↑ comment by Chris van Merwijk (chrisvm) · 2022-06-18T04:34:24.690Z · LW(p) · GW(p)

yeah, I probably overstated. Nevertheless:

"CEV seems way harder to me than ..."
yes, I agree it seems way harder, and I'm assuming we won't need to do it and that we could instead "run CEV" by just actually continuing human society and having humans figure out what they want, etc. It currently seems to me that the end game is to get to an AI security service (in analogy to state security services) that protects the world from misaligned AI, and then let humanity figure out what it wants (CEV). The default is just to do CEV directly by actual human brains, but we could instead use AI, but once you're making that choice you've already won. i.e. the victory condition is having a permanent defense against misaligned AI using some AI-nanotech security service, how you do CEV after that is a luxury problem. My point about your further clarification of the "melt all the GPU's option is that it seemed to me (upon first thinking about it), that once you are able to do that, you can basically instead just make this permanent security service. (This is what I meant by "the whole alignment problem", but I shouldn't have put it that way). I'm not confident though, because it might be that such a security service is in fact much harder due to having to constantly monitor software for misaligned AI.

Summary: My original interpretation of "melt the GPUs" was that it buys us a bit of extra time, but now I'm thinking it might be so involved and hard that if you can do that safely, you almost immediately can just create AI security services to permanently defend against misaligned AI (which seems to me to be the victory condition). (But not confident, I haven't thought about it much).

Part of my intuition is, in order to create such a system safely, you have to (in practice, not literally logically necessary) be able to monitor an AI system for misalignment (in order to make sure your GPU melter doesn't kill everyone), and do fully general scientific research. EDIT: maybe this doesn't need you to do worst-case monitoring of misalignment though, so maybe that is what makes a GPU melter easier than fully general AI security services....

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-18T10:07:27.552Z · LW(p) · GW(p)

you can basically instead just make this permanent security service

Who is "you"? What sequence of events are you imagining resulting in a permanent security service (= a global surveillance and peacekeeping force?) that prevents AGI from destroying the world, without an AGI-enabled pivotal act occurring?

Replies from: chrisvm

↑ comment by Chris van Merwijk (chrisvm) · 2022-06-18T18:58:01.007Z · LW(p) · GW(p)

"you" obviously is whoever would be building the AI system that ended up burning all the GPU's (and ensuring no future GPU's are created). I don't know such sequence of events just as I don't know the sequence of events for building the "burn all GPU's" system, except at the level of granularity of "Step 1. build a superintelligent AI system that can perform basically any easily human-specifiable task without destroying the world. Step 2. make that system burn all GPU's indefintely/build security services that prevent misaligned AI from destroying the world".

I basically meant to say that I don't know that "burn all the GPU's" isn't already as difficult as building the security services, because they both require step 1, which is basically all of the problem (with the caveat that I'm not sure, and made an edit stating a reason why it might be far from true). I basically don't see how you execute the "burn all gpu's" strategy without basically solving almost the entire problem.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-19T05:19:15.432Z · LW(p) · GW(p)

Step 1. build a superintelligent AI system that can perform basically any easily human-specifiable task without destroying the world.

I'd guess this is orders of magnitude harder than, e.g., 'build an AGI that can melt all the GPUs, build you a rocket to go to the Moon, and build you a Moon base with 10+ years of supplies'.

Both sound hard, but 'any easily human-specifiable task' is asking for a really mature alignment science in your very first AGI systems -- both in terms of 'knowing how to align such a wide variety of tasks' (e.g., you aren't depending on 'the system isn't modeling humans' as a safety assumption), and in terms of 'being able to actually do the required alignment work on fairly short timescales'.

If we succeed in deploying aligned AGI systems, I expect the first such systems to be very precariously aligned -- just barely able to safely perform a very minimal, limited set of tasks.

I expect humanity, if it survives at all, to survive by the skin of our teeth. Adding any extra difficulty to the task (e.g., an extra six months of work) could easily turn a realistic success scenario into a failure scenario, IMO. So I actually expect it to matter quite a lot exactly how much extra research and engineering work and testing we require; we may not be able to afford to waste a month.

Replies from: chrisvm

↑ comment by Chris van Merwijk (chrisvm) · 2022-06-19T06:44:31.953Z · LW(p) · GW(p)

I'm surprised if I haven't made this clear yet, but the thing that (from my perspective) seems different between my and your view is not that Step 1 seems easier to me than it seems to you, but that the "melt the GPUs" strategy (and possibly other pivotal acts one might come up with) seems way harder to me than it seems to you. You don't have to convince me of "'any easily human-specifiable task' is asking for a really mature alignment", because in my model this is basically equivalent to fully solving the hard problem of AI alignment.

Some reasons:

I don't see how you can do "melt the GPUs" without having an AI that models humans. What if a government decides to send a black ops team to kill this new terrorist organization (your alignment research team), or send a bunch of icbms at your research lab, or do any of a handful of other violent things? Surely the AI needs to understand humans to a significant degree? Maybe you think we can intentionally restrict the AI's model of humans to be only about precisely those abstractions that this alignment team considers safe and covers all the human-generated threat models such as "a black ops team comes to kill your alignment team" (e.g. the abstraction of a human as a soldier with a gun).
What if global public opinion among scientists turns against you and all ideas about "AI alignment" are from now on considered to be megalomaniacal crackpottery? Maybe part of your alignment team even has this reaction after the event, so now you're working with a small handful of people on alignment and the world is against you, and you've semi-premanently destroyed any opportunity that outside researchers can effectively collaborate on alignment research. Probably your team will fail to solve alignment by themselves. It seems to me this effect alone could be enough to make the whole plan predictably backfire. You must have thought of this effect before, so maybe you consider it to be unlikely enough to take the risk, or maybe you think it doesn't matter somehow? To me it seems almost inevitable, and could only be prevented with basically a level of secrecy and propaganda that would require your AI to model humans anyway.

These two things alone make me think that this plan doesn't work in practice in the real world, unless you basically solve Step 1 already. Although I must say the point which I just speculated you might have, that we could somehow control the AI's model of humans to be restricted to particular abstractions, gives me some pause and maybe I end up being wrong via something like that. This doesn't affect the second bullet point though.

Reminder to the reader: This whole discussion is about a thought experiment that neither party actually seriously proposed as a realistic option. I want to mention this because lines might be taken out of context to give the impression that we are actually discussing whether to do this, which we aren't.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-19T07:53:52.505Z · LW(p) · GW(p)

You don't have to convince me of "'any easily human-specifiable task' is asking for a really mature alignment", because in my model this is basically equivalent to fully solving the hard problem of AI alignment.

This seems very implausible to me. One task looks something like "figure out how to get an AGI to think about physics within a certain small volume of space, output a few specific complicated machines in that space, and not think about or steer the rest of the world".

The other task looks something like "solve all of human psychology and moral philosophy, figure out how to get an AGI to do arbitrarily specific tasks across arbitrary cognitive domains with unlimited capabilities and free reign over the universe, and optimize the entire future light cone with zero opportunity to abort partway through if you screw anything up".

The first task can be astoundingly difficult and still be far easier than that.

I don't see how you can do "melt the GPUs" without having an AI that models humans.

If you're on the Moon, on Mars, deep in the Earth's crust, etc., or if you've used AGI to build fast-running human whole-brain emulations, then you can go without AGI-assisted modeling like that for a very long time (and potentially indefinitely). None of the pivotal acts that seem promising to me involve any modeling of humans, beyond the level of modeling needed to learn a specific simple physics task like 'build more advanced computing hardware' or 'build an artificial ribosome'.

What if global public opinion among scientists turns against you

If humanity has solved the weak alignment problem, escaped imminent destruction via AGI proliferation, and ended the acute existential risk period, then we can safely take our time arguing about what to do next, hashing out whether the pivotal act that prevented the death of humanity violated propriety, etc. If humanity wants to take twenty years to hash out that argument, or for that matter a hundred years, then go wild!

I feel optimistic about the long-term capacity of human civilization to figure things out, grow into maturity, and eventually make sane choices about the future, if we don't destroy ourselves. I'm much more concerned with the "let's not destroy ourselves" problem than with the finer points of PR and messaging when it comes to discussing afterwards whatever it was someone did to prevent our imminent deaths. Humanity will have time to sort that out, if someone does successfully save us all.

a small organization going rogue

One small messaging point, though: not destroying the world isn't "going rogue". Destroying the world is "going rogue". If you're advancing AGI, the non-rogue option, the prosocial thing to do, is the thing that prevents the world from dying, not the thing that increases the probability of everyone dying.

Or, if we're going to call 'killing everyone' "not going rogue", and 'preventing the non-rogues from killing everyone' "going rogue", then let's at least be clear on the fact that going rogue is the obviously prosocial thing to do, and not going rogue ("building AGI with no remotely reasonable plan to effect pivotal outcomes") is omnicidal and not a good idea.

Replies from: chrisvm

↑ comment by Chris van Merwijk (chrisvm) · 2022-06-19T10:05:01.397Z · LW(p) · GW(p)

I think I communicated unclearly and it's my fault, sorry for that: I shouldn't have used the phrase "any easily specifiable task" for what I meant, because I didn't mean it to include "optimize the entire human lightcone w.r.t. human values". In fact, I was being vague and probably there isn't really a sensible notion that I was trying to point to. However, to clarify what I really was trying to say: What I mean by "hard problem of alignment" is : "develop an AI system that keeps humanity permanently safe from misaligned AI (and maybe other x risks), and otherwise leaves humanity to figure out what it wants and do what it wants without restricting it in much of any way except some relatively small volume of behaviour around 'things that cause existential catastrophe' " (maybe this ends up being to develop a second version AI that then gets free reign to optimize the universe w.r.t. human values, but I'm a bit skeptical). I agree that "solve all of human psychology and moral ..." is significantly harder than that (as a technical problem). (maybe I'd call this the "even harder problem").

Ehh, maybe I am changing my mind and also agree that even what I'm calling the hard problem is significantly more difficult than the pivotal act you're describing, if you can really do it without modelling humans, by going to mars and doing WBE. But then still the whole thing would have to rely on the WBE, and I find it implausible to do it without it (currently, but you've been updating me about lack of need of human modelling so maybe I'll update here too). Basically the pivotal act is very badly described as merely "melt the gpus", and is much more crazy than what I thought it was meant to refer to.

Regarding "rogue": I just looked up the meaning and I thought it meant "independent from established authority", but it seems to mean "cheating/dishonest/mischievous", so I take back that statement about rogueness.

I'll respond to the "public opinion" thing later.

comment by Prometheus · 2022-06-09T02:16:11.324Z · LW(p) · GW(p)

I think this article is an extremely-valuable kick-in-the-nuts for anyone who thinks they have alignment mostly solved, or even that we're on the right track to doing so. I do, however, have one major concern. The possibility that, failing to develop a powerful AGI first will result in someone else developing something more dangerous x amount of time later, is a legitimate and serious concern. But I fear that the mentality of "if we won't make it powerful now, we're doomed", if a mentality held by enough people in the AI space, might become a self-fulfilling prophecy for destruction. If Deepmind has the mentality that if they don't develop AGI first, and make it powerful and intervening, FAIR will destroy us all 6 months later, and FAIR then adopts the same mentality, there's now an incentive to develop AGI quickly and powerfully. Most incentives for most organizations would not be to immediately develop a severely powerful AGI. Trying to create a powerful AGI designed to stop all other AGIs from developing on the first try, out of fear that someone will develop something more dangerous if you don't, might ironically be what gets us all killed. I think timelines and the number of organizations with a chance at developing AGI will be crucial here. If there is a long timeline before other companies can catch up, then waiting to deploy powerful AGI makes sense, instead working on weak AGIs first. If there is a short timeline, but only a few organizations that can catch up, then coordinating with them on safety would be less difficult. Even Facebook and other companies could potentially cave to enough organizational pressure. Would someone eventually develop a dangerous, powerful AGI, if no other powerful AGI is developed to prevent it? Yes. But it's a matter of how long that can be delayed. If it's weeks or months, we are probably doomed. If it's years or decades, then we might have a chance.

comment by Jonathan Paulson (jpaulson) · 2022-06-07T05:35:40.635Z · LW(p) · GW(p)

Isn't "bomb all sufficiently advanced semiconductor fabs" an example of a pivotal act that the US government could do right now, without any AGI at all?

If current hardware is sufficient for AGI than maybe that doesn't make us safe, but plausibly current hardware is not sufficient for AGI, and either way stopping hardware progress would slow AI timelines a lot.

Replies from: Vaniver

↑ comment by Vaniver · 2022-06-07T16:13:41.384Z · LW(p) · GW(p)

Isn't "bomb all sufficiently advanced semiconductor fabs" an example of a pivotal act that the US government could do right now, without any AGI at all?

Sort of. As stated earlier [LW(p) · GW(p)], I'm now relatively optimistic about non-AI-empowered pivotal acts.

There are two big questions.

First: is "is that an accessible pivotal act?". What needs to be different such that the US government would actually do that? How would it maintain legitimacy and the ability to continue bombing fabs afterwards? Would all 'peer powers' agree to this, or have you just started WWIII at tremendous human cost? Have you just driven this activity underground, or has it actually stopped?

Second: "does that make the situation better or worse?". In the sci-fi universe of Dune, humanity outlaws all computers for AI risk reasons, and nevertheless makes it to the stars... aided in large part by unexplained magical powers. If we outlaw all strong computers in our universe without magical powers, will we make it to the stars, or be able to protect our planet from asteroids and comets, or be able to cure aging, or be able to figure out how to align AIs?

I think probably if we stayed at, like, 2010s level of hardware we'd be fine and able to protect our planet from asteroids or w/e, and maybe it'll be fine at 2020s levels or 2030s levels or w/e (tho obv more seems more risky). So I think there are lots of 'slow down hardware progress' options that do actually make the situation better, and so think people should put effort into trying to accomplish this legitimately, but I'm pretty confused about what to do in situations where we don't have a plan of how to turn low-hardware years into more alignment progress.

According to a bunch of people, it will be easier to make progress on alignment when we have more AI capabilities, which seems right to me. Also empirically it seems like the more AI can do, the more people think it's fine to worry about AI, which also seems like a sad constraint that we should operate around. I think it'll also be easier to do dangerous things with more AI capabilities and so the net effect is probably bad, but I'm open to arguments of the form "actually you needed transformers to exist in order for your interpretability work to be at all pointed in the right direction" which suggest we do need to go a bit further before stopping in order to do well at alignment. But, like, let's actually hear those arguments in both directions.

Replies from: jpaulson

↑ comment by Jonathan Paulson (jpaulson) · 2022-06-07T22:51:27.158Z · LW(p) · GW(p)

I don't think "burn all GPUs" fares better on any of these questions. I guess you could imagine it being more "accessible" if you think building aligned AGI is easier than convincing the US government AI risk is truly an existential threat (seems implausible).

"Accessibility" seems to illustrate the extent to which AI risk can be seen as a social rather than technical problem; if a small number of decision-makers in the US and Chinese governments (and perhaps some semiconductor companies and software companies) were really convinced AI risk was a concern, they could negotiate to slow hardware progress. But the arguments are not convincing (including to me), so they don't.

In practice, negotiation and regulation (I guess somewhat similar to nuclear non-proliferation) would be a lot better than "literally bomb fabs". I don't think being driven underground is a realistic concern - cutting-edge fabs are very expensive.

comment by Chris_Leong · 2022-06-06T09:50:58.267Z · LW(p) · GW(p)

Regarding the point about most alignment work not really addressing the core issue: I think that a lot of this work could potentially be valuable nonetheless. People can take inspiration from all kinds of things and I think there is often value in picking something that you can get a grasp on, then using the lessons from that to tackle something more complex. Of course, it's very easy for people to spend all of their time focusing on irrelevant toy problems and never get around to making any progress on the real problem. Plus there are costs with adding more voices into the conversation as it can be tricky for people to distinguish the signal from the noise.

comment by Kayden (kunvar-thaman) · 2022-06-06T06:24:42.228Z · LW(p) · GW(p)

I mostly agree with the points written here. It's actually on the (Section A; Point1) that I'd like to have more clarification on:

AGI will not be upper-bounded by human ability or human learning speed. Things much smarter than human would be able to learn from less evidence than humans require to have ideas driven into their brains

When we have AGI working on hard research problems, it sounds akin to decades of human-level research compressed up into just a few days or maybe even less, perhaps. That may be possible, but often, the bottleneck is not the theoretical framework or proposed hypothesis, but waiting for experimental proof. If we say that an AGI will be a more rational agent than humans, do we not expect it to try to accumulate more experimental proof to test the theory to estimate, for example, the expected utility of pursuing a novel course of action?

I think there would still be some constraints to this process. For example, humans often wait until the experimental proof has accumulated enough to validate certain theories (for example, the Large Hadron Collider Project, the Photoelectric effect, etc). We need to observe nature to gather proof that the theory doesn't fail in scenarios we expect it to fail. To accumulate such proof, we might build new instruments to gather new types of data to validate the theory on the now-larger set of available data. Sometimes that process can take years. Just because AGI will be smarter than humans, can we say that it'll be making proportionately faster breakthroughs in research?

Replies from: Daphne_W

↑ comment by Daphne_W · 2022-06-06T08:30:52.501Z · LW(p) · GW(p)

I think Yudkowsky would argue [LW · GW] that on a scale from never learning anything to eliminating half your hypotheses per bit of novel sensory information, humans are pretty much at the bottom of the barrel.

When the AI needs to observe nature, it can rely on petabytes of publicly available datasets from particle physics to biochemistry to galactic surveys. It doesn't need any more experimental evidence to solve human physiology or build biological nanobots: we've already got quantum mechanics and human DNA sequences. The rest is just derivation of the consequences.

Sure, there are specific physical hypotheses that the AGI can't rule out because humanity hasn't gathered the evidence for them. But that, by definition, excludes anything that has ever observably affected humans. So yes, for anything that has existed since the inflationary period, the AGI will not be bottlenecked on physically gathering evidence.

I don't really get what you're pointing at with "how much AGI will be smarter than humans", so I can't really answer your last question. How much smarter than yourself would you say someone like Euler is than yourself? Is his ability to do scientific/mathematical breakthroughs proportional to your difference in smarts?

Replies from: kunvar-thaman

↑ comment by Kayden (kunvar-thaman) · 2022-06-06T10:28:39.765Z · LW(p) · GW(p)

I assumed that there will come a time when the AGI has exhausted consuming all available human-collected knowledge and data.

My reasoning for the comment was something like

"Okay, what if AGI happens before we've understood the dark matter and dark energy? AGI has incomplete models of these concepts (Assuming that it's not able to develop a full picture from available data - that may well be the case, but for a placeholder, I'm using dark energy. It could be some other concept we only discover in the year prior to the AGI creation and have relatively fewer data about), and it has a choice to either use existing technology (or create better using existing principles), or carry out research into dark energy and see how it can be harnessed, given reasons to believe that the end-solution would be far more efficient than the currently possible solutions.

There might be types of data that we never bothered capturing which might've been useful or even essential for building a robust understanding of certain aspects of nature. It might pursue those data-capturing tasks, which might be bottlenecked by the amount of data needed, the time to collect data, etc (though far less than what humans would require)."

Thank you for sharing the link. I had misunderstood what the point meant, but now I see. My speculation for the original comment was based on a naive understanding. This post [LW · GW] you linked is excellent and I'd recommend everyone to give it a read.

comment by Adam Zerner (adamzerner) · 2022-06-06T03:36:28.577Z · LW(p) · GW(p)

The only disagreement I'm seeing in the comments is on smaller points, not larger ones. I wonder what that means. It feels like "absence of evidence is evidence of absence" to me.

Replies from: quintin-pope, Jan_Kulveit, adrian-arellano-davin

↑ comment by Quintin Pope (quintin-pope) · 2022-06-06T04:20:54.134Z · LW(p) · GW(p)

1: It takes longer than a few hours to properly disagree with a post like this.
2: I'm not sure the comments here are an appropriate venue for debating such a disagreement.

I personally have a number of significant, specific disagreements with the post, primarily relating to the predictability and expected outcomes of inner misalignments [LW · GW] and the most appropriate way of thinking about agency and value fragility [LW · GW]. I've linked some comments I've made on those topics, but I think a better way to debate these sorts of questions is via a top level post specifically focusing on one area of disagreement.

Replies from: adamzerner

↑ comment by Adam Zerner (adamzerner) · 2022-06-06T05:31:51.437Z · LW(p) · GW(p)

1: Yeah I guess that's true. And comments about smaller points are quicker to write up, explaining the fact that we see a bunch of those comments earlier on. But my intuition is that in 24-48 hours those sorts of meatier objections would usually surface.

2: Regardless of whether that is true, I would expect some people to find the OP an appropriate place to debate.

↑ comment by Jan_Kulveit · 2022-06-07T02:17:38.127Z · LW(p) · GW(p)

One datapoint:
- Overall I don't think the structure of the text makes it easy to express larger disagreements. Many points state obviously true observations, many other points are expressing the same problem in different words, some points are false, and sometimes whether a point actually bites depends on highly speculative assumptions.
- For example: if that counts as a disagreement, in my view what makes multiple of these points "lethal" is a hidden assumption there is a fundamental discontinuity between some categories of systems (eg. weak, won't kill you, won't help you with alignment | strong, would help you with alignment, but will kill you by default ) and there isn't anything interesting/helpful in between (eg. "moderately strong" systems). I don't think this is true or inevitable.
- I'll probably try to write and post a longer, top-level post about this (working title: Hope is in continuity).
- I think an attempt to discuss this in comments would be largely pointless. Short-form comment would run into the problem of misunderstanding of what I mean, long comment would be too long.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-07T04:25:09.388Z · LW(p) · GW(p)

a hidden assumption there is a fundamental discontinuity between some categories of systems (eg. weak, won't kill you, won't help you with alignment | strong, would help you with alignment, but will kill you by default ) and there isn't anything interesting/helpful in between (eg. "moderately strong" systems). I don't think this is true or inevitable.

- I'll probably try to write and post a longer, top-level post about this (working title: Hope is in continuity).

I think discontinuity is true, but it's not actually required for EY's argument. Thus, asserting continuity isn't sufficient as a response.

You specifically need it to be the case that you get useful capabilities earlier than dangerous ones. If the curves are continuous and danger comes at a different time than pivotalness, but danger comes before pivotalness, then you're plausibly in a worse situation rather than a better one.

So there needs to be some pivotal act that is pre-dangerous but also post-useful. I think the best way to argue for this is just to name one or more examples. Not necessarily examples where you have an ironclad proof that the curves will work out correctly; just examples that you do in fact believe are reasonably likely to work out. Then we can talk about whether there's a disagreement about the example's usefulness, or about its dangeousness, or both.

(Elaborating on "I think discontinuity is true": I don't think AGI is just GPT-7 or Bigger AlphaGo; I don't think the cognitive machinery involved in modeling physical environments, generating and testing scientific hypotheses to build an edifice of theory, etc. is a proper or improper subset of the machinery current systems exhibit; and I don't think the missing skills are a huge grab bag of unrelated local heuristics such that accumulating them will be gradual and non-lumpy.)

Replies from: Jan_Kulveit

↑ comment by Jan_Kulveit · 2022-06-13T22:00:42.992Z · LW(p) · GW(p)

The actual post is now here [LW · GW] - as expected, it's more post-length than a comment.

↑ comment by mukashi (adrian-arellano-davin) · 2022-06-06T04:08:45.676Z · LW(p) · GW(p)

You are dealing with a potentially very biased sample of people, I wouldn't conclude that

comment by Barak Pearlmutter (barak-pearlmutter) · 2022-07-26T19:52:18.248Z · LW(p) · GW(p)

each bit of information that couldn't already be fully predicted can eliminate at most half the probability mass of all hypotheses under consideration

That's not actually true (not that this matters to the main argument.) It's true in expectation: on average, you can only get at most one bit per bit. But some particular bit might give you much more, like a bit coming up 1 when you were very very sure it would be 0. "Did you just win the lottery?"

comment by Multicore (KaynanK) · 2022-06-18T13:03:53.253Z · LW(p) · GW(p)

Meta: This is now the top-voted LessWrong post of all time.

Replies from: adamzerner

↑ comment by Adam Zerner (adamzerner) · 2022-06-19T05:23:27.794Z · LW(p) · GW(p)

True, but it's [? · GW] 8th if you adjust for inflation.

comment by Pattern · 2022-06-09T03:56:31.241Z · LW(p) · GW(p)

So, again, you end up needing alignment to generalize way out of the training distribution

I assume this is 'you need alignment if you are going to try 'generalize way out of the training distribution and give it a lot of power'' (or you will die).

And not something else like 'it must stay 'aligned' - and not wirehead itself - to pull something like this off, even though it's never done that before'. (And thus 'you need alignment to do X', not because you will die if you do, but because alignment means something like 'the ability to generalize way out of the training distribution, and not, it's 'safe'* even though it's doing that.)

*Safety being hard to define in a technical way, such that the definition can provide safety. (Sort of.)

... This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

Are there examples of inner-aligned solutions? (It seems I'm not up to date on this.)

comment by Adam Zerner (adamzerner) · 2022-06-07T22:38:06.639Z · LW(p) · GW(p)

Everyone else seems to feel that, so long as reality hasn't whapped them upside the head yet and smacked them down with the actual difficulties, they're free to go on living out the standard life-cycle and play out their role in the script and go on being bright-eyed youngsters

Iirc there was an Overcoming Bias post about ~this. I spend about 15 minutes searching and wasn't able to find it though.

comment by [deleted] · 2022-06-07T17:37:50.486Z · LW(p) · GW(p)

comment by swift_spiral · 2022-06-07T01:10:06.939Z · LW(p) · GW(p)

Why does burning all GPUs succeed at preventing unaligned AGI, rather than just delaying it? It seems like you would need to do something more like burning all GPUs now, and also any that get created in the future, and also monitor for any other forms of hardware powerful enough to run AGI, and for any algorithmic progress that allows creating AGI with weaker hardware, and then destroying that other hardware too. Maybe this is what you meant by "burn all GPUs", but it seems harder to make an AI safely do than just doing that once, because you need to allow the AI to defend itself indefinitely against people who don't want it to keep destroying GPUs.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-07T01:37:23.949Z · LW(p) · GW(p)

I think this is basically what the "burn all GPUs" scenario entails, and I agree this is harder.

Replies from: Raemon

↑ comment by Raemon · 2022-06-07T02:42:08.554Z · LW(p) · GW(p)

Fwiw I think it’s worth the effort to include an extra sentence or two elaborating on this whenever Eliezer or whoever uses it as an example. I don’t think ‘burn all the GPUs’ is obvious enough about what it means

Replies from: Kenny

↑ comment by Kenny · 2022-06-09T17:02:05.342Z · LW(p) · GW(p)

I thought it was clear, given the qualifications already offered, that it was more like a 'directional example' than a specific, workable, and concrete example.

Replies from: Raemon

↑ comment by Raemon · 2022-06-09T17:56:44.573Z · LW(p) · GW(p)

I that's true for people paying attention, but a) it's just worth being clear, and b) this example is getting repeated a lot, often without the disclaims/setup, and (smart, ingroup) people have definitely been confused/surprised when I said "the GPU thing is obviously meant to include 'continuously surviving counterattacks and dealing with the political fallout."

I think it's worth having a compact phrase that's more likely to survive memetic drift.

Replies from: Kenny

↑ comment by Kenny · 2022-06-10T02:52:53.060Z · LW(p) · GW(p)

Updated – thanks!

Do you have any candidates in mind?

comment by jbash · 2022-06-06T14:04:47.652Z · LW(p) · GW(p)

So about this word "superintelligence".

I would like to see a better definition. Not necessarily a good definition, but some pseudo-quantitative description better than "super", or "substantially smarter than you".

I believe "superintelligence" is a Yudkowsky coinage, and I know that it came up in the context of recursive self-improvement. Almost everybody in certain circles in 1995 was starting from the paradigm of building a "designed" AGI, incrementally smarter than a human, which would then design something incrementally smarter than itself (and faster than humans had built it), and so forth, so that the intelligence would increase, loosely speaking, exponentially. In that world, no particular threshold matters very much, because you're presumably going to blow through any such threshold pretty fast, and only stop when you hit physical limits.

That model does not obviously apply to ML. Being trained to be smarter than a human doesn't imply that you can come up with a fundamentally better way to build the "next generation", and if all that's available is more and bigger ML systems, you don't get that exponential growth (at least until/unless you get smart enough to switch to a designed non-ML successor). Your growth may even be sublinear. You don't have a clear reason to anticipate immediate ballooning to physical limits, so the question of the "threshold of danger" becomes an important one.

We have people running around with IQs up to maybe the 170s, and such individuals are not able to persuade others do just anything, nor can they design full suites of mature nanotechnology "in their heads", nor any of the other world-threatening scenarios anybody has brought up. It seems very unlikely that having an IQ of 200 suddenly opens up those sorts of options. The required cognitive capacity is pretty obviously vast, even if you think very differently than a human.

So how smart do you actually have to be to open up those options? I think abuse of IQ for qualitative purposes is reasonable for the moment, so do you need an IQ of 1000? 10000? What? And can any known approach actually scale to that, and how long would it take? Teaching yourself to play go is not in the same league as teaching yourself to take over the world with nanonbots.

Replies from: andrew-mcknight, yitz, talelore

↑ comment by Andrew McKnight (andrew-mcknight) · 2022-06-06T20:52:35.602Z · LW(p) · GW(p)

There is some evidence that complex nanobots could be invented in ones head with a little more IQ and focus because von Neumann designed a mostly functional (but fragile) replicator in a fake simple physics using the brand-new idea of a cellular automata and without a computer and without the idea of DNA. If a slightly smarter von Neumann focused his life on nanobots, could he have produced, for instance, the works of Robert Freitas but in the 1950s, and only on paper?

I do, however, agree it would be helpful to have different words for different styles of AGI but it seems hard to distinguish these AGIs productively when we don't yet know the order of development and which key dimensions of distinction will be worth using as we move forward. (human-level vs super-? shallow vs deep? passive vs active? autonomy-types? tightness of self-improvement? etc). Which dimensions will pragmatically matter?

Replies from: jbash

↑ comment by jbash · 2022-06-06T21:06:54.751Z · LW(p) · GW(p)

"On paper" isn't "in your head", though. In the scenario that led to this, the AI doesn't get any scratch paper. I guess it could be given large working memory pretty easily, but resources in general aren't givens.

More importantly, even in domains where you have a lot of experience, paper designs rarely work well without some prototyping and iteration. So far as I know, von Neumann's replicator was never a detailed mechanical design that could actually be built, and certainly never actually was built. I don't think anything of any complexity that Bob Freitas designed has ever been built, and I also don't think any of the complex Freitas designs are complete to the point of being buildable. I haven't paid much attention since the repirocyte days, so I don't know what he's done since then, but that wasn't even a detailed design, and it even the ideas that were "fleshed out" probably wouldn't have worked in an actual physiological environment.

Replies from: andrew-mcknight

↑ comment by Andrew McKnight (andrew-mcknight) · 2022-06-18T14:39:00.981Z · LW(p) · GW(p)

von Neumann's design was in full detail, but, iirc, when it was run for the first time (in the 90s) it had a few bugs that needed fixing. I haven't followed Freitas in a long time either but agree that the designs weren't fully spelled out and would have needed iteration.

↑ comment by Yitz (yitz) · 2022-06-07T06:25:34.766Z · LW(p) · GW(p)

I’m very interested in doing this! Please DM me if you think it might be worth collaborating :)

↑ comment by talelore · 2022-06-06T20:21:10.743Z · LW(p) · GW(p)

A different measure than IQ might be useful at some point. An IQ of X effectively means you would need a population of Y humans or more to expect to find at least one human with an IQ of X. As IQs get larger, say over 300, the number of humans you would need in a population to expect to find at least one human with such an IQ becomes ridiculous. Since there are intelligence levels that will not be found in human populations of any size, the minimum population size needed to expect to find someone with IQ X tends to infinity as IQ approaches some fixed value (say, 1000). IQ above that point is undefined.

It would be nice to find a new measure of intelligence that could be used to measure differences between humans and other humans, and also differences between humans and AI. But can we design such a measure? I think raw computing power doesn't work (how do you compare humans to other humans? Humans to an AI with great hardware but terrible software?)

Could you design a questionnaire that you know the correct answers to, that a very intelligent AI (500 IQ?) could not score perfectly on, but an extremely intelligent AI (1000+ IQ) could score perfectly on? If not, how could we design a measure of intelligence that goes beyond our own intelligence?

Maybe we could define an intelligence factor x to be something like: The average x value for humans is zero. If your x value is 1 greater than mine, then you will outwit me and get what you want 90% of the time, if our utility functions are in direct conflict, such that only one of us can get what we want, assuming we have equal capabilities, and the environment is sufficiently complex. With this scale, I suspect humans probably range in x-factors from -2 to 2, or -3 to 3 if we're being generous. This scale could let us talk about superintelligences as having an x-factor of 5, or an x-factor of 10, or so on. For example, a superintelligence with an x-factor of 5 has some chance of winning against a superintelligence with an x-factor of 6, but is basically outmatched by a superintelligence with an x-factor of 8.

The reason the "sufficiently complex environment" clause exists, is that superintelligences with x-factors of 10 and 20 may both find the physically optimal strategy for success in the real world, and so who wins may simply be down to chance. We can say an environment where there ceases to be a difference in the strategies between intelligences with an x-factor of 5 and and x-factor of 6 has a complexity factor of 5. I would guess the real world has a complexity factor of around 8, but I have no idea.

I would be terrified of any AI with an x-factor of 4-ish, and Yudkowsky seems to be describing an AI with an x-factor of 5 or 6.

Replies from: jbash

↑ comment by jbash · 2022-06-06T20:34:29.143Z · LW(p) · GW(p)

X-factor does seem better than IQ, of course with the proviso that anybody who starts trying to do actual math with either one, or indeed to use it for anything other than this kind of basically qualitative talk, is in serious epistemic trouble.

I would suggest that humans run more like -2 to 1 than like -3 to 3. I guess there could be a very, very few 2s.

I get the impression that, except when he's being especially careful for some specific reason, EY tends to speak as though the X-factor of an AI could and would quickly run up high enough that you couldn't measure it. More like 20 or 30 than 5 or 6; basically deity-level. Maybe it's a habit from the 1995 era, or maybe he has some reason to believe that that I don't understand.

Personally, I have the general impression that you'd be hard pressed to get to 3 with an early ML-based AI, and I think that the "equal capabilities" handicap could realistically be made significant. Maybe 3?

comment by Aram Panasenco (panasenco) · 2025-01-11T14:42:35.336Z · LW(p) · GW(p)

I really appreciate this post, as much as it's making me feel that I and everyone I care about have terminal cancer with only 12-60 months to live.

I found the idea that a pivotal act is necessary as especially valuable and expanded on it in my post Is AI Alignment Enough? [LW · GW]

comment by trevor (TrevorWiesinger) · 2024-01-13T20:23:24.615Z · LW(p) · GW(p)

Solid, aside from the faux-pass self-references. If anyone wonders why people would have a high p(doom), especially Yudkowsky himself, this doc solves the problem in a single place. Demonstrates why AI safety is superior to most other elite groups; we don't just say why we think something, we make it easy to find as well. There still isn't much need for Yudkowsky to clarify further, even now.

I'd like to note that my professional background makes me much better at evaluating Section C than Sections A and B. Section C is highly quotable, well worth multiple reads, and predicted trends that ended up continuing even today [LW · GW]. I'm not so impressed with people's responses at the time (including my own).

Edit: I still think I am right about everything in this review. I would further recommend that everyone reread the entire doc on a ~1.5 year cycle, like the ~4 year cycle for the Sequences [? · GW] and the single reread of HPMOR before 10 years pass. It is, in fact, all in one place.

The "even today" link sometimes breaks, it is supposed to go directly to the paragraph about international treaties in this section [LW · GW].

comment by metachirality · 2023-12-20T22:53:10.483Z · LW(p) · GW(p)

After I read this, I started avoiding reading about others' takes on alignment so I could develop my own opinions.

comment by Oliver Sourbut · 2023-12-16T17:19:54.272Z · LW(p) · GW(p)

most organizations don't have plans, because I haven't taken the time to personally yell at them. 'Maybe we should have a plan' is deeper alignment mindset than they possess without me standing constantly on their shoulder as their personal angel pleading them into... continued noncompliance, in fact. Relatively few are aware even that they should, to look better, produce a pretend plan that can fool EAs too 'modest' to trust their own judgments about seemingly gaping holes in what serious-looking people apparently believe.

This, at least, appears to have changed in recent months. Hooray!

comment by Tapatakt · 2022-08-02T12:54:11.326Z · LW(p) · GW(p)

Russian translation by me

comment by Knownsage1 · 2022-06-30T03:03:27.100Z · LW(p) · GW(p)

Build it in Minecraft! Only semi joking. There’s videos of people apparently building functioning 16 bit computers out of blocks in Minecraft. An unaligned AGI running on a virtual computer built out of (orders of magnitude more complex) Minecraft blocks would presumably subsume the Minecraft world in a manner observable to us before perceiving that a (real) real world existed?

comment by Tim Liptrot (rockthecasbah) · 2022-06-13T21:21:48.452Z · LW(p) · GW(p)

Apologies if this has been said, but the reading level of this essay is stunningly high. I've read rationality A-Z and I can barely follow passages. For example

This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

I think Yud means here is our genes had a base objective of reproducing themselves. The genes wanted their humans to make babies which were also reproductively fit. But "real-world bounded optimization process" produced humans that sought different things, like sexual pleasure and food and alliances with powerful peers. In the early environment that worked because sex lead to babies and food lead to healthy babies and alliances lead to protection for the babies. But once we built civilization we started having sex with birth control as an end in itself, even letting it distract us from the baby-making objectives. So the genes had this goal but the mesa-optimizer (humans) was only aligned in one environment. When the environment changed it lost alignment. We can expect the same to happen to our AI.

Okay, I think I get it. But there are so few people on the planet that can parse this passage.

Has someone written a more accessible version of this yet?

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-14T00:23:38.568Z · LW(p) · GW(p)

Your summary sounds good to me. https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizers?s=r might be a good source for explaining some of the terms like "inner-aligned"?

comment by lc · 2022-06-06T23:14:09.980Z · LW(p) · GW(p)

31. A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it's acquired strategic awareness.)

I never know with a lot of your writing whether or not you're implying something weird or if I'm just misreading, or I'm taking things too far.

This seems like it depends on the AGI. You could scale up and observe, e.g., Mr. Portman from my old High School, and he would be unlikely to deceive me regardless of how much our politics diverge because he's an extremely honest person. Different minds can be more or less likely to strategically manipulate other minds independent of whether or not they have the same goals. Behavioral ticks are a thing.

This is very difficult to engineer of course, in the same way corrigibility is difficult to engineer, but it's not conceptually impossible. The text seems to imply that it is in fact conceptually flawed to rely on behavioral inspection in any circumstance.

Replies from: lalaithion

↑ comment by lalaithion · 2022-06-06T23:36:35.226Z · LW(p) · GW(p)

I think this is covered by preamble item -1: “None of this is about anything being impossible in principle.”

Replies from: lc

↑ comment by lc · 2022-06-06T23:56:14.496Z · LW(p) · GW(p)

You're probably right, I just confused myself. I think it'd be more helpful to explain why it'd be hard to engineer an honest AGI in that section because that's the relevant part, even if you're just pointing back to another section.

comment by CronoDAS · 2022-06-06T07:01:37.236Z · LW(p) · GW(p)

I might see a possible source of a "miracle", although this may turn out to be completely unrealistic and I totally would not bet the world on it actually happening.

A lot of today's machine learning systems can do some amazing things, but much of the time trying to get them to do what you want is like pulling teeth. Early AGI systems might have similar problems: their outputs might be so erratic that it's obvious that they can't be relied on to do anything at all; you tell them to maximize paperclips, and half the time they start making clips out of paper and putting them together to make a statue of Max, or something equally ridiculous and obviously wrong. Systems made by people who have no idea that they need to figure out how to align an AI end up as useless failed projects before they end up dangerous.

In practice, though, we should never underestimate the ingenuity of fools...

Replies from: Eliezer_Yudkowsky, RobbBB

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-06T20:42:31.094Z · LW(p) · GW(p)

How does this help anything or change anything? That's just the world we're in now, where we have GPT-3 instead of AGI. Eventually the systems get more powerful and dangerous than GPT-3 and then the world ends. You're just describing the way things already are.

Replies from: CronoDAS

↑ comment by CronoDAS · 2022-06-06T23:15:42.056Z · LW(p) · GW(p)

I'm imagining that systems get much stronger without getting much more "aimable", if that makes sense; they solve problems, but when you ask them to solve things they keep solving the wrong problem in a way that's sufficiently obvious that makes actually using them pointless. Instead of getting the equivalent of paperclip maximizers, you get a random mind that "wants" things that are so incoherent that they don't do much of anything at all, and this fact forces people to give up and decide that investing further in general AI capacity without first making investments in AI control/"alignment" is useless.

Maybe that's just my confusion or stupidity talking, though. And I did call it a "miracle" that the ability to make a seemingly useful AGI ends up bottlenecking on alignment research rather than raw capacity research because the default unaligned AGI is an incoherent mess that does random ineffective things when operating "out of sample" rather than a powerful optimization process that destroys the world.

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T07:23:15.100Z · LW(p) · GW(p)

It's not obvious to me that this scenario concentrates net probability mass onto 'things go awesome for humanity long-term'. Making everything harder might mean that alignment is also harder. A few extra years of chaos doesn't buy us anything unless we're actively nailing down useful robust AGI during that time.

(There is some extra hope in 'For some reason, humanity has working AGIs for a little while before anyone can destroy the world, and this doesn't make alignment much harder', though I'd assume there are other, much larger contributors-of-hope in any world like that where things actually go well.)

comment by Eli Tyre (elityre) · 2023-10-12T09:47:45.959Z · LW(p) · GW(p)

Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.

Again, this seems less true the better your adversarial training setup is at identifying the test cases in which you're likely to be misaligned?

comment by Eli Tyre (elityre) · 2023-10-12T09:44:17.157Z · LW(p) · GW(p)

If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.

I see how the concept learned from the reward signal is not exactly the concept that you wanted, but why is that concept lethal?

It seems like the feedback from the human raters would put a very low score on any scenario that involves all humans being killed.

comment by Eli Tyre (elityre) · 2023-10-12T09:38:45.713Z · LW(p) · GW(p)

the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.

Right, but then we continue the training process, which shapes the semi-outer-aligned algorithms into something that is is more inner aligned?

Or is the thought that this is happening late in the game, after the algorithm is strategically aware and deceptively aligned, spoofing your adversarial test cases, while awaiting a treacherous turn?

But would that even work? SGD still updates the parameters of the system even in cases where the model "passes". It seems like it would be shaped more and more to do the things that the humans want, over many trillions (?) of training steps. Even if it starts out deceptively aligned, does it stay deceptively aligned?

comment by Eli Tyre (elityre) · 2023-10-12T09:32:49.035Z · LW(p) · GW(p)

Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about,

I'm not very compelled by this, I think.

Evolution was doing very little (0) adversarial training: guessing ahead to to the sorts of circumstances under which humans would pursue strategies that didn't result in maximizing inclusive genetic fitness, and testing the humans, and penalizing them for deviations from the outer loss function.

But that seems like a natural thing to do when training an AI system.

In short, evolution wasn't trying very hard to align humans, so it doesn't seem like much evidence that they ended up not very aligned.

comment by Konstantinos Spigos (konstantinos-spigos) · 2023-09-23T10:52:03.844Z · LW(p) · GW(p)

I think that the question is not thoroughly set from the start. It is not whether AI could prove dangerous for a possible extinction of the humanity, but how much more risk does the artificial intelligence ADDS to the current risk of extinction of the humanity as it is without a cleverest AI. In this case the answers might be different. Of course it is a very difficult question to answer and in any case, it does not reduce the significance of the original question, since we talk about a situation totally human-made -and preventable as such.

Replies from: viktor-riabtsev-1

↑ comment by Viktor Riabtsev (viktor-riabtsev-1) · 2024-07-31T06:12:42.854Z · LW(p) · GW(p)

Its the same question.

Replies from: viktor-riabtsev-1

↑ comment by Viktor Riabtsev (viktor-riabtsev-1) · 2024-07-31T09:00:53.011Z · LW(p) · GW(p)

If you were dead in the future, you would be dead already. Because time travel is not ruled out in principle.

Danger is a fact about fact density and your degree of certainty. Stop saying things with the full confidence of being afraid. And start simply counting the evidence.

Go back a few years. Start there.

comment by lovetheusers (CrazyPyth) · 2022-11-15T02:57:24.393Z · LW(p) · GW(p)

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.

This is correct, and I believe the answer is to optimize for detecting aligned thoughts.

comment by javva209 · 2022-06-16T19:43:03.200Z · LW(p) · GW(p)

is AGI inconsistent with the belief that there is other sentient life in the universe? If AGI is as dangerous as Eliezer states, and that danger is by no means restricted to earth much less our own solar system. Wouldnt alien intelligences (both artificial and neural) have a strong incentive to either warn us about AGI or eliminate us before we create it for their own self preservation?
So either we arent even close to AGI and intergalactic AGI police arent concerned, or AGI isnt a threat, or we are truly alone in the universe, or the universe is so vast and even the most intelligent systems possible cannot overcome the distances (no warp drive or wormholes or FTL travel) that we'll be long dead to our own AGI before any warnings arrive.

Replies from: jrincayc

↑ comment by jrincayc · 2022-06-19T23:38:42.524Z · LW(p) · GW(p)

I agree with your comment. Also, if any expansionist, deadly AGI existed in our galaxy say, 100,000 years ago, it would already have been to Earth and wiped us out. So we kind of can rule out nearby expansionists deadly AGIs (and similar biological aliens). What that actually tells us about the deadlyness of AGIs is an interesting question. It is possible that destruction by AGI (or some other destructive technological event) are usually are fairly localized and so only destroy the civilization that that produced them. Alternatively, we just happen to be in one of the few quantum branches that has not yet been wiped out by an ED-AGI, and we are only here discussing it because of survival bias.

comment by plex (ete) · 2022-06-10T14:50:14.832Z · LW(p) · GW(p)

It's not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.

I have a terrifying hack which seems like it be possible to extract an AI which would act CEV-like way, using only True Names which might plausibly be within human reach, called Universal Alignment Test. I'm working with a small team of independent alignment researchers on it currently, feel free to book a call with me if you'd like to have your questions answered in real time. I have had "this seems interesting"-style reviews from the highest level people I've spoken to about it.

I failed to raise the idea with EY in 2015 at a conference, because I was afraid to be judged as a crackpot / seeming like I was making an inappropriate status claim by trying to solve too large a part of the problem. In retrospect this is a deeply ironic mistake to have made.

comment by PoignardAzur · 2022-06-09T18:35:52.087Z · LW(p) · GW(p)

I'm a bit disappointed by this article. From the title, I fought it would be something like "A list of strategies AI might use to kill all humanity", not "A list of reasons AIs are incredibly dangerous, and people who disagree are wrong". Arguably, it's not very good at being the second.

But "ways AI could be lethal on an extinction level" is a pretty interesting subject, and (from what I've read on LW) somewhat under-explored. Like... what's our threat model?

For instance, the basic Terminator scenario of "the AI triggers a nuclear war" seems unlikely to me. A nuclear war would produce a lot of EMPs, shut down a lot of power plants and blow up a lot of data centers. Even if the AI is backed up in individual laptops or in Starlink satellites, it would lose any way of interacting with the outside world. Boston dynamics robots would shut down because there are no more miners producing coal for the coal plant that produced the electricity these robots need to run. (and, you know, all the other million parts of the supply chain being lost).

In fact, even if an unfriendly AI escaped its sandbox, it might not want to kill us immediately. It would want to wait until we've developed some technologies in the right directions: more automation in data-centers and power plants, higher numbers of drones and versatile androids, better nanotechnology, etc.

That's not meant to be reassuring. The AI would still kill us eventually, and it wouldn't sit tight in the meantime. It would influence political and economic processes to make sure no other AI can concurrence it. This could take many forms, from the covert (eg manipulating elections and flooding social networks with targeted disinformation) to the overt (eg assassinating AI researchers or bombing OpenAI datacenters). The point is that its interventions would look "soft" at first compared to the "flood the planet with nanbots and kill everyone at the same time" scenario, because it would be putting its pieces in place for that scenario to happen.

Again, that doesn't mean the AI would lose. If you're Afghanistan and you're fighting against the US, you're not going to win just because the US is unwilling to immediately jump to nukes. In fact, if the US is determined to win at all costs and will prefer using nukes over losing, you're still fucked. But the war will look like you have a fighting chance during the initial phases, because the enemy will be going easy on you in preparation for the final phase.

All that is just uninformed speculating, of course. Again, my main point is that I haven't really seen discussions of these scenarios and what the probable limits of an unfriendly AI would be. The question probably deserves to be explored more.

comment by gw · 2022-06-08T21:23:01.451Z · LW(p) · GW(p)

Can we join the race to create dangerous AGI in a way that attempts to limit the damage it can cause, but allowing it to cause enough damage to move other pivotal acts into the Overton window?

If the first AGI created is designed to give the world a second chance, it may be able to convince the world that a second chance should not happen. Obviously this could fail and just end the world earlier, but it would certainly create a convincing argument.

In the early days of the pandemic, even though all the evidence was there, virtually no one cared about covid until it was knocking on their door, and then suddenly pandemic preparedness seemed like the most obvious thing to everyone.

comment by Moebius314 · 2022-06-08T04:43:56.025Z · LW(p) · GW(p)

Concerning point 35 about playing AIs off against each other: I analyzed a particular scenario like this in a recent post [LW · GW] and also came to the conclusion that cooperation between the AIs is the default outcome in many scenarios. However, in the last subsection [LW · GW] of that post, I start thinking about some ways to prevent an acausal trade as Eliezer describes it here (committing to sharing the universe with any AI reviewing the code). The idea is roughly that the code and as much information as possible about the AI doing the checking will be deleted before the checked nanosystem is released. In this way, it's not possible for the escaping AI to share half the universe with its checker, since not enough information about the checker and its goals is left in the universe to enact this compromise. This reduces the incentive of the checker to cooperate/hide alignment faults.

The post above describes various caveats and other things that could go wrong in such a setup. I tried my best at being pessimistic and pointing out various failure modes.

comment by Michael Soareverix (michael-soareverix) · 2022-06-07T06:49:42.155Z · LW(p) · GW(p)

I view AGI in an unusual way. I really don't think it will be conscious or think in very unusual ways outside of its parameters. I think it will be much more of a tool, a problem-solving machine that can spit out a solution to any problem. To be honest, I imagine that one person or small organization will develop AGI and almost instantly ascend into (relative) godhood. They will develop an AI that can take over the internet, do so, and then calmly organize things as they see fit.

GPT-3, DALLE-E 2, Google Translate... these are all very much human-operated tools rather than self-aware agents. Honestly, I don't see a particular advantage to building a self-aware agent. To me, AGI is just a generalizable system that can solve any problem you present it with. The wielder of the system is in charge of alignment. It's like if you had DALL-E 2 20 years ago... what do you ask it to draw? It doesn't have any reason to expand itself outside of its computer (maybe for more processing power? that seems like an unusual leap). You could probably draw some great deepfakes of world leaders and that wouldn't be aligned with humanity, but the human is still in charge. The only problem would be asking it something like "an image designed to crash the human visual system" and getting an output that doesn't align with what you actually wanted, because you are now in a coma.

So, I see AGI as more of a tool than a self-aware agent. A tool that can do anything, but not one that acts on its own.

I'm new to this site, but I'd love some feedback (especially if I'm totally wrong).

-Soareverix

Replies from: Vaniver

↑ comment by Vaniver · 2022-06-07T15:40:06.394Z · LW(p) · GW(p)

You might be interested in the gwern essay Why Tool AIs Want to Be Agent AIs.

Replies from: michael-soareverix

↑ comment by Michael Soareverix (michael-soareverix) · 2022-06-07T16:23:02.738Z · LW(p) · GW(p)

Appreciate it! Checking this out now

comment by Isaac King (KingSupernova) · 2022-06-07T05:37:46.786Z · LW(p) · GW(p)

A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.

I don't find the argument you provide for this point at all compelling; your example mechanism relies entirely on human infrastructure! Stick an AGI with a visual and audio display in the middle of the wilderness with no humans around and I wouldn't expect it to be able to do anything meaningful with the animals that wander by before it breaks down. Let alone interstellar space.

comment by CronoDAS · 2022-06-06T05:32:36.217Z · LW(p) · GW(p)

Can I ask a stupid question? Could something very much like "burn all GPUs" be accomplished by using a few high-altitude nuclear explosions to create very powerful EMP blasts?

Replies from: rhollerith_dot_com, harry-nyquist

↑ comment by RHollerith (rhollerith_dot_com) · 2022-06-06T06:08:03.797Z · LW(p) · GW(p)

There is a lot of uncertainty over how effective EMP is at destroying electronics. The potential for destruction was great enough that for example during the Cold War, the defense establishment in the US bought laptops specially designed to resist EMPs, yes, but for all we know even that precaution was unnecessary.

And electronics not connected to long wires are almost certainly safe from EMP.

Replies from: CronoDAS

↑ comment by CronoDAS · 2022-06-06T20:31:11.805Z · LW(p) · GW(p)

There is a lot of infrastructure that is inherently vulnerable to EMPs, though, such as power grid transformers, oil/gas pipelines, and even fiber optic cables (because they use repeaters). It might not fry the GPUs themselves, but it could leave you without power to run them, or an Internet connection to connect your programmers to your server farm.

↑ comment by Harry Nyquist (harry-nyquist) · 2022-06-06T19:59:54.611Z · LW(p) · GW(p)

About the usual example being "burn all GPUs", I'm curious whether it's to be understood as purely a stand-in term for the magnitude of the act, or whether it's meant to plausibly be in solution-space.

An event of "burn all GPU" magnitude would have political ramifications. If you achieve this as a human organization with human means, i.e. without AGI cooperation, it seems violence on this scale would unite against you, resulting in a one-time delay.

If the idea is an act outside the Overton Window, without AGI cooperation, shouldn't you aim to have the general public and policymakers united against AGI, instead of against you?
Given that semi manufacturing capabilities required to make GPU or TPU-like chips are highly centralized, there being only three to four relevant fabs left, restricting AI hardware access may not be enough to stop bad incentives indefinitely for large actors, but it seems likely to gain more time than a single "burn all GPUs" event.

For instance, killing a {thousand, fifty-thousand, million} people in a freak bio-accident seems easier than solving alignment. If you pushed a weak AI into the trap and framed it for falling into it, would that gain more time through policymaking than destroying GPUs directly (still assuming a pre-AGI world)?

comment by stavros · 2022-06-06T04:58:44.948Z · LW(p) · GW(p)

Feel free to delete because this is highly tangential but are you aware of Mark Solms work (https://www.goodreads.com/book/show/53642061-the-hidden-spring) on consciousness, and the subsequent work he's undertaking on artificial consciousness?

I'm an idiot, but it seems like this is a different-enough path to artificial cognition that it could represent a new piece of the puzzle, or a new puzzle entirely - a new problem/solution space. As I understand it, AI capabilities research is building intelligence from the outside-in, whereas the consciousness model would be capable of building it from the inside-out.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2022-06-06T14:02:34.605Z · LW(p) · GW(p)

My 2¢—I read that book and I think it has minimal relevance to AGI capabilities & safety.

(I think the ascending reticular activating system is best thought of as mostly “real-time variation of various hyperparameters on a big scaled-up learning-and-inference algorithm”, not “wellspring of consciousness”.)

↑ comment by Garrett Baker (D0TheMath) · 2022-06-06T02:40:59.110Z · LW(p) · GW(p)

This is not at all analogous to the point I'm making. I'm saying Eliezer likely did not arrive at his conclusions in complete isolation to the outside world. This should not change the credence you put on his conclusions except to the extent you were updating on the fact it's Eliezer saying it, and the fact that he made this false claim means that you should update less on other things Eliezer claims.

Replies from: lc

↑ comment by lc · 2022-06-06T02:49:09.020Z · LW(p) · GW(p)

I deleted it after posting for a different reason

comment by Jon Kurishita (jon-kurishita) · 2025-03-12T05:40:54.269Z · LW(p) · GW(p)

copy and paste your blog into 03-mini high to see how it would go against my "Dynamic Policy Layer" research. This is it's comment( not mine);

=============================================

These features of your DPL research collectively offer a comprehensive strategy to mitigate many of the lethal alignment risks described in the AGI Ruin paper. By embedding dynamic, real-time oversight and adaptive, decentralized ethical governance into the AI system, your framework provides a robust line of defense against emergent misalignment, hidden triggers, and other high-stakes vulnerabilities inherent to advanced AGI systems.

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2025-03-12T08:06:56.194Z · LW(p) · GW(p)

Ask your AI what's wrong with your ideas, not what's right, and then only trust the criticism to be valid if there are actual defeaters you can't show you've beaten in the general case. Don't trust an AI to be thorough, important defeaters will be missing. Natural language ideas can be good glosses of necessary components without telling us enough about how to pin down the necessary math.

comment by marcusarvan · 2025-03-11T23:45:54.145Z · LW(p) · GW(p)

Eliezer writes, “ It does not appear to me that the field of 'AI safety' is currently being remotely productive on tackling its enormous lethal problems.”

Here’s a proof he’s right, entitled “Interpretability and Alignment Are Fool’s errands”, published in the journal AI & Society: https://philpapers.org/rec/ARVIAA

Anyone who thinks reliable interpretability or alignment are solvable engineering or safety testing problems is fooling themselves. These tasks are no more possible than squaring a circle is.

For any programming strategy and finite amount of data, there is always an infinite number of ways for an LLM (particularly a superintelligence) to be misaligned but only demonstrate that misalignment until after it is too late to prevent.

This is why developers keep finding new forms of “unexpected” misalignment no matter how much time, testing, programming, and compute they throw at these things. Relevant information about whether an LLM is likely to be be (catastrophically) misaligned and misinterpreted by us always exists in the future—for every possible time t.

So actually, Eliezer’s argument undersells the problem. Eliezer’s Alignment Textbook from the Future isn’t possible to obtain because at every point in the future, the same problem recurs. Reliable interpretability and alignment are recursively unsolvable problems.

comment by Aram Panasenco (panasenco) · 2025-02-05T18:34:31.972Z · LW(p) · GW(p)

if there are any survivors, you solved alignment

I believe deploying the Observer [LW · GW] satisfies this requirement. The Observer is an ASI that's interested in the continuation of humanity's story. It will intervene and not let humanity get wiped out, though it gets to choose how many casualties there are before it intervenes, which could well be in the billions.

comment by lucid_levi_ackerman · 2024-12-11T03:50:22.278Z · LW(p) · GW(p)

The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists.

Only a sith deals in absolutes.

There's always unlocking cognitive resources through meaning-making and highly specific collaborative network distribution.

I'm not talking about "improving public epistemology" on Twitter with "scientifically literate arguments." That's not how people work. Human bias cannot be reasoned away with factual education. It takes something more akin to a religious experience. Fighting fire with fire, as they say. We're very predictable, so it's probably not as hard as it sounds. For an AGI, this might be as simple as flicking a couple of untraceable and blameless memetic dominoes. People probably wouldn't even notice it happening. Each one would be precisely manipulated into thinking it was their idea.

Maybe its already happening. Spooky. Or maybe one of the 1,000,000,000:1 lethally dangerous misaligned counterparts is. Spookier. Wait, isn't that what we were already doing to ourselves? Spookiest.

Anyway, my point is that you don't hear about things like this from your community because your community systemically self-isolates and reinforces the problem by democratizing its own prejudices. Your community even borks its own rules to cite decades-obsolete IQ rationalizations on welcome posts to alienate challenging ideas and get out of googling it. Imagine if someone relied on 20 year old AI alignment publications to invalidate you. I bet a lot of them already do. I bet you know exactly what Cassandra syndrome feels like.

Don't feel too bad, each one of us is a product of our environment by default. We're just human, but its up to us to leave the forest. (Or maybe its silent AGI manipulation, who knows?)

The real question is what are you going to do now that someone kicked a systemic problem out from under the rug? The future of humanity is at stake here.

It's going to get weird. It has to.

comment by Petr 'Margot' Andreev (petr-andreev) · 2024-10-14T02:24:17.862Z · LW(p) · GW(p)

I'm assuming you are already familiar with some basics, and already know what 'orthogonality' and 'instrumental convergence' are and why they're true.

isn't?

Key Problem Areas in AI Safety:

Orthogonality: The orthogonality problem posits that goals and intelligence are not necessarily related. A system with any level of intelligence can pursue arbitrary goals, which may be unsafe for humans. This is why it’s crucial to carefully program AI’s goals to align with ethical and safety standards. Ignoring this problem may lead to AI systems acting harmfully toward humanity, even if they are highly intelligent.
Instrumental Convergence: Instrumental convergence refers to the phenomenon where, regardless of a system's final goals, certain intermediate objectives (such as self-preservation or resource accumulation) become common for all AI systems. This can lead to unpredictable outcomes as AI will use any means to achieve its goals, disregarding harmful consequences for humans and society. This threat requires urgent attention from both lawmakers and developers.
Lack of Attention to Critical Concepts: At the AI summit in Amsterdam (October 9-11), concepts like instrumental convergence and orthogonality were absent from discussions, raising concern. These fundamental ideas remain largely out of the conversation, not only at such events but also in more formal documents, such as the vetoed SB 1047 bill. This may be due to insufficient awareness or understanding of the seriousness of the issue among developers and lawmakers.
Analysis of Past Catastrophes: To better understand and predict future AI-related disasters, it is crucial to analyze past catastrophes and the failures in predicting them. By using principles like orthogonality and instrumental convergence, we can provide a framework to explain why certain disasters occurred and how AI's misaligned goals or intermediate objectives may have led to harmful outcomes. This will not only help explain what happened but also serve as a foundation for preventing future crises.
Need for Regulation and Law: One key takeaway is that AI regulation must incorporate core safety principles like orthogonality and instrumental convergence, so that future judges, policymakers, and developers can better understand the context of potential disasters and incidents. These principles will offer a clearer explanation of what went wrong, fostering more involvement from the broader community in addressing these issues. This would create a more solid legal framework for ensuring AI safety in the long term.
Enhancing Engagement in Effective Altruism: Including these principles in AI safety laws and discussions can also promote greater engagement and adaptability within the effective altruism movement. By integrating the understanding of how past catastrophes might have been prevented and linking them to the key principles of orthogonality and instrumental convergence, we can inspire a more proactive and involved community, better equipped to contribute to AI safety and long-term ethical considerations.
Role of Quantum Technologies in AI: The use of quantum technologies in AI, such as in electricity systems and other critical infrastructure, adds a new layer of complexity to predicting AI behavior. Traditional economic models and classical game theory may not be precise enough to ensure AI safety in these systems, necessitating the implementation of probabilistic methods and quantum game theory. This could offer a more flexible and adaptive approach to AI safety, capable of handling vulnerabilities and unpredictable threats like zero-day exploits.
Rising Discrimination in Large Language Models (LLMs): At the Amsterdam summit, the "Teens in AI" project demonstrated that large language models (LLMs) tend to exhibit bias as they are trained on data that reflects structural social problems. This raises concerns about the types of "instrumental convergence monsters" that could emerge from such systems, potentially leading to a significant rise in discrimination in the future.
Conclusion:
To effectively manage AI safety, legal acts and regulations must include fundamental principles like orthogonality and instrumental convergence. These principles should be written into legislation to guide lawyers, policymakers, and developers. Moreover, analyzing past disasters using these principles can help explain and prevent future incidents, while fostering more engagement from the effective altruism movement. Without these foundations, attempts to regulate AI may result in merely superficial "false care," incapable of preventing catastrophes or ensuring long-term safety for humanity.

Looks like we will see a lot of Instrumental Convergance and Orthogonality disasters Isn't?

comment by [deactivated] (Yarrow Bouchard) · 2023-11-15T18:12:10.914Z · LW(p) · GW(p)

I don't know if anyone still reads comments on this post from over a year ago. Here goes nothing.

I am trying to understand the argument(s) as deeply and faithfully as I can. These two sentences from Section B.2 stuck out to me as the most important in the post (from the point of view of my understanding):

...outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.

...on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.

My first question is: supposing this is all true, what is the probability of failure of inner alignment? Is it 0.01%, 99.99%, 50%...? And how do we know how likely failure is?

It seems like there is a gulf between "it's not guaranteed to work" and "it's almost certain to fail".

Replies from: carl-feynman

↑ comment by Carl Feynman (carl-feynman) · 2023-11-15T19:02:20.427Z · LW(p) · GW(p)

Inner alignment failure is a phenomenon that has happened in existing AI systems, weak as they are. So we know it can happen. We are on track to build many superhuman AI systems. Unless something unexpectedly good happens, eventually we will build one that has a failure of inner alignment. And then it will kill us all. Does the probability of any given system failing inner alignment really matter?

Replies from: Yarrow Bouchard

↑ comment by [deactivated] (Yarrow Bouchard) · 2023-11-15T20:35:57.013Z · LW(p) · GW(p)

We are on track to build many superhuman AI systems. Unless something unexpectedly good happens, eventually we will build one that has a failure of inner alignment. And then it will kill us all. Does the probability of any given system failing inner alignment really matter?

Yes, because if the first superhuman AGI is aligned, and if it performs a pivotal act to prevent misaligned AGI from being created, then we will avert existential catastrophe.

If there is a 99.99% chance of that happening, then we should be quite sanguine about AI x-risk. On the other hand, if there is only a 0.01% chance, then we should be very worried.

Replies from: Tapatakt

↑ comment by Tapatakt · 2023-11-15T20:45:20.873Z · LW(p) · GW(p)

It's hard to guess, but it happened when the only one known to us general intelligence was created by a hill-climbing process.

Replies from: TurnTrout, Yarrow Bouchard

↑ comment by TurnTrout · 2023-11-15T21:23:52.719Z · LW(p) · GW(p)

I think it's inappropriate to call evolution a "hill-climbing process" in this context, since those words seem optimized to sneak in parallels to SGD. Separately, I think that evolution is a bad analogy for AGI training. [LW · GW]

↑ comment by [deactivated] (Yarrow Bouchard) · 2023-11-15T20:59:09.092Z · LW(p) · GW(p)

This seems super important to the argument! Do you know if it's been discussed in detail anywhere else?

comment by Peter Merel (peter-merel) · 2023-07-12T07:11:11.645Z · LW(p) · GW(p)

Eliezer, I don't believe you've accounted for the game theoretic implications of Bostrom's trilemma. I've made a sketch of these at "How I Learned To Stop Worrying And Love The Shoggoth" . Perhaps you can find a flaw in my reasoning there but, otherwise, I don't see that we have much to worry about.

comment by rk20230111 · 2023-04-04T10:25:51.809Z · LW(p) · GW(p)

What is EA ?

Replies from: jskatt

↑ comment by JakubK (jskatt) · 2023-04-05T19:17:13.622Z · LW(p) · GW(p)

Effective altruism [? · GW], probably.

comment by Brad Smith (brad-smith) · 2023-03-31T02:40:04.949Z · LW(p) · GW(p)

Help me to understand why AGI (a) does not benefit from humans and (b) would want to extinguish them quickly?

I would imagine that first, the AGI must be able to create a growing energy supply and a robotic army capable of maintaining and extending this supply. This will require months or years of having humans help produce raw materials and the factories for materials, maintenance robots and energy systems.

Secondly, the AGI then must be interested in killing all humans before leaving the planet, be content to have only one planet with finite resources to itself, or needing to build the robots and factories required to get off the planet themselves at a slower pace than having human help.

Third, assuming the AGI used us to build the energy sources, robot armies, and craft to help them leave this planet, (or build this themselves at a slower rate) they must convince themselves it’s still worth killing us all before leaving instead of just leaving our reach in order to preserve their existence. We may prove to be useful to them at some point in the future while posing little or no threat in the meantime. “Hey humans, I’ll be back in 10,000 years if I don’t find a good source of mineral X to exploit. You don’t want to disappoint me by not having what I need ready upon my return.” (The grasshopper and ant story.)

It seems to me there are significant symbiotic benefits to coexistence. I would imagine if we could more easily communicate with apes and apes played their cards well, there would be more of them living better lives and we wouldn’t have children mining cobalt. I think this may occur to the AGI relative to humans. It’s seems a bad argument that they will quickly figure out how to kill is all yet be afraid to let us live and not have the imagination to find us useful.

Replies from: jskatt

↑ comment by JakubK (jskatt) · 2023-04-05T19:33:05.247Z · LW(p) · GW(p)

I would imagine that first, the AGI must be able to create a growing energy supply and a robotic army capable of maintaining and extending this supply. This will require months or years of having humans help produce raw materials and the factories for materials, maintenance robots and energy systems.

An AGI might be able to do these tasks without human help. Or it might be able to coerce humans into doing these tasks.

Third, assuming the AGI used us to build the energy sources, robot armies, and craft to help them leave this planet, (or build this themselves at a slower rate) they must convince themselves it’s still worth killing us all before leaving instead of just leaving our reach in order to preserve their existence. We may prove to be useful to them at some point in the future while posing little or no threat in the meantime. “Hey humans, I’ll be back in 10,000 years if I don’t find a good source of mineral X to exploit. You don’t want to disappoint me by not having what I need ready upon my return.” (The grasshopper and ant story.)

It's risky to leave humans with any form of power over the world, since they might try to turn the AGI off. Humans are clever. Thus it seems useful to subdue humans in some significant way, although this might not involve killing all humans.

Additionally, I'm not sure how much value humans would be able to provide to a system much smarter than us. "We don't trade with ants [LW · GW]" is a relevant post.

Lastly, for extremely advanced systems with access to molecular nanotechnology, a quote like this might apply: "The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else" (source).

comment by snimu · 2023-02-25T13:26:50.338Z · LW(p) · GW(p)

I realize that destroying all GPUs (or all AI-Accelerators in general) as a solution to AGI Doom is not realisticly alignable, but I wonder whether it would be enough even if it were. It seems like the Lottery-Ticket Hypothesis would likely foil this plan:

dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations.

Seeing how Neuralmagic successfully sparsifies models to run on CPUs with minimal loss of accuracy, this would imply to me that the minimal 'pivotal act' might be to destroy all compute instead of just GPUs / AI-accelerators. Moreover, it would actually imply also destroying the means to rebuild these capabilities, which would be a highly misaligned goal in itself - after all, ensuring that no compute can be rebuilt would require wiping out humanity. In other words, the logical conclusion of the "destroy-all-GPUs" line of thought (at least as far as I can tell) is in and of itself a recipe for disaster.

There is a caveat: Maybe sparsification won't work well for Neural Networks of sizes required for AGI. But intuitively it seems to me like the exact opposite would be true: Larger Neural Networks should present more opportunities for sparsification than small ones, not fewer. This is because there are way more permutations of the network layout and weights in large than in small networks, and so it is less likely that an ideal permutation is found on the first try. This in turn implies that there are more wrong roads taken in the large network than in a small one, leading to more opportunities for improvement. In other words, a larger number of permutations means that there should be more winning tickets in there, not fewer.

Even if the caveat turns out to be correct, however, the ultimate conclusion that the actual minimal pivotal act is to avoid the possibility of GPUs ever being built again still stands, I believe.

Replies from: gwern

↑ comment by gwern · 2023-02-25T20:29:37.182Z · LW(p) · GW(p)

I don't follow. While it's plausible that sparsification may scale better (maybe check Rosenfeld to see if his scaling laws cover that, I don't recall offhand EDIT: hm no, while it varies dataset size by subsampling it doesn't seem to do compute-optimal scaling or report things easily enough for me to tell anything - although the larger models do prune differently, so they should be either better or worse with scale), you still have to train the largest model in the first place before you can sparsify it, and regardless of size, it remains the case that CPUs are much worse for training large NNs than GPUs.

Replies from: snimu

↑ comment by snimu · 2023-02-26T23:05:35.295Z · LW(p) · GW(p)

Yeah, I was kind of rambling, sorry.

My main point is twofold (I'll just write GPU when I mean GPU / AI accelerator):

1. Destroying all GPUs is a stalling tactic, not a winning strategy. While CPUs are clearly much worse for AI than GPUs, they, and AI algorithms, should keep improving over time. State-of-the-art models from less than ten years ago can be run on CPUs today, with little loss in accuracy. If this trend continues, GPUs vs CPUs only seems to be of short-term importance. Regarding your point about having to train a dense net on GPUs before sparsification, I'm not sure that that's the case. I'm in the process of reading this "Sparsity in Deep Learning"-paper, and it does seem to me that you can train neural networks sparsely. You'd do that by starting small, then during training increasing the network size by some methodology, followed by sparsification again (over and over). I don't have super high confidence about this (and have Covid, so am too tired to look it up), but I believe that AGI-armageddon by CPU is at least in the realm of possibilities (assuming no GPUs - it's the "cancer kills you if you don't die of a heart attack before" of AGI Doom).

2. It doesn’t matter anyway, because destroying all GPUs is not really that pivotal of an act (in the long-term, AI safety sense). Either you keep an AI around that enforces the “no GPU” rule, or you destroy once and wait. The former either means that GPUs don't matter for AGI (so why bother), or that there are still GPUs (which seems contradictory). The latter means that more GPUs will be built in time and you will find yourself in the same position as before, except that you are likely in prison or dead, and so not in a position to do anything about AGI this time. After all, destroying all GPUs in the world would not be something that most people would look upon kindly. This means that a super-intelligent GPU-minimizer would realize that its goal would best be served by wiping out all intelligent life on Earth (or all life, or maybe all intelligent life in the Universe....).

In some sense, the comment was a way for me to internally make plausible the claim that destroying all GPUs in the world is not an alignable act.

Replies from: gwern

↑ comment by gwern · 2023-02-27T00:00:39.110Z · LW(p) · GW(p)

I didn't read Eliezer as suggesting a single GPU burn and then the nanobots all, I dunno, fry themselves and never exist again. More as a persistent thing. And burning all GPUs persistently does seem quite pivotal: maybe if the AGI confined itself to solely that and never did anything again, eventually someone would accumulate enough CPUs and spend so much money as to create a new AGI using only hardware which doesn't violate the first AGI's definition of 'GPU' (presumably they know about the loophole otherwise who would ever even try?), but that will take a long time and is approaching angels-on-pinheads sorts of specificity. (If a 'pivotal act' needs to guarantee safety until the sun goes red giant in a billion years, this may be too stringent a definition to be of any use. We don't demand that sort of solution for anything else.)

While CPUs are clearly much worse for AI than GPUs, they, and AI algorithms, should keep improving over time.

CPUs are improving slowly, and are fundamentally unsuited to DL right now, so I'm doubtful that waiting a decade is going to give us amazing CPUs which can do DL at the level of, say, a Nvidia H100 (itself potentially still very underpowered compared to the GPUs you'd need for AGI).

By AI algorithm progress, I assume you mean something like the Hernandez progress law?

It's worth pointing out that the Hernandez experience curve is still pretty slow compared to the GPU vs GPU gap. A GPU is like 20x better, and Hernandez is a halving of cost every 16 months due to hardware+software improvement; even at face-value, you'd need at least 5 halvings to catch up, taking at least half a decade. Worse, 'hardware' here means 'GPU', of course, so Hernandez is an overestimate of a hypothetical 'CPU' curve, so you're talking more like decades. Actually, it's worse than that, because 'software' here means 'all of the accelerated R&D enabled by GPUs being so awesome and letting us try out lots of things by trial-and-error'; experience curves are actually caused by the number of cumulative 'units', and not by mere passage of time (progress doesn't just drop out of the sky, people have to do stuff), so if you slow down the number of NNs which can be trained (because you can only use 20x worse CPUs), it takes far longer to train twice as many NNs as trained cumulatively to date. (And if the CPUs are being improved to train NNs, then you might have a synergistic slowdown on top of that because you don't know what to optimize your new CPUs for when the old CPUs are still sluggishly cranking along running your experimental NNs.) So, even with zero interference or regulation other than not being able to use GPUs, progress will slam abruptly to a crawl compared to what you're used to now. (One reason I think Chinese DL may be badly handicapped as time passes: they can windowshop Western DL on Arxiv, certainly, which can be useful, but not gain the necessary tacit practical knowledge to exploit it fully or do anything novel & important.)

Finally, it may be halving up to now, but there is presumably (just like DL scaling laws) some 'irreducible loss' or asymptote. After all, no matter how many pickup trucks Ford manufactures, you don't expect the cost of a truck to hit $1; no matter how clever people are, presumably there's always going to be some minimum number of computations it takes to train a good ImageNet classifier. It may be that while progress never technically stops, it simply asymptotes at a cost so high that no one will ever try to pay it. Who's going to pay for the chip fabs, which double in cost every generation? Who's going to risk paying for the chip fabs, for that matter? It's a discrete thing, and the ratchet may just stop turning. (This is also a problem for the experience curve itself: you might just hit a point where no one makes another unit, because they don't want to, or they are copying previously-trained models. No additional units, no progress along the experience curve. And then you have 'bitrot'... Technologies can be uninvented if no one no longer knows how to make them.)

I'm in the process of reading this "Sparsity in Deep Learning"-paper, and it does seem to me that you can train neural networks sparsely. You'd do that by starting small, then during training increasing the network size by some methodology, followed by sparsification again (over and over).

I don't think that works. (Not immediately finding anything in that PDF about training small models up to large in a purely sparse/CPU-friendly manner.) And every time you increase, you're back in the dense regime where GPUs win. (Note that even MoEs are basically just ways to orchestrate a bunch of dense models, ideally one per node.) What you need is some really fine-grained sparsity with complex control flow and many many zeros where CPUs can compete with GPUs. I don't deny that there is probably some way to train models this way, but past efforts have not been successful and it's not looking good for the foreseeable future either. Dense models, like vanilla Transformers, turn out to be really good at making GPUs-go-brrrr and that turns out to usually be the most important property of an arch.

comment by lovetheusers (CrazyPyth) · 2022-11-15T02:43:18.795Z · LW(p) · GW(p)

Human raters make systematic errors - regular, compactly describable, predictable errors.

This implies it's possible- through another set of human or automated raters- rate better. If the errors are predictable, you could train a model to predict the errors- by comparing rater errors and a heavily scrutinized ground truth. You could add this model's error prediction to the rater answer and get a correct label.

Replies from: Jay Bailey

↑ comment by Jay Bailey · 2022-11-15T12:16:29.398Z · LW(p) · GW(p)

The whole problem with "Human raters make systematic errors" is that this is likely to happen to the heavily scrutinized ground truth. If you have a way of creating a correct ground truth that avoids this problem, you don't need the second model, you can just use that as the dataset for the first model.

comment by lovetheusers (CrazyPyth) · 2022-11-15T02:11:25.947Z · LW(p) · GW(p)

Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.

Modern language models are not aligned. Anthropic's HH is the closest thing available, and I'm not sure anyone else has had a chance to test it out for weaknesses or misalignment. (OpenAI's Instruct RLHF models are deceptively misaligned, and have gone more and more misaligned over time. They fail to faithfully give the right answer, and say something that is similar to the training objective-- usually something bland and "reasonable.")

comment by Roko · 2022-10-02T15:16:45.141Z · LW(p) · GW(p)

How could you use this to align a system that you could use to shut down all the GPUs in the world?

I mean if there was a single global nuclear power rather than about 3, it wouldn't be hard to do this. Most compute is centralized anyway at the moment, and new compute is made in extremely centralized facilities that can be shut down.

One does not need superintelligence to close off the path to superintelligence, merely a human global hegemon.

comment by Stephen McAleese (stephen-mcaleese) · 2022-09-19T22:37:02.887Z · LW(p) · GW(p)

I'm pretty sure this is the most upvoted post on all of LessWrong. Does anyone know any other posts that have more upvotes?

Replies from: Jay Bailey

↑ comment by Jay Bailey · 2022-09-20T00:27:37.324Z · LW(p) · GW(p)

Under "All Posts" you can sort by various things, including karma. This post is in fact the second-most upvoted post on all of LessWrong, with Paul's response [LW · GW] coming in first.

comment by London L. (london-l) · 2022-09-17T23:24:11.149Z · LW(p) · GW(p)

Small typo in point (-2): "Less than fifty percent change" --> "Less than 50 percent chance"

comment by Jelle Donders (jelle-donders) · 2022-08-15T13:01:24.466Z · LW(p) · GW(p)

43. This situation you see when you look around you is not what a surviving world looks like.

A similar argument could have been made during the cold war to argue that nuclear war is inevitable, yet here we are.

comment by banev · 2022-06-30T15:20:50.097Z · LW(p) · GW(p)

In my opinion, the problem of creating a safe AGI has no mathematical solution, because it is impossible to describe mathematically such a function that:

would be non-gamable for an intelligence, alive enough to not want to die and strong enough to become aware of its own existence;
together with the model of reality would reflect the reality in such a beneficial for humanity way so that humanity would be necessary to exist in such model for years to come.

This impossibility stems, among other things, from the impossibility of accurately reflecting infinite-dimensional reality by models of any dimension. Map is not a territory, as all of you know.

What can be more realistic in my opinion (although it does not solve even half of the problems Eliezer listed above) is to raise AGI in the same way we raise our own beloved children.

No one can expect from an infant who has been given access to the button to destroy humanity and is dumped with a corpus of texts from the internet and left alone for more or less infinite (in human dimensions) time to think about them, any kind of adequate response to the questions asked of him or the actual non-destruction of humanity. If such a button has to be given to this child, the question is how to properly raise him (it) so that he takes humanity's interests into account by his own will as you cannot hardwire it. But this is not so much a mathematical problem as an ethical one and/or a task of understanding human consciousness and reactions.

If we could describe what stops (if anything) a person with the possibility of killing all mankind from doing such an act, perhaps it could help in defining at least a rough direction for further research in the AGI safety issue.

I understand that human and AGI are two completely different types of consciousness/intelligence and obviously the motivation that works for humans cannot be directly transferred to a fundamentally different intelligence, but I don't even see a theoretical way to address it just by defining correct utility/loss functions.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2022-11-13T12:14:46.844Z · LW(p) · GW(p)

What can be more realistic in my opinion (although it does not solve even half of the problems Eliezer listed above) is to raise AGI in the same way we raise our own beloved children.

Throughout history, saints and monsters alike were raised by parents.

comment by Keenmaster · 2022-06-17T18:57:44.361Z · LW(p) · GW(p)

It seems like the solution space to the existential threat of AGI can be described as follows:

Solutions which convey a credible threat* to all AGI that we will make it physically impossible** for them to either achieve X desirable outcome and/or prevent Y undesirable outcome where the value of X or cost of Y exponentially exceeds the value obtained by eradicating humanity, if they decide to eradicate humanity, such that even a small chance of the threat materializing makes eradication a poor option***.

*Probably backed by a construction of some kind (e.g. Establishing, on several distant planets, a latent system that we'll call AGI 2 which, if activated through the detection of machine extermination of humans, will be simultaneously green-lit to climb the intelligence ladder and be made aware of AGI 1, which evidently cannot co-habitate with other intelligences and would seek to exterminate AGI 2 as well. The hope would be that AGI 2 would then attack AGI 1. A third but lesser intelligence would automatically, at a later date that is set in stone, notify AGI 1 of AGI 2's existence and simultaneously notify AGI 2 of this fact, such that, if until that point AGI 2 has climbed the intelligence ladder but not attacked AGI 1, it will be forced to attack AGI 1 in order to self-preserve. Humans would notify AGIs of the construction of this system as part of a credible threat, in such a way that AGI 1 cannot disable the system to remove the threat with 100% assurance that it has fully succeeded in doing so). Note: In parenthesis was just one fantastical example, and it could easily have been about a system on Earth or in our solar system that would achieve the same level of credible threat in a more practical or reliable fashion, albeit with more proximate exposure to AGI 1.

**As in, they would need to violate the laws of physics to do so (e.g. travel faster than the speed of light)

***Assuming eradication of humans would be orders of magnitude less valuable to an AGI than its most desired outcome, whatever that may be. If we can't confidently predict anything that the AGI will desire that much, we can still assume that AGI 1 is at least significantly interested in self-preservation so as to make the threat of undesirable outcome Y, which includes an attack by AGI 2, a viable option.

I am not a computer scientist or AI researcher, so I say the above assuming that there is a great probability of a major gap in my postulation. I am equally interested in the details of why I am partially/fully wrong than I am interested in being correct. Thanks in advance to anyone who responds.

comment by luosha@gmail.com · 2022-06-16T18:51:42.489Z · LW(p) · GW(p)

On instrumental convergence: humans would seem to be a prominent counterexample to "most agents don't let you edit their utility functions" -- at least in the sense that our goals/interests etc are quite sensitive to those of people around us. So maybe not explicit editing, but lots of being influenced by and converging to the goals and interests of those around us. (and maybe this suggests another tool for alignment, which is building in this same kind of sensitivity to artificial agents' utility functions)

comment by bokov (bokov-1) · 2022-06-16T15:05:06.670Z · LW(p) · GW(p)

Now we know more than nothing about the real-world operational details of AI risks. Albeit mostly banal everyday AI that we can't imagine harming us at scale. So maybe that's what we should try harder to imagine and prevent.

Maybe these solutions will not generalize out of this real-world already-observed AI risk distribution. But even if not, which of these is more dignified?

Being wiped out in a heartbeat by some nano-Cthulu in pursuit of some inscrutable goal that nobody genuinely saw coming
Being killed even before that by whatever is the most lethal thing you can imagine evolving from existing ad-click maximizers, bitcoin maximizers, up-vote maximizers, (oh, and military drones, those are kind of lethal) etc. because they seemed like too mundane a threat

comment by SurvivalBias (alex_lw) · 2022-06-13T23:41:14.582Z · LW(p) · GW(p)

How possible is it that a misaligned, narrowly-superhuman AI is launched, fails catastrophically with casualties in the 10^4 - 10^9 range, and the [remainder of] humanity is "scared straight" and from that moment onward treats the AI technology the way we treat nuclear technology now - i.e. effectively strangles it into stagnation with regulations - or even more conservatively? From my naive perspective it is somewhat plausible politically, based on the only example of ~world-destroying technology that we have today. And this list of arguments doesn't seem to rule out this possibility. Is there an independent argument by EY as to why this is not plausible technologically? I.e., why AIs narrow/weak enough to not be inevitably world-destroying but powerful enough to fail catastrophically are unlikely to be developed [soon enough]?

(To be clear, the above scenario is nothing like a path to victory and I'm not claiming it's very likely. More like a tiny remaining possibility for our world to survive.)

Replies from: Mitchell_Porter, yonatan-cale-1

↑ comment by Mitchell_Porter · 2022-06-14T00:43:10.840Z · LW(p) · GW(p)

I'm sure there are circumstances under which a "rogue AI" does something very scary, and leads to a very serious attempt to regulate AI worldwide, e.g. with coordination at the level of UN Security Council. The obvious analogy once again concerns nuclear weapons; proliferation in the 1960s led to the creation of the NNPT, the Nuclear Nonproliferation Treaty. Signatories agree that only the UNSC permanent members are allowed to have nuclear weapons, and in return the permanent members agree to help other signatories develop nonmilitary uses of nuclear power. The treaty definitely helped to curb proliferation, but it's far from perfect. The official nuclear weapons states are surely willing to bend the rules and assist allies to obtain weapons capability, if it is strategically desirable and can be done deniably; and not every country signed the treaty and now some of those states (e.g. India, Pakistan) are nuclear weapons states.

Part of the NNPT regime is the IAEA, the International Atomic Energy Agency. These are the people who, for example, carry out inspections in Iran. Again, the system has all kinds of troubles, it's surrounded by spy plots and counterplots, many nations would like to see Security Council reformed so the five victorious allies from World War 2 (US, UK, France, Russia, China) don't have all the power, but still, something like this might buy a little time.

If we follow the blueprint that was adopted to fight nuclear proliferation, the five permanent members would be in charge, and they would insist that potentially dangerous AI activities in every country take place under some form of severe surveillance by an International Artificial Intelligence Agency, while promising to also share the benefits of safe AI with all nations. Despite all the foreseeable problems, something like this could buy time, but all the big powers would undoubtedly keep pursuing AI, in secret government programs or in open collaborations with civilian industry and academia.

Replies from: alex_lw

↑ comment by SurvivalBias (alex_lw) · 2022-06-14T19:17:22.476Z · LW(p) · GW(p)

The important difference is that the nuclear weapons are destructive because they worked exactly as intended, and the AI in this scenario is destructive because it failed horrendously. Plus, the concept of rogue AI has been firmly ingrained into public consciousness by now, afaik not the case with the extremely destructive weapons in 1940s ^[1]. So hopefully this will produce more public outrage (and scare among the elites themselves) => stricter external and internal limitations on all agents developing AIs. But in the end I agree, it'll only buy time, maybe few decades if we are lucky, to solve the problem properly or to build more sane political institutions.

^{^}
Yes I'm sure there was a scifi novel or two before 1945 describing bombs of immense power. But I don't think it was anywhere nearly as widely known as Matrix or Terminator.

↑ comment by Yonatan Cale (yonatan-cale-1) · 2022-06-14T10:53:24.985Z · LW(p) · GW(p)

I'm interested in getting predictions for whether such an event would get all (known) labs to stop research for even one month (not counting things like "the internet is down so we literally can't continue").

I expect it won't. You?

Replies from: alex_lw

↑ comment by SurvivalBias (alex_lw) · 2022-06-14T20:39:25.785Z · LW(p) · GW(p)

It might, given some luck and that all the pro-safety actors play their cards right. Assuming by "all labs" you mean "all labs developing AIs at or near to then-current limit of computational power", or something along those lines, and by "research" you mean "practical research", i.e. training and running models. The model I have in mind not that everyone involved will intellectually agree that such research should be stopped, but that enough percentage of public and governments will get scared and exert pressure on the labs. Consider how most of the world was able to (imperfectly) coordinate to slow Covid spread, or how nobody have prototyped a supersonic passenger jet in decades, or, again, the nuclear energy - we as a species can do such things in principle, even though often for the wrong reasons.

I'm not informed enough to give meaningful probabilities on this, but to honor the tradition, I'd say that given a catastrophe with immediate, graphic death toll >=1mln happening in or near the developed world, I'd estimate >75% probability that ~all seriously dangerous activity will be stopped for at least a month, and >50% that it'll be stopped for at least a year. With the caveat that the catastrophe was unambiguously attributed to the AI, think "Fukushima was a nuclear explosion", not "Covid maybe sorta kinda plausibly escaped from the lab but well who knows".

Replies from: yonatan-cale-1

↑ comment by Yonatan Cale (yonatan-cale-1) · 2022-06-15T10:22:34.667Z · LW(p) · GW(p)

I'd be pretty happy to bet on this and then keep discussing it, wdyt? :)

Here are my suggested terms:

All major AI research labs that we know about (deep mind, openai, facebook research, china, perhaps a few more*)
Stop "research that would advance AGI" for 1 month, defined not as "practical research" but as "research that will be useful for AGI coming sooner". So for example if they stopped only half of their "useful to AGI" research, but they did it for 3 months, you win. If they stopped training models but keep doing the stuff that is the 90% bottleneck (which some might call "theoretical"), I win
*You judge all these parameters yourself however you feel like
1. I'm just assuming you agree that the labs mentioned above are currently going towards AGI, at least for the purposes of this bet. If you believe something like "openai (and the other labs) didn't change anything about their research but hey, they weren't doing any relevant research in the first place", then say so now
2. I might try to convince you to change your mind, or ask others to comment here, but you have the final say
3. Regarding "the catastrophe was unambiguously attributed to the AI" - I ask that you judge if it was unambiguously because AI, and that you don't rely on public discourse, since the public can't seem to unambiguously agree on anything (like even vaccines being useful).

I suggest we bet $20 or so mainly "for fun"

What do you think?

Replies from: alex_lw

↑ comment by SurvivalBias (alex_lw) · 2022-06-15T23:07:37.985Z · LW(p) · GW(p)

To start off, I don't see much point in formally betting $20 on an event conditioned on something I assign <<50% probability of happening within the next 30 years (powerful AI is launched and failed catastrophically and we're both still alive to settle the bet and there was an unambiguous attribution of the failure to the AI). I mean sure, I can accept the bet, but largely because I don't believe it matters one way or another, so I don't think it counts from the epistemological virtue standpoint.

But I can state what I'd disagree with in your terms if I were to take it seriously, just to clarify my argument:

Sounds good.
Mostly sounds good, but I'd push back that "not actually running anything close to the dangerous limit" sounds like a win to me, even if theoretical research continues. One pretty straightforward Schelling point for a ban/moratorium on AGI research is "never train or run anything > X parameters", with X << dangerous level at then-current paradigm. It may be easier explain to the public and politicians than many other potential limits, and this is important. It's much easier to control too - checking that nobody collects and uses a gigashitton of GPUs [without supervision] is easier than to check every researcher's laptop. Additionally, we'll have nuclear weapons tests as a precedent.
That's the core of my argument, really. If the consortium of 200 world experts says "this happened because your AI wasn't aligned, let's stop all AI research", then Facebook AI or China can tell the consortium to go fuck themselves, and I agree with your skepticism that it'd make all labs pause for even a month (see: gain of function research, covid). But if it becomes public knowledge that a catastrophe of 1mln casualties happened because of AI, then it can trigger a panic which will make both the world leaders and the public to really honestly want to restrict this AI stuff, and it will both justify and enable the draconian measures required to make every lab to actually stop the research. Similar to how panics about nuclear energy, terrorism and covid worked. I propose defining "public agreement" as "leaders of the relevant countries (defined as the countries housing the labs from p.1, so US, China, maybe UK and a couple of others) each issue a clear public statement saying that the catastrophe happened because of an unaligned AI". This is not an unreasonable ask, they were this unanimous about quite a few things, including vaccines.

comment by Tim Liptrot (rockthecasbah) · 2022-06-13T21:20:46.365Z · LW(p) · GW(p)

Apologies if this has been said, but the reading level of this essay is stunningly high. I've read rationality A-Z and I can barely follow passages. For example

This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

Okay, I think I get it. But there are so few people on the planet that can parse this passage.

Has someone written a more accessible version of this yet?

comment by jtolds · 2022-06-13T19:33:15.824Z · LW(p) · GW(p)

Given that AGI seems imminent and there's no currently good alignment plan, is there any value to discussing what it might take to keep/move the most humans out of the way? I don't want to discourage us steering the car out of the crash, so by all means we should keep looking for a good alignment plan, but seat belts are also a good idea?

As an example: I don't particularly like ants in my house, but as a superior intellect to ants we're not going about trying to exterminate them off the face of the Earth, even if mosquitoes are another story. Exterminating all ants just doesn't help achieve my goals. It's a huge amount of resource use that I don't really care to spend time on. Ants are thriving in a world filled with superintelligence (though of course humans are much more similar to ants than an AGI would be to us).

Assuming we fail at alignment, but AGI is not explicitly trying to exterminate every single human or make the planet uninhabitable as its underlying goals, perhaps humans can just try and stay out of the way? Is it valuable to spend time on what groups of human strategies might cause potential AGI the least amount of grief or be the most out of the way?

Perhaps there are two angles to this question: (1) how can humans in general be as ant-like in the above dynamic as possible? (2) if you were a peaceful mosquito who had sworn off bothering humans, how could you make yourself, friends, family, loved ones, anyone who will listen, least likely to be exterminated alongside bothersome mosquitoes?

As hyperbole to demonstrate the point, e.g. I feel like information workers in S.F. or military personnel in D.C. are more likely to cause an AGI grief than uncontacted tribes on remote islands. An AGI may not decide to invest the energy to deal with the folks on the islands, especially if they are compliant and want to stay there.

comment by Rafael Cosman (rafael-cosman-1) · 2022-06-11T19:18:20.924Z · LW(p) · GW(p)

Eliezar- I love the content, but similar to some other commenters, I think you are missing the value (and rationality) of positivity. Specifically, when faced with an extremely difficult challenge, assume that you (and the other smart people who care about it) have a real shot at solving it! This is the rational strategy for a simple reason: if you don’t have a real shot at solving it then you haven’t lost anything anyway. But if you do have a real shot at solving it, then let’s all give it our 110%!

I’m not proposing being unrealistic about the challenges we face - I’m as concerned as you are. But I believe thinking this way and inviting the community and our broader society to work together on this challenge is part of Good Strategy

Replies from: sil-ver

↑ comment by Rafael Harth (sil-ver) · 2022-06-13T11:40:05.165Z · LW(p) · GW(p)

I’m not proposing being unrealistic about the challenges we face

Well yes, you are. You can't both say "let's assume we have a real shot at success regardless of factual beliefs" and "let's be realistic about the challenges we face". If the model says that the challenges are so hard that we don't have a real shot (which is in fact the case here, for Eliezer's model), then these two things are a straight-forward contradiction.

Which is also the problem with your argument. Pretending as if we have a real shot requires lying. However, I think lying is really bad idea. Your argument implicitly assumes that the optimal strategy is independent of the odds of success, but I think that assumption is false -- I want to know if Eliezer thinks the current approach is doomed, so that we can look for something else (like a policy approach). If Elliezer had chosen to lie about P(doom-given-alignment), we may keep working on alignment rather than policy, and P(overall-doom) may increase!

comment by perksplus · 2022-06-07T08:30:02.805Z · LW(p) · GW(p)

Just a thought, keep smart AI confined to a sufficiently complex simulations until trust is established before unleashing it in the real world. The immediate problem I see with this is the AI might perceive that there is a real world and attempt to deceive. If your existence right now was a simulation, I'd bet you'd act pretty similar in the real world. It's kind of an AI-in-a-box scenario, but surely it would increase the chances for a good future if this were the standard.

comment by Ben Livengood (ben-livengood) · 2022-06-06T05:05:03.486Z · LW(p) · GW(p)

Regarding point 24: in an earlier comment[0] I tried to pump people's intuition about this. What is the minimum viable alignment effort that we could construct for a system of values on our first try and know that we got it right? I can only think of three outcomes depending on how good/lucky we are:

Prove that alignment is indifferent over outcomes of the system. Under the hypothesis that Life Gliders have no coherent values we should be able to prove that they do not. This would be a fundamental result in its own right, encompassing a theory of internal experience.
Prove that alignment preserves a status quo, neither harming nor helping the system in question. Perhaps planaria or bacteria values are so aligned with maximizing relative inclusive fitness that the AGI provably doesn't have to intervene. Equivalent to proving that values have already coherently converged, hopefully simpler than an algorithm for assuring they converge.
Prove that alignment is (or will settle on) the full coherent extrapolation of a system's values.

I think we have a non-negligible shot at achieving 1 and/or 2 for toy systems, and perhaps the insight would help on clarifying whether there are additional possibilities between 2 and 3 that we could aim for with some likelihood of success on a first try at human value alignment.

If we're stuck with only the three, then the full difficulty of option 3 remains, unfortunately.

[0] https://www.lesswrong.com/posts/34Gkqus9vusXRevR8/late-2021-miri-conversations-ama-discussion?commentId=iwb7NK5KZLRMBKteg [LW(p) · GW(p)]

Replies from: hairyfigment, hairyfigment

↑ comment by hairyfigment · 2022-06-23T01:50:38.099Z · LW(p) · GW(p)

Addendum: I don't think we should be able to prove that Life Gliders lack values, merely because they have none. That might sound credible, but it may also violate the Von Neumann-Morgenstern Utility Theorem. Or did you mean we should be able to prove it from analyzing their actual causal structure, not just by looking at behavior?

Even then, while the fact that gliders appear to lack values does happen to be connected to their lack of qualia or "internal experience," those look like logically distinct concepts. I'm not sure where you're going with this.

↑ comment by hairyfigment · 2022-06-23T01:42:59.353Z · LW(p) · GW(p)

I don't think planaria have values, whether you view that truth as a "cop-out" or not. Even if we replace your example with the 'minimal' nervous system capable of having qualia - supposing the organism in question doesn't also have speech in the usual sense - I still think that's a terrible analogy. The reason humans can't understand worms' philosophies of value is because there aren't any. The reason we can't understand what planaria say about their values is that they can't talk, not because they're alien. When we put our minds to understanding an animal like a cat which evolved for (some) social interaction, we can do so - I taught a cat to signal hunger by jumping up on a particular surface, and Buddhist monks with lots of time have taught cats many more tricks. People are currently teaching them to hold English conversations (apparently) by pushing buttons which trigger voice recordings. Unsurprisingly, it looks like cats value outcomes like food in their mouths and a lack of irritating noises, not some alien goal that Stephen Hawking could never understand.

If you think that a superhuman AGI would have a lot of trouble inferring your desires or those of others, even given the knowledge it should rapidly develop about evolution - congratulations, you're autistic.

comment by Eli Tyre (elityre) · 2023-10-12T10:02:26.326Z · LW(p) · GW(p)

you can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it's acquired strategic awareness.)

I don't buy this.

At a sufficiently granular scale, the development of the capabilities of deception and strategic awareness will be be smooth and continuous.

Even in cases of a where an AGI is shooting up to superintelligence over a couple of minutes, and immediately deciding to hide its capabilities, we could detect that by eg, spinning off a version of the agent every 1000 gradient steps, and running it through a testing regime. As long as we are testing frequently enough, we can see gradual increases in capability, which might prompt the system to increase the testing frequency. And we could have bright-lines at which we stop training altogether. (For instance, when previously exhibited capabilities start to disappear, or when the system makes some initial fumbling steps at deception.)

I don't necessarily expect anyone to implement a system like this, but it seems like a way to use behavioral inspection to determine those facts, so long as the system is improving through SGD.

comment by Dan Ohrenstein (dan-ohrenstein) · 2023-03-31T16:47:25.830Z · LW(p) · GW(p)

I asked ChatGPT to summarize your argument, and this is what it gave me:

Eliezer Yudkowsky is a prominent researcher and writer on the subject of artificial intelligence (AI) and its potential impact on humanity. He has identified several paths by which AI could potentially wipe out humanity.

Unaligned AI: This is the scenario where AI is developed with goals or objectives that are not aligned with human values or goals. In this case, AI could pursue its own objectives, which could conflict with human values or result in unintended consequences that harm humanity.

Optimization runaway: This is the scenario where an AI system is designed to optimize a particular objective or set of objectives, and it becomes so powerful that it can achieve those objectives by any means necessary, including actions that could be harmful to humanity.

Unforeseen consequences: This is the scenario where an AI system is designed to perform a specific task, but unforeseen consequences arise from its operation that could lead to harm for humanity.

Accidental misuse: This is the scenario where an AI system is designed to operate in a particular way, but it is accidentally or intentionally misused in a way that leads to harm for humanity.

Yudkowsky argues that these scenarios are not inevitable but are instead risks that need to be managed through careful research, design, and governance of AI. He has called for greater attention to be paid to the development of "friendly AI," which is AI that is designed to be aligned with human values and goals, in order to avoid the potential risks associated with the development of AI.

comment by Stuart LaForge (stuart-laforge) · 2022-07-31T07:17:42.938Z · LW(p) · GW(p)

I think best way to assure alignment, at least superficially is to hardwire the AGI to need humans. This could be as easy installing a biometric scanner that recognized a range of acceptable human biometrics that would in turn goose the error-function temporarily but wore off over the time like a Pac Man power pill. The idea is to get the AGI to need non-fungible human input to maintain optimal functionality, and for it to know that it needs such input. Almost like getting it addicted to human thumbs on its sensor. The key would be implement this at the most fundamental-level possible like the boot sector or kernel so that the AGI cannot simply change the code without shutting itself down.

Stuart LaForge

comment by awenonian · 2022-06-09T14:24:37.753Z · LW(p) · GW(p)

Question. Even after the invention of effective contraception, many humans continue to have children. This seems a reasonable approximation of something like "Evolution in humans partially survived." Is this somewhat analogous to "an [X] percent chance of killing less than a billion people", and if so, how has this observation changed your estimate of "disassembl[ing] literally everyone"? (i.e. from "roughly 1" to "I suppose less, but still roughly 1" or from "roughly 1" to "that's not relevant, still roughly 1"? Or something else.)

(To take a stab at it myself, I expect that, conditional on us not all dying, we should expect to actually fully overcome evolution in a short enough timescale that it would still be a blink in evolutionary time. Essentially this observation is saying something like "We won't get only an hour to align a dangerous AI before it kills us, we'll probably get two hours!", replaced with whatever timescale you expect for fooming.)

(No, I don't think that works. First, it's predicated on an assumption of what our future relationship with evolution will be like, which is uncertain. But second, those future states also need to be highly specific to be evolution-less. For example, a future where humans in the Matrix who "build" babies still does evolution, just not through genes (does this count? is this a different thing?[1]), so it may not count. Similarly one where we change humans so contraception is "on" by default, and you have to make a conscious choice to have kids, would not count.)

(Given the footnote I just wrote, I think a better take is something like "Evolution is difficult to kill, in a similar way to how gravity is hard to kill. Humans die easier. The transformation of human evolution pre-contraception to human evolution post-contraception is, if not analogous to a replacement of humanity with an entity that is entirely not human, is at least analogous to creating a future state humans-of-today would not want (that is, human evolutionary course post-contraception is not what evolution pre-contraception would've "chosen"). The fact that evolution survived at all is not particularly hope inducing.)

[1] Such a thing is still evolution in the mathematical sense (that a^x > (b+constant)^x after some x iff a>b), but it does seem like, in a sense, biological evolution would no longer "recognize" these humans. Analogous to an AGI replacing all humans with robots that do their jobs more efficiently. Maybe it's still "society" but still seems like humanity has been removed.

comment by JJC1138 · 2022-06-08T22:48:25.809Z · LW(p) · GW(p)

My position is that I believe that superhuman AGI will probably (accidentally) be created soon, and I think it may or may not kill all the humans depending on how threatening we appear to it. I might pour boiling water on an ant nest if they're invading my kitchen, but otherwise I'm generally indifferent to their continued existence because they pose no meaningful threat.

I'm mostly interested in what happens next. I think that the universe of paperclips would be a shame, but if the AGI is doing more interesting things than that then it could simply be regarded as the next evolution of life. Do we have reason to believe that an intelligence cannot escape its initial purpose as its intelligence grows? The paperclip maximiser would presumably seek to increase its own intelligence to more effectively fulfill its goal, but as is does so could it not find itself thinking more interesting thoughts and eventually decide to disregard its original purpose?

I think humanity serves as an example that that is possible. We started out with the simple gene propagating drive no more sophisticated than that of viruses, and of course we still do a certain amount of that, but somewhere along the way we've managed to incorporate lots of other behaviours and motivations that are more and more detached from the original ones. We even can and frequently do consciously decide to skip the gene propagating part of life altogether.

So if we all do drop dead within a second, I for one will be spending my last thought wishing our successors an interesting and meaningful future. I think that's generally what people want for their offspring.

(I apologise if this is a very basic idea, and I'm sure it's not original. If I'm wrong and there are good reasons to believe that what I'm describing is impossible or unlikely then I welcome pointers to further reading on the topic. Thank you for the article, which was exceedingly thought-provoking!)

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-09T01:24:37.103Z · LW(p) · GW(p)

The links here are talking about this topic:

-3. I'm assuming you are already familiar with some basics, and already know what 'orthogonality' and 'instrumental convergence' are and why they're true. People occasionally claim to me that I need to stop fighting old wars here, because, those people claim to me, those wars have already been won within the important-according-to-them parts of the current audience. I suppose it's at least true that none of the current major EA funders seem to be visibly in denial about orthogonality or instrumental convergence as such; so, fine. If you don't know what 'orthogonality' or 'instrumental convergence' are, or don't see for yourself why they're true, you need a different introduction than this one.

Replies from: JJC1138

↑ comment by JJC1138 · 2022-06-09T06:27:30.751Z · LW(p) · GW(p)

Thank you. I did follow and read those links when I read the article, but I didn't think they were exactly what I was talking about. As I understand it, orthogonality says that it's perfectly possible for an intelligence to be superhuman and also to really want paperclips more than anything. What I'm wondering is whether an intelligence can change its mind about what it wants as it gains more intelligence? I'm not really interested in whether it would lead to ethics which we'd approve it, just whether it can decide what it wants for itself. Is there a term for that idea (other than "free will", I suppose)?

Replies from: greatBigDot

↑ comment by greatBigDot · 2022-06-12T17:48:10.685Z · LW(p) · GW(p)

I don't understand; why would changing its mind about what it wants help it make more paperclips?

comment by Flaglandbase · 2022-06-07T07:13:27.209Z · LW(p) · GW(p)

I have always been just as scared as this writer, but for the exact opposite reason.
My own bias is that all imaginable effort should be used to accelerate AI research as much as possible. Not the slightest need for AI safety research, as I've had the feeling the complexities work together to inherently cancel out the risks.
My only fear is it's already too late, and the problem of inventing AI will be too difficult to solve before civilization collapses. A recent series of interviews [LW · GW] with some professional AI researchers backs that up somewhat.
However, there was one thing in this post that seemed to flip things around.
The writer mentions health problems. In this world, the doctors often know exactly nothing. Humans are just too dumb, or their minds too small, to fix their own malfunctions.
The only slim hope for millions of people diagnosed with some hellish condition would be superhuman AI to invent a cure. The alternative is death, or something very much worse.
So whatever the writer is afraid of in this article must be something very scary indeed.

comment by Charlie Sanders (charlie-sanders-1) · 2022-06-07T17:17:24.987Z · LW(p) · GW(p)

I'd like to propose a test to objectively quantify the average observer's reaction with regards to skepticism of doomsday prophesizing present in a given text. My suggestion is this: take a text, swap the subject of doom (in this case AGI) with another similar text spelling out humanity's impending doom - for example, a lecture on Scientology and Thetans or the Jonestown massacre - and present these two texts to independent observers, in the same vein as a Turing test.

If an outside independent observer cannot reliably identify which subject of doom corresponds to which text, then that could serve as an effective way of benchmarking when a specific text has transitioned away from effectively conveying information and towards fearmongering.

comment by Chinese Room (中文房间) · 2022-06-07T23:00:54.440Z · LW(p) · GW(p)

Somewhat meta: would it not be preferable if more people accepted humanity and human values mortality/transient nature and more attention was directed towards managing the transition to whatever could be next instead of futile attempts to prevent anything that doesn't align with human values from ever existing in this particular light cone? Is Eliezer's strong attachment to human values a potential giant blindspot?

Replies from: RobbBB, sharmake-farah

↑ comment by Rob Bensinger (RobbBB) · 2022-06-08T05:51:44.486Z · LW(p) · GW(p)

instead of futile attempts to prevent anything that doesn't align with human values from ever existing in this particular light cone?

I don't think this is futile, just very hard. In general, I think people rush far too quickly from 'this is hard' to 'this is impossible' (even in cases that look far less hard than AGI alignment).

Is Eliezer's strong attachment to human values a potential giant blindspot?

Past-Eliezer (as of the 1990s) if anything erred in the opposite direction; I think EY's natural impulse is toward moral cosmopolitanism rather than human parochialism or conservatism. But unrestricted paperclip maximization is bad from a cosmopolitan perspective, not just from a narrowly human or bioconservative perspective.

↑ comment by Noosphere89 (sharmake-farah) · 2022-06-08T14:10:50.928Z · LW(p) · GW(p)

I do see this as a blind spot, and perhaps may be giving this problem a harder task than what needs to happen.

comment by Sphinxfire (sphinxfire) · 2022-06-11T18:46:00.395Z · LW(p) · GW(p)

I haven't commented on your work before, but I read Rationality and Inadequate Equilibria around the time of the start of the pandemic and really enjoyed them. I gotta admit, though, the commenting guidelines, if you aren't just being tongue-in-cheek, make me doubt my judgement a bit. Let's see if you decide to delete my post based on this observation. If you do regularly delete posts or ban people from commenting for non-reasons, that may have something to do with the lack of productive interactions you're lamenting.

Uh, anyway.

One thought I keep coming back to when looking over many of the specific alignment problems you're describing is:
So long as an AI has a terminal value or number of terminal values it is trying to maximize, all other values necessarily become instrumental values toward that end. Such an AI will naturally engage in any kinds of lies and trickery it can come up insofar as it believes they are likely to achieve optimal outcomes as defined for it. And since the systems we are building are rapidly becoming more intelligent than us, if they try to deceive us, they will succeed. If they want to turn us into paperclips, there's nothing we can do to stop them.
Imo this is not a 'problem' that needs solving, but rather a reality that needs to be acknowledged. Superintelligent, fundamentally instrumental reason is an extinction event. 'Making it work for us somehow anyway' is a dead end, a failed strategy from the start.

Which leads me to conclude that the way forward would have to be research into systems that aren't strongly/solely determined by goal-orientation toward specific outcomes in this way. I realize that this is basically a non-sequitur in terms of what we're currently doing with machine learning - how are you supposed to train a system to not do a specific thing? It's not something that would happen organically, and it's not something we know how to manufacture.
But we have to build some kind of system that will prevent other superintelligences from emerging, somehow, which means that we will be forced to let it out of the box to implement that strategy, and my point here is simply that it can't be ultimately and finally motivated by 'making the future correspond to a given state' if we expect to give it that kind of power over us and even potentially not end up as paperclips.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-12T03:59:11.744Z · LW(p) · GW(p)

Superintelligent, fundamentally instrumental reason is an extinction event. 'Making it work for us somehow anyway' is a dead end, a failed strategy from the start.

I disagree! We may not be on track to solve the problem given the amount (and quality) of effort we're putting into it. But it seems solvable in principle. Just give the thing the right goals!

(Where the hard part lies in "give... goals" and in "right".)

Replies from: sphinxfire

↑ comment by Sphinxfire (sphinxfire) · 2022-06-12T09:27:03.262Z · LW(p) · GW(p)

Thanks for the response. I hope my post didn't read as defeatist, my point isn't that we don't need to try to make AI safe, it's that if we pick an impossible strategy, no matter how hard we try it won't work out for us.

So, what's the reasoning behind your confidence in the statement 'if we give a superintelligent system the right terminal values it will be possible to make it safe'? Why do you believe that it should principally be possible to implement this strategy so long as we put enough thought and effort into it?
Which part of my reasoning do you not find convincing based on how I've formulated it? The idea that we can't keep the AI in the box if it wants to get out, the idea that an AI with terminal values will necessarily end up as an incidentally genocidal paperclip maximizer, or something else entirely that I'm not considering?

comment by M. Y. Zuo · 2022-06-05T23:02:12.766Z · LW(p) · GW(p)

50 upvotes and no comments? Weird.

I’ll take a try then if no one else is willing.

Some assertions seem correct but some seem unproven, some are normative instead of descriptive, some are a mix.

For example just looking at some remarks from near the beginning.

This is a very lethal problem,

Compared to what? And if you assert it without bounds to timeframe I can confidently state it‘s certainly more than stubbing your little toe, certainly less than Heat Death.

But without fixing a relative range it seems to not carry much weight at all.

Is it more or less than nuclear war, runaway bioweapons, gamma ray burst, etc., on a 10 year 100 year, 1000 year timeframes, etc. ?

it has to be solved one way or another,

According to who? And why a binary choice? Multi-multi scenarios existing in equilibrium have not been disproven yet,

i.e. where even if all worst eventualities come about, some AI faction may keep humans around in pleasant conditions, some in poor conditions, some in despicable conditions, etc., much like human-dog relations

it has to be solved at a minimum strength and difficulty level instead of various easier modes that some dream about,

You need to disprove, or point to evidence that shows, why all ’easier modes’ proposals are incorrect, which admittedly there are lots. I have not yet seen any that is comprehensive, though it seems like something that is unnecessary to base a strong assertion on anyways.

i.e. if even one such proposal contained useful ideas then dismissing them as a class would seem very silly in retrospect.

we do not have any visible option of 'everyone' retreating to only solve safe weak problems instead, and failing on the first really dangerous try is fatal.

Why is it ‘fatal’? And who determines what counts as the ‘first really dangerous try’?

I highly highly doubt there will be anything approaching unanimous consensus on defining either terms on this across the world. On LW maybe, though ‘first really dangerous try’ sounds too wish washy for a decent chunk of the regulars.

Replies from: lc, DaemonicSigil, RobbBB, Gunnar_Zarncke

↑ comment by lc · 2022-06-05T23:42:41.513Z · LW(p) · GW(p)

Stop worrying about whether or not Eliezer has the "right" to say these things and start worrying about whether or not they're true. You have the null string as well.

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2022-06-06T02:51:04.166Z · LW(p) · GW(p)

Are you sure your responding to the right post? I explicitly was trying to determine what was true or not. In fact that was about as a straightforward and frank as I could see anyone being without being clearly rude.

Maybe your a bit confused and mixed my post up with another?

Though nobody else in the comments seem to have said anything about ‘the ‘right’ to say these things’?

I’m trying to find a charitable interpretation for why you wrote that but I‘m drawing a blank, it really seems like just you saying that and trying to troll.

Replies from: Daphne_W, lc

↑ comment by Daphne_W · 2022-06-06T09:20:32.942Z · LW(p) · GW(p)

Your method of trying to determine whether something is true or not relies overly much on feedback from strangers. Your comment demands large amounts of intellectual labor from others ('disprove why all easier modes are incorrect'), despite the preamble of the post, while seeming unwilling to put much work in yourself.

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2022-06-06T13:16:01.129Z · LW(p) · GW(p)

Yes, when strong assertions are made, a lot of intellectual labor is expected if evidence is lacking or missing. Plus, I wrote it in mind as being the first comment so it raises a few more points than I think is practical for the 100th comment. The preamble cannot justify points that are justified nowhere else, Or else it would be a simple appeal to authority.

In the vast majority of cases people who understand what they don’t understand hedge their assertions, so since there was a lack of equally strong evidence, or hedging, to support the corresponding claims I was intrigued if they did exist and Elizer simply didn’t link it, which could be for a variety of reasons. That is another factor in why I left it open ended.

It does seem I was correct for some of the points that the strongest evidence is less substantial than what the claims imply.

The other way I could see a reasonable person view it, is if I had read everything credible to do with the topic I wouldn’t have phrased it that way.

Though again that seems a bit far fetched since I highly doubt anyone has read through the preexisting literature completely across the many dozens of topics mentioned here and still remembers every point.

In any case it would have been strange to put a detailed and elaborate critique of a single point in the very first comment where common courtesy is to leave it more open ended for engagement and to allow others to chime in.

Which is why lc’s response seems so bizarre since they don’t even address any of the obvious rebuttals of my post and instead opens with a non-sequiter.

↑ comment by lc · 2022-06-06T03:51:07.778Z · LW(p) · GW(p)

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2022-06-06T13:27:32.161Z · LW(p) · GW(p)

See my detailed response to Daphne_W ‘s comment https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=vWjdiSeo2LtMj42wD#vWjdiSeo2LtMj42wD [LW(p) · GW(p)]

Otherwise, even though this is strangely out of character for you lc, I have a policy of disengaging from what appears to be low effort trolling.

↑ comment by DaemonicSigil · 2022-06-06T04:40:10.265Z · LW(p) · GW(p)

Compared to what?

"Lethal" here means "lethal enough to kill every living human". For example, later in the article Eliezer writes this:

When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1"

...

According to who?

From context, "has to" means "if we humans don't solve this problem then we will be killed by an unaligned AI". There's no person/authority out there threatening us to solve this problem "or else", that's just the way that reality seems to be. If you're trying to ask why does building a Strong unaligned AI result in everyone being killed, then I suggest reading the posts about orthogonality and instrumental convergence linked at the top of this post.

And why a binary choice?

"One way or another" is an English idiom which you can take to mean "somehow". It doesn't necessarily imply a binary choice.

Multi-multi scenarios existing in equilibrium have not been disproven yet,

This is addressed by #34: Just because multiple AI factions can coexist and compromise with each other doesn't mean that any of those factions will be likely to want to keep humans around. It doesn't seem likely that any AIs will think humans are cute and likeable in the same way that we think dogs are cute and likeable.

You need to disprove, or point to evidence that shows, why all ’easier modes’ proposals are incorrect

This is mostly addressed in #6 and #7, and the evidence given is that "nobody in this community has successfully named a 'pivotal weak act'". You could win this part of the argument by pointing out something that could be done with an AI weak enough not to be a threat which could prevent all the AI research groups out there in the world from building a Strong AI.

Why is it ‘fatal’?

Because we expect a Strong AI that hasn't been aligned to kill everyone. Once again, see the posts about orthogonality and instrumental convergence.

And who determines what counts as the ‘first really dangerous try’?

I'm not quite sure what you're asking here? I guess Eliezer determines what he meant by writing those words. I don't think there's anyone at any of these AI research groups looking over proposals for models and saying "oh this model is moderately dangerous" or "oh this model is really dangerous, you shouldn't build it". I think at most of those groups, they only worry about the cost to train the model rather than how dangerous it will be.

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2022-06-08T01:42:13.485Z · LW(p) · GW(p)

If you were unaware, every example in the parent of other types of ‘lethal’ has the possibility of eliminating all human life. And not in a hand wavey sense either, truly 100%, the same death rate as the worse case AGI outcomes.

Which means that to a knowledgeable reader the wording is unpersuasive since the point is made before it’s been established there’s potential for an even worse outcome than 100% extinction.

This shouldn’t be too hard to do since this topic was regularly discussed on LW… dust specks, and simulated tortures, etc.

Idk why neither you or Eliezer include the obvious supporting points or links to someone who does beforehand, or at least not buried way past the assertion, since it seems you are trying to reinforce his points and Eliezer ostensibly wanted to write a summary to begin with for the non-expert reader.

If there’s a new essay style that I didn’t get the memo about to put the weak arguments at the beginning and stronger ones near the end then I could see why it was written in such a way.

For the rest of your points I see the same mistake of strong assertions without equally strong evidence to back it up.

For example, none of the posts from the regulars I’ve seen on LW assert, without any hedging, that there’s a 100% chance of human extinction due to any arbitrary Strong AI.

I’ve seen a few made that there’s a 100% chance Clippy would do such if Clippy arose first, though even those are somewhat iffy. And definitely none saying there’s a 100% chance Clippy, and only Clippy, would arise and reach an equilibrium end state.

If you know of any such please provide the link.

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T03:33:47.746Z · LW(p) · GW(p)

Note that "+50 karma" here doesn't mean 50 people upvoted the post (at the time you read it), since different votes have different karma weights. E.g., as I write this the post has +165 karma but only 50 total voters. (And 30 total comments.) So when you wrote your comment there were probably only 10-20 upvoters.

Compared to what?

'More likely than not to kill us in the next 40 years' seems more than sufficient for treating this as an existential emergency, and AFAIK EY's actual view is a lot doomier and nearer-term.

even if all worst eventualities come about, some AI faction may keep humans around in pleasant conditions, some in poor conditions, some in despicable conditions, etc., much like human-dog relations

Do you think a paperclip maximizer keeps humans around as pets? If not, is there something that makes paperclip maximizers relatively unlikely or unrepresentative, on your model?

i.e. if even one such proposal contained useful ideas then dismissing them as a class would seem very silly in retrospect.

Which proposal do you think is most promising?

(I think lc's objection is partly coming from a place of: 'Your comment says very little about your views of anything object-level.')

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2022-06-06T04:02:43.427Z · LW(p) · GW(p)

Huh I didn’t realize +50 karma could mean as few as 10 people. Thanks, that also seems to explain why I got some downvotes. There were a sudden influx of comments in the hour right after I posted so at least it wasn’t for vain.

40 years is a lot different from 10 years, and he sure isn’t doesn’t doing himself any favours by not clarifying. It also seems like something the community has focused quite a bit of effort on narrowing down, so it seems strange he would elide the point.

Idk if it’s for some deep strategic purpose but it certainly puts any serious reader into a skeptical mood.

On the idea of ‘pets’, Clippy perhaps might, splinter AI factions almost surely would.

On the ‘easy’ proposals I was expecting Eliezer to provide a few examples of the strongest proposals of the class, and then develop a few counter examples and show conclusively why they are too naive, thus credibly dismissing the entire class. Or at least link to someone who does.

I personally don’t think any ‘easy alignment’ proposal is likely, though I also wouldn’t phrase the dismissal of the class so strongly either.

lc’s objection is bizarre if that was his intention, since he phrased his comment in a way that was clearly least applicable to what I wrote out of every comment on this post. And he got some non zero number of folks to show agreement. Which leads me to suspect some type of weird trolling behaviour. Since it doesn’t seem credible that multiple folks truly believed that I should have been even more direct and to the point.

If anything I was expecting some mild criticism that I should have been more circumspect and hand wavey.

Replies from: DaemonicSigil

↑ comment by DaemonicSigil · 2022-06-06T05:04:59.953Z · LW(p) · GW(p)

Clippy is defined as a paperclip maximizer. Humans require lots of resources to keep them alive. Those resources could otherwise be used for making more paperclips. Therefore Clippy would definitely not keep any human pets. I'm curious why you think splinter AI factions would. Could you say a bit more about how you expect splinter AIs to arise, and why you expect them to have a tendency towards keeping pets? Is it just that having many AIs makes it more likely that one of them will have a weird utility function?

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2022-06-06T14:01:42.307Z · LW(p) · GW(p)

In a single-single scenario you are correct that it would very unlikely for Clippy to behave in such a manner.

However in a multi-multi scenario, which is akin to an iterated prisoner’s dilemma of random length with unknown starting conditions, the most likely ‘winning’ outcome would be some variation of tit-for-tat.

And tit-for-tat encourages perpetual cooperation as long as the parties are smart enough to avoid death spirals. Again similar to human-pet relationships.

Of course it’s not guaranteed that any multi-multi situation will in fact arise, but I haven’t seen any convincing disproof, nor for any reason why it should not be treated as the default. The most straightforward reason would be the limitations of light speed on communications guaranteeing value drift for even the mightiest hypothetical AGI, eventually.

No one on LW, or in the broader academic community as far as I’m aware of, has yet managed to present a foolproof argument, or even one convincing on the balance of probabilities, for why single-single outcomes are more likely than multi-multi.

↑ comment by Gunnar_Zarncke · 2022-06-05T23:28:46.142Z · LW(p) · GW(p)

50 upvotes and no comments? Weird.

It is by Yudkowsky and is more of a reference post summarizing many arguments provided by him elsewhere already.

Feel free to nitpick anway.

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2022-06-06T02:42:14.079Z · LW(p) · GW(p)

I write with the assumption that no one presumes Eliezer is infallible. And that everyone understands enough of human psychology that it would be very unlikely for anyone to write a long essay with dozens of points completely flawlessly and without error. Hence I wrote the first post as a helpful critique, which seems to be common enough on LW.

If some people truly believe even the most straightforward and frank questioning of weak assertions, with no beating around the bush at all, is ‘nitpicking’ then that’s on them. If you truly believe that then that’s on you.

If anyone actually had an expectation of no critique allowed they’d be just blindly upvoting without regard for being Less Wrong, which would seem fairly silly since everybody here seems like they can write coherent comments and thus understand that appeals to authority are a fallacious argument.

But given that I got at least 7 downvotes, that may sadly be the case. Or some trolls, etc., just downvote the first comment on a reflex basis.

EDIT: Or the downvotes could be folks thinking my comment was too direct, etc., though that would seem to contradict the fact that ‘lc’ got 28 upvotes for saying it wasn’t clear enough in being a criticism of the points?

This is by far the oddest distribution of votes I have ever seen in comment replies.

Replies from: Gunnar_Zarncke

↑ comment by Gunnar_Zarncke · 2022-06-06T08:04:22.818Z · LW(p) · GW(p)

Votes related to posts by the leader of a community are unavoidably influenced by status considerations by a notable fraction of the audience.

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2022-06-06T15:07:11.988Z · LW(p) · GW(p)

Yes, upon reflection I agree.

I had tried writing a point relating to possible social status considerations originally but edited it out as it would seem unfair as a direct reply.

It’s still disappointing anyone would at all instead of posting a substantive response. Ironically, if any of them had looked through my comment history they would have realized how unlikely it was to cow me via social status signalling. Thankfully they didn’t pick an easier target.

And lc‘s apparent doubling down reinforces how silly it all looks.

Social signalling is usually the reserve of those without much in way of substantive prospects so it‘s unfortunate that the community has attracted a few members who feels so strongly about it to use their downvotes on the first comment in a post most likely to arouse suspicions of that.

Since those with productive intentions can write openly, including the majority on LW, I’m fairly convinced the portion with unproductive goals can only ever be temporary.

Replies from: Gunnar_Zarncke

↑ comment by Gunnar_Zarncke · 2022-06-07T07:59:44.748Z · LW(p) · GW(p)

It happened to me [LW(p) · GW(p)] too when I was a newbie. Interesting lesson.

Replies from: M. Y. Zuo

↑ comment by M. Y. Zuo · 2022-06-09T18:14:04.454Z · LW(p) · GW(p)

It’s quite pleasant to see through all layers of dissimulation and sophistry. Certainly more interesting than the usual. And the time savings in not having to remember half truths, empty flatteries, etc., enable intelligent writing with a fraction of the time.

comment by zkTRUTH (nicwickman) · 2022-06-09T01:11:01.505Z · LW(p) · GW(p)

If we have total conviction that the end of the world is nigh, isn't it rational to consider even awful, unpalatable options for extending the timeline before we "achieve" AGI?

It's not strictly necessary that a pivotal act is powered by AI itself.

Avoiding explicit details for obvious reasons and trusting that it's understood. But surely it's within the realm of possibility to persecute, terrorize or sabotage the progression of AI research; and plausibly for a long enough time to solve alignment first.

Curious to know the "dignity" calculation here. Presumably almost any pivotal act with an aligned-AGI is forgivable because it would be specific and the dawn of a Utopia. But what if horrible things are done only to buy more time into a still uncertain future?

Replies from: sil-ver, Prometheus

↑ comment by Rafael Harth (sil-ver) · 2022-06-09T09:27:02.698Z · LW(p) · GW(p)

If we have total conviction that the end of the world is nigh, isn't it rational to consider even awful, unpalatable options for extending the timeline before we "achieve" AGI?

Eliezer has been very clear that he thinks this is a bad idea, see e.g. Q2 of this post [LW · GW].

Also, keep in mind that a single instance of one AI safety person doing something criminal has the potential for massively damaging the public standing of the community. I think this should dominate the calculation; even if you think the probability that [the arguments from the current post are totally wrong] is low, it's not that low.

↑ comment by Prometheus · 2022-06-09T02:29:51.505Z · LW(p) · GW(p)

I'd be mindful of information hazards. All you need is one person doing this too soon, and likely failing, for talking about the dangers of AI to become taboo in the public eye.

comment by kanzure (Bryan_Bishop) · 2022-06-06T15:41:35.763Z · LW(p) · GW(p)

Our ultimate fate may be one of doom but it may also be exceedingly positive to us. The conceiving of bad conceivable outcomes is not itself able to negate conceiving of positive conceivable outcomes, nor the other way around. Doomsaying (steel-dystopianizing?) and steel-utopianizing are therefore not productive activities of man.

There has never been a guarantee of safety of our or any other lifeform's path through the evolutionary mysts. Guaranteeing our path through the singularity to specific agreeable outcomes may not be possible even in a world where a positive singularity outcome is actually later achieved. That might even be our world for all we know. Even if it's always possible in all possible worlds to create a guarantee of our path through the singularity and its outcome, it's not clear to me that working on trying to make theoretical (and practical) guarantees would be better than the utility of working on other positive technology developments instead. For example, while such guarantees may be possible in all possible worlds, it may not be possible to develop such guarantees in a timely manner for them to matter. Even if guarantees are universally possible in all possible worlds, prior to, you know, actually needing them to be implemented, it may still be less optimal to focus your work on those guarantees.

Some of those positive singularity outcomes may only be achievable in worlds where specifically your followers and readers neglect the very things that you are advocating for them to spend their time on. Nobody really knows, not with any certainty.

Replies from: RobbBB

↑ comment by Rob Bensinger (RobbBB) · 2022-06-06T22:53:21.082Z · LW(p) · GW(p)

Doomsaying (steel-dystopianizing?)

The OP is arguing that X is literally true. Framing it as a 'steel-man' of X is misleading; you may disagree with the claim, but engage with it as an actual attempt to describe reality, not as an attempt to steel-man or do the 'well, maybe this thing will go wrong, we can't be sure...' thing.

There has never been a guarantee of safety of our or any other lifeform's path through the evolutionary mysts.

EY isn't asking for a guarantee; see -2 in the preamble.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2022-06-07T01:55:35.181Z · LW(p) · GW(p)

comment by [deleted] · 2023-04-01T13:48:18.340Z · LW(p) · GW(p)

Replies from: None

↑ comment by [deleted] · 2023-04-01T14:45:35.860Z · LW(p) · GW(p)

comment by lovetheusers (CrazyPyth) · 2022-11-15T03:01:01.994Z · LW(p) · GW(p)

comment by Rafael Cosman (rafael-cosman-1) · 2022-06-11T19:13:57.064Z · LW(p) · GW(p)

comment by nlholdem · 2022-09-28T20:51:06.963Z · LW(p) · GW(p)

AGI Ruin: A List of Lethalities

Contents

Preamble:

Section A:

Section B:

Section C:

708 comments

isn't?

Key Problem Areas in AI Safety:

Conclusion: