I read Eliezer's response as basically "Yes, in the following sense: I would certainly have learned very new and very exciting facts about intelligence..."
I prefer Eliezer's response over just saying "yes", because there's ambiguity in what it means to be a "crux" here, and because "agentic" in Richard's question is an unclear term.
I wish EY could stop saying "pivotal act" long enough to talk about why he thinks intelligence implies an urge for IRL agenticness.
I don't know what you mean by "intelligence" or "an urge for IRL agenticness" here, but I think the basic argument for 'sufficiently smart and general AI will behave as though it is consistently pursuing goals in the physical world' is that sufficiently smart and general AI will (i) model the physical world, (ii) model chains of possible outcomes in the physical world, and (iii) be able to search for policies that make complex outcomes much more or less likely. If that's not sufficient for "IRL agenticness", then I'm not sure what would be sufficient or why it matters (for thinking about the core things that make AGI dangerous, or make it useful).
Talking about pivotal acts then clarifies what threshold of "sufficiently smart" actually matters for practical purposes. If there's some threshold where AI becomes smart and general enough to be "in-real-life-agentic", but this threshold is high above the level needed for pivotal acts, then we mostly don't have to worry about "in-real-life agenticness".
Or at least, define the term "pivotal act" and explain why he says it so much.
Once again Yudkowsky could have agreed or disagreed or corrected, but confusingly chooses "none of the above":
What do you find confusing about it? Eliezer is saying that he's not making a claim about what's possible in principle, just about what's likely to be reached by the first AGI developers. He then answers the question here (again, seems fine to me to supply a "Yes, in the following sense:"):
I think that obvious-to-me future outgrowths of modern ML paradigms are extremely liable to, if they can learn how to do sufficiently superhuman X, generalize to taking over the world. How fast this happens does depend on X. It would plausibly happen relatively slower (at higher levels) with theorem-proving as the X, and with architectures that carefully stuck to gradient-descent-memorization over shallow network architectures to do a pattern-recognition part with search factored out (sort of, this is not generally safe, this is not a general formula for safe things!); rather than imposing anything like the genetic bottleneck you validly pointed out as a reason why humans generalize. Profitable X, and all X I can think of that would actually save the world, seem much more problematic.
Expressing a thought in your own words can often be clearer than just saying "Yes" or "No"; e.g., it will make it more obvious whether you misunderstood the intended question.
We have now received the first partial run that meets our quality bar. The run was submitted by LessWrong user Vanilla_cabs. Vanilla's team is still expanding the run (and will probably fix some typos, etc. later), but I'm providing a copy of it here with Vanilla's permission, to give others an example of the kind of thing we're looking for:
Vanilla's run is currently 266 steps long. Per the Visible Thoughts Project FAQ, we're willing to pay authors $20 / step for partial runs that meet our quality bar (up to at least the first 5,000 total steps we're sent), so the partial run here will receive $5320 from the prize pool (though the final version will presumably be much longer and receive more; we expect a completed run to be about 1000 steps).
Vanilla_cabs is open to doing paid consultation for anyone who's working on this project. So if you want feedback from someone who understands our quality bar and can demonstrably pass it, contact Vanilla_cabs via their LessWrong profile.
(Daniel Dennett's book Darwin's Dangerous Idea does a good job I think of imparting intuitions about the 'Platonic inevitability' of it.)
Possibly when Richard says "evolutionary theory" he means stuff like 'all life on Earth has descended with modification from a common pool of ancestors', not just 'selection is a thing'? It's also an empirical claim that any of the differences between real-world organisms in the same breeding population are heritable.
how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?
I think the intended visualization is simply that you create a very small self-replicating machine, and have it replicate enough times in the atmosphere that every human-sized organism on the planet will on average contain many copies of it.
One of my co-workers at MIRI comments:
(further conjunctive detail for visualizer-plausibility: most of your replication time is in all the doublings before the last doubling, and in particular you can make a shitload in a pretty small space before launching it into the jet stream to disperse. the jet stream can be used to disperse stuff throughout the atmosphere (and it can use solar radiation, at least, to keep reproducing). it could in principle be powered and do minor amounts of steering.
example things the [AGI] who has no better plan than this paltry human-conceivable plan has to think about are "how does the time-cost of making sure [I hit the people] at the south pole base and [on] all the cruise liners and in all the nuclear submarines, trade off against the risk-cost of leaving that fragment of humanity alive", etc.)
When I look at factory-farmed animals, I feel awful for them. So coming into this, I have some expectation that my eventual understanding of consciousness, animal cognition, and morality (C/A/M) will add up to normalcy (i.e. not net positive for many animals).
'It all adds up to normality' doesn't mean 'you should assume your initial intuitions and snap judgments are correct even in cases where there's no evolutionary or physical reason for the intuition/judgment to be correct'. It means 'reductive explanations generally have to recapture the phenomenon somehow'. Here, the phenomenon is a feeling of your brain, and 'that feeling is just anthropomorphism' recaptures the phenomenon perfectly, regardless of whether animals are conscious, what their inner life is like (if they're conscious), etc.
I agree with the claim 'my gut reaction is that factory-farmed pigs suffer a lot'. I disagree with the claim 'my gut reaction is that factory-farmed pigs would be better off not existing'. I think that's a super different claim, and builds in a lot more theory and deliberative reasoning (though it may feel obvious once it's been cached long enough).
I do think that that gut reaction is important information
I just disagree. I think it's not important at all, except insofar as it helps us notice the hypothesis that life might be terrible, net-negative, etc. for chickens in factory farms.
E.g., a lot of people seem to think that chickens are obviously conscious, but that ants aren't obviously conscious (or even that they're obviously not conscious). This seems like an obviously silly position to me, unless the person has a very detailed, well-supported, predictive model of consciousness that makes that prediction. In this case, I think that going through the imaginative exercise of anthropomorphizing ants could be quite epistemically useful, to make it more salient that this really is a live possibility.
But no, I don't think the imaginative exercise actually gives us Bayesian evidence about what's going on inside ants' brains — it's purely 'helping correct for a bias that made us bizarrely neglect a hypothesis a superintelligence would never neglect'; the way the exercise plays out in one's head doesn't covary with ant consciousness across possible worlds. And exactly the same is true for chickens.
Note that there might be other crucial factors in assessing whether 'more factory farming' or 'less factory farming' is good on net — e.g., the effect on wild animals, including indirect effects like 'factory farming changes the global climate, which changes various ecosystems around the world, which increases/decreases the population of various species (or changes what their lives are like)'.
It then matters a lot how likely various wild animal species are to be moral patients, whether their lives tend to be 'worse than death' vs. 'better than death', etc.
The number would be much higher than 60% on strictly utilitarian grounds, but humans aren't strict utilitarians and it makes sense for people working hard on improving animal lives to develop strong feelings about their own personal relationship to factory farming, or to want to self-signal their commitment in some fashion.
I do think that most of EA's distinctive moral views are best understood as 'moves in the direction of utilitarianism' relative to the typical layperson's moral intuitions. This is interesting because utilitarianism seems false as a general theory of human value (e.g., I don't reflectively endorse being perfectly morally impartial between my family and a stranger). But utilitarianism seems to get one important core thing right, which is 'when the stakes are sufficiently high and there aren't complicating factors, you should definitely be impartial, consequentialist, scope-sensitive, etc. in your high-impact decisions'; the weird features of EA morality seem to mostly be about emulating impartial benevolent maximization in this specific way, without endorsing utilitarianism as a whole.
Like, an interest in human challenge trials is a very recognizably ‘EA-moral-orientation’ thing to do, even though it’s not a thing EAs have traditionally cared about — and that’s because it’s thinking seriously, quantitatively, and consistently about costs and benefits, it’s consequentialist, it’s impartially trying to improve welfare, etc.
There’s a general, very simple and unified thread running through all of these moral divergences AFAICT, and it’s something like ‘when choices are simultaneously low-effort enough and high-impact enough, and don’t involve severe obvious violations of ordinary interpersonal ethics like "don’t murder", utilitarianism gets the right answer’. And I think this is because ‘impartially maximize welfare’ is itself a simple idea, and an incredibly crucial part of human morality.
I'd guess the most controversial part of this post will be the claim 'it's not incredibly obvious that factory-farmed animals (if conscious) have lives that are worse than nonexistence'?
But I don't see why. It's hard to be confident of any view on this, when we understand so little about consciousness, animal cognition, or morality. Combining three different mysteries doesn't tend to create an environment for extreme confidence — rather, you end up even more uncertain in the combination than in each individual component.
And there are obvious (speciesist) reasons people would tend to put too much confidence in 'factory-farmed animals have net-negative lives'.
E.g., when we imagine the Holocaust, we imagine relatively rich and diverse experiences, rather than reducing concentration camp victims to a very simple thing like 'pain in the void'.
I would guess that humans' nightmarish experience in concentration camps was usually better than nonexistence; and even if you suspect this is false, it seems easy to imagine how it could be true, because there's a lot more to human experience than 'pain, and beyond that pain, darkness'. It feels like a very open question in the human case.
But just because chickens lack some of the specific faculties humans have, doesn't mean that (if conscious) chicken minds are 'simple', or simple in the particular ways people tend to assume. In particular, it's far from obvious (and depends on contingent theories about consciousness and cognition) that you need human-style language or abstraction in order to have 'rich' experience that just has a lot of morally important stuff going on. A blank map doesn't correspond to a blank territory; it corresponds to a thing we know very little about.
(For similar reasons, I think EAs in general worry far too little about whether chickens and other animals are utility monsters — this seems like a very live hypothesis to me, whether factory-farmed chickens have net-positive lives or net-negative ones.)
I haven't done anything like a careful analysis, but at a guess, this shift has some promise for unifying the classical split between epistemic and instrumental rationality. Rationality becomes the art of seeking interaction with reality such that your anticipations keep synching up more and more exactly over time.
"Unifying epistemic and instrumental reality" doesn't seem desirable to me — winning and world-mapping are different things. We have to choose between them sometimes, which is messy, but such is the nature of caring about more than one thing in life.
World-mapping is also a different thing from prediction-making, though they're obviously related in that making your brain resemble the world can make your brain better at predicting future states of the world — just fast forward your 'map' and see what it says.
The two can come apart, e.g., if your map is wrong but coincidentally gives you the right answer in some particular case — like a clock that's broken and always says it's 10am, but you happen to check it at 10am. Then you're making an accurate prediction on the basis of something other than having an accurate map underlying that prediction. But this isn't the sort of thing to shoot for, or try to engineer; merely accurate predictiveness is a diminished version of world-mapping.
All of this is stuff that (in some sense) we know by experience, sure. But the most fundamental and general theory we use to make sense of truth/accuracy/reasoning needn't be the earliest theory we can epistemically justify, or the most defensible one in the face of Cartesian doubts.
Earliness, foundationalness, and immunity-to-unrealistically-extreme-hypothetical-skepticism are all different things, and in practice the best way to end up with accurate and useful foundations (in my experience) is to 'build them as you go' and refine them based on all sorts of contingent and empirical beliefs we acquire, rather than to impose artificial earliness or un-contingent-ness constraints.
I don't use microCOVID much. Two things I'd like from the site:
A simple, reasonable, user-friendly tool for non-rationalists I know who are more worried about COVID than me (e.g., family).
A tool I can use if a future strain arises that's a lot more scary. Something fast and early, that updates regularly as new information comes in.
The latter goal seems more useful in general, and my sense is that microCOVID isn't currently set up to do that kind of thing -- the site currently says "Not yet updated for the Omicron variant", over a month in.
For the latter goal, updating fast matters more than meticulously citing sources and documenting all your reasoning. I see less need for a 'GiveWell of microCOVID' (that carefully defends every claim), and more value in a sort of Bayesian approach where you take the individuals with the best forecasting track record on COVID-related things, ask for their take on all the uncertain parameters, and then let people pick their favorite forecaster (or favorite aggregation method) from a dropdown menu.
Buddhism is a huge part of Joshin's life (which seems fine to note), but if there's an implied argument 'Buddhism is causally responsible for this style of discourse', 'all Buddhists tend to be like this', etc., you'll have to spell that out more.
Firstly, I (partially?) agree that the current DL paradigm isn't strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.
I don't know what "strongly alignable", "robust, high certainty paradigm", or "approximately/weakly alignable" mean here. As I said in another comment:
There are two problems here:
Problem #1: Align limited task AGI to do some minimal act that ensures no one else can destroy the world with AGI.
Problem #2: Solve the full problem of using AGI to help us achieve an awesome future.
Problem #1 is the one I was talking about in the OP, and I think of it as the problem we need to solve on a deadline. Problem #2 is also indispensable (and a lot more philosophically fraught), but it's something humanity can solve at its leisure once we've solved #1 and therefore aren't at immediate risk of destroying ourselves.
If you have enough time to work on the problem, I think basically any practical goal can be achieved in CS, including robustly aligning deep nets. The question in my mind is not 'what's possible in principle, given arbitrarily large amounts of time?', but rather 'what can we do in practice to actually end the acute risk period / ensure we don't blow ourselves up in the immediate future?'.
(Where I'm imagining that you may have some number of years pre-AGI to steer toward relatively alignable approaches to AGI; and that once you get AGI, you have at most a few years to achieve some pivotal act that prevents AGI tech somewhere in the world from paperclipping the world.)
The weakly alignable baseline should be "marginally better than humans".
I don't understand this part. If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world's existential risk. (Similarly, I think fast-running high-fidelity human emulations are one of the more plausible techs humanity could use to save the world, since you could then do a lot of scarily impressive intellectual work quickly (including work on the alignment problem) without putting massive work into cognitive transparency, oversight, etc.)
I'm taking for granted that AGI won't be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred. So I'm thinking in terms of 'what's the least difficult-to-align act humanity could attempt with an AGI?'.
Maybe you mean something different by "marginally better than humans"?
As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind's case), there's hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans - ie the approximate alignment solution that evolution found.
I think this is a purely Problem #2 sort of research direction ('we have subjective centuries to really nail down the full alignment problem'), not a Problem #1 research direction ('we have a few months to a few years to do this one very concrete AI-developing-a-new-physical-technology task really well').
I think a common culprit is people misunderstanding Gödel's theorems as blocking more things than they actually do. There's also field-specific folklore — e.g., a lot of traditional academic decision theorists seem to have somehow acquired the belief that you can't assign probabilities to your own actions, on pain of paradox.
I... think that makes more sense? Though Eliezer was saying the field's progress overall was insufficient, not saying 'decision theory good, ML bad'. He singled out eg Paul Christiano and Chris Olah as two of the field's best researchers.
I'd argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.
I think this is straightforwardly true in two different ways:
Prior to the deep learning revolution, Eliezer didn't predict that ANNs would be a big deal — he expected other, neither-GOFAI-nor-connectionist approaches to AI to be the ones that hit milestones like 'solve Go'.
MIRI thinks the current DL paradigm isn't alignable, so we made a bet on trying to come up with more alignable AI approaches (which we thought probably wouldn't succeed, but considered high-enough-EV to be worth the attempt).
I don't think this has anything to do with the OP, but I'm happy to talk about it in its own right. The most relevant thing would be if we lost a bet like 'we predict deep learning will be too opaque to align', but we still are just as pessimistic about humanity's ability to align deep nets are ever, so if you think we've hugely underestimated the tractability of aligning deep nets, I'd need to hear more about why. What's the path to achieving astronomically good outcomes, on the assumption that the first AGI systems are produced by 2021-style ML methods?
Problem #1: Align limited task AGI to do some minimal act that ensures no one else can destroy the world with AGI.
Problem #2: Solve the full problem of using AGI to help us achieve an awesome future.
Problem #1 is the one I was talking about in the OP, and I think of it as the problem we need to solve on a deadline. Problem #2 is also indispensable (and a lot more philosophically fraught), but it's something humanity can solve at its leisure once we've solved #1 and therefore aren't at immediate risk of destroying ourselves.
The rhetorical approach of the comment is also weird to me. 'So you've never heard of CIRL?' surely isn't a hypothesis you'd give more weight to than 'You think CIRL wasn't a large advance', 'You think CIRL is MIRI-ish', 'You disagree with me about the size and importance of the alignment problem such that you think it should be a major civilizational effort', 'You think CIRL is cool but think we aren't yet hitting diminishing returns on CIRL-sized insights and are therefore liable to come up with a lot more of them in the future'. etc. So I assume the question is rhetorical; but then it's not clear to me what you believe about CIRL or what point you want to make with it.
So you haven't heard of IRL, CIRL, value learning, that whole DL safety track, etc? Or are you outright dismissing them? I'd argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.
This comment and the entire conversation that spawned from it is weirdly ungrounded in the text — I never even mentioned DL. The thing I was expressing was 'relative to the capacity of the human race, and relative to the importance and (likely) difficulty of the alignment problem, very few research-hours have gone into the alignment problem at all, ever; so even if you're pessimistic about the entire space of MIRI-ish research directions, you shouldn't have a confident view that there are no out-of-left-field research directions that could arise in the future to take big bites out of the alignment problem'.
Relative to what I mean by 'reasoning about messy physical environments at all', MuZero and Tesla Autopilot don't count. I could see an argument for GPT-3 counting, but I don't think it's in fact doing the thing.
Making a map of your map is another one of those techniques that seem to provide more grounding but do not actually.
Sounds to me like one of the things Eliezer is pointing at in Hero Licensing:
Look, thinking things like that is just not how the inside of my head is organized. There’s just the book I have in my head and the question of whether I can translate that image into reality. My mental world is about the book, not about me.
You do want to train your brain, and you want to understand your strengths and weaknesses. But dwelling on your biases at the expense of the object level isn't actually usually the best way to give your brain training data and tweak its performance.
I think there's a lesson here that, e.g., Scott Alexander hadn't fully internalized as of his 2017 Inadequate Equilibria review. There's a temptation to "go meta" and find some cleaner, more principled, more objective-sounding algorithm to follow than just "learn lots and lots of object-level facts so you can keep refining your model, learn some facts about your brain too so you can know how much to trust it in different domains, and just keep doing that".
But in fact there's no a priori reason to expect there to be a shortcut that lets you skip the messy unprincipled your-own-perspective-privileging Bayesian Updating thing. Going meta is just a tool in the toolbox, and it's risky to privilege it on 'sounds more objective/principled' grounds when there's neither a theoretical argument nor an empirical-track-record argument for expecting that approach to actually work.
Teaching the low-description-length principles of probability to your actual map-updating system is much more feasible (or at least more cost-effective) than emitting your actual map into a computationally realizable statistical model.
I think this is a good distillation of Eliezer's view (though I know you're just espousing your own view here). And of mine, for that matter. Quoting Hero Licensing again:
STRANGER: I believe the technical term for the methodology is “pulling numbers out of your ass.” It’s important to practice calibrating your ass numbers on cases where you’ll learn the correct answer shortly afterward. It’s also important that you learn the limits of ass numbers, and don’t make unrealistic demands on them by assigning multiple ass numbers to complicated conditional events.
ELIEZER: I’d say I reached the estimate… by thinking about the object-level problem? By using my domain knowledge? By having already thought a lot about the problem so as to load many relevant aspects into my mind, then consulting my mind’s native-format probability judgment—with some prior practice at betting having already taught me a little about how to translate those native representations of uncertainty into 9:1 betting odds.
One framing I use is that there are two basic perspectives on rationality:
Prosthesis: Human brains are naturally bad at rationality, so we can identify external tools (and cognitive tech that's too simple and straightforward for us to misuse) and try to offload as much of our reasoning as possible onto those tools, so as to not have to put weight down (beyond the bare minimum necessary) on our own fallible judgment.
Strength training: There's a sense in which every human has a small AGI (or a bunch of AGIs) inside their brain. If we didn't have access to such capabilities, we wouldn't be able to do complicated 'planning and steering of the world into future states' at all.
It's true that humans often behave 'irrationally', in the sense that we output actions based on simpler algorithms (e.g., reinforced habits and reflex behavior) that aren't doing the world-modeling or future-steering thing. But if we want to do better, we mostly shouldn't be leaning on weak reasoning tools like pocket calculators; we should be focusing our efforts on more reliably using (and providing better training data) the AGI inside our brains. Nearly all of the action (especially in hard foresight-demanding domains like AI alignment) is in improving your inner AGI's judgment, intuitions, etc., not in outsourcing to things that are way less smart than an AGI.
In practice, of course, you should do some combination of the two. But I think a lot of the disagreements MIRI folks have with other people in the existential risk ecosystem are related to us falling on different parts of the prosthesis-to-strength-training spectrum.
Techniques that give the illusion of objectivity are usually not useless. But to use them effectively, you have to see through the illusion of objectivity, and treat their outputs as observations of what those techniques output, rather than as glimpses at the light of objective reasonableness.
My Eliezer-model thinks that "there will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles" is far less than 30% likely, because it's so conjunctive:
It requires that there ever be a one-year interval in which the world output doubles.
It requires that there be a preceding four-year interval in which world output doubles.
So, it requires that the facts of CS be such that we can realistically get AI tech that capable before the world ends...
... and separately, that this capability not accelerate us to superintelligent AI in under four years...
... and separately, that ASI timelines be inherently long enough that we don't incidentally get ASI within four years anyway.
Separately, it requires that individual humans make the basic-AI-research decisions to develop that tech before we achieve ASI. (Which may involve exercising technological foresight, making risky bets, etc.)
Separately, it requires that individual humans leverage that tech to intelligently try to realize a wide variety of large economic gains, before we achieve ASI. (Which may involve exercising technological, business, and social foresight, making risky bets, etc.)
Separately, it requires that the regulatory environment be favorable.
(Possibly other assumptions are required here too, like 'the first groups that get this pre-AGI tech even care about transforming the world economy, vs. preferring to focus on more basic research, or alignment / preparation-for-AGI, etc.')
You could try to get multiple of those properties at once by assuming specific things about the world's overall adequacy and/or about the space of all reachable intelligent systems; but from Eliezer's perspective these views fall somewhere on the spectrum between 'unsupported speculation' and 'flatly contradicted by our observations so far', and there are many ways to try to tweak civilization to be more adequate and/or the background CS facts to be more continuous, and still not hit the narrow target "a complete 4 year interval in which world output doubles" (before AGI destroys the world or a pivotal act occurs).
(I'm probably getting a bunch of details about Eliezer's actual model wrong above, but my prediction is that his answer will at least roughly look like this.)
Is this 5 years of engineering effort and then humans leaving it alone with infinite compute?
Maybe something like '5 years of engineering effort to start automating work that qualitatively (but incredibly slowly and inefficiently) is helping with AI research, and then a few decades of throwing more compute at that for the AI to reach superintelligence'?
With infinite compute you could just recapitulate evolution, so I doubt Paul thinks there's a crux like that? But there could be a crux that's about whether GPT-3.5 plus a few decades of hardware progress achieves superintelligence, or about whether that's approximately the fastest way to get to superintelligence, or something.
Do you think that human generality of thought requires a unique algorithm and/or brain structure that's not present in chimps? Rather than our brains just being scaled up chimp brains that then cross a threshold of generality (analogous to how GPT-3 had much more general capabilities than GPT-2)?
I think human brains aren't just bigger chimp brains, yeah.
(Though it's not obvious to me that this is a crux. If human brains were just scaled up chimp-brains, it wouldn't necessarily be the case that chimps are scaled-up 'thing-that-works-like-GPT' brains, or scaled-up pelycosaur brains.)
Does the 'additional miracle' comment make sense if you assume that frame – that AGI will come from something like scaled up versions of current ML systems?
If scaling up something like GPT-3 got you to AGI, I'd still expect discontinuous leaps as the tech reached the 'can reason about messy physical environments at all' threshold (and probably other leaps too). Continuous tech improvement doesn't imply continuous cognitive output to arbitrarily high levels. (Nor does continuous cognitive output imply continuous real-world impact to arbitrarily high levels!)
I think I don't understand Carl's "separate, additional miracle" argument. From my perspective, the basic AGI argument is:
"General intelligence" makes sense as a discrete thing you can invent at a particular time. We can think of it as: performing long chains of reasoning to steer messy physical environments into specific complicated states, in the way that humans do science and technology to reshape their environment to match human goals. Another way of thinking about it is 'AlphaGo, but the game environment is now the physical world rather than a Go board'.
Humans (our only direct data point) match this model: we can do an enormous variety of things that were completely absent from our environment of evolutionary adaptedness, and when we acquired this suite of abilities we 'instantly' (on a geologic timescale) had a massive discontinuous impact on the world.
So we should expect AI, at some point, to go from 'can't do sophisticated reasoning about messy physical environments in general' to 'can do that kind of reasoning', at which point you suddenly have an 'AlphaGo of the entire physical world'. Which implies all the standard advantages of digital minds over human minds, such as:
We can immediately scale AlphaWorld with more hardware, rather than needing to wait for an organism to biologically reproduce.
We can rapidly iterate on designs and make deliberate engineering choices, rather than waiting to stumble on an evolutionarily fit point mutation.
We can optimize the system directly for things like scientific reasoning, whereas human brains can do science only as a side-effect of our EAA capacities.
When you go from not-having an invention to having one, there's always a discontinuous capabilities jump. Usually, however, the jump doesn't have much immediate impact on the world as a whole, because the thing you're inventing isn't a super-high-impact sort of thing. When you go from 0 to 1 on building Microsoft Word, you have a discontinuous Microsoft-Word-sized impact on the world. When you go from 0 to 1 on building AGI, you have a discontinuous AGI-sized impact on the world.
Thinking in the abstract about 'how useful would it be to be able to automate all reasoning about the physical world / all science / all technology?' is totally sufficient to make it clear why this impact would probably be enormous; though if we have doubts about our ability to abstractly reason to this conclusion, we can look at the human case too.
In that context, I find the "separate, additional miracle" argument weird. There's no additional miracle where we assume both AGI and intelligence explosion as axioms. Rather, AGI implies intelligence explosion because the 'be good at reasoning about physical environments in general, constructing long chains of reasoning, strategically moving between different levels of abstraction, organizing your thoughts in a more laserlike way, doing science and technology' thing implies being able to do AI research, for the same reason humans are able to do AI research. (And once AI can do AI research, it's trivial to see why this would accelerate AI research, and why this acceleration could feed on itself until it runs out of things to improve.)
If you believe intelligence explosion is a thing but don't think AGI is a thing, then sure, I can put myself in a mindset where it's weird to imagine two different world-changing events happening at around the same time ('I've already bought into intelligence explosion; now you want me to also buy into this crazy new thing that's supposed to happen at almost the exact same time?!').
But this reaction seems to require zooming out to the level of abstraction 'these are two huge world-impacting things; two huge world-impacting things shouldn't happen at the same time!'. The entire idea of AGI is 'multiple world-impacting sorts of things happen simultaneously'; otherwise we wouldn't call it 'general', and wouldn't talk about getting the capacity to do particle physics and pharmacology and electrical engineering simultaneously.
The fact that, e.g. AIs are mastering so much math and language while still wielding vastly infrahuman brain-equivalents, and crossing human competence in many domains (where there was ongoing effort) over decades is significant evidence for something smoother than the development of modern humans and their culture.
I agree with this as a directional update — it's nontrivial evidence for some combination of (a) 'we've already figured out key parts of reasoning-about-the-physical-world, and/or key precursors' and (b) 'you can do a lot of impressive world-impacting stuff without having full general intelligence'.
But I don't in fact believe on this basis that we already have baby AGIs. And if the argument isn't 'we already have baby AGIs' but rather 'the idea of "AGI" is wrong, we're going to (e.g.) gradually get one science after another rather than getting all the sciences at once', then that seems like directionally the wrong update to make from Atari, AlphaZero, GPT-3, etc. E.g., we don't live in a Hanson-esque world where AIs produce most of the scientific progress in biochemistry but the field has tried and failed for years to make serious AI-mediated progress on aerospace engineering.
"I suspect I would start to attach that same meaning to any code phrase" and "I think that even talking about either using a code phrase or to spell it out inevitably pushes toward that being a norm" are both concerns of mine, but I think I'm more optimistic than you that they just won't be big issues by default, and that we can deliberately avoid them if they start creeping in. I'm also perfectly happy in principle to euphemism-treadmill stuff and keep rolling out new terms, as long as the swap is happening (say) once every 15 years and not once every 2 years.
Why not say something like "hey, I'm bowing out of this conversation now, but it's not intended to be any sort of reflection on you or the topic, I'm not making a statement, I'm just doing what's good for me and that's all"?
That seems fine too, if I feel like putting the effort into writing a long thing like that, customizing it for the particular circumstances, etc. But I've noticed many times that it's a surprisingly large effort to hit exactly the right balance of social signals in a case like this, given what an important and commonplace move it is. (And I think I'm better than most people at wordsmithing this kind of thing, so if it's hard for me then I worry even more about a bunch of other people.)
Even just taking the time to include all the caveats and explanations can send the wrong signal -- can make a conversation feel more tense, defensive, adversarial, or hypercautious, since why else would you be putting so much work into clarifying stuff rather than just giving a chill 'bye now :)'?.
Avoiding that takes skill too. I think this is just a legit hard social thing to communicate. Having another tool in my toolbox that lets me totally ignore one of the most common difficult things to communicate seems great to me. :)
(And indeed, with your second paragraph I see that you're spotting some of the issues. We can just pre-despair of there being any possible solution to this, but also maybe the jargon would just work. We haven't tried, and jargon does sometimes just work.)
I think in most cases with public, online, asynchronous communication, it probably makes the most sense to just exit without a message about it.
In a minority of cases, though (e.g., where I've engaged in a series of back-and-forths and then abruptly stopped responding, or where someone asks me a direct Q or what-have-you), I find that I want an easy boilerplate way to notify others that I'm unlikely to respond more. I think "(Leaving orbit. 🙂)" or similar solves that specific problem for me.
Yeah, I would favor "tapping out" if it felt more neutral to me. 'Tapping out', 'bowing out', etc. sound a little resentful/aggressive to my ear, like you're exiting an annoying scuffle that's beneath your time. Even the combat-ish associations are a thing I'd prefer to avoid, if possible.
When I try to mentally simulate negative reader-reactions to the dialogue, I usually get a complicated feeling that's some combination of:
Some amount of conflict aversion: Harsh language feels conflict-y, which is inherently unpleasant.
Empathy for, or identification with, the people or views Eliezer was criticizing. It feels bad to be criticized, and it feels doubly bad to be told 'you are making basic mistakes'.
Something status-regulation-y: My reader-model here finds the implied threat to the status hierarchy salient (whether or not Eliezer is just trying to honestly state his beliefs), and has some version of an 'anti-cheater' or 'anti-rising-above-your-station' impulse.
How right/wrong do you think this is, as a model of what makes the dialogue harder or less pleasant to read from your perspective?
(I feel a little wary of stating my model above, since (a) maybe it's totally off, and (b) it can be rude to guess at other people's mental states. But so far this conversation has felt very abstract to me, so maybe this can at least serve as a prompt to go more concrete. E.g., 'I find it hard to read condescending things' is very vague about which parts of the dialogue we're talking about, about what makes them feel condescending, and about how the feeling-of-condescension affects the sentence-parsing-and-evaluating experience.)
When I try to think of gift ideas for dolphins, am I failing to notice some way in which I'm "selfishly" projecting what I think dolphins should want onto them, or am I violating some coherence axiom?
I think it's rather that 'it's easy to think of ways to help a dolphin (and a smart AGI would presumably find this easy too), but it's hard to make a general intelligence that robustly wants to just help dolphins, and it's hard to safely coerce an AGI into helping dolphins in any major way if that's not what it really wants'.
I think the argument is two-part, and both parts are important:
A random optimization target won't tend to be 'help dolphins'. More specifically, if you ~gradient-descent your way to the first general intelligence you can find that has the external behavior 'help dolphins in the training environment' (or that is starting to approximate that behavior), you will almost always find an optimizer that has some other goal in general.
E.g.: Humans invented condoms once we left the EAA. In this case, we could imagine that we have instilled some instinct in the AGI that makes it emit dolphin-helping behaviors at low capability levels; but then once it has more options, it will push into extreme starts of the state-space. (Condoms are humans' version of 'tiling the universe with smiley faces'.)
Alternatively: If you tried to get a human prisoner to devote their lives to helping dolphins, you would get 'human who pretends to care about dolphins but is always on the lookout for opportunities to escape' long before you get 'human who has deeply and fully replaced their utility function with helping dolphins'. In this case, we can imagine an AGI that pretends to care about the optimization target as a deliberate strategy.
Given that you haven't instilled exactly the desired 'help dolphins' goal right off the bat, now there are strong coherence-pressures against the AGI allowing its goal to be changed ('improved'), against the AGI allowing something else with a different goal to call the shots, etc.
"In the past, particularly in the 17th and 18th centuries, the term 'rationalist' was often used to refer to free thinkers of an anti-clerical and anti-religious outlook, and for a time the word acquired a distinctly pejorative force (thus in 1670 Sanderson spoke disparagingly of 'a mere rationalist, that is to say in plain English an atheist of the late edition...'). The use of the label 'rationalist' to characterize a world outlook which has no place for the supernatural is becoming less popular today; terms like 'humanist' or 'materialist' seem largely to have taken its place. But the old usage still survives."
The Online Etymology Dictionary says:
1620s, "one who follows reason and not authority in thought or speculation," especially "physician whose treatment is based on reasoning," from rational + -ist. In theology/philosophy, "one who applies rational criticism to the claims of supernatural authority or revelation," 1640s. This sense shades into that of "one who believes that human reason, properly employed, renders religion superfluous." Related: Rationalistic; rationalism (1800 in medicine; 1827 in theology, "adherence to the supremacy of reason in matters of belief or conduct;" by 1876 in general use).
Separately, your definitions for rationalist vs. empiricist are off. Per SEP, the usual definition is some variant of 'rationalists think we have some innate knowledge, while empiricists think we get all our knowledge from experience'.
Alberto Vanzo argues that the modern philosophical 'rationalist vs. empiricist' dichotomy comes from Kant and early Kant-influenced thinkers. Though the dichotomies 'rational vs. irrational' and 'reason vs. experience' are both much older than the term 'rationalist'; e.g., Francis Bacon in ~1600 was contrasting 'rationalis' with 'empiricus', though this was talking about dogmatists vs. experimentalists, not talking about the modern (Kant-inspired) rationalist v. empiricist dichotomy.
I don't know Eliezer's view on this — presumably he either disagrees that the example he gave is "mundane AI safety stuff", or he disagrees that "mundane AI safety stuff" is widespread? I'll note that you're a MIRI research associate, so I wouldn't have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.
It seems to me that I've watched organizations like OpenPhil try to sponsor academics to work on AI alignment, and it seems to me that they just can't produce what I'd consider to be real work. The journal paper that Stuart Armstrong coauthored on "interruptibility" is a far step down from Armstrong's other work on corrigibility. It had to be dumbed way down (I'm counting obscuration with fancy equations and math results as "dumbing down") to be published in a mainstream journal. It had to be stripped of all the caveats and any mention of explicit incompleteness, which is necessary meta-information for any ongoing incremental progress, not to mention important from a safety standpoint. The root cause can be debated but the observable seems plain. If you want to get real work done, the obvious strategy would be to not subject yourself to any academic incentives or bureaucratic processes. Particularly including peer review by non-"hobbyists" (peer commentary by fellow "hobbyists" still being potentially very valuable), or review by grant committees staffed by the sort of people who are still impressed by academic sage-costuming and will want you to compete against pointlessly obscured but terribly serious-looking equations.
Not believing theories which don’t make new testable predictions just because they retrodict lots of things in a way that the theories proponents claim is more natural, but that you don’t understand, because that seems generally suspicious
My Eliezer-model doesn't categorically object to this. See, e.g., Fake Causality:
[Phlogiston] feels like an explanation. It’s represented using the same cognitive data format. But the human mind does not automatically detect when a cause has an unconstraining arrow to its effect. Worse, thanks to hindsight bias, it may feel like the cause constrains the effect, when it was merely fitted to the effect.
[...] Thanks to hindsight bias, it’s also not enough to check how well your theory “predicts” facts you already know. You’ve got to predict for tomorrow, not yesterday.
Nineteenth century evolutionism made no quantitative predictions. It was not readily subject to falsification. It was largely an explanation of what had already been seen. It lacked an underlying mechanism, as no one then knew about DNA. It even contradicted the nineteenth century laws of physics. Yet natural selection was such an amazingly good post facto explanation that people flocked to it, and they turned out to be right. Science, as a human endeavor, requires advance prediction. Probability theory, as math, does not distinguish between post facto and advance prediction, because probability theory assumes that probability distributions are fixed properties of a hypothesis.
The rule about advance prediction is a rule of the social process of science—a moral custom and not a theorem. The moral custom exists to prevent human beings from making human mistakes that are hard to even describe in the language of probability theory, like tinkering after the fact with what you claim your hypothesis predicts. People concluded that nineteenth century evolutionism was an excellent explanation, even if it was post facto. That reasoning was correct as probability theory, which is why it worked despite all scientific sins. Probability theory is math. The social process of science is a set of legal conventions to keep people from cheating on the math.
Yet it is also true that, compared to a modern-day evolutionary theorist, evolutionary theorists of the late nineteenth and early twentieth century often went sadly astray. Darwin, who was bright enough to invent the theory, got an amazing amount right. But Darwin’s successors, who were only bright enough to accept the theory, misunderstood evolution frequently and seriously. The usual process of science was then required to correct their mistakes.
My Eliezer-model does object to things like 'since I (from my position as someone who doesn't understand the model) find the retrodictions and obvious-seeming predictions suspicious, you should share my worry and have relatively low confidence in the model's applicability'. Or 'since the case for this model's applicability isn't iron-clad, you should sprinkle in a lot more expressions of verbal doubt'. My Eliezer-model views these as isolated demands for rigor, or as isolated demands for social meekness.
Part of his general anti-modesty and pro-Thielian-secrets view is that it's very possible for other people to know things that justifiably make them much more confident than you are. So if you can't pass the other person's ITT / you don't understand how they're arriving at their conclusion (and you have no principled reason to think they can't have a good model here), then you should be a lot more wary of inferring from their confidence that they're biased.
Not believing theories which don’t make new testable predictions just because they retrodict lots of things in the world naturally (in a way you sort of get intuitively), because you don’t trust your own assessments of naturalness that much in the absence of discriminating evidence
My Eliezer-model thinks it's possible to be so bad at scientific reasoning that you need to be hit over the head with lots of advance predictive successes in order to justifiably trust a model. But my Eliezer-model thinks people like Richard are way better than that, and are (for modesty-ish reasons) overly distrusting their ability to do inside-view reasoning, and (as a consequence) aren't building up their inside-view-reasoning skills nearly as much as they could. (At least in domains like AGI, where you stand to look a lot sillier to others if you go around expressing confident inside-view models that others don't share.)
Not believing theories which don’t make new testable predictions just because they retrodict lots of things in the world naturally (in a way you sort of get intuitively), because most powerful theories which cause conceptual revolutions also make new testable predictions, so it’s a bad sign if the newly proposed theory doesn’t.
My Eliezer-model thinks this is correct as stated, but thinks this is a claim that applies to things like Newtonian gravity and not to things like probability theory. (He's also suspicious that modest-epistemology pressures have something to do with this being non-obvious — e.g., because modesty discourages you from trusting your own internal understanding of things like probability theory, and instead encourages you to look at external public signs of probability theory's impressiveness, of a sort that could be egalitarianly accepted even by people who don't understand probability theory.)
(I'll emphasize again, by the way, that this is a relative comparison of my model of Paul vs. Eliezer. If Paul and Eliezer's views on some topic are pretty close in absolute terms, the above might misleadingly suggest more disagreement than there in fact is.)
I would frame the question more as 'Is this question important for the entire chain of actions humanity needs to select in order to steer to good outcomes?', rather than 'Is there a specific thing Paul or Eliezer personally should do differently tomorrow if they update to the other's view?' (though the latter is an interesting question too).
Some implications of having a more Eliezer-ish view include:
In the Eliezer-world, humanity's task is more foresight-loaded. You don't get a long period of time in advance of AGI where the path to AGI is clear; nor do you get a long period of time of working with proto-AGI or weak AGI where we can safely learn all the relevant principles and meta-principles via trial and error. You need to see far more of the bullets coming in advance of the experiment, which means developing more of the technical knowledge to exercise that kind of foresight, and also developing more of the base skills of thinking well about AGI even where our technical models and our data are both thin.
My Paul-model says: 'Humans are just really bad at foresight, and it seems like AI just isn't very amenable to understanding; so we're forced to rely mostly on surface trends and empirical feedback loops. Fortunately, AGI itself is pretty simple and obvious (just keep scaling stuff similar to GPT-3 and you're done), and progress is likely to be relatively slow and gradual, so surface trends will be a great guide and empirical feedback loops will be abundant.'
My Eliezer-model says: 'AI foresight may be hard, but it seems overwhelmingly necessary; either we see the bullets coming in advance, or we die. So we need to try to master foresight, even though we can't be sure of success in advance. In the end, this is a novel domain, and humanity hasn't put much effort into developing good foresight here; it would be foolish to despair before we've made a proper attempt. We need to try to overcome important biases, think like reality, and become capable of good inside-view reasoning about AGI. We need to hone and refine our gut-level pattern-matching, as well as our explicit models of AGI, as well as the metacognition that helps us improve the former capacities.'
In the Eliezer-world, small actors matter more in expectation; there's no guarantee that the largest and most well-established ML groups will get to AGI first. Governments in particular matter less in expectation.
In the Eliezer-world, single organizations matter more: there's more potential for a single group to have a lead, and for other groups to be passive or oblivious. This means that you can get more bang for your buck by figuring out how to make a really excellent organization full of excellent people; and you get comparatively less bang for your buck from improving relations between organizations, between governments, etc.
The Eliezer-world is less adequate overall, and also has more capabilities (and alignment) secrets.
So, e.g., research closure matters more — both because more secrets exist, and because it's less likely that there will be multiple independent discoveries of any given secret at around the same time.
Also, if your background view of the world is more adequate, you should be less worried about alignment (both out of deference to the ML mainstream that is at least moderately less worried about alignment; and out of expectation that the ML mainstream will update and change course as needed).
Relatedly, in Eliezer-world you have to do more work to actively recruit the world's clearest and best thinkers to helping solve alignment. In Paul-world, you can rely more on future AI progress, warning shots, etc. to naturally grow the alignment field.
In the Eliezer-world, timelines are both shorter and less predictable. There's more potential for AGI to be early-paradigm rather than late-paradigm; and even if it's late-paradigm, it may be late into a paradigm that doesn't look very much like GPT-3 or other circa-2021 systems.
In the Eliezer-world, there are many different paths to AGI, and it may be key to humanity's survival that we pick a relatively good path years in advance, and deliberately steer toward more alignable approaches to AGI. In the Paul-world, there's one path to AGI, and it's big and obvious.
My Eliezer-model is a lot less surprised by lulls than my Paul-model (because we're missing key insights for AGI, progress on insights is jumpy and hard to predict, the future is generally very unpredictable, etc.). I don't know exactly how large of a lull or winter would start to surprise Eliezer (or how much that surprise would change if the lull is occurring two years from now, vs. ten years from now, for example).
It is amazing that our neural networks work at all; terrifying that we can dump in so much GPU power that our training methods work at all; and the fact that AlphaGo can even exist is still blowing my mind. It's like watching a trillion spiders with the intelligence of earthworms, working for 100,000 years, using tissue paper to construct nuclear weapons.
People occasionally ask me about signs that the remaining timeline might be short. It's very easy for nonprofessionals to take too much alarm too easily. Deep Blue beating Kasparov at chess was not such a sign. Robotic cars are not such a sign.
"Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves... Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0."
Repeat: IT DEFEATED THE EUROPEAN GO CHAMPION 5-0.
As the authors observe, this represents a break of at least one decade faster than trend in computer Go.
This matches something I've previously named in private conversation as a warning sign - sharply above-trend performance at Go from a neural algorithm. What this indicates is not that deep learning in particular is going to be the Game Over algorithm. Rather, the background variables are looking more like "Human neural intelligence is not that complicated and current algorithms are touching on keystone, foundational aspects of it." What's alarming is not this particular breakthrough, but what it implies about the general background settings of the computational universe.
To try spelling out the details more explicitly, Go is a game that is very computationally difficult for traditional chess-style techniques. Human masters learn to play Go very intuitively, because the human cortical algorithm turns out to generalize well. If deep learning can do something similar, plus (a previous real sign) have a single network architecture learn to play loads of different old computer games, that may indicate we're starting to get into the range of "neural algorithms that generalize well, the way that the human cortical algorithm generalizes well".
This result also supports that "Everything always stays on a smooth exponential trend, you don't get discontinuous competence boosts from new algorithmic insights" is false even for the non-recursive case, but that was already obvious from my perspective. Evidence that's more easily interpreted by a wider set of eyes is always helpful, I guess.
Next sign up might be, e.g., a similar discontinuous jump in machine programming ability - not to human level, but to doing things previously considered impossibly difficult for AI algorithms.
I hope that everyone in 2005 who tried to eyeball the AI alignment problem, and concluded with their own eyeballs that we had until 2050 to start really worrying about it, enjoyed their use of whatever resources they decided not to devote to the problem at that time.
To which my Eliezer-model's response is "Indeed, we should expect that the first AGI systems will be pathetic in relative terms, comparing them to later AGI systems. But the impact of the first AGI systems in absolute terms is dependent on computer-science facts, just as the impact of the first nuclear bombs was dependent on facts of nuclear physics. Nuclear bombs have improved enormously since Trinity and Little Boy, but there is no law of nature requiring all prototypes to have approximately the same real-world impact, independent of what the thing is a prototype of."
I’ve listened to them as is and I find it pretty easy to follow, but if you’re interested in making it even easier for people to follow, these fine gentlemen have put up a ~$230 RFP/bounty for anybody who turns it into audio where each person has a different voice.
That link isn't working for me; where's the bounty?