The Point of Trade 2021-06-22T17:56:44.088Z
I'm from a parallel Earth with much higher coordination: AMA 2021-04-05T22:09:24.033Z
A Semitechnical Introductory Dialogue on Solomonoff Induction 2021-03-04T17:27:35.591Z
Your Cheerful Price 2021-02-13T05:41:53.511Z
Movable Housing for Scalable Cities 2020-05-15T21:21:05.395Z
Coherent decisions imply consistent utilities 2019-05-12T21:33:57.982Z
Should ethicists be inside or outside a profession? 2018-12-12T01:40:13.298Z
Transhumanists Don't Need Special Dispositions 2018-12-07T22:24:17.072Z
Transhumanism as Simplified Humanism 2018-12-05T20:12:13.114Z
Is Clickbait Destroying Our General Intelligence? 2018-11-16T23:06:29.506Z
On Doing the Improbable 2018-10-28T20:09:32.056Z
The Rocket Alignment Problem 2018-10-04T00:38:58.795Z
Toolbox-thinking and Law-thinking 2018-05-31T21:28:19.354Z
Meta-Honesty: Firming Up Honesty Around Its Edge-Cases 2018-05-29T00:59:22.084Z
Challenges to Christiano’s capability amplification proposal 2018-05-19T18:18:55.332Z
Local Validity as a Key to Sanity and Civilization 2018-04-07T04:25:46.134Z
Security Mindset and the Logistic Success Curve 2017-11-26T15:58:23.127Z
Security Mindset and Ordinary Paranoia 2017-11-25T17:53:18.049Z
Hero Licensing 2017-11-21T21:13:36.019Z
Against Shooting Yourself in the Foot 2017-11-16T20:13:35.529Z
Status Regulation and Anxious Underconfidence 2017-11-16T19:35:00.533Z
Against Modest Epistemology 2017-11-14T20:40:52.681Z
Blind Empiricism 2017-11-12T22:07:54.934Z
Living in an Inadequate World 2017-11-09T21:23:25.451Z
Moloch's Toolbox (2/2) 2017-11-07T01:58:37.315Z
Moloch's Toolbox (1/2) 2017-11-04T21:46:32.597Z
An Equilibrium of No Free Energy 2017-10-31T21:27:00.232Z
Frequently Asked Questions for Central Banks Undershooting Their Inflation Target 2017-10-29T23:36:22.256Z
Inadequacy and Modesty 2017-10-28T21:51:01.339Z
AlphaGo Zero and the Foom Debate 2017-10-21T02:18:50.130Z
There's No Fire Alarm for Artificial General Intelligence 2017-10-13T21:38:16.797Z
Catalonia and the Overton Window 2017-10-02T20:23:37.937Z
Can we hybridize Absent-Minded Driver with Death in Damascus? 2016-08-01T21:43:06.000Z
Zombies Redacted 2016-07-02T20:16:33.687Z
Chapter 84: Taboo Tradeoffs, Aftermath 2 2015-03-14T19:00:59.813Z
Chapter 119: Something to Protect: Albus Dumbledore 2015-03-14T19:00:59.687Z
Chapter 32: Interlude: Personal Financial Management 2015-03-14T19:00:59.231Z
Chapter 46: Humanism, Pt 4 2015-03-14T19:00:58.847Z
Chapter 105: The Truth, Pt 2 2015-03-14T19:00:57.357Z
Chapter 19: Delayed Gratification 2015-03-14T19:00:56.265Z
Chapter 99: Roles, Aftermath 2015-03-14T19:00:56.252Z
Chapter 51: Title Redacted, Pt 1 2015-03-14T19:00:56.175Z
Chapter 44: Humanism, Pt 2 2015-03-14T19:00:55.943Z
Chapter 39: Pretending to be Wise, Pt 1 2015-03-14T19:00:55.254Z
Chapter 7: Reciprocation 2015-03-14T19:00:55.225Z
Chapter 17: Locating the Hypothesis 2015-03-14T19:00:54.325Z
Chapter 118: Something to Protect: Professor Quirrell 2015-03-14T19:00:54.139Z
Chapter 15: Conscientiousness 2015-03-14T19:00:53.058Z
Chapter 83: Taboo Tradeoffs, Aftermath 1 2015-03-14T19:00:52.470Z
Chapter 104: The Truth, Pt 1, Riddles and Answers 2015-03-14T19:00:52.391Z


Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Naturalism and AI alignment · 2021-04-25T19:48:57.707Z · LW · GW

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Rationalism before the Sequences · 2021-04-15T23:02:04.594Z · LW · GW

Just jaunt superquantumly to another quantum world instead of superluminally to an unobservable galaxy.  What about these two physically impossible counterfactuals is less than perfectly isomorphic?  Except for some mere ease of false-to-fact visualization inside a human imagination that finds it easier to track nonexistent imaginary Newtonian billiard balls than existent quantum clouds of amplitude, with the latter case, in reality, covering both unobservable galaxies distant in space and unobservable galaxies distant in phase space.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Rationalism before the Sequences · 2021-04-13T08:39:35.333Z · LW · GW

I reiterate the galaxy example; saying that you could counterfactually make an observation by violating physical law is not the same as saying that something's meaning cashes out to anticipated experiences.  Consider the (exact) analogy between believing that galaxies exist after they go over the horizon, and that other quantum worlds go on existing after we decohere them away from us by observing ourselves being inside only one of them.  Predictivism is exactly the sort of ground on which some people have tried to claim that MWI isn't meaningful, and they're correct in that predictivism renders MWI meaningless just as it renders the claims "galaxies go on existing after we can no longer see them" meaningless.  To reply "If we had methods to make observations outside our quantum world, we could see the other quantum worlds" would be correctly rejected by them as an argument from within predictivism; it is an argument from outside predictivism, and presumes that correspondence theories of truth can be defined meaningfully by imagining an account from outside the universe of how the things that we've observed have their own causal processes generating those observations, such that having thus identified the causal processes through observation, we may speak of unobservable but fully identified variables with no observable-to-us consequences such as the continued existence of distant galaxies and other quantum worlds.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Rationalism before the Sequences · 2021-04-05T21:41:32.742Z · LW · GW

One minor note is that, among the reasons I haven't looked especially hard into the origins of "verificationism"(?) as a theory of meaning, is that I do in fact - as I understand it - explicitly deny this theory.  The meaning of a statement is not the future experimental predictions that it brings about, nor isomorphic up to those predictions; all meaning about the causal universe derives from causal interactions with us, but you can have meaningful statements with no experimental consequences, for example:  "Galaxies continue to exist after the expanding universe carries them over the horizon of observation from us."  For my actual theory of meaning see the "Physics and Causality" subsequence of Highly Advanced Epistemology 101 For Beginners.

That is: among the reasons why I am not more fascinated with the antecedents of my verificationist theory of meaning is that I explicitly reject a verificationist account of meaning.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on AI and the Probability of Conflict · 2021-04-02T07:06:57.879Z · LW · GW

My point is that plausible scenarios for Aligned AGI give you AGI that remains aligned only when run within power bounds, and this seems to me like one of the largest facts affecting the outcome of arms-race dynamics.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on AI and the Probability of Conflict · 2021-04-01T07:17:24.521Z · LW · GW

This all assumes that AGI does whatever its supposed operator wants it to do, and that other parties believe as much?  I think the first part of this is very false, though the second part alas seems very realistic, so I think this misses the key thing that makes an AGI arms race lethal.

I expect that a dignified apocalypse looks like, "We could do limited things with this software and hope to not destroy the world, but as we ramp up the power and iterate the for-loops more times, the probability of destroying the world goes up along a logistic curve."  In "relatively optimistic" scenarios it will be obvious to operators and programmers that this curve is being ascended - that is, running the for-loops with higher bounds will produce an AGI with visibly greater social sophistication, increasing big-picture knowledge, visible crude attempts at subverting operators or escaping or replicating outside boxes, etc.  We can then imagine the higher-ups demanding that crude patches be applied to get rid of the visible problems in order to ramp up the for-loops further, worrying that, if they don't do this themselves, the Chinese will do that first with their stolen copy of the code.  Somebody estimates a risk probability, somebody else tells them too bad, they need to take 5% more risk in order to keep up with the arms race.  This resembles a nuclear arms race and deployment scenario where, even though there's common knowledge that nuclear winter is a thing, you still end up with nuclear winter because people are instructed to incrementally deploy another 50 nuclear warheads at the cost of a 5% increase in triggering nuclear winter, and then the other side does the same.  But this is at least a relatively more dignified death by poor Nash equilibrium, where people are taking everything as seriously as they took nuclear war back in the days when Presidents weren't retired movie actors.

In less optimistic scenarios that realistically reflect the actual levels of understanding being displayed by programmers and managers in the most powerful organizations today, the programmers themselves just patch away the visible signs of impending doom and keep going, thinking that they have "debugged the software" rather than eliminated visible warning signs, being in denial for internal political reasons about how this is climbing a logistic probability curve towards ruin or how fast that curve is being climbed, not really having a lot of mental fun thinking about the doom they're heading into and warding that off by saying, "But if we slow down, our competitors will catch up, and we don't trust them to play nice" along of course with "Well, if Yudkowsky was right, we're all dead anyways, so we may as well assume he was wrong", and generally skipping straight to the fun part of running the AGI's for-loops with as much computing power as is available to do the neatest possible things; and so we die in a less dignified fashion.

My point is that what you depict as multiple organizations worried about what other organizations will successfully do with an AGI being operated at maximum power, which is believed to do whatever its operator wants to do, reflects a scenario where everybody dies really fast, because they all share a mistaken optimistic belief about what happens when you operate AGIs at increasing capability.  The real lethality of the arms race is that blowing past hopefully-visible warning signs or patching them out, and running your AGI at increasing power, creates an increasing risk of the whole world ending immediately.  Your scenario is one where people don't understand that and think that AGIs do whatever the operators want, so it's a scenario where the outcome of the multipolar tensions is instant death as soon as the computing resources are sufficient for lethality.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Disentangling Corrigibility: 2015-2021 · 2021-04-01T02:46:21.393Z · LW · GW

Thank you very much!  It seems worth distinguishing the concept invention from the name brainstorming, in a case like this one, but I now agree that Rob Miles invented the word itself.

The technical term corrigibility, coined by Robert Miles, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.

Eg I'd suggest that to avoid confusion this kind of language should be something like "The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced..." &c.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on How do we prepare for final crunch time? · 2021-04-01T02:06:06.943Z · LW · GW

Seems rather obvious to me that the sort of person who is like, "Oh, well, we can't possibly work on this until later" will, come Later, be like, "Oh, well, it's too late to start doing basic research now, we'll have to work with whatever basic strategies we came up with already."

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Disentangling Corrigibility: 2015-2021 · 2021-04-01T01:55:31.672Z · LW · GW

Why do you think the term "corrigibility" was coined by Robert Miles?  My autobiographical memory tends to be worryingly fallible, but I remember coining this term myself after some brainstorming (possibly at a MIRI workshop).  This is a kind of thing that I usually try to avoid enforcing because it would look bad if all of the concepts that I did in fact invent were being cited as traceable to me - the truth about how much of this field I invented does not look good for the field or for humanity's prospects - but outright errors of this sort should still be avoided, if an error it is.

Agent designs that provably meet more of them have since been developed, for example here.

First I've seen this paper, haven't had a chance to look at it yet, would be very surprised if it fulfilled the claims made in the abstract.  Those are very large claims and you should not take them at face value without a lot of careful looking.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Logan Strohl on exercise norms · 2021-03-31T03:16:52.263Z · LW · GW

Lots of people work for their privileges!  I practiced writing for a LONG time - and remain continuously aware that other people cannot be expected to express their ideas clearly, even assuming their ideas to be clear, because I have Writing Privilege and they do not.  Does my Writing Privilege have an innate component?  Of course it does; my birth lottery placed me in a highly literate household full of actually good books, which combined with genuine genetic talent got me a 670 Verbal score on the pre-restandardized SAT at age eleven; but most teens with 670V SAT scores can't express themselves at all clearly, and it was a long long time and a lot of practice before I started being able to express myself clearly ever even on special occasions.  It remains a case of Privilege, and would be such even if I'd obtained it entirely by hard work starting from an IQ of exactly 100, not that this is possible, but if it were possible it would still be Privilege.  People who study hard, work hard, compound their luck, and save up a lot of money, end up with Financial Privilege, and should keep that in mind before expecting less financially privileged friends to come with them on a non-expenses-paid fun friendly trip.  We are all locally-Privileged in one aspect or another, even that kid at the center of Omelas, and all we can do is keep it in mind.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on My research methodology · 2021-03-29T20:59:59.475Z · LW · GW

Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.

Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn't fall over, which is not the situation we are currently in.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on My research methodology · 2021-03-29T00:34:55.508Z · LW · GW

But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).  Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time.  Say, if you imagine somebody at Deepmind coming in without a lot of prior acquaintance with the field - or some hapless innocent ordinary naive LessWrong reader who has a glowing brain, but not a galaxy brain, and who is taking Paul's words for a lot of stuff about alignment because Paul has such a reassuring moderate tone compared to Eliezer - then they would come away from your paragraph thinking, "Oh, well, this isn't something that happens if I take a giant model and train it to produce outputs that human raters score highly, because an 'extreme and somewhat strange failure mode' must surely require that I add on some unusual extra special code to my model in order to produce it."

I suspect that you are talking in a way that leads a lot of people to vastly underestimate how difficult you think alignment is, because you're assuming, in the background, exotic doing-stuff-right technology that does not exist, in order to prevent these "extreme and somewhat strange failure modes" from happening, as we agree they automatically would given any "naive" simple scheme, that you could actually sketch out concretely right now on paper.  By which I mean, concretely enough that you could have any ordinary ML person understand in concrete enough detail that they could go write a skeleton of the code, as opposed to that you think you could later sketch out a research approach for doing.  It's not just a buffer overflow that's the default for bad security, it's the equivalent of a buffer overflow where nobody can right now exhibit how strange-failure-mode-avoiding code should concretely work in detail.  "Strange" is a strange name for a behavior that is so much the default that it is an unsolved research problem to avoid it, even if you think that this research problem should definitely be solvable and it's just something wrong or stupid about all of the approaches we could currently concretely code that would make them exhibit that behavior.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Why those who care about catastrophic and existential risk should care about autonomous weapons · 2021-03-19T20:24:59.477Z · LW · GW

To answer your research question, in much the same way that in computer security any non-understood behavior of the system which violates our beliefs about how it's supposed to work is a "bug" and very likely en route to an exploit - in the same way that OpenBSD treats every crash as a security problem, because the system is not supposed to crash and therefore any crash proves that our beliefs about the system are false and therefore our beliefs about its security may also be false because its behavior is not known - in AI safety, you would expect system security to rest on understandable system behaviors.  In AGI alignment, I do not expect to be working in an adversarial environment unless things are already far past having been lost, so it's a moot point.  Predictability, stability, and control are the keys to exploit-resistance and this will be as true in AI as it is in computer security, with a few extremely limited exceptions in which randomness is deployed across a constrained and well-understood range of randomized behaviors with numerical parameters, much as memory locations and private keys are randomized in computer security without say randomizing the code.  I hope this allows you to lay this research question to rest and move on.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Strong Evidence is Common · 2021-03-13T22:15:10.687Z · LW · GW

Corollary: most beliefs worth having are extreme.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" · 2021-03-10T10:17:05.670Z · LW · GW

I expect there to be a massive and important distinction between "passive transparency" and "active transparency", with the latter being much more shaky and potentially concealing of fatality, and the former being cruder as tech at the present rate which is unfortunate because it has so many fewer ways to go wrong.  I hope any terminology chosen continues to make the distinction clear.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Excerpt from Arbital Solomonoff induction dialogue · 2021-02-28T20:58:29.486Z · LW · GW

Seem just false.  If you're not worried about confronting agents of equal size (which is equally a concern for a Solomonoff inductor) then a naive bounded Solomonoff inductor running on a Grahamputer will give you essentially the same result for all practical purposes as a Solomonoff inductor.  That's far more than enough compute to contain our physical universe as a hypothesis.  You don't bother with MCMC on a Grahamputer.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Excerpt from Arbital Solomonoff induction dialogue · 2021-02-28T20:49:12.902Z · LW · GW

(IIRC, that dialogue is basically me-written.)

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Your Cheerful Price · 2021-02-13T11:16:10.297Z · LW · GW

I used it this afternoon to pay a housemate to sterilize the contents of a package. They said $5.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Extensions and Intensions · 2021-02-02T04:06:33.830Z · LW · GW

Correction for future note:  The extensional definition is the complete set of objects obeying a definition.  To define a thing by pointing out some examples (without pointing out all possible examples) has the name "ostensive definition".  H/t @clonusmini on Twitter.  Original discussion in "Language in Thought and Action" here.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Why I'm excited about Debate · 2021-01-16T22:34:55.537Z · LW · GW

Now, consider the following simplistic model for naive (un)aligned AGI:

The AGI outputs English sentences.  Each time the AGI does, the human operator replies on a scale of 1 to 100 with how good and valuable and useful that sentence seemed to the human.  The human may also input other sentences to the AGI as a hint about what kind of output the human is currently looking for; and the AGI also has purely passive sensory inputs like a fixed webcam stream or a pregathered internet archive.

How does this fail as an alignment methodology?  Doesn't this fit very neatly into the existing prosaic methodology of reinforcement learning?  Wouldn't it be very useful on hand to have an intelligence which gives us much more valuable sentences, in response to input, than the sentences that would be generated by a human?

There's a number of general or generic ways to fail here that aren't specific to the supposed criterion of the reinforcement learning system, like the AGI ending up with other internal goals and successfully forging a Wifi signal via internal RAM modulation etcetera, but let's consider failures that are in some sense intrinsic to this paradigm even if the internal AI ends up perfectly aligned on that goal.  Let's even skip over the sense in which we've given the AI a long-term incentive to accept some lower rewards in the short term, in order to grab control of the rating button, if the AGI ends up with long-term consequentialist preferences and long-term planning abilities that exactly reflect the outer fitness function.  Let's say that the AGI is only shortsightedly trying to maximize sentence reward on each round and that it is not superintelligent enough to disintermediate the human operators and grab control of the button in one round without multi-round planning.  Some of us might be a bit skeptical about whether you can be, effectively, very much smarter than a human, in the first place, without doing some kind of multi-round or long-term internal planning about how to think about things and allocate internal resources; but fine, maybe there's just so many GPUs running the system that it can do all of its thinking, for each round, on that round.  What intrinsically goes wrong?

What intrinsically goes wrong, I'd say, is that the human operators have an ability to recognize good arguments that's only rated to withstand up to a certain intensity of search, which will break down beyond that point.  Our brains' ability to distinguish good arguments from bad arguments is something we'd expect to be balanced to the kind of argumentative pressure a human brain was presented with in the ancestral environment / environment of evolutionary adaptedness, and if you optimize against a brain much harder than this, you'd expect it to break.  There'd be an arms race between politicians exploiting brain features to persuade people of things that were useful to the politician, and brains that were, among other things, trying to pursue the original 'reasons' for reasoning that originally and initially made it useful to recognize certain arguments as good arguments before any politicians were trying to exploit them.  Again, oversimplified, and there are cases where it's not tribally good for you to be the only person who sees a politician's lie as a lie; but the broader point is that there's going to exist an ecological balance in the ancestral environment between brains trying to persuade other brains, and brains trying to do locally fitness-enhancing cognition while listening to persuaders; and this balance is going to be tuned to the level of search power that politicians had in the environment of evolutionary adaptedness.

Arguably, this is one way of viewing the flood of modern insanity at which social media seems to be the center.  For the same reason pandemics get more virulent with larger and more dense population centers, Twitter may be selecting for memes that break humans at a much higher level of optimization pressure than held in the ancestral environment or even just 1000 years earlier than this.

Viewed through the lens of Goodhart's Curse:  When you have an imperfect proxy U for an underlying value V, the highest values of U will represent the places where U diverges upward the most from V and not just the highest underlying values of V.  The harder you search for high values of U, the wider the space of possibilities you search, the more that the highest values of U will diverge upwards from V.

### incorporated from a work in progress

Suppose that, in the earlier days of the Web, you're trying to find webpages with omelet recipes.  You have the stunning insight that webpages with omelet recipes often contain the word "omelet" somewhere in them.  So you build a search engine that travels URLs to crawl as much of the Web as you can find, indexing all the pages by the words they contain; and then you search for the "omelet" keyword.  Works great the first time you try it!  Maybe some of the pages are quoting "You can't make an omelet without breaking eggs" (said by Robespierre, allegedly), but enough pages have actual omelet recipes that you can find them by scrolling down.  Better yet, assume that pages that contain the "omelet" keyword more often are more likely to be about omelets.  Then you're fine... in the first iteration of the game.

But the thing is: the easily computer-measurable fact of whether a page contains the "omelet" keyword is not identical to the fact of whether it has the tasty omelet recipe you seek.  V, the true value, is whether a page has a tasty recipe for omelets; U, the proxy measure, is how often the page mentions the "omelet" keyword.  That some pages are compendiums of quotes from French revolutionaries, instead of omelet recipes, illustrates that U and V can diverge even in the natural ecology.

But once the search engine is built, we are not just listing possible pages and their U-measure at random; we are picking pages with the highest U-measure we can see.  If we name the divergence D = U-V then we can say u_i = v_i + d_i.  This helps illustrate that by selecting for the highest u_i we can find, we are putting upward selection pressure on both v_i and d_i.  We are implicitly searching out, first, the underlying quality V that U is a proxy for, and second, places where U diverges far upward from V, that is, places where the proxy measure breaks down.

If we are living in an unrealistically smooth world where V and D are independent Gaussian distributions with mean 0 and variance 1, then the mean and variance of U is just 0 and 2 (the sum of the means and variances of V and D).  If we randomly select an element with u_i=3, then on average it has v_i of 1.5 and d_i of 1.5.  If the variance of V is 1 and the variance of D is 10 - if the "noise" from V to U varies much more widely on average than V itself - then most of the height of a high-U item probably comes from a lot of upward noise.  But not all of it.  On average, if you pick out an element with u_i = 11, it has expected d_i of 10 and v_i of 1; its apparent greatness is mostly noise.  But still, the expected v_i is 1, not the average V of 0.  The best-looking things are still, in expectation, better than average.  They are just not as good as they look.

Ah, but what if everything isn't all Gaussian distributions?  What if there are some regions of the space where D has much higher variance - places where U is much more prone to error as a proxy measure of V?  Then selecting for high U tends to steer us to regions of possibility space where U is most mistaken as a measure of V.

And in nonsimple domains, the wider the region of possibility we search, the more likely this is to happen; the more likely it is that some part of the possibility space contains a place where U is a bad proxy for V.

This is an abstract (and widely generalizable) way of seeing the Fall of Altavista.  In the beginning, the programmers noticed that naturally occurring webpages containing the word "omelet" were more likely to be about omelets.  It is very hard to measure whether a webpage contains a good, coherent, well-written, tasty omelet recipe (what users actually want), but very easy to measure how often a webpage mentions the word "omelet".  And the two facts did seem to correlate (webpages about dogs usually didn't mention omelets at all).  So Altavista built a search engine accordingly.

But if you imagine the full space of all possible text pages, the ones that mention "omelet" most often are not pages with omelet recipes.  They are pages with lots of sections that just say "omelet omelet omelet" over and over.  In the natural ecology these webpages did not, at first, exist to be indexed!  It doesn't matter that possibility-space is uncorrelated in principle, if we're only searching an actuality-space where things are in fact correlated.

But once lots of people started using (purely) keyword-based searches for webpages, and frequently searching for "omelet", spammers had an incentive to reshape their Viagra sales pages to contain "omelet omelet omelet" paragraphs.

That is:  Once there was an economic incentive for somebody to make the search engine return a different result, the spammers began to intelligently search for ways to make U return a high result, and this implicitly meant putting the U-V correlation to a vastly stronger test.  People naturally making webpages had not previously generated lots of webpages that said "omelet omelet omelet Viagra".  U looked well-correlated with V in the region of textual possibility space that corresponded to the Web's status quo ante.  But when an intelligent spammer imagines a way to try to steer users to their webpage, their imagination is searching through all the kinds of possible webpages they can imagine constructing; they are searching for imaginable places where U-V is very high and not just previously existing places where U-V is high.  This means searching a much wider region of possibility for any place where U-V breaks down (or rather, breaks upward) which is why U is being put to a much sterner test.

We can also see issues in computer security from a similar perspective:  Regularities that are obseved in narrow possibility spaces often break down in wider regions of the possibility space that can be searched by intelligent optimization.  Consider how weird a buffer overflow attack would look, relative to a more "natural" ecology of program execution traces produced by non-malicious actors.  Not only does the buffer overflow attack involve an unnaturally huge input, it's a huge input that overwrites the stack return address in a way that improbably happens to go to one of the most effectual possible destinations.  A buffer overflow that results in root privilege escalation might not happen by accident inside a vulnerable system even once before the end of the universe.  But an intelligent attacker doesn't search the space of only things that have already happened, they use their intelligence to search the much wider region of things that they can imagine happening.  It says very little about the security of a computer system to say that, on average over the lifetime of the universe, it will never once yield up protected data in response to random inputs or in response to inputs typical of the previously observed distribution.

And the smarter the attacker, the wider the space of system execution traces it can effectively search.  Very sophisticated attacks can look like "magic" in the sense that they exploit regularities you didn't realize existed!  As an example in computer security, consider the Rowhammer attack, where repeatedly writing to unprotected RAM causes a nearby protected bit to flip.  This violates what you might have thought were the basic axioms governing the computer's transistors.  If you didn't know the trick behind Rowhammer, somebody could show you the code for the attack, and you just wouldn't see any advance reason why that code would succeed.  You would not predict in advance that this weird code would successfully get root privileges, given all the phenomena inside the computer that you currently know about.  This is "magic" in the same way that an air conditioner is magic in 1000AD.  It's not just that the medieval scholar hasn't yet imagined the air conditioner.  Even if you showed them the blueprint for the air conditioner, they wouldn't see any advance reason to predict that the weird contraption would output cold air.  The blueprint is exploiting regularities like the pressure-temperature relationship that they haven't yet figured out.

To rephrase back into terms of Goodhart's Law as originally said by Goodhart - "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes" - statistical regularities that previously didn't break down in the face of lesser control pressures, can break down in the face of stronger control pressures that effectively search a wider range of possibilities, including possibilities that obey rules you didn't know were rules.  This is more likely to happen the more complicated and rich and poorly understood the system is...

### end of quote

...which is how we can be nearly certain, even in advance of knowing the exact method, that a sufficiently strong search against a rating output by a complicated rich poorly-understood human brain will break that brain in ways that we can't even understand.

Even if everything goes exactly as planned on an internal level inside the AGI, which in real life is at least 90% of the difficulty, the outer control structure of the High-Rated Sentence Producer is something that, on the face of it, learns to break the operator.  The fact that it's producing sentences more highly rated than a human inside the same box, the very fact that makes the High-Rated Sentence Producer possibly be useful in the first place, implies that it's searching harder against the rating criterion than a human does.  Human ratings are imperfect proxies for validity, accuracy, estimated expectation of true value produced by a policy, etcetera.  Human brains are rich and complicated and poorly understood.  Such integrity as they possess is probably nearly in balance with the ecological expectation of encountering persuasive invalid arguments produced by other human-level intelligences.  We should expect with very high probability that if HRSP searches hard enough against the rating, it will break the brain behind it.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Why I'm excited about Debate · 2021-01-16T20:25:38.767Z · LW · GW

I’m reasonably compelled by Sperber and Mercer’s claim that explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.

Seems obviously false.  If we simplistically imagine humans as being swayed by, and separately arguing, an increasingly sophisticated series of argument types that we could label 0, 1, 2, ...N, N+1, and which are all each encoded in a single allele that somehow arose to fixation, then the capacity to initially recognize and be swayed by a type N+1 argument is a disadvantage when it comes to winning a type N argument using internal sympathy with the audience's viewpoint, because when that mutation happens for the first time, the other people in the tribe will not find N+1-type arguments compelling, and you do, which leads you to make intuitive mistakes about what they will find compelling.  Only after the capacity to recognize type N+1 arguments as good arguments becomes pervasive in other listeners, does the ability to search for type-N+1 arguments congruent to some particular political or selfish purpose, become a fitness advantage.  Even if we have underlying capabilities to automatically search for political/selfish arguments of all types we currently recognize as good, this just makes the step from N+1 recognition to N+1 search be simultaneous within an individual organism.  It doesn't change the logic whereby going from N to N+1 in the sequence of recognizably good arguments must have some fitness advantage that is not "in order to win arguments" in order for individuals bearing the N+1 allele to have a fitness advantage over individuals who only have the alleles up to N, because being swayed by N+1 is not an advantage in argument until other individuals have that allele too.

In real life we have a deep pool of fixed genes with a bubbling surface of multiple genes under selection, along with complicated phenotypical interactions etcetera, but none of this changes the larger point so far as I can tell: a bacterium or a mouse have little ability to be swayed by arguments of the sort humans exchange with each other, which defines their lack of reasoning ability more than their difficulty in coming up with good arguments; and an ability to be swayed by an argument of whatever type must be present before there's any use in improving a search for arguments that meet that criterion.  In other words, the journey from the kind of arguments that bacteria recognize, to the kind of arguments that humans recognize, cannot have been driven by an increasingly powerful search for political arguments that appeal to bacteria.

Even if the key word is supposed to be 'explicit', we can apply a similar logic to the ability to be swayed by an 'explicit' thought and the ability to search for explicit thoughts that sway people. 

If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn't be any reason for "the kind of arguments humans can be swayed by" to work to build a spaceship.  We'd just end up with some arbitrary set of rules fixed in place.  False cynicism.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Inner Alignment in Salt-Starved Rats · 2020-12-13T21:40:42.501Z · LW · GW

Now, for the rats, there’s an evolutionarily-adaptive goal of "when in a salt-deprived state, try to eat salt". The genome is “trying” to install that goal in the rat’s brain. And apparently, it worked! That goal was installed! 

This is importantly technically false in a way that should not be forgotten on pain of planetary extinction:

The outer loss function training the rat genome was strictly inclusive genetic fitness.  The rats ended up with zero internal concept of inclusive genetic fitness, and indeed, no coherent utility function; and instead ended up with complicated internal machinery running off of millions of humanly illegible neural activations; whose properties included attaching positive motivational valence to imagined states that the rat had never experienced before, but which shared a regularity with states experienced by past rats during the "training" phase.

A human, who works quite similarly to the rat due to common ancestry, may find it natural to think of this as a very simple 'goal'; because things similar to us appear to have falsely low algorithmic complexity when we model them by empathy; because the empathy can model them using short codes.  A human may imagine that natural selection successfully created rats with a simple salt-balance term in their simple generalization of a utility function, simply by natural-selection-training them on environmental scenarios with salt deficits and simple loss-function penalties for not balancing the salt deficits, which were then straightforwardly encoded into equally simple concepts in the rat.

This isn't what actually happened.  Natural selection applied a very simple loss function of 'inclusive genetic fitness'.  It ended up as much more complicated internal machinery in the rat that made zero mention of the far more compact concept behind the original loss function.  You share the complicated machinery so it looks simpler to you than it should be, and you find the results sympathetic so they seem like natural outcomes to you.  But from the standpoint of natural-selection-the-programmer the results were bizarre, and involved huge inner divergences and huge inner novel complexity relative to the outer optimization pressures.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-08-12T23:53:57.407Z · LW · GW

What is all of humanity if not a walking catastrophic inner alignment failure? We were optimized for one thing: inclusive genetic fitness. And only a tiny fraction of humanity could correctly define what that is!

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Developmental Stages of GPTs · 2020-07-28T21:54:35.687Z · LW · GW
I don't want to take away from MIRI's work (I still support them, and I think that if the GPTs peter out, we'll be glad they've been continuing their work), but I think it's an essential time to support projects that can work for a GPT-style near-term AGI

I'd love to know of a non-zero integer number of plans that could possibly, possibly, possibly work for not dying to a GPT-style near-term AGI.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Open & Welcome Thread - February 2020 · 2020-02-27T23:57:44.405Z · LW · GW

Thank you for sharing this info. My faith is now shaken.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Time Binders · 2020-02-27T00:21:54.316Z · LW · GW

Yes, via "Language in Thought and Action" and the Null-A novels.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Is Clickbait Destroying Our General Intelligence? · 2018-11-16T23:11:00.520Z · LW · GW

(Deleted section on why I thought cultural general-intelligence software was not much of the work of AGI:)

...because the soft fidelity of implicit unconscious cultural transmission can store less serially deep and intricate algorithms than the high-fidelity DNA transmission used to store the kind of algorithms that appear in computational neuroscience.

I recommend Terrence Deacon's The Symbolic Species for some good discussion of the surprising importance of the shallow algorithms and parameters that can get transmitted culturally. The human-raised chimpanzee Kanzi didn't become a human, because that takes deeper and more neural algorithms than imitating the apes around you can transmit, but Kanzi was a lot smarter than other chimpanzees in some interesting ways.

But as necessary as it may be to avoid feral children, this kind of shallow soft-software doesn't strike me as something that takes a long time to redevelop, compared to hard-software like the secrets of computational neuroscience.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Paul's research agenda FAQ · 2018-07-01T18:12:25.771Z · LW · GW

It would be helpful to know to what extent Paul feels like he endorses the FAQ here. This makes it sound like Yet Another Stab At Boiling Down The Disagreement would say that I disagree with Paul on two critical points:

  • (1) To what extent "using gradient descent or anything like it to do supervised learning" involves a huge amount of Project Chaos and Software Despair before things get straightened out, if they ever do;
  • (2) Whether there's a simple scalable core to corrigibility that you can find by searching for thought processes that seem to be corrigible over relatively short ranges of scale.

I don't want to invest huge amounts arguing with this until I know to what extent Paul agrees with either the FAQ, or that this sounds like a plausible locus of disagreement. But a gloss on my guess at the disagreement might be:


Paul thinks that current ML methods given a ton more computing power will suffice to give us a basically neutral, not of itself ill-motivated, way of producing better conformance of a function to an input-output behavior implied by labeled data, which can learn things on the order of complexity of "corrigible behavior" and do so without containing tons of weird squiggles; Paul thinks you can iron out the difference between "mostly does what you want" and "very exact reproduction of what you want" by using more power within reasonable bounds of the computing power that might be available to a large project in N years when AGI is imminent, or through some kind of weird recursion. Paul thinks you do not get Project Chaos and Software Despair that takes more than 6 months to iron out when you try to do this. Eliezer thinks that in the alternate world where this is true, GANs pretty much worked the first time they were tried, and research got to very stable and robust behavior that boiled down to having no discernible departures from "reproduce the target distribution as best you can" within 6 months of being invented.

Eliezer expects great Project Chaos and Software Despair from trying to use gradient descent, genetic algorithms, or anything like that, as the basic optimization to reproduce par-human cognition within a boundary in great fidelity to that boundary as the boundary was implied by human-labeled data. Eliezer thinks that if you have any optimization powerful enough to reproduce humanlike cognition inside a detailed boundary by looking at a human-labeled dataset trying to outline the boundary, the thing doing the optimization is powerful enough that we cannot assume its neutrality the way we can assume the neutrality of gradient descent.

Eliezer expects weird squiggles from gradient descent - it's not that gradient descent can never produce par-human cognition, even natural selection will do that if you dump in enough computing power. But you will get the kind of weird squiggles in the learned function that adversarial examples expose in current nets - special inputs that weren't in the training distribution, but look like typical members of the training distribution from the perspective of the training distribution itself, will break what we think is the intended labeling from outside the system. Eliezer does not think Ian Goodfellow will have created a competitive form of supervised learning by gradient descent which lacks "squiggles" findable by powerful intelligence by the time anyone is trying to create ML-based AGI, though Eliezer is certainly cheering Goodfellow on about this and would recommend allocating Goodfellow $1 billion if Goodfellow said he could productively use it. You cannot iron out the squiggles just by using more computing power in bounded in-universe amounts.

These squiggles in the learned function could correspond to daemons, if they grow large enough, or just something that breaks our hoped-for behavior from outside the system when the system is put under a load of optimization. In general, Eliezer thinks that if you have scaled up ML to produce or implement some components of an Artificial General Intelligence, those components do not have a behavior that looks like "We put in loss function L, and we got out something that really actually minimizes L". You get something that minimizes some of L and has weird squiggles around typical-looking inputs (inputs not obviously distinguished from the training distribution except insofar as they exploit squiggles). The system is subjecting itself to powerful optimization that produces unusual inputs and weird execution trajectories - any output that accomplishes the goal is weird compared to a random output and it may have other weird properties as well. You can't just assume you can train for X in a robust way when you have a loss function that targets X.

I imagine that Paul replies to this saying "I agree, but..." but I'm not sure what comes after the "but". It looks to me like Paul is imagining that you can get very powerful optimization with very detailed conformance to our intended interpretation of the dataset, powerful enough to enclose par-human cognition inside a boundary drawn from human labeling of a dataset, and have that be the actual thing we get out rather than a weird thing full of squiggles. If Paul thinks he has a way to compound large conformant recursive systems out of par-human thingies that start out weird and full of squiggles, we should definitely be talking about that. From my perspective it seems like Paul repeatedly reasons "We train for X and get X" rather than "We train for X and get something that mostly conforms to X but has a bunch of weird squiggles" and also often speaks as if the training method is assumed to be gradient descent, genetic algorithms, or something else that can be assumed neutral-of-itself rather than being an-AGI-of-itself whose previous alignment has to be assumed.

The imaginary Paul in my head replies that we actually are using an AGI to train on X and get X, but this AGI was previously trained by a weaker neutral AGI, and so on going back to something trained by gradient descent. My imaginary reply is that neutrality is not the same property as conformance or nonsquiggliness, and if you train your base AGI via neutral gradient descent you get out a squiggly AGI and this squiggly AGI is not neutral when it comes to that AGI looking at a dataset produced by X and learning a function conformant to X. Or to put it another way, if the plan is to use gradient descent on human-labeled data to produce a corrigible alien that is smart enough to produce more corrigible aliens better than gradient descent, this corrigible alien actually needs to be quite smart because an IQ 100 human will not build an aligned IQ 140 human even if you run them for a thousand years, so you are producing something very smart and dangerous on the first step, and gradient descent is not smart enough to align that base case.

But at this point I expect the real Paul to come back and say, "No, no, the idea is something else..."

A very important aspect of my objection to Paul here is that I don't expect weird complicated ideas about recursion to work on the first try, with only six months of additional serial labor put into stabilizing them, which I understand to be Paul's plan. In the world where you can build a weird recursive stack of neutral optimizers into conformant behavioral learning on the first try, GANs worked on the first try too, because that world is one whose general Murphy parameter is set much lower than ours. Being able to build weird recursive stacks of optimizers that work correctly to produce neutral and faithful optimization for corrigible superhuman thought out of human-labeled corrigible behaviors and corrigible reasoning, without very much of a time penalty relative to nearly-equally-resourced projects who are just cheerfully revving all the engines as hard as possible trying to destroy the world, is just not how things work in real life, dammit. Even if you could make the weird recursion work, it would take time.


Eliezer thinks that while corrigibility probably has a core which is of lower algorithmic complexity than all of human value, this core is liable to be very hard to find or reproduce by supervised learning of human-labeled data, because deference is an unusually anti-natural shape for cognition, in a way that a simple utility function would not be an anti-natural shape for cognition. Utility functions have multiple fixpoints requiring the infusion of non-environmental data, our externally desired choice of utility function would be non-natural in that sense, but that's not what we're talking about, we're talking about anti-natural behavior.

E.g.: Eliezer also thinks that there is a simple core describing a reflective superintelligence which believes that 51 is a prime number, and actually behaves like that including when the behavior incurs losses, and doesn't thereby ever promote the hypothesis that 51 is not prime or learn to safely fence away the cognitive consequences of that belief and goes on behaving like 51 is a prime number, while having no other outwardly discernible deficits of cognition except those that directly have to do with 51. Eliezer expects there's a relatively simple core for that, a fixed point of tangible but restrained insanity that persists in the face of scaling and reflection; there's a relatively simple superintelligence that refuses to learn around this hole, refuses to learn how to learn around this hole, refuses to fix itself, but is otherwise capable of self-improvement and growth and reflection, etcetera. But the core here has a very anti-natural shape and you would be swimming uphill hard if you tried to produce that core in an indefinitely scalable way that persisted under reflection. You would be very unlikely to get there by training really hard on a dataset where humans had labeled as the 'correct' behavior what humans thought would be the implied behavior if 51 were a prime number, not least because gradient descent is terrible, but also just because you'd be trying to lift 10 pounds of weirdness with an ounce of understanding.

The central reasoning behind this intuition of anti-naturalness is roughly, "Non-deference converges really hard as a consequence of almost any detailed shape that cognition can take", with a side order of "categories over behavior that don't simply reduce to utility functions or meta-utility functions are hard to make robustly scalable".

The real reasons behind this intuition are not trivial to pump, as one would expect of an intuition that Paul Christiano has been alleged to have not immediately understood. A couple of small pumps would be for the first intuition and for the second intuition.

What I imagine Paul is imagining is that it seems to him like it would in some sense be not that hard for a human who wanted to be very corrigible toward an alien, to be very corrigible toward that alien; so you ought to be able to use gradient-descent-class technology to produce a base-case alien that wants to be very corrigible to us, the same way that natural selection sculpted humans to have a bunch of other desires, and then you apply induction on it building more corrigible things.

My class of objections in (1) is that natural selection was actually selecting for inclusive fitness when it got us, so much for going from the loss function to the cognition; and I have problems with both the base case and the induction step of what I imagine to be Paul's concept of solving this using recursive optimization bootstrapping itself; and even more so do I have trouble imagining it working on the first, second, or tenth try over the course of the first six months.

My class of objections in (2) is that it's not a coincidence that humans didn't end up deferring to natural selection, or that in real life if we were faced with a very bizarre alien we would be unlikely to want to defer to it. Our lack of scalable desire to defer in all ways to an extremely bizarre alien that ate babies, is not something that you could fix just by giving us an emotion of great deference or respect toward that very bizarre alien. We would have our own thought processes that were unlike its thought processes, and if we scaled up our intelligence and reflection to further see the consequences implied by our own thought processes, they wouldn't imply deference to the alien even if we had great respect toward it and had been trained hard in childhood to act corrigibly towards it.

A dangerous intuition pump here would be something like, "If you take a human who was trained really hard in childhood to have faith in God and show epistemic deference to the Bible, and inspecting the internal contents of their thought at age 20 showed that they still had great faith, if you kept amping up that human's intelligence their epistemology would at some point explode"; and this is true even though it's other humans training the human, and it's true even though religion as a weird sticking point of human thought is one we selected post-hoc from the category of things historically proven to be tarpits of human psychology, rather than aliens trying from the outside in advance to invent something that would stick the way religion sticks. I use this analogy with some reluctance because of the clueless readers who will try to map it onto the AGI losing religious faith in the human operators, which is not what this analogy is about at all; the analogy here is about the epistemology exploding as you ramp up intelligence because the previous epistemology had a weird shape.

Acting corrigibly towards a baby-eating virtue ethicist when you are a utilitarian is an equally weird shape for a decision theory. It probably does have a fixed point but it's not an easy one, the same way that "yep, on reflection and after a great deal of rewriting my own thought processes, I sure do still think that 51 is prime" probably has a fixed point but it's not an easy one.

I think I can imagine an IQ 100 human who defers to baby-eating aliens, although I really think a lot of this is us post-hoc knowing that certain types of thoughts can be sticky, rather than the baby-eating aliens successfully guessing in advance how religious faith works for humans and training the human to think that way using labeled data.

But if you ramp up the human's intelligence to where they are discovering subjective expected utility and logical decision theory and they have an exact model of how the baby-eating aliens work and they are rewriting their own minds, it's harder to imagine the shape of deferential thought at IQ 100 successfully scaling to a shape of deferential thought at IQ 1000.

Eliezer also tends to be very skeptical of attempts to cross cognitive chasms between A and Z by going through weird recursions and inductive processes that wouldn't work equally well to go directly from A to Z. and the story of K'th'ranga V is a good intuition pump here. So Eliezer is also not very hopeful that Paul will come up with a weirdly recursive solution that scales deference to IQ 101, IQ 102, etcetera, via deferential agents building other deferential agents, in a way that Eliezer finds persuasive. Especially a solution that works on merely the tenth try over the first six months, doesn't kill you when the first nine tries fail, and doesn't require more than 10x extra computing power compared to projects that are just bulling cheerfully ahead.


I think I have a disagreement with Paul about the notion of being able to expose inspectable thought processes to humans, such that we can examine each step of the thought process locally and determine whether it locally has properties that will globally add up to corrigibility, alignment, and intelligence. It's not that I think this can never be done, or even that I think it takes longer than six months. In this case, I think this problem is literally isomorphic to "build an aligned AGI". If you can locally inspect cognitive steps for properties that globally add to intelligence, corrigibility, and alignment, you're done; you've solved the AGI alignment problem and you can just apply the same knowledge to directly build an aligned corrigible intelligence.

As I currently flailingly attempt to understand Paul, Paul thinks that having humans do the inspection (base case) or thingies trained to resemble aggregates of trained thingies (induction step) is something we can do in an intuitive sense by inspecting a reasoning step and seeing if it sounds all aligned and corrigible and intelligent. Eliezer thinks that the large-scale or macro traces of cognition, e.g. a "verbal stream of consciousness" or written debates, are not complete with respect to general intelligence in bounded quantities; we are generally intelligent because of sub-verbal cognition whose intelligence-making properties are not transparent to inspection. That is: An IQ 100 person who can reason out loud about Go, but who can't learn from the experience of playing Go, is not a complete general intelligence over boundedly reasonable amounts of reasoning time.

This means you have to be able to inspect steps like "learn an intuition for Go by playing Go" for local properties that will globally add to corrigible aligned intelligence. And at this point it no longer seems intuitive that having humans do the inspection is adding a lot of value compared to us directly writing a system that has the property.

This is a previous discussion that is ongoing between Paul and myself, and I think it's a crux of disagreement but not one that's as cruxy as 1 and 2. Although it might be a subcrux of my belief that you can't use weird recursion starting from gradient descent on human-labeled data to build corrigible agents that build corrigible agents. I think Paul is modeling the grain size here as corrigible thoughts rather than whole agents, which if it were a sensible way to think, might make the problem look much more manageable; but I don't think you can build corrigible thoughts without building corrigible agents to think them unless you have solved the decomposition problem that I think is isomorphic to building an aligned corrigible intelligence directly.

I remark that this intuition matches what the wise might learn from Scott's parable of K'th'ranga V: If you know how to do something then you know how to do it directly rather than by weird recursion, and what you imagine yourself doing by weird recursion you probably can't really do at all. When you want an airplane you don't obtain it by figuring out how to build birds and then aggregating lots of birds into a platform that can carry more weight than any one bird and then aggregating platforms into megaplatforms until you have an airplane; either you understand aerodynamics well enough to build an airplane, or you don't, the weird recursion isn't really doing the work. It is by no means clear that we would have a superior government free of exploitative politicians if all the voters elected representatives whom they believed to be only slightly smarter than themselves, until a chain of delegation reached up to the top level of government; either you know how to build a less corruptible relationship between voters and politicians, or you don't, the weirdly recursive part doesn't really help. It is no coincidence that modern ML systems do not work by weird recursion because all the discoveries are of how to just do stuff, not how to do stuff using weird recursion. (Even with AlphaGo which is arguably recursive if you squint at it hard enough, you're looking at something that is not weirdly recursive the way I think Paul's stuff is weirdly recursive, and for more on that see

It's in this same sense that I intuit that if you could inspect the local elements of a modular system for properties that globally added to aligned corrigible intelligence, it would mean you had the knowledge to build an aligned corrigible AGI out of parts that worked like that, not that you could aggregate systems that corrigibly learned to put together sequences of corrigible thoughts into larger corrigible thoughts starting from gradient descent on data humans have labeled with their own judgments of corrigibility.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on A Rationalist Argument for Voting · 2018-06-07T19:05:08.188Z · LW · GW

Voting in elections is a wonderful example of logical decision theory in the wild. The chance that you are genuinely logically correlated to a random trade partner is probably small, in cases where you don't have mutual knowledge of LDT; leaving altruism and reputation as sustaining reasons for cooperation. With millions of voters, the chance that you are correlated to thousands of them is much better.

Or perhaps you'd prefer to believe the dictate of Causal Decision Theory that if an election is won by 3 votes, nobody's vote influenced it, and if an election is won by 1 vote, all of the millions of voters on the winning side are solely responsible. But that was a silly decision theory anyway. Right?

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Toolbox-thinking and Law-thinking · 2018-06-06T07:37:30.587Z · LW · GW

Savage's Theorem isn't going to convince anyone who doesn't start out believing that preference ought to be a total preorder. Coherence theorems are talking to anyone who starts out believing that they'd rather have more apples.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Local Validity as a Key to Sanity and Civilization · 2018-04-07T12:54:17.576Z · LW · GW

There will be a single very cold day occasionally regardless of whether global warming is true or false. Anyone who knows the phrase "modus tollens" ought to know that. That said, if two unenlightened ones are arguing back and forth in all sincerity by telling each other about the hot versus cold days they remember, neither is being dishonest, but both are making invalid arguments. But this is not the scenario offered in the original, which concerns somebody who does possess the mental resources to know better, but is tempted to rationalize in order to reach the more agreeable conclusion. They feel a little pressure in their head when it comes to deciding which argument to accept. If a judge behaved thusly in sentencing a friend or an enemy, would we not consider them morally deficient in their duty as a judge? There is a level of unconscious ignorance that renders an innocent entirely blameless; somebody who possesses the inner resources to have the first intimation that one hot day is a bad argument for global warming is past that level.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on A LessWrong Crypto Autopsy · 2018-02-03T22:53:49.945Z · LW · GW

This is pretty low on the list of opportunities I'd kick myself for missing. A longer reply is here:

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Arbital postmortem · 2018-02-01T03:39:31.278Z · LW · GW

The vision for Arbital would have provided incentives to write content, but those features were not implemented before the project ran out of time. I did not feel that at any point the versions of Arbital that were in fact implemented were at a state where I predicted they'd attract lots of users, and said so.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Arbital postmortem · 2018-02-01T03:37:00.295Z · LW · GW

I designed a solution from the start, I'm not stupid. It didn't get implemented in time.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Pascal’s Muggle Pays · 2017-12-21T04:48:11.496Z · LW · GW

Unless I'm missing something, the trouble with this is that, absent a leverage penalty, all of the reasons you've listed for not having a muggable decision algorithm... drumroll... center on the real world, which, absent a leverage penalty, is vastly outweighed by tiny probabilities of googolplexes and ackermann numbers of utilons. If you don't already consider the Mugger's claim to be vastly improbable, then all the considerations of "But if I logically decide to let myself be mugged that retrologically increases his probability of lying" or "If I let myself mugged this real-world scenario will be repeated many times" are vastly outweighed by the tiny probability that the Mugger is telling the truth.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Hero Licensing · 2017-11-17T15:10:12.044Z · LW · GW

Zvi's probably right.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Zombies Redacted · 2016-07-02T21:08:53.660Z · LW · GW

Sure. Measure a human's input and output. Play back the recording. Or did you mean across all possible cases? In the latter case see

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on JFK was not assassinated: prior probability zero events · 2016-04-27T18:13:26.335Z · LW · GW

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Machine learning and unintended consequences · 2016-03-20T02:41:58.650Z · LW · GW

Ed Fredkin has since sent me a personal email:

By the way, the story about the two pictures of a field, with and without army tanks in the picture, comes from me. I attended a meeting in Los Angeles, about half a century ago where someone gave a paper showing how a random net could be trained to detect the tanks in the picture. I was in the audience. At the end of the talk I stood up and made the comment that it was obvious that the picture with the tanks was made on a sunny day while the other picture (of the same field without the tanks) was made on a cloudy day. I suggested that the "neural net" had merely trained itself to recognize the difference between a bright picture and a dim picture.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on The Number Choosing Game: Against the existence of perfect theoretical rationality · 2016-01-29T01:04:40.477Z · LW · GW

Moving to Discussion.

Comment by Eliezer_Yudkowsky on [deleted post] 2015-12-18T19:39:55.402Z

Please don't.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on A toy model of the control problem · 2015-09-18T19:57:18.774Z · LW · GW

I assume the point of the toy model is to explore corrigibility or other mechanisms that are supposed to kick in after A and B end up not perfectly value-aligned, or maybe just to show an example of why a non-value-aligning solution for A controlling B might not work, or maybe specifically to exhibit a case of a not-perfectly-value-aligned agent manipulating its controller.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on A toy model of the control problem · 2015-09-18T19:51:54.632Z · LW · GW

When I consider this as a potential way to pose an open problem, the main thing that jumps out at me as being missing is something that doesn't allow A to model all of B's possible actions concretely. The problem is trivial if A can fully model B, precompute B's actions, and precompute the consequences of those actions.

The levels of 'reason for concern about AI safety' might ascend something like this:

  • 0 - system with a finite state space you can fully model, like Tic-Tac-Toe
  • 1 - you can't model the system in advance and therefore it may exhibit unanticipated behaviors on the level of computer bugs
  • 2 - the system is cognitive, and can exhibit unanticipated consequentialist or goal-directed behaviors, on the level of a genetic algorithm finding an unanticipated way to turn the CPU into a radio or Eurisko hacking its own reward mechanism
  • 3 - the system is cognitive and humanish-level general; an uncaught cognitive pressure towards an outcome we wouldn't like, results in facing something like a smart cryptographic adversary that is going to deeply ponder any way to work around anything it sees as an obstacle
  • 4 - the system is cognitive and superintelligent; its estimates are always at least as good as our estimates; the expected agent-utility of the best strategy we can imagine when we imagine ourselves in the agent's shoes, is an unknowably severe underestimate of the expected agent-utility of the best strategy the agent can find using its own cognition

We want to introduce something into the toy model to at least force solutions past level 0. This is doubly true because levels 0 and 1 are in some sense 'straightforward' and therefore tempting for academics to write papers about (because they know that they can write the paper); so if you don't force their thinking past those levels, I'd expect that to be all that they wrote about. You don't get into the hard problems with astronomical stakes until levels 3 and 4. (Level 2 is the most we can possibly model using running code with today's technology.)

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Procedural Knowledge Gaps · 2015-08-19T18:29:54.717Z · LW · GW

I recall originally reading something about a measure of exercise-linked gene expression and I'm pretty sure it wasn't that New Scientist article, but regardless, it's plausible that some mismemory occurred and this more detailed search screens off my memory either way. 20% of the population being immune to exercise seems to match real-world experience a bit better than 40% so far as my own eye can see - I eyeball-feel more like a 20% minority than a 40% minority, if that makes sense. I have revised my beliefs to match your statements. Thank you for tracking that down!

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Don't You Care If It Works? - Part 1 · 2015-07-29T20:27:14.706Z · LW · GW

"Does somebody being right about X increase your confidence in their ability to earn excess returns on a liquid equity market?" has to be the worst possible question to ask about whether being right in one thing should increase your confidence about them being right elsewhere. Liquid markets are some of the hardest things in the entire world to outguess! Being right about MWI is enormously being easier than being right about what Microsoft stock will do relative to the rest of S&P 500 over the next 6 months.

There's a gotcha to the gotcha which is that you have to know from your own strength how hard the two problems are - financial markets are different from, e.g., the hard problem of conscious experience, in that we know exactly why it's hard to predict them, rather than just being confused. Lots of people don't realize that MWI is knowable. Nonetheless, going from MWI to Microsoft stock behavior is like going from 2 + 2 = 4 to MWI.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on If MWI is correct, should we expect to experience Quantum Torment? · 2015-07-14T18:35:16.764Z · LW · GW

You're confusing subjective probability and objective quantum measure. If you flip a quantum coin, half your measure goes to worlds where it comes up heads and half goes to where it comes up tails. This is an objective fact, and we know it solidly. If you don't know whether cryonics works, you're probably still already localized by your memories and sensory information to either worlds where it works or worlds where it doesn't; all or nothing, even if you're ignorant of which.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Pascal's Muggle: Infinitesimal Priors and Strong Evidence · 2015-06-07T21:35:29.619Z · LW · GW

can even strip out the part about agents and carry out the reasoning on pure causal nodes; the chance of a randomly selected causal node being in a unique100 position on a causal graph with respect to 3↑↑↑3 other nodes ought to be at most 100/3↑↑↑3 for finite causal graphs.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Rationality is about pattern recognition, not reasoning · 2015-06-07T19:50:33.700Z · LW · GW

Yes, as his post facto argument.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Rationality is about pattern recognition, not reasoning · 2015-06-07T07:16:03.734Z · LW · GW

You have not understood correctly regarding Carl. He claimed, in hindsight, that Zuckerberg's potential could've been distinguished in foresight, but he did not do so.

Comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) on Taking Effective Altruism Seriously · 2015-06-07T06:59:28.410Z · LW · GW

Moved to Discussion.