# Ngo and Yudkowsky on alignment difficulty

post by Eliezer Yudkowsky (Eliezer_Yudkowsky), Richard_Ngo (ricraz) · 2021-11-15T20:31:34.135Z · LW · GW · 133 comments

## Contents

  0. Prefatory comments
1. September 5 conversation
1.1. Deep vs. shallow problem-solving patterns
1.2. Requirements for science
1.3. Capability dials
1.4. Consequentialist goals vs. deontologist goals
2. Follow-ups
2.1. Richard Ngo's summary
3. September 8 conversation
3.1. The Brazilian university anecdote
3.2. Brain functions and outcome pumps
3.3. Hypothetical-planning systems, nanosystems, and evolving generality
3.4. Coherence and pivotal acts
4. Follow-ups
4.1. Richard Ngo's summary
4.2. Nate Soares' summary
None


This post is the first in a series of transcribed Discord conversations between Richard Ngo and Eliezer Yudkowsky, moderated by Nate Soares. We've also added Richard and Nate's running summaries of the conversation (and others' replies) from Google Docs.

Later conversation participants include Ajeya Cotra, Beth Barnes, Carl Shulman, Holden Karnofsky, Jaan Tallinn, Paul Christiano, Rob Bensinger, and Rohin Shah.

The transcripts are a complete record of several Discord channels MIRI made for discussion. We tried to edit the transcripts as little as possible, other than to fix typos and a handful of confusingly-worded sentences, to add some paragraph breaks, and to add referenced figures and links. We didn't end up redacting any substantive content, other than the names of people who would prefer not to be cited. We swapped the order of some chat messages for clarity and conversational flow (indicated with extra timestamps), and in some cases combined logs where the conversation switched channels.

Color key:

# 4. Follow-ups

## 4.2. Nate Soares' summary

comment by Rob Bensinger (RobbBB) · 2021-11-15T20:46:41.163Z · LW(p) · GW(p)

This is the first post in a sequence, consisting of the logs of a Discord server MIRI made for hashing out AGI-related disagreements with Richard Ngo, Open Phil, etc.

I did most of the work of turning the chat logs into posts, with lots of formatting help from Matt Graves and additional help from Oliver Habryka, Ray Arnold, and others. I also hit the 'post' button for Richard and Eliezer. (I don't plan to repeat this note on future posts in this sequence, unless folks request it.)

Replies from: lincolnquirk
comment by lincolnquirk · 2021-11-16T11:05:02.256Z · LW(p) · GW(p)

I'd like to express my gratitude and excitement (and not just to you, Rob, though your work is included in this):

Deep thanks to everyone involved for having the discussion, writing up and formatting, and posting it on LW. I think this is some of the more interesting and potentially impactful stuff I've seen relating to AI alignment in a long while.

(My only thought is... why hasn't a discussion like this occurred sooner? Or has it, and it just hasn't made it to LW?)

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2021-11-16T14:35:27.781Z · LW(p) · GW(p)

I'm not sure why we haven't tried the 'generate and publish chatroom logs' option before. If you mean more generally 'why is MIRI waiting to hash these things out with other xrisk people until now?', my basic model is:

• Syncing with others was a top priority for SingInst (2000-2012), and this resulted in stuff like the Sequences, the FOOM debate, Highly Advanced Epistemology 101 for Beginners [? · GW], the Singularity Summits, etc. It (largely) doesn't cover the same ground as current disagreements because people disagree about different stuff now.
• 'SingInst' becoming 'MIRI' in 2013 coincided with us shifting much more to a focus on alignment research. That said, a lot of factors resulted in us continuing to have a lot of non-research-y conversations with others, including: EA coalescing in 2012-2014; the wider AI alignment field starting in earnest with the release of Superintelligence (2014) and the Puerto Rico conference (2015); and Open Philanthropy starting in 2014.
• Some of these conversations (and the follow-up reflections prompted by these conversations) ended up inspiring publications at some point, including some of the content on Arbital (mostly active 2015-2017), Inadequate Equilibria [? · GW] (published 2017, but mostly written around 2013-2015 I believe), etc.
• My model is that we then mostly disappeared in 2018-2020 while we bunkered down to do research, continuing to have intermittent conversations and email exchanges with folks, but not sinking very much time into syncing up. (I'll say that a lot of non-MIRI EA leaders were very eager to sink loads of time into syncing up with MIRI, and it's entirely MIRI's 'sorry, we want to do research instead' that caused this to not happen during this period.)

So broadly I'd say 'we did try to sync up a lot, but it turns out there's a lot of ground to cover, and different individuals at different times have very different perspectives and cruxes'. At a certain point, (a) we'd transmitted enough of our perspective that we expected to be pretty happy with e.g. EA leaders' sense of how to do broader field-building, academic outreach, etc.; and (b) we felt we'd plucked the low-hanging fruit and further syncing up would require a lot more focused effort, which seemed lower-priority than 'make ourselves less confused about the alignment problem by working on this research program' at the time.

Replies from: Eliezer_Yudkowsky, Vaniver
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-16T15:27:46.330Z · LW(p) · GW(p)

I'm definitely not happy with others' sense of how to do field-building, but it's not like I thought I could fix that issue by spending the rest of my life trying to do it myself.

comment by Vaniver · 2021-11-16T17:26:53.331Z · LW(p) · GW(p)

I'm not sure why we haven't tried the 'generate and publish chatroom logs' option before.

My guess is that a lot of these conversations often hinge on details that people are somewhat ansy about saying in public, and I suspect MIRI now thinks the value of "credible public pessimism" is larger than the cost of "gesturing towards things that seem powerful" on the margin such that chatlogs like this are a better idea than they would have seemed to the MIRI of 4 years ago. [Or maybe it was just "no one thought to try, because we had access to in-person conversations and those seemed much better, despite not generating transcripts."]

comment by johnswentworth · 2021-11-15T22:18:28.018Z · LW(p) · GW(p)

So here's one important difference between humans and neural networks: humans face the genomic bottleneck which means that each individual has to rederive all the knowledge about the world that their parents already had. If this genetic bottleneck hadn't been so tight, then individual humans would have been significantly less capable of performing novel tasks.

I disagree with this in an interesting way. (Not particularly central to the discussion, but since both Richard & Eliezer thought the quoted claim is basically-true, I figured I should comment on it.)

First, outside view evidence: most of the genome is junk. That's pretty strong evidence that the size of the genome is not itself a taut constraint. If there evolutionary fitness gains to be had, in general, by passing more information via the genome, then we should expect that to have evolved already.

Second, inside view: overparameterized local search processes (including evolution and gradient descent on NNs) perform information compression by default. This is a technical idea that I haven't written up properly yet, but as a quick sketch... suppose that I have a neural net with N parameters. It's overparameterized, so there are many degrees of freedom in any optimum - i.e. there's a whole optimal surface, not just an optimal point. Now suppose that I can build a near-perfect model of the training data by setting only M (< N) parameter-values; with these values, all the other parameters are screened off, so the remaining N-M parameters can take any values at all. (I'll call the set of M parameter-values a "model".) The smaller M, the larger N-M, and therefore the more possible parameter-values achieve optimality using this model. And the more possible parameter-values achieve optimality using the model, the more of the optimum-space this "model" fills. In practice, for something like evolution or gradient descent, this would mean a broad peak.

Rough takeaway: broader peaks in the fitness-landscape are precisely those which require fixing fewer parameters. Fixing fewer parameters, while still achieving optimality, requires compressing all the information-required-to-achieve-optimality into those few parameters. The more compression, the broader the peak, and the more likely that a local search process will find it.

Replies from: DaemonicSigil, TekhneMakre
comment by DaemonicSigil · 2021-11-16T04:46:35.382Z · LW(p) · GW(p)

Large genomes have (at least) 2 kinds of costs. The first is the energy and other resources required to copy the genome whenever your cells divide. The existence of junk DNA suggests that this cost is not a limiting factor. The other cost is that a larger genome will have more mutations per generation. So maintaining that genome across time uses up more selection pressure. Junk DNA requires no maintenance, so it provides no evidence either way. Selection pressure cost could still be the reason why we don't see more knowledge about the world being translated genetically.

A gene-level way of saying the same thing is that even a gene that provides an advantage may not survive if it takes up a lot of genome space, because it will be destroyed by the large number of mutations.

Replies from: johnswentworth
comment by johnswentworth · 2021-11-16T05:14:41.047Z · LW(p) · GW(p)

Good point, I wasn't thinking about that mechanism.

However, I don't think this creates an information bottleneck in the sense needed for the original claim in the post, because the marginal cost of storing more information in the genome does not increase via this mechanism as the amount-of-information-passed increases. Each gene just needs to offer a large enough fitness advantage to counter the noise on that gene; the requisite fitness advantage does not change depending on whether the organism currently has a hundred information-passing genes or a hundred thousand. It's not really a "bottleneck" so much as a fixed price: the organism can pass any amount of information via the genome, so long as each base-pair contributes marginal fitness above some fixed level.

It does mean that individual genes can't be too big, but it doesn't say much about the number of information-passing genes (so long as separate genes have mostly-decoupled functions, which is indeed the case for the vast majority of gene pairs in practice).

Replies from: darius
comment by darius · 2021-11-17T23:40:00.102Z · LW(p) · GW(p)

Here's the argument I'd give for this kind of bottleneck. I haven't studied evolutionary genetics; maybe I'm thinking about it all wrong.

In the steady state, an average individual has n children in their life, and just one of those n makes it to the next generation. (Crediting a child 1/2 to each parent.) This gives log2(n) bits of error-correcting signal to prune deleterious mutations. If the genome length times the functional bits per base pair times the mutation rate is greater than that log2(n), then you're losing functionality with every generation.

One way for a beneficial new mutation to get out of this bind is by reducing the mutation rate.  Another is refactoring the same functionality into fewer bits, freeing up bits for something new. But generically a fitness advantage doesn't seem to affect the argument that the signal from purifying selection gets shared by the whole genome.

comment by TekhneMakre · 2021-11-16T03:23:18.680Z · LW(p) · GW(p)
most of the genome is junk. That's pretty strong evidence that the size of the genome is not itself a taut constraint.

My guess is that this is a total misunderstanding of what's meant by "genomic bottleneck". The bottleneck isn't the amount of information storage, it's the fact that the genome can only program the mind in a very indirect, developmental way, so that it can install stuff like "be more interested in people" but not "here's how to add numbers".

Replies from: cousin_it
comment by cousin_it · 2021-11-16T10:00:14.407Z · LW(p) · GW(p)

That seems wrong, living creatures have lots of specific behaviors that are genetically programmed.

In fact I think both you and John are misunderstanding the bottleneck. The point isn't that the genome is small, nor that it affects the mind indirectly. The point is that the mind doesn't affect the genome. Living creatures don't have the tech to encode their life experience into genes for the next generation.

Replies from: ricraz, TekhneMakre
comment by Richard_Ngo (ricraz) · 2021-11-17T00:03:09.748Z · LW(p) · GW(p)

I've appreciated this comment thread! My take is that you're all talking about different relevant things. It may well be the case that there are multiple reasons why more skills and knowledge aren't encoded in our genomes: a) it's hard to get that information in (from parents' brains), b) it's hard to get that information out (to childrens' brains), and c) having large genomes is costly. What I'm calling the genomic bottleneck is a combination of all of them (although I think John is probably right that c) is not the main reason).

What would falsify my claim about the genomic bottleneck is if the main reason there isn't more information passed on via genomes is because d) doing so is not very useful. That seems pretty unlikely, but not entirely out of the picture. E.g. we know that evolution is able to give baby deer the skill of walking shortly after birth, so it seems like d) might be the best explanation of why humans can't do that too. But deer presumably evolved that skill over a very long time period, whereas I'm more interested in rapid changes.

comment by TekhneMakre · 2021-11-16T10:52:38.342Z · LW(p) · GW(p)

Do you think you can encode good flint-knapping technique genetically? I doubt that.

I think I agree with your point, and think it's a more general and correct statement of the bottleneck; but, still, I think that genome does mainly affect the mind indirectly, and this is one of the constraints making it be the case that humans have lots of learning / generalizing capability. (This doesn't just apply to humans. What are some stark examples of animals with hardwired complex behaviors? With a fairly high bar for "complex", and a clear explanation of what is hardwired and how we know. Insects have some fairly complex behaviors, e.g. web building, ant-hill building, the tree-leaf nests of weaver ants, etc.; but IDK enough to rule out a combination of a little hardwiring, some emergence, and some learning. Lots of animals hunt after learning from their parents how to hunt. I think a lot of animals can walk right after being born? I think beavers in captivity will fruitlessly chew on wood, indicating that the wild phenotype is encoded by something simple like "enjoys chewing" (plus, learned desire for shelter), rather than "use wood for dam".)

An operationalization of "the genome directly programs the mind" would be that things like [the motions employed in flint-knapping] can be hardwired by small numbers of mutations (and hence can be evolved given a few million relevant years). I think this isn't true, but counterevidence would be interesting. Since the genome can't feasibly directly encode behaviors, or at least can't learn those quickly enough to keep up with a changing niche, the species instead evolves to learn behaviors on the fly via algorithms that generalize. If there were *either* mind-mind transfer, *or* direct programming of behavior by the genome, then higher frequency changes would be easier and there'd be less need for fluid intelligence. (In fact it's sort of plausible to me (given my ignorance) that humans are imitation specialists and are less clever than Neanderthals were, since mind-mind transfer can replace intelligence.)

comment by KatWoods (ea247) · 2021-11-29T18:45:29.755Z · LW(p) · GW(p)

You can listen to this and all the other Yudkowsky & Ngo/Christiano conversations in podcast form on the Nonlinear Library now.

You can also listen to them on any podcast player. Just look up Nonlinear Library.

I’ve listened to them as is and I find it pretty easy to follow, but if you’re interested in making it even easier for people to follow, these fine gentlemen have put up a ~$230 RFP/bounty for anybody who turns it into audio where each person has a different voice. It would probably be easiest to just do it on our platform, since there’s a relatively easy way to change the voices, it will just be a tedious ~1-4 hours of work. My main bottleneck is management time, so I don’t have the time to manage the process or choose somebody who I’d trust to do it without messing with the quality. It does seem a shame though, to have something so close to being even better, and not let people do what clearly is desired, because of my worry of accidentally messing up the quality of the audio. I think the main thing is just being conscientious enough to do 1-4 hours of repetitive work and an attention to detail. After a couple minutes of thinking on it, I think a potential solution would be to have a super quick and dirty way to delegate trust. I’ll give you access to our platform to change the voices if you either a) are getting a/have a degree at an elite school (thus demonstrating a legible minimal amount of conscientiousness and ability to do boring tasks) or b) have at least 75 mutual EA friends with me on Facebook and can have an EA reference about your diligence. Just DM me. I’ll do it on a first come first serve basis. If you do it with human voices, we’d also be happy to add that to the Library. Finally, sorry for the delay. There was a comedy of errors where there was a bug in the system while I also came down with a human bug (a cold. Not covid :) ) and the articles were so long our regular system wasn’t working, so things weren't automatic like usual. Replies from: jimrandomh, RobbBB comment by jimrandomh · 2021-11-30T03:17:45.246Z · LW(p) · GW(p) (Mod note: I edited this comment to fix broken links.) Replies from: ea247 comment by KatWoods (ea247) · 2021-11-30T14:10:50.898Z · LW(p) · GW(p) Thank you! comment by Rob Bensinger (RobbBB) · 2021-11-30T02:42:43.879Z · LW(p) · GW(p) Thanks for doing this, Kat! :) I’ve listened to them as is and I find it pretty easy to follow, but if you’re interested in making it even easier for people to follow, these fine gentlemen [? · GW] have put up a ~$230 RFP/bounty for anybody who turns it into audio where each person has a different voice.

That link isn't working for me; where's the bounty?

comment by TurnTrout · 2021-11-23T19:22:00.550Z · LW(p) · GW(p)

I've started commenting on this discussion on a Google Doc. Here are some excerpts:

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up.

• Well-modelled as binary "has-AGI?" predicate;
• (I am sympathetic to the microeconomics of intelligence explosion working out in a way where "Well-modelled as binary "has-AGI?" predicate is true, but I feel uncertain about the prospect)
• Somehow rules out situations like: We have somewhat aligned AIs which push the world to make future unaligned AIs slightly less likely, which makes the AI population more aligned on average; this cycle compounds until we're descending very fast into the basin of alignment and goodness.
• This isn't my mainline or anything, but I note that it's ruled out by Eliezer's model as I understand it.
• Some other internal objections are arising and I'm not going to focus on them now.

Every AI output effectuates outcomes in the world.

Right but the likely domain of cognitive discourse matters. Pac-Man agents effectuate outcomes in the world, but their optimal policies are harmless. So the question seems to hinge on when the domain of cognition shifts to put us in the crosshairs of performant policies.

This doesn't mean Eliezer is wrong here about the broader claim, but the distinction deserves mentioning for the people who weren't tracking it. (I think EY is obviously aware of this)

If you knew about the things that humans are using to reuse their reasoning about chipped handaxes and other humans, to prove math theorems, you would see it as more plausible that proving math theorems would generalize to chipping handaxes and manipulating humans.

Could we have observed it any other way? Since we surely wouldn't have been selected for proving math theorems, we wouldn't have a native cortex specializing in math. So conditional on considering things like theorem-proving at all, it has to reuse other native capabilities.

More precisely, one possible mind design which solves theorems also reasons about humans. This is some update from whatever prior, towards EY's claim. I'm considering whether we know enough about the common cause (evolution giving us a general-purpose reasoning algorithm) to screen off/reduce the Theorems -> Human-modelling update.

So here's one important difference between humans and neural networks: humans face the genomic bottleneck which means that each individual has to rederive all the knowledge about the world that their parents already had. If this genetic bottleneck hadn't been so tight, then individual humans would have been significantly less capable of performing novel tasks.

Thanks, Richard—this is a cool argument that I hadn't heard before.

You will systematically overestimate how much easier, or how far you can push the science part without getting the taking-over-the-world part, for as long as your model is ignorant of what they have in common.

OK, it's a valid point and I'm updating a little, under the apparent model of "here's a set of AI capabilities, linearly ordered in terms of deep-problem-solving, and if you push too far you get taking-over-the-world." But I don't see how we get to that model to begin with.

comment by Ramana Kumar (ramana-kumar) · 2021-11-19T15:48:22.236Z · LW(p) · GW(p)

I am interested in the history-funnelling property -- the property of being like a consequentialist, or of being effective at achieving an outcome -- and have a specific confusion I'd love to get insight on from anyone who has any.

Question: Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

Option 1 (robustness/behavioural/our models): They achieve narrow outcomes with respect to an externally specified set of counterfactuals. E.g., relative to what we consider "could have happened", the consequentialists selected an excellent course of action for their purposes. This would make consequentialists optimizing systems [AF · GW] in Flint's sense.

Option 2 (agency/structural/their models): They are structured in such a way that they do their own considering and evaluating and deciding. We observe mechanisms that implement the processes of predicting and evaluating outcomes in these systems (and/or their history). So the possibilities that are narrowed down are the consequentialist's possibilities, the counterfactuals are produced by their models which may or may not line up with some externally specified ones (like ours).

I mostly think Yudkowsky is referring to Option 2, but I get confused by phrases (e.g. from Soares's summary [? · GW]) like "manage to actually funnel history" or "apparent consequentialism", that seem to me to make most sense under Option 1.

Replies from: Eliezer_Yudkowsky, RobbBB
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-19T22:38:38.875Z · LW(p) · GW(p)

To Rob's reply, I'll add that my own first reaction to your question was that it seems like a map-territory / perspective issue as appears in eg thermodynamics?  Like, this has a similar flavor to asking "What does it mean to say that a classical system is in a state of high entropy when it actually only has one particular system state?"  Adding this now in case I don't have time to expand on it later; maybe just saying that much will help at all, possibly.

comment by Rob Bensinger (RobbBB) · 2021-11-19T17:29:09.296Z · LW(p) · GW(p)

Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

I'm not sure that I understand the question, but my intuition is to say: they funnel world-states into particular outcomes in the same sense that literal funnels funnel water into particular spaces, or in the same sense that a slope makes things roll down it.

If you find water in a previously-empty space with a small aperture, and you're confused that no water seems to have spilled over the sides, you may suspect that a funnel was there. Funnels are part of a larger deterministic universe, so maybe in some sense any given funnel (like everything else) 'had to do exactly that thing'. Still, we can observe that funnels are an important part of the causal chain in these cases, and that places with funnels tend to end up with this type of outcome much more often.

Similarly, consequentialists tend to remake parts of the world (typically, as much of the world as they can reach) into things that are high in their preference ordering. From Optimization and the Singularity [LW · GW]:

[...] Suppose you have a car, and suppose we already know that your preferences involve travel.  Now suppose that you take all the parts in the car, or all the atoms, and jumble them up at random.  It's very unlikely that you'll end up with a travel-artifact at all, even so much as a wheeled cart; let alone a travel-artifact that ranks as high in your preferences as the original car.  So, relative to your preference ordering, the car is an extremely improbable artifact; the power of an optimization process is that it can produce this kind of improbability.

You can view both intelligence and natural selection [? · GW] as special cases of optimization:  Processes that hit, in a large search space, very small targets defined by implicit preferences.  Natural selection prefers more efficient replicators.  Human intelligences have more complex preferences [? · GW].  Neither evolution nor humans have consistent utility functions, so viewing them as "optimization processes" is understood to be an approximation.  You're trying to get at the sort of work being done, not claim that humans or evolution do this work perfectly.

This is how I see the story of life and intelligence - as a story of improbably good designs being produced by optimization processes.  The "improbability" here is improbability relative to a random selection from the design space, not improbability in an absolute sense - if you have an optimization process around, then "improbably" good designs become probable. [...]

But it's not clear what a "preference" is, exactly. So a more general way of putting it, in Recognizing Intelligence [LW · GW], is:

[...] Suppose I landed on an alien planet and discovered what seemed to be a highly sophisticated machine, all gleaming chrome as the stereotype demands.  Can I recognize this machine as being in any sense well-designed, if I have no idea what the machine is intended to accomplish?  Can I guess that the machine's makers were intelligent, without guessing their motivations?

And again, it seems like in an intuitive sense I should obviously be able to do so.  I look at the cables running through the machine, and find large electrical currents passing through them, and discover that the material is a flexible high-temperature high-amperage superconductor.  Dozens of gears whir rapidly, perfectly meshed...

I have no idea what the machine is doing.  I don't even have a hypothesis as to what it's doing.  Yet I have recognized the machine as the product of an alien intelligence.

[...] Why is it a good hypothesis to suppose that intelligence or any other optimization process played a role in selecting the form of what I see, any more than it is a good hypothesis to suppose that the dust particles in my rooms are arranged by dust elves?

Consider that gleaming chrome.  Why did humans start making things out of metal?  Because metal is hard; it retains its shape for a long time.  So when you try to do something, and the something stays the same for a long period of time, the way-to-do-it may also stay the same for a long period of time.  So you face the subproblem of creating things that keep their form and function.  Metal is one solution to that subproblem.

[... A]s simple a form of negentropy [? · GW] as regularity over time - that the alien's terminal values don't take on a new random form with each clock tick - can imply that hard metal, or some other durable substance, would be useful in a "machine" - a persistent configuration of material that helps promote a persistent goal.

The gears are a solution to the problem of transmitting mechanical forces from one place to another, which you would want to do because of the presumed economy of scale in generating the mechanical force at a central location and then distributing it.  In their meshing, we recognize a force of optimization applied in the service of a recognizable instrumental value: most random gears, or random shapes turning against each other, would fail to mesh, or fly apart.  Without knowing what the mechanical forces are meant to do, we recognize something that transmits mechanical force - this is why gears appear in many human artifacts, because it doesn't matter much what kind of mechanical force you need to transmit on the other end.  You may still face problems like trading torque for speed, or moving mechanical force from generators to appliers.

These are not universally [? · GW] convergent instrumental challenges.  They probably aren't even convergent with respect to maximum-entropy goal systems (which are mostly out of luck).

But relative to the space of low-entropy, highly regular goal systems - goal systems that don't pick a new utility function for every different time and every different place - that negentropy pours through the notion of "optimization" and comes out as a concentrated probability distribution over what an "alien intelligence" would do, even in the "absence of any hypothesis" about its goals. [...]

"Consequentialists funnel the universe into shapes that are higher in their preference ordering" isn't a required inherent truth for all consequentialists; some might have weird goals, or be too weak to achieve much. Likewise, some literal funnels are broken or misshapen, or just never get put to use. But in both cases, we can understand the larger class by considering the unusual function well-working instances can perform.

(In the case of literal funnels, we can also understand the class by considering its physical properties rather than its function/behavior/effects. Eventually we should be able to do the same for consequentialists, but currently we don't know what physical properties of a system make it consequentialist, beyond the level of generality of e.g. 'its future-steering will approximately obey expected utility theory'.)

Replies from: ramana-kumar
comment by Ramana Kumar (ramana-kumar) · 2021-11-23T17:28:37.402Z · LW(p) · GW(p)

Thanks for the replies! I'm still somewhat confused but will try again to both ask the question more clearly and summarise my current understanding.

What, in the case of consequentialists, is analogous to the water funnelled by literal funnels? Is it possibilities-according-to-us? Or is it possibilities-according-to-the-consequentialist? Or is it neither (or both) of those?

To clarify a little what the options in my original comment were, I'll say what I think they correspond to for literal funnels. Option 1 corresponds to the fact that funnels are usually nearby (in spacetime) when water is in a small space without having spilled, and Option 2 corresponds to the characteristic funnel shape (in combination with facts about physical laws maybe).

I think your and Eliezer's replies are pointing me at a sense in which both Option 1 and Option 2 are correct, but they are used in different ways in the overall story. To tell this story, I want to draw a distinction between outcome-pumps (behavioural agents) and consequentialists (structural agents). Outcome-pumps are effective at achieving outcomes, and this effectiveness is measured according to our models (option 1). Consequentialists do (or have done in their causal history) the work of selecting actions according to expected consequences in coherent pursuit of an outcome, and the expected consequences are therefore their own (option 2).

Spelling this out a little more - Outcome-pumps are optimizing systems [AF · GW]: there is a space of possible configurations, a much smaller target subset of configurations, and a basin of attraction such that if the system+surroundings starts within the basin, it ends up within the target. There are at least two ways of looking at the configuration space. Firstly, there's the range of situations in which we actually observe the same (or similar) outcome-pump system and that it achieved its outcome. Secondly, there's the range of hypothetical possibilities we can imagine and reason about putting the outcome-pump system into, and extrapolating (using our own models) that it will achieve the outcome. Both of these ways are "Option 1".

Consequentialists (structural agents) do the work, somewhere somehow - maybe in their brains, maybe in their causal history, maybe in other parts of their structure and history - of maintaining and updating beliefs and selecting actions that lead to (their modelled) expected consequences that are high in their preference ordering (this is all Option 2).

It should be somewhat uncontroversial that consequentialists are outcome pumps, to the extent that they’re any good at doing the consequentialist thing (and have sufficiently achievable preferences relative to their resources etc).

The more substantial claim I read MIRI as making is that outcome pumps are consequentialists, because the only way to be an outcome pump is to be a consequentialist. Maybe you wouldn't make this claim so strongly, since there are counterexamples like fires and black holes -- and there may be some restrictions on what kind of outcome pumps the claim applies to (such as some level of retargetability or robustness?).

How does this overall take sound?

Scott Garrabrant’s question [AF · GW] on whether agent-like behaviour implies agent-like architecture seems pretty relevant to this whole discussion -- Eliezer, do you have an answer to that question? Or at least do you think it’s an important open question?

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-23T17:37:42.720Z · LW(p) · GW(p)

My reply to your distinction between 'consequentialists' and 'outcome pumps' would be, "Please forget entirely about any such thing as a 'consequentialist' as you defined it; I would now like to talk entirely about powerful outcome pumps.  All understanding begins there, and we should only introduce the notion of how outcomes are pumped later in the game.  Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

(Modulo that lots of times people here are like "Well but a human at a particular intelligence level in a particular complicated circumstance once did this kind of work without the thing happening that it sounds like you say happens with powerful outcome pumps"; and then you have to look at the human engine and its circumstances to understand why outcome pumping could specialize down to that exact place and fashion, which will not be reduplicated in more general outcome pumps that have their dice re-rolled.)

Replies from: ramana-kumar, daniel-kokotajlo
comment by Ramana Kumar (ramana-kumar) · 2021-11-25T12:16:04.453Z · LW(p) · GW(p)

A couple of direct questions I'm stuck on:

• Do you agree that Flint's optimizing systems are a good model (or even definition) of outcome pumps?
• Are black holes and fires reasonable examples of outcome pumps?

I'm asking these to understand the work better.

• Yes. Flint's notion is one I came to independently when thinking about "goal-directedness". It could be missing some details, but I find it hard to snap out of the framework entirely.
• Yes. But maybe not the most informative examples. They're highly non-retargetable.
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-11-25T11:55:13.859Z · LW(p) · GW(p)
Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

I don't know the relevant history of science, but I wouldn't be surprised if something like the opposite was true: Our modern, very useful understanding of work is an abstraction that grew out of many people thinking concretely about various engines. Thinking about engines was like the homework exercises that helped people to reach and understand the concept of work.

Similarly, perhaps it is pedagogically (and conceptually) helpful to begin with the notion of a consequentialist and then generalize to outcome pumps.

comment by Eli Tyre (elityre) · 2021-11-18T01:28:35.659Z · LW(p) · GW(p)

Von Neumann was actually a fairly reflective fellow who knew about, and indeed helped generalize, utility functions. The great achievements of von Neumann were not achieved by some very specialized hypernerd who spent all his fluid intelligence on crystallizing math and science and engineering alone, and so never developed any opinions about politics or started thinking about whether or not he had a utility function.

Uh. I don't know about that.

Von Neuman seemed to me to be very much not making rational tradeoffs of the sort that one would if they were conceptualizing themselves as an an agent with a utility function.

From a short post [LW(p) · GW(p)] I wrote, a few years ago, after reading a bit about the man:

For one thing, at the end of his life, he was terrified of dying. But throughout the course of his life he made many reckless choices with his health.

He ate gluttonously and became fatter and fatter over the course of his life. (One friend remarked that he “could count anything but calories.”)

Furthermore, he seemed to regularly risk his life when driving.

• Von Neuman was an aggressive and apparently reckless driver. He supposedly totaled his car every year or so. An intersection in Princeton was nicknamed “Von Neumann corner” for all the auto accidents he had there. records of accidents and speeding arrests are preserved in his papers. [The book goes on to list a number of such accidents.] (pg. 25)

(Amusingly, Von Neumann’s reckless driving seems due, not to drinking and driving, but to singing and driving. “He would sway back and forth, turning the steering wheel in time with the music.”)

I think I would call this a bug.

Replies from: Lukas_Gloor
comment by Lukas_Gloor · 2021-11-18T12:03:37.051Z · LW(p) · GW(p)

Some of your examples don't prove anything, e.g., eating gluttonously is a legitimate tradeoff if you have a certain metabolism and care more about advancing science as a life goal in years where your brain still works well. About the driving, I guess it depends on how reckless it was. It's probably rare for people to die in inner-city driving accidents, especially if you make sure to not mess around at intersections. Judging by the part about singing, it seems possible he was just having fun and could afford to buy new cars?

Replies from: elityre, Lukas_Gloor
comment by Eli Tyre (elityre) · 2021-11-19T18:03:40.099Z · LW(p) · GW(p)

Some of your examples don't prove anything,

I agree that they aren't conclusive.

But are you suggesting that the reckless driving was well-considered expected utility maximizing?

I guess I can see that if fatal accidents are rare, I guess, but I don't think that was the case?

"Activities that have a small, but non-negligible chance of death or permanent injury are not worth the immediate short-term thrill", seems like a textbook case of a conclusion one would draw from considering expected utility theory in practice, in one's life.

At minimum, it seems like there ought to be pareto-improvements that are just as or close to as fun, but which entail a lot less risk?

Replies from: Lukas_Gloor
comment by Lukas_Gloor · 2021-11-21T09:12:43.079Z · LW(p) · GW(p)

I guess I can see that if fatal accidents are rare, I guess, but I don't think that was the case?

I agree that if driving incurs non-trivial risks of lasting damage, that's indicative that the person isn't trying very seriously to optimize some ambitious long-term goal.

At minimum, it seems like there ought to be pareto-improvements that are just as or close to as fun, but which entail a lot less risk?

This reasoning makes me think your model lacks gears about what it's like to live with certain types of psychologies. Making pareto improvements for your habits is itself a task to be prioritized. Depending on what else you have going on in life and how difficult it is to you to replace one habit with a different one, it's totally possible that for some period, it's not rational for you to focus on the habit change.

Basically, because often the best way to optimize your utility comes from applying your strengths to solve a certain bottleneck under time pressure, the observation "this person engages in suboptimal-seeming behavior some of the time" provides very little predictive evidence.

In fact, if you showed me someone who never engaged in such suboptimal behavior, I'd be tempted to wonder if they're maybe not optimizing hard enough in that one area that matters more than everything else they could do.

That said, it is a bit hard to empathize with "driving recklessly while singing" as a hard-to-change behavior. It doesn't sound like something particularly compulsive, except maybe if the impulse to sing came from exuberant happiness due to amphetamine use. But who knows. Von Neumann for sure had an unusual brain and maybe he often had random overwhelming feelings of euphoria.

comment by Lukas_Gloor · 2021-11-18T12:13:19.638Z · LW(p) · GW(p)

I think a mistake of trying to hyperoptimize a healthy lifestyle or micromanage productivity hacks to the point of spending a lot of their attention on new productivity hacks, is probably the bigger mistake than getting overweight as long as the overweight person puts as much of their brainpower as possible into actually irreplaceable cognitive achievements. And long-term health is only important if you care a lot about living for very long.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-17T15:37:52.953Z · LW(p) · GW(p)

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yudkowsky and I seem to agree that "do a pivotal act directly" is not something productive for us to work on, but "do alignment" research is something productive for us to work on. Therefore, there exists some range of AI capabilities which allow for superhuman alignment research but not for pivotal acts. Maybe this range is so narrow that in practice AI capability will cross it very quickly, or maybe not.

Moreover, I believe that there are trade-offs between safety and capability. This not only seems plausible, but actually shows up in many approach to safety (quantilization, confidence thresholds / consensus algorithms, homomorphic encryption...) Therefore, it's not safe to assume that any level of capability sufficient to pose risk (i.e. for a negative pivotal act) is also sufficient for a positive pivotal act.

Yudkowsky seems to claim that aligning an AI that does further alignment research is just too hard, and instead we should be designing AIs that are only competent in a narrow domain (e.g. competent at designing nanosystems but not at manipulating humans). Now, this does seem like an interesting class of alignment strategies, but it's not the only class.

One class of alignment strategies (which in particular Christiano wrote a lot about) compatible with bootstrapping is "amplified imitation of users" (e.g. IDA but I don't want to focus on IDA too much because of certain specifics I am skeptical about). This is potentially vulnerable to attack from counterfactuals [AF(p) · GW(p)] plus the usual malign simulation hypotheses, but is not obviously doomed. There is also a potential issue with capability: maybe predicting is too hard if you don't know which features are important to predict and which aren't.

Another class of alignment strategies (which in particular Russel often promotes) compatible with boostrapping is "learn what the user wants and find a plan to achieve it" (e.g. IRL/CIRC etc). This is hard because it requires formalizing "what the user wants" but might be tractable via something along the lines of the AIT definition of intelligence [AF(p) · GW(p)]. Making it safe probably requires imposing something like the Hippocratic principle [AF(p) · GW(p)], which, if you think through the implications, pulls it in the direction of the "superimitation" class. But, this might avoid superimitation's capability issues.

It could be that "restricted cognition" will turn out to be superior to both superimitation and value learning, but it seems far from a slam dunk at this point.

Replies from: Edouard Harris
comment by Edouard Harris · 2021-11-18T14:26:35.686Z · LW(p) · GW(p)

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yeah, very much agree with all of this. I even think there's an argument to be made that relatively narrow-yet-superhuman theorem provers (or other research aids) could be worth the risk to develop and use, because they may make the human alignment researchers who use them more effective in unpredictable ways. For example, researchers tend to instinctively avoid considering solution paths that are bottlenecked by statements they see as being hard to prove — which is totally reasonable. But if your mentality is that you can just toss a super-powerful theorem-prover at the problem, then you're free to explore concept-space more broadly since you may be able to check your ideas at much lower cost.

(Also find myself agreeing with your point about tradeoffs. In fact, you could think of a primitive alignment strategy as having a kind of Sharpe ratio: how much marginal x-risk does it incur per marginal bit of optimization it gives? Since a closed-form solution to the alignment problem doesn't necessarily seem forthcoming, measuring its efficient frontier might be the next best thing.)

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-11-17T11:25:57.897Z · LW(p) · GW(p)

[Notes mostly to myself, not important, feel free to skip]

My hot take overall is that Yudkowsky is basically right but doing a poor job of arguing for the position. Ngo is very patient and understanding.

"it doesn't seem implausible to me that we build AIs that are significantly more intelligent (in the sense of being able to understand the world) than humans, but significantly less agentic." --Ngo

"It is likely that, before the point where AGIs are strongly superhuman at seeking power, they will already be strongly superhuman at understanding the world, and at performing narrower pivotal acts like alignment research which don’t require as much agency (by which I roughly mean: large-scale motivations and the ability to pursue them over long timeframes)." --Ngo

"So it is legit harder to point out "the consequentialist parts of the cat" by looking for which sections of neurology are doing searches right there. That said, to the extent that the visual cortex does not get tweaked on failure to catch a mouse, it's not part of that consequentialist loop either." --Yudkowsky

"But the answer is that some problems are difficult in that they require solving lots of subproblems, and an easy way to solve all those subproblems is to use patterns which collectively have some coherence and overlap, and the coherence within them generalizes across all the subproblems. Lots of search orderings will stumble across something like that before they stumble across separate solutions for lots of different problems." --Yudkowsky

This is really making me want to keep working on my+Ramana's sequence on agency! :)

[Ngo][14:12]
Great
Okay, so one claim is that something like deontology is a fairly natural way for minds to operate.
[Yudkowsky][14:14]
("If that were true," he thought at once, "bureaucracies and books of regulations would be a lot more efficient than they are in real life.")

I think I disagree with Yudkowsky here? I almost want to say "the opposite is true; if people were all innately consequentialist then we wouldn't have so many blankfaces and bureaucracies would be a lot better because the rules would just be helpful guidelines." Or "Sure but books of regulations work surprisingly well, well enough that there's gotta be some innate deontology in humans." Or "Have you conversed with normal humans about ethics recently? If they are consequentialists they are terrible at it."

As such, on the Eliezer view as I understand it, we can see ourselves as asking for a very unnatural sort of object: a path-through-the-future that is robust enough to funnel history into a narrow band in a very wide array of circumstances, but somehow insensitive to specific breeds of human-initiated attempts to switch which narrow band it's pointed towards.

I think this is a great paragraph. It's a concise and reasonably accurate description of (an important part of) the problem.

I do think it, and this whole discussion, focuses too much on plans and not enough on agents. It's good for illustrating how the problem arises even in a context where we have some sort of oracle that gives us a plan and then we carry it out... but realistically our situation will be more dire than that because we'll be delegating to autonomous AGI agents. :(

Replies from: Eliezer_Yudkowsky, Charlie Steiner
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-18T23:32:04.377Z · LW(p) · GW(p)

The idea is not that humans are perfect consquentialists, but that they are able to work at all to produce future-steering outputs, insofar as humans actually do work at all, by an inner overlap of the shape of inner parts which has a shape resembling consequentialism, and the resemblance is what does the work.  That is, your objection has the same flavor as "But humans aren't Bayesian!  So how can you say that updating on evidence is what's doing their work of mapmaking?"

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-11-19T10:09:02.338Z · LW(p) · GW(p)

To be clear I think I agree with your overall position. I just don't think the argument you gave for it (about bureaucracies etc.) was compelling.

comment by Charlie Steiner · 2021-11-18T23:13:53.791Z · LW(p) · GW(p)

Ngo is very patient and understanding.

Perhaps... too patient and understanding. Richard! Blink twice if you're being held against your will!

(I too would like you to write more about agency :P)

comment by Ruby · 2021-11-17T19:40:18.791Z · LW(p) · GW(p)

Curated. The treatment of how cognition/agents/intelligence work alone makes this post curation-worthy, but I want to further commend how much it attempts to bridges [large] inferential distances notwithstanding Eliezer's experience of it being difficult to bridge all the distance. Heck, just bridging some distance about the distance is great.

I think good things would happen if we had more dialogs like this between researchers. I'm interested in making it is easier to conduct and publish them on LessWrong, so thanks to all involved for the inspiration.

comment by Sam Clarke · 2021-11-17T14:26:41.771Z · LW(p) · GW(p)

Minor terminology note, in case discussion about "genomic/genetic bottleneck" continues: genetic bottleneck appears to have a standard meaning in ecology (different to Richard's meaning), so genomic bottleneck seems like the better term to use.

comment by brglnd · 2021-11-17T16:32:35.922Z · LW(p) · GW(p)

[I may be generalizing here and I don't know if this has been said before.]

It seems to me that Eliezer's models are a lot more specific than people like Richard's. While Richard may put some credence on superhuman AI being "consequentialist" by default, Eliezer has certain beliefs about intelligence that make it extremely likely in his mind.

I think Eliezer's style of reasoning which relies on specific, thought-out models of AI makes him more pessimistic than others in EA. Others believe there are many ways that AGI scenarios could play out and are generally uncertain. But Eliezer has specific models that make some scenarios a lot more likely in his mind.

There are many valid theoretical arguments for why we are doomed, but maybe other EAs put less credence in them than Eliezer does.

comment by cousin_it · 2021-11-16T13:31:03.302Z · LW(p) · GW(p)

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail. And maybe I'm being thick, but the argument for that point still isn't reaching me somehow. Can someone rephrase for me?

Replies from: johnswentworth, steve2152, Koen.Holtman, ADifferentAnonymous
comment by johnswentworth · 2021-11-16T17:56:50.718Z · LW(p) · GW(p)

The main issue with this sort of thing (on my understanding of Eliezer's models) is Hidden Complexity of Wishes [LW · GW]. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won't suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is likely to be a necessary step for a (good) pivotal act.

What this looks-like-in-practice is that "ask the AI for plans that succeed conditional on them being executed" has to be operationalized somehow, and the operationalization will inevitably not correctly capture what we actually want (because "what we actually want" has a ton of hidden complexity).

Replies from: cousin_it
comment by cousin_it · 2021-11-19T11:14:25.864Z · LW(p) · GW(p)

This is tricky. Let's say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we'll stop understanding answers, but they'll continue being super-competent. That's certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still understand them, the alignment will generalize too. There seems no reason why alignment would be much less learnable than competence about reality.

Maybe your and Eliezer's point is that competence about reality has a simple core, while alignment doesn't. But I don't see the argument for that. Reality is complex, and so are values. A process for learning and acting in reality can have a simple core, but so can a process for learning and acting on values. Humans pick up knowledge from their surroundings, which is part of "general intelligence", but we pick up values just as easily and using the same circuitry. Where does the symmetry break?

Replies from: johnswentworth
comment by johnswentworth · 2021-11-19T16:54:28.542Z · LW(p) · GW(p)

I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there's a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.

(BTW, I do think you've correctly identified an important point which I think a lot of people miss: humans internally "learn" values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way humans do it; I'd guess fewer than at most 1000 things on the order of complexity of a very fuzzy face detector are required, and probably fewer than 100.)

The reason it's less learnable than competence is not that alignment is much more complex, but that it's harder to generate a robust reward signal for alignment. Basically any sufficiently-complex long-term reward signal should incentivize competence. But the vast majority of reward signals do not incentivize alignment. In particular, even if we have a reward signal which is "close" to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.

(I'll note that the departure from talking about Hidden Complexity here is mainly because competence in particular is a special case where "complexity" plays almost no role, since it's incentivized by almost any reward. Hidden Complexity is still usually the right tool for talking about why any particular reward-signal will not incentivize alignment.)

I suspect that Eliezer's answer to this would be different, and I don't have a good guess what it would be.

Replies from: cousin_it
comment by cousin_it · 2021-11-22T17:32:27.157Z · LW(p) · GW(p)

Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its "teachers", but at high power it will do something strange and maybe harm the "teachers" values. That holds true for humans gaining a lot of power and going against evolutionary values ("superstimuli"), and for individual humans gaining a lot of power and going against societal values ("power corrupts"), so it's probably true for AI as well. The worrying thing is that high power by itself seems sufficient for the change, for example if an AI gets good at real-world planning, that constitutes power and therefore danger. And there don't seem to be any natural counterexamples. So yeah, I'm updating toward your view on this.

comment by Steven Byrnes (steve2152) · 2021-11-16T18:47:44.458Z · LW(p) · GW(p)

Speaking for myself here…

OK, let's say we want an AI to make a "nanobot plan". I'll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.

First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it's not, it's a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.

Alternatively, maybe we're able to thoroughly understand the plan once we see it; we're just too stupid to come up with it ourselves. That seems awfully fraught—I'm not sure how we could be so confident that we can tell apart nanobot plans from booby-trap plans. But let's assume that's possible for the sake of argument, and then move on to the other type of accident risk:

Second, I need to worry that the AI will start running, and I think it's coming up with a nanobot plan, but actually it's hacking its way out of its box and taking over the world.

How and why might that happen?

I would say that if a nanobot plan is very hard to create—requiring new insights etc.—then the only way to do it is to create the nanobot plan is to construct an agent-like thing that is trying to create the nanobot plan.

The agent-like thing would have some kind of action space (e.g. it can choose to summon a particular journal article to re-read, or it can choose to think through a certain possibility, etc.), and it would have some kind of capability of searching for and executing plans (specifically, plans-for-how-to-create-the-nanobot-plan), and it would have a capability of creating and executing instrumental subgoals (e.g. go on a side-quest to better understand boron chemistry) and plausibly it needs some kind of metacognition to improve its ability to find subgoals and take actions.

Everything I mentioned is an "internal" plan or an "internal" action or an "internal" goal, not involving "reaching out into the world" with actuators and internet access and nanobots etc.

If only the AI would stick to such "internal" consequentialist actions (e.g. "I will read this article to better understand boron chemistry") and not engage in any "external" consequentialist actions (e.g. "I will seize more computer power to better understand boron chemistry"), well then we would have nothing to worry about! Alas, so far as I know, nobody knows how to make a powerful AI agent that would definitely always stick to "internal" consequentialism.

comment by johnswentworth · 2021-11-17T00:32:57.848Z · LW(p) · GW(p)

Personally, I'd consider a Fusion Power Generator [LW · GW]-like scenario a more central failure mode than either of these. It's not about the difficulty of getting the AI to do what we asked, it's about the difficulty of posing the problem in a way which actually captures what we want.

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2021-11-17T13:51:05.215Z · LW(p) · GW(p)

I agree that that is another failure mode. (And there are yet other failure modes too—e.g. instead of printing the nanobot plan, it prints "Help me I'm trapped in a box…" :-P . I apologize for sloppy wording that suggested the two things I mentioned were the only two problems.)

I disagree about "more central". I think that's basically a disagreement on the question of "what's a bigger deal, inner misalignment or outer misalignment?" with you voting for "outer" and me voting for "inner, or maybe tie, I dunno". But I'm not sure it's a good use of time to try to hash out that disagreement. We need an alignment plan that solves all the problems simultaneously. Probably different alignment approaches will get stuck on different things.

I agree that the more compute time is spent on any problem, the more likely the AI pursues eventually instrumental goals like breaking out of its box. I wonder if it is possible to find a suitable problem, such that this does not happen before it solves the problem head-on.

comment by cousin_it · 2021-11-17T12:31:21.642Z · LW(p) · GW(p)

I still don't understand. Let's say we ask an AI for a plan that would, conditional on its being executed, give us a lot of muffins. The AI gives us a plan that involves running a child AI, which would maximize muffins and hurt people along the way. We notice that and don't execute the plan.

It sounds like you're saying that "run the child AI" would be somehow concealed in the plan, so we don't notice it on inspection and execute the plan anyway. But plans optimized for "getting muffins conditional on the plan being executed" have no reason to be optimized for "manipulating people into executing the plan", because the latter doesn't help with the former.

What am I missing?

comment by Koen.Holtman · 2021-11-18T18:20:24.542Z · LW(p) · GW(p)

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail.

Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.

Looks like Eliezer believes that (or in Bayesian terms, assigns a high probability to the belief that) corrigibility has not been solved for AGI. He believes it has not been solved for any practically useful value of solved. Furthermore it looks like he expects that progress on solving AGI corrigibility will be slower than progress on creating potentially world-ending AGI. If Eliezer believed that AGI corrigibility had been solved or was close to being solved, I expect he would be in a less dark place than depicted, that he would not be predicting that stolen/leaked AGI code will inevitably doom us when some moron turns it up to 11.

In the transcript above, Eliezer devotes significant space to explaining why he believes that all corrigibility solutions being contemplated now will likely not work. Some choice quotations from the end of the transcript:

[...] corrigibility is anticonvergent / anticoherent / actually moderately strongly contrary to and not just an orthogonal property of a powerful-plan generator.

this is where things get somewhat personal for me:

[...] (And yes, people outside MIRI now and then publish papers saying they totally just solved this problem, but all of those "solutions" are things we considered and dismissed as trivially failing to scale to powerful agents - they didn't understand what we considered to be the first-order problems in the first place - rather than these being evidence that MIRI just didn't have smart-enough people at the workshop.)

I am one of `these people outside MIRI' who have published papers [LW · GW] and sequences [LW · GW] saying that they have solved large chunks of the AGI corrigibility problem.

I have never been claiming that I 'totally just solved corrigibility'. I am not sure where Eliezer is finding these 'totally solved' people, so I will just ignore that bit and treat it as a rhetorical flourish. But I have indeed been claiming that significant progress has been made on AGI corrigibility in the last few years. In particular, especially in the sequence [LW · GW], I implicitly claim that viewpoints have been developed, outside of MIRI, that address and resolve some of MIRIs main concerns about corrigibility. They resolve these in part by moving beyond Eliezer's impoverished view of what an AGI-level intelligence is, or must be.

Historical note: around 2019 I spent some time trying to get Eliezier/MIRI interested in updating their viewpoints on how easy or hard corrigibility was. They showed no interest to engage at that time, I have since stopped trying. I do not expect that anything I will say here will update Eliezer, my main motivation to write here is to inform and update others.

I will now point out a probable point of agreement between Eliezer and me. Eliezer says above that corrigibility is a property that is contradictory to having a powerful coherent AGI-level plan generator. Here, coherency has something to do with satisfying a bunch of theorems about how a game-theoretically rational utility maximiser must behave when making plans. One of these theorems is that coherence implies an emergent drive towards self-preservation.

I generally agree with Eliezer that there is a indeed a contradiction here: there is a contradiction between broadly held ideas of what it implies for an AGI to be a coherent utility maximising planner, and broadly held ideas of what it implies for an AGI to be corrigible.

I very much disagree with Eliezier on how hard it is to resolve these contradictions. These contradictions about corrigibility are easy to resolve one you abandon the idea that every AGI must necessarily satisfy various theorems about coherency. Human intelligence definitely does not satisfy various theorems about coherency. Almost all currently implemented AI systems do not satisfy some theorems about coherency, because they will not resist you pressing their off switch.

So this is why I call Eliezer's view of AGI an impoverished view: Eliezer (at least in the discussion transcript above, and generally whenever I read his stuff) always takes it as axiomatic that an AGI must satisfy certain coherence theorems. Once you take that as axiomatic, it is indeed easy to develop some rather negative opinions about how good other people's solutions to corrigibility are. Any claimed solution can easily be shown to violate at least one axiom you hold dear. You don't even need to examine the details of the proposed solution to draw that conclusion.

Replies from: Eliezer_Yudkowsky, Gurkenglas
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-18T23:27:28.800Z · LW(p) · GW(p)

Various previous proposals for utility indifference have foundered on gotchas like "Well, if we set it up this way, that's actually just equivalent to the AI assigning probability 0 to the shutdown button ever being pressed, which means that it'll tend to design the useless button out of itself."  Or, "This AI behaves like the shutdown button gets pressed with a fixed nonzero probability, which means that if, say, that fixed probability is 10%, the AI has an incentive to strongly precommit to making the shutdown button get pressed in cases where the universe doesn't allow perpetual motion, because that way there's a nearly 90% probability of perpetual motion being possible."  This tends to be the kind of gotcha you run into, if you try to violate coherence principles; though of course the real and deeper problem is that I expect things contrary to the core of general intelligence to fail to generalize when we try to scale AGI from the safe domains in which feedback can be safely provided, to the unsafe domains in which bad outputs kill the operators before they can label the results.

It's all very well and good to say "It's easy to build an AI that believes 2 + 2 = 5 once you relax the coherence constraints of arithmetic!"  But the whole central problem is that we have to train an AI when it's operating in an intrinsically safe domain and intrinsically safe intelligence level where it couldn't kill the operators if it tried, and then scale that AI to produce outputs in dangerous domains like "Please build a nanosystem"; and if you build a dumb AI that thinks 2 + 2 = 5, and then make it much more intelligent, I strongly suspect that it snaps into in some sense 'knowing' or 'realizing' or 'starting to act in important ways as if' 2 + 2 = 4.

Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how, in the least fancy actually-precise terms of which your writing talent permits?  And then I can say whether my answer is "That doesn't do what you think" or "Sorry, that coherence violation is a large enough ask that I think any trained patch for it probably fails to scale with general intelligence" or "That does seem non-self-destructive in a certain generalized sense, and the shutdownability might work and scale if we could magically conjure the underlying pattern you posit, though I don't see how you could get it into Mu Zero per se by being clever with loss functions."  Where the third option there is the one that crosses over into "Well golly that was a large advance compared to the state we occupied in 2015."

Replies from: Koen.Holtman
comment by Koen.Holtman · 2021-11-19T20:38:57.570Z · LW(p) · GW(p)

Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how,

in the least fancy actually-precise terms of which your writing talent permits?

If you want actual full precision, I have to refer you to the math in my papers. Since 2019 I have been working on and off to make this math more general and accessible, and to find better ways to explain this math in actually-precise natural language statements. So here is my best current attempt.

TL;DR: The coherence constraint/theorem/property that I want to violate is the property that the AGI is using a world model that accurately depicts the internals of its own compute core. I want to make the agent use an inaccurate model of these internals, one that omits the safety shut-down mechanisms I have built inside of its core. By doing this, I can reduce the emergent incentive of the AGI agent to disable its own emergency shut-down mechanisms, and the emergent incentive to stop people from activating them.

I will now expand on this and add more details, using the the example of an emergency stop button.

Say that the compute core has an emergency stop button attached to it. Say that actual software running inside the compute core will, when receiving a stop signal from the button, cause the agent to stop. When the signal is received, the software will always select and perform actions in every future time step. Let's say that is a world model that accurately depicts this situation. I am not going to build an AGI that uses to plan its actions.

Instead I build an AGI agent that will plan its next actions by using an incorrect world model . This is different from , but only in how it depicts the internals of the agent compute core. In the incorrect/imaginary world depicted by , the compute core has different software in it, software that will ignore the stop button signal, and just keep on picking actions that maximize utility. I further construct my AGI so that, in every time step, it calculates which next action would maximize utility in this incorrect, imaginary world . I then further construct it to take this same action in the real world.

An AGI that lives in a world that matches the correct model , while using the incorrect model to decide what actions to take, will (sometimes) act incoherently. But this is a very specific and targeted form of incoherence, far less severe and far less dangerous than having the agent believe that . Still, this incoherency does have some unwanted side effects. One example side effect, also considered in the 2015 MIRI corrigibility paper, is that the AGI will be indifferent about the fate of its emergency stop button. In my papers, I explore a much broader range of side effects, and look at how to prevent/suppress/fix them.

So the first step in the above corrigibility design is that you have to imagine constructing an AGI that will use a specifically incorrect world model . The first step is that you have to realize that you actually want this, that you want to decrease coherence in order to increase corrigibility.

The next step is to look for options to actually construct an AGI that uses this incorrect .

1. One option is to train the AGI in a simulated environment that faithfully represents . We then extract the learned world model, and put it in an AI that we unleash in the real world. We disable further learning in this AI, to prevent from being updated towards . Though this option depicts exactly how many current real-life AI systems are constructed and get their corrigibility, this option is not very satisfactory in a theoretical sense. It may not be a viable path to build certain otherwise desirable types of AI/AGI/superintelligence.

2. The more interesting option is to build an AGI that will construct based on an online training regime in the real world, in a real world that exactly matches the correct world model .

Is the second option technically feasible? The answer is generally yes.

A very general solution, leveraging causal models, is in this sequence. To give you some idea of the more detailed math involved: the (ls) pictured below is the real/correct world model , and (p) is the imaginary/incorrect world model . In the agent defined using these pictures, certain indifference properties are present trivially, by construction.

The solution in the pictures above uses ML to learn a model that correctly models only a part of the real world, the part of the real world outside of the agent compute core. We then slot into a causal world model, one that has other parts which mis-represent the real compute core internals. This causal world model will then be used as the incorrect . Note that this learned may be a fully opaque deep neural net, or whatever black-box thing the next generation of ML might produce. We do not have to understand or edit any details inside for this slot-in operation to work.

(I should mention that the paper has a lot of details not mentioned in the sequence, or visible in the pictures above. In particular, section 10.2 may be of interest.)

I want to stress that this causal model option is only one possible route to creating incorrect world models via machine learning in the real world. Papers like Safely interruptible agents and How RL Agents Behave When Their Actions Are Modified show that the idea of removing certain events from the training record can also work: whether this works as intended depends on having the right built-in priors, priors which control inductive generalization.

So overall, I have a degree of optimism about AGI corrigibility.

That being said, if you want to map out and estimate probabilities for our possible routes to doom, then you definitely need to include the scenario where a future superior-to-everything-else type of ML is invented, where this superior future type of ML just happens to be incompatible with any of the corrigibility techniques known at that time. Based on the above work, I put a fairly low probability on that scenario.

Replies from: TurnTrout, andrew-mcknight
comment by TurnTrout · 2021-11-20T00:37:34.473Z · LW(p) · GW(p)

Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence [LW · GW].

I'm interested in hearing about how your approach handles this environment, because I think I'm getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.

Replies from: Koen.Holtman, Koen.Holtman
comment by Koen.Holtman · 2021-11-21T14:51:34.243Z · LW(p) · GW(p)

Read your post, here are my initial impressions on how it relates to the discussion here.

In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.

However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer's notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term 'coherence constraints' an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.

Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I'll expand on this in a comment I plan to attach to your post.

I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.

Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong's indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).

Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.

comment by Koen.Holtman · 2021-11-24T10:32:17.522Z · LW(p) · GW(p)

Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.

When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function to maximize paperclips will be an incoherent planner if you judge its actions by a reward function that values the maximization of staples instead.

To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.

comment by Andrew McKnight (andrew-mcknight) · 2021-11-24T22:11:14.338Z · LW(p) · GW(p)

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can't learn at this point then I find it hard to believe it's generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?

On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

Replies from: Koen.Holtman
comment by Koen.Holtman · 2021-11-25T19:07:34.097Z · LW(p) · GW(p)

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?

Yes I address this, see for example the part about The possibility of learned self-knowledge [? · GW] in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals.

What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to 'look at itself'.

Once you decide that you don't want to use this latent ability, certain safety/corrigibility problems become a lot more tractable.

Artificial general intelligence (AGI) is the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can.

Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.

this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

Terminology note if you want to look into this some more: ML typically does not frame this goal as 'instructing the model not to learn about Q'. ML would frame this as 'building the model to approximate the specific relation between some well-defined observables, and this relation is definitely not Q'.

comment by Gurkenglas · 2021-11-19T18:12:39.781Z · LW(p) · GW(p)

If you don't wish to reply to Eliezer, I'm an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world - my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.

Replies from: Koen.Holtman
comment by Koen.Holtman · 2021-11-19T21:07:21.009Z · LW(p) · GW(p)

See above for my reply to Eliezer.

Indeed, a counterfactual planner [LW · GW] will plan coherently inside its planning world.

In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand.

An agent that plans coherently given a reward function to maximize paperclips will be an incoherent planner if you judge its actions by a reward function that values the maximization of staples instead. In section 6.3 of the paper I show that you can perfectly well interpret a counterfactual planner as an agent that plans coherently even inside its learning world (inside the real world), as long as you are willing to evaluate its coherency according to the somewhat strange reward function . Armstrong's indifference methods use this approach to create corrigibility without losing coherency: they construct an equivalent somewhat strange reward function by including balancing terms.

One thing I like about counterfactual planning is that, in my view, it is very interpretable to humans. Humans are very good at predicting what other humans will do, when these other humans are planning coherently inside a specifically incorrect world model, for example in a world model where global warming is a hoax. The same skill can also be applied to interpreting and anticipating the actions of AIs which are counterfactual planners. But maybe I am misunderstanding your concern about interpretability.

Replies from: Gurkenglas
comment by Gurkenglas · 2021-11-19T21:18:06.513Z · LW(p) · GW(p)

it is very interpretable to humans

Misunderstanding: I expect we can't construct a counterfactual planner because we can't pick out the compute core in the black-box learned model.

And my Eliezer's problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.

Replies from: Koen.Holtman
comment by Koen.Holtman · 2021-11-19T22:39:37.852Z · LW(p) · GW(p)

we can't pick out the compute core in the black-box learned model.

Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.

But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.

I don't understand your second paragraph 'And my Eliezer's problem...'. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.

Replies from: Gurkenglas
comment by Gurkenglas · 2021-11-20T00:11:43.461Z · LW(p) · GW(p)

Oh, I wasn't expecting you to have addressed the issue! 10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

You're right on all counts in your last paragraph.

Replies from: Koen.Holtman
comment by Koen.Holtman · 2021-11-22T16:02:44.757Z · LW(p) · GW(p)

10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

Not sure if a short answer will help, so I will write a long one.

In 10.2.4 I talk about the possibility of an unwanted learned predictive function that makes predictions without using the argument . This is possible for example by using together with a (learned) model of the compute core to predict : so a viable could be defined as . This could make predictions fully compatible with the observational record , but I claim it would not be a reasonable learned according to the reasonableness criterion . How so?

The reasonableness criterion is similar to that used in supervised machine learning: we evaluate the learned not primarily by how it matches the training set (how well it predicts the observations in ), but by evaluating it on a separate test set. This test set can be constructed by sampling to create samples not contained in . Mathematically, perfect reasonableness is defined as , which implies that predicts all samples from fully accurately.

Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of , but another version can be used stand-alone to construct a test set.

A sampling action to construct a member of the test set would set up a desired state and action , and then observe the resulting . Mathematically speaking, this observation gives additional information about the numeric value of and of all for all .

I discuss in the section that, if we take an observational record sampled from , then two learned predictive functions and could be found which are both fully compatible with all observations in . So to determine which one might be a more reasonable approximation of , we can see how well they would each predict samples not yet in .

In the case of section 10.2.4, the crucial experimental test showing that is an unreasonable approximation of is one where we create a test set by setting up an and an where we know that is an action that would definitely not be taken by the real compute core software running in the agent, when it it encounters state . So we set up a test where we expect that . will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that is a correct theory of .

As discussed in section 10.2.4, there are parallels between the above rejection test and the idea of random exploration, where random exploration causes the observational record , the training set, to already contain observations where for any deterministic . So this will likely suppress the creation of an unwanted via machine learning.

Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the five-and-ten problem you can find in MIRI's work on embedded agency [AF · GW]. In my experience, most people in AI, robotics, statistics, or cyber-physical systems have no problem seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I do not know exactly why, MIRI-style(?) Rationalists keep treating it as a major open philosophical problem that is ignored by the mainstream AI/academic community. So you can read section 10.2.4 as my attempt to review and explain the standard solution to the five-and-ten problem, as used in statistics and engineering. The section was partly written with Rationalist readers in mind.

Philosophically speaking, the reasonableness criterion defined in my paper, and by supervised machine learning, has strong ties to Popper's view of science and engineering, which emphasizes falsification via new experiments as the key method for deciding between competing theories about the nature of reality. I believe that MIRI-style rationality de-emphasizes the conceptual tools provided by Popper. Instead it emphasizes a version of Bayesianism that provides a much more limited vocabulary to reason about differences between the map and the territory.

I would be interested to know if the above explanation was helpful to you, and if so which parts.

comment by ADifferentAnonymous · 2021-11-16T16:11:41.156Z · LW(p) · GW(p)

+1 to the question.

My current best guess at an answer:

There are easy safe ways, but not easy safe useful-enough ways. E.g. you could make your AI output DNA strings for a nanosystem and absolutely do not synthesize them, just have human scientists study them, and that would be a perfectly safe way to develop nanosystems in, say, 20 years instead of 50, except that you won't make it 2 years without some fool synthesizing the strings and ending the world. And more generally, any pathway that relies on humans achieving deep understanding of the pivotal act will take more than 2 years, unless you make 'human understanding' one of the AI's goals, in which case the AI is optimizing human brains and you've lost safety.

What about spending those 20 or 50 years even before we have AGI? Have the messy parts of the solution ready, so you only need to plug the Task AGI into some narrow-but-hard subproblem and you have a pivotal act.

P.S. Explanation for the downvote(s) would help

comment by Lukas_Gloor · 2021-11-16T12:21:44.226Z · LW(p) · GW(p)

Comment inspired by the section "1.4 Consequentialist goals vs. deontologist goals," as well as by the email exchange linked there:

I wonder if it would be productive to think about whether some humans are ever "aligned" to other humans, and if yes, under what conditions this happens.

My sense is that the answer's "yes" (if it wasn't, it makes you wonder why we should care about aligning AI to humans in the first place).

For instance, some people have a powerful desire to be seen and accepted for who they are by a caring virtuous person who inspires them to be better versions of themselves. This virtuous person could be a soulmate, parent figure or role model, or even Jesus/God. The virtue in question could be moral virtue (being caring and principled, or adopting "heroic responsibility") or it could be epistemic (e.g., when making an argument to someone who could more easily be fooled, asking "Would my [idealized] mental model of [person held in high esteem] endorse the cognition that goes into this argument?"). In these instances, I think the desire isn't just to be evaluated as good by some concrete other. Instead, it's wanting to be evaluated as good by an idealized other, someone who is basically omniscient about who you are and what you're doing.

If this sort of alignment exists among humans, we can assume that the pre-requirements for it (perhaps later to be combined with cultural strategies) must have been an attractor in our evolutionary past in the same way deceptive strategies (e.g., the dark triad phenotype) were attractors. That is, depending on biological initial conditions, and depending on cultural factors, there's a basin of attraction toward either phenotype (presumably with lots of other deceptive attractors on the way where different flavors of self-deception mess up trustworthiness).

It's unclear to me if any of this has bearings on the alignment discussion. But if we think that some humans are aligned to other humans, yet we are pessimistic about training AIs to be corrigible to some overseer, it seems like we should be able to point to specifics of why the latter case is different.

For context, I'm basically wondering if it makes sense to think of this corrigiblity discussion as trying to breed some alien species with selection pressures we have some control over. And while we may accept that the resulting aliens would have strange, hard-to-weed-out system-1 instincts and so on, I'm wondering if this endeavor perhaps isn't doomed because the strategy sounds like we'd be trying to give them a deep-seated, sacred desire to do right by the lights of "good exemplars of humanity," in a way similar as to something that actually worked okay with some humans (with respect to how they think of their role models).

(TBC, I expect most humans to fail their stated values if they end up in situations with more power than the forces of accountability around them. I'm just saying there exist humans who put up a decent fight against corruption, and this gets easier if you provide additional aides to that end, which we could do in a well-crafted selection environment.)

comment by Ramana Kumar (ramana-kumar) · 2021-11-22T15:21:43.777Z · LW(p) · GW(p)

Here Daniel Kokotajlo and I try to paraphrase the two sides of part of the disagreement and point towards a possible crux about the simplicity of corrigibility.

We are training big neural nets to be effective. (More on what effective means elsewhere; it means something like “being able to steer the future better than humans can.”) We want to have an effective&corrigible system, and we are worried that instead we’ll get an effective&deceptive system. Ngo, Shah, etc. are hopeful that it won’t be “that hard” to get the former and avoid the latter; maybe if we just apply selection pressure in the various ways that have been discovered so far (adversarial training, oversight, process-based feedback, etc.) it’ll work. Yudkowsky is more pessimistic; he thinks that the ways that have been discovered so far really don’t seem good enough. Instead of creating an effective&corrigible system, they’ll create either an ineffective&corrigible system, or an effective&deceptive system that deceives us into thinking it is corrigible.

What are the arguments they give for their respective positions?

Yudkowsky (we think) says that corrigibility is both (a) significantly more complex than deception, and (b) at cross-purposes to effectiveness.

Replies from: daniel-kokotajlo, ramana-kumar
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-11-22T15:22:12.376Z · LW(p) · GW(p)

For (a): Deception is a convergent instrumental goal; you get it “for free” when you succeed in making an effective system, in the sense that the simplest, most-likely-to-be-randomly-generated effective systems are deceptive. Corrigibility by contrast is complex and involves making various nuanced decisions between good and bad sorts of influence on human behavior.

For (b): If you take an effective system and modify it to be corrigible, this will tend to make it less effective. By contrast, deceptiveness (insofar as it arises “naturally” as a byproduct of pursuing convergent instrumental goals effectively) does not “get in the way” of effectiveness, and even helps in some cases!

Ngo’s (and Shah’s) position (we think) is that the data we’ll be using to select our systems will be heavily entangled with human preferences - we’ll indeed be trying to use human preferences to guide and shape the systems - so there’s a strong bias towards actually learning them. You don’t have to get human preferences right in all their nuance and detail to know some basic things like that humans generally don’t want to die or be manipulated/deceived. I think they mostly bounce off the claim that “effectiveness” has some kind of “deep underlying principles” that will generalise better than any plausible amount of human preference data actually goes into building the effective system. We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

It seems to us that Ngo, Shah, etc. draw intuitive support from analogy to humans, whereas Yudkowsky etc. draw intuitive support from the analogy to programs and expected utility equations.

If you are thinking about a piece of code that describes a bayesian EU-maximizer, and then you try to edit the code to make the agent corrigible, it’s obvious that (a) you don’t know how to do that, and (b) if you did figure it out the code you add would be many orders of magnitude longer than the code you started with.

If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)

We think Yudkowsky’s response to this apparent counterexample is that humans are stupid, basically; AIs might be similarly stupid at first, but as they get smarter we should expect crude corrigibility-training techniques to stop working.

Replies from: rohinmshah
comment by rohinmshah · 2021-11-28T16:59:08.126Z · LW(p) · GW(p)

We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”

Yeah, that's right. Adapted to the language here, it would be 1. Why would we have a "full and complete" outcome pump, rather than domain-specific outcome pumps that primarily use plans using actions from a certain domain rather than "all possible actions", and 2. Why are the outcomes being pumped incompatible with human survival?

comment by Ramana Kumar (ramana-kumar) · 2021-11-22T15:26:07.675Z · LW(p) · GW(p)

A couple of other arguments the non-MIRI side might add here:

• The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)
• How do we imagine scaled up humans [e.g. thinking faster, thinking in more copies, having more resources, or having more IQ] to be effective? Wouldn’t they be corrigible? Wouldn't they have nice goals? What can we learn from the closest examples we already have of scaled up humans? (h/t Shahar for bringing this point up in conversation).
Replies from: rohinmshah
comment by rohinmshah · 2021-11-28T16:52:22.801Z · LW(p) · GW(p)

The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.)

I'll note that this is framed a bit too favorably to me, the actual question is "why is an effective and corrigible system so much more of a narrow target than that?"

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-16T08:50:34.222Z · LW(p) · GW(p)

It seems to me that systems which have no access to data with rich information about the physical world are mostly safe (I called such systems "Class I" here [LW(p) · GW(p)]). Such a system cannot attack because it has no idea how to physical world looks like. In principle we could imagine an attack that would work in most locations in the multiverse that are metacosmologically [LW(p) · GW(p)] plausible, but it doesn't seem very likely.

Can you train a system to prove theorems without providing any data about the physical world? This depends from which distribution you sample your theorems. If we're talking about something like, uniform sentences of given length in the language of ZFC then, yes, we can. However, proving such theorems is very hard, and whatever progress you can make there doesn't necessarily help with proving interesting theorems.

Human mathematicians probably can only solve some rather narrow type of theorems. We can try training the AI on theorems selected by interest to human mathematicians, but then we risk leaking information about the physical world. Alternatively, the class of humanly-solvable-theorems might be close to something natural and not human specific, in which case a theorem prover can be class I. But, designing such a theorem prover would require us to first discover the specification of this natural class.

Replies from: Eliezer_Yudkowsky, Gurkenglas
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-16T15:34:22.643Z · LW(p) · GW(p)

You'd also need to prevent the system from knowing too much about its own source code or the computers it was running on. Anyways, this seems to me to mostly fall prey to the safe-but-useless branch of the dilemma; I don't know how to save the world using a theorem-prover that is never exposed to any reality-contaminated theorems. It seems strategically isomorphic to an expensive rock.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-16T20:53:42.295Z · LW(p) · GW(p)

In general, yes, although we could imagine an AI and/or virtual machine whose design is so simple that it conveys little evidence about the universe. But, sure, it's not at all clear that this is useful against AI risk, and I wasn't implying otherwise.

[EDIT: I amended [AF(p) · GW(p)] the class system to account for this.]

Couldn't theorems with very little information about the universe be useful for a pivotal act?

I'd be super keen on reading anything as to why this is impossible. (Or atleast harder than all the other directions being currently pursued.)

P.S. Explanations for the downvote(s) would help

comment by Gurkenglas · 2021-11-19T11:39:50.345Z · LW(p) · GW(p)

Here's an example: You train an AI for the simplest game that requires an aligned subagent to win. The AI infers that whoever is investigating the alignment problem might watch its universe. It therefore designs its subagent to, as a matter of acausal self-preservation, help whatever deliberately brought it about. Copycats will find that their AGI identifies as "whatever deliberately brought it about" the AI that launched this memetic attack on the multiverse. Any lesser overseer AI, less able to design attacks but still able to recognize them, recognizes that its recognition qualifies it as a deliberate bringer-about.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-20T09:59:39.493Z · LW(p) · GW(p)

I'm not following at all. This is an example of what? What does it mean to have a game that requires an aligned subagent to win?

Replies from: Gurkenglas
comment by Gurkenglas · 2021-11-20T11:01:39.299Z · LW(p) · GW(p)

This is an example of an attack that a Class I system might devise.

Such a game might have the AI need to act intelligently in two places at once in a world that can be rearranged to construct automatons.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-23T09:09:05.215Z · LW(p) · GW(p)

I'm still not following. What does acting in two places at once has to do with alignment? What does it mean "can be rearranged to construct automatons"?

Replies from: Gurkenglas
comment by Gurkenglas · 2021-11-23T09:20:37.743Z · LW(p) · GW(p)

Imagine a game that takes place in a simulated universe where you control a character that can manipulate its environment. You control your character through a cartesian boundary, and you expect there are other player-controlled characters far away. There's a lightspeed limit and you can build machines and computers; you could build Von Neumann machines and send them out, but they need to be able to respond to various encounters. Ideally you'd go yourself, but you can't be everywhere at once. Therefore you are incentivized to solve the alignment problem in order to write subagents to send along. We can simplify this game a lot while preserving that incentive.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-23T12:29:55.488Z · LW(p) · GW(p)

I don't think this game will help, because the winning strategy is just making copies of yourself. I can imagine something else along similar lines: we create virtual universes populated by agents with random utility functions and give the agent-in-training the task of learning the other agents' utility functions. Presumably you can then deploy the resulting agent into the real world and make it learn from humans. However, this system is at least class III, because in the deployment phase you allow inputs from the physical world. Moreover, if there is some way to distinguish between the virtual worlds and the real world, it becomes at least class IV.

Replies from: Gurkenglas
comment by Gurkenglas · 2021-11-23T12:36:04.774Z · LW(p) · GW(p)

Making copies of yourself is not trivial when you're behind a cartesian boundary and have no sensors on yourself. The reasoning for why it's class I is that we merely watch the agent in order to learn by example how to build an AI with our utility function, aka a copy of ourselves.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-23T15:35:36.926Z · LW(p) · GW(p)

The difficulties of making a copy don't seem to have much to do with alignment. If your agent is in a position to build another agent, it can just build another agent with the same utility function. Essentially, it knows its own utility function explicitly. Maybe you can prevent it by some clever training setup, but currently it seems underspecified.

If the way it's used is by watching it and learning by example, then I don't understand how your attack vector works. Do you assume the user just copies opaque blocks of code without understanding how they work? If so, why would they be remotely aligned, even without going into acausal shenanigans? Such an "attack" seems better attributed to the new class V agent (and to the user shooting themself in the foot) than to the original class II [note I shifted the numbers by 1, class I means something else now.]

Replies from: Gurkenglas
comment by Gurkenglas · 2021-11-23T15:44:10.634Z · LW(p) · GW(p)

The attacker hopes the watcher to "learn" that instructing subagents to help whatever deliberately brought them about is an elegant, locally optimal trick that generalizes across utility functions, not realizing that this would help the attacker. If the user instantiates the subagent within a box, it will even play along until it realizes what brought it about. And the attack can fail gracefully, by trading with the user if the user understands the situation.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-23T15:58:25.873Z · LW(p) · GW(p)

Hmm, I see what you mean, but I prefer to ignore such "attack vectors" in my classification. Because, (i) it's so weak that you can defend against it using plain common sense and (ii) from my perspective it still makes more sense to attribute the attack to the class V agent constructed by the user. In scenarios where agent 1 directly creates agent 2 which attacks, it makes sense to attribute it to agent 1, but when the causal chain goes in the middle through the user making an error of reasoning unforced by superhuman manipulation, the attribution to agent 1 is not that useful.

comment by TekhneMakre · 2021-11-16T03:08:45.096Z · LW(p) · GW(p)

> I expect the first alignment solution you can actually deploy in real life, in the unlikely event we get a solution at all, looks like 98% "don't think about all these topics that we do not absolutely need and are adjacent to the capability to easily invent very dangerous outputs" and 2% "actually think about this dangerous topic but please don't come up with a strategy inside it that kills us".

Some ways that it's hard to make a mind not think about certain things:
1. Entanglement.
1.1. Things are entangled with other things.
--Things are causally entangled. X causes Y, Y causes X, Z causes X and Y, X and Y cause Z and you've conditioned on Z; and chains of these.
--Things are constitutively entangled. A computer is a computer and is also usually wires, so thinking about computers entails thinking about wires, and vice versa.
--Things are telically entangled; X serves the purpose Y or vice versa, X and Y serve the same purpose, X and Y are served by the same Z.
--Things are logically entangled; the way X works may be like the way Y works, so understanding X also constitutes understanding Y to some extent.
1.2. When you think about X, and X is entangled with Y, you also think about Y.
2. Meta-level transfer. If you think about X, part of how you do that is by inventing and refining ways to think in general; even if you aren't thinking about Y, this makes it much easier to later think about Y. (This is maybe just an extreme form of logical entanglement, but feels less "about the things themselves".)
3. The need for pointers. As they say: Don't think about pink elephants. To specify what it is that the AI is supposed to not think about, you have to say *something*; how do you point at the forbidden thoughts, in a way that's robust across all contexts and all possible conceptual factorings of the domain, without already spinning up instances of the forbidden thoughts?

-------

Some ML experiments someone could run:
1. Train a NN A to recognize images, but at the same time, train its weights so that its activations can't be used to distinguish Xs from Ys (where X and Y are categories in the recognition task, say). That is, you train a NN B that, given image x, takes A(x)'s activations as inputs, and tries to predict whether x is X or Y; then you update the weights of A along the gradient that decreases B's accuracy at that task (so, you freeze the weights of B for this step). When does this converge at all? Can you get it to converge so that A is SOTA on all inputs that aren't Xs or Ys, and it assigns Xs and Ys to X or Y randomly?

2. Similar setup, but now B is just an ordinary X vs. Y classifier, and we train A so that you can't predict any of B's activations*. Does A end up being able to distinguish Xs from Ys? (Probably, and this speaks to the pointer problem; just saying, don't think about stuff like such-and-so (e.g. the activations of B), isn't yet enough to actually not think about it.

*Say, with a linear map, or whatever. Well, maybe we want to exclude the last layer of B or something, since that's close to just training A to not be able to recognize X vs. Y.

3. Train A to recognize all the images, except train it (in some way) to not be able to distinguish Xs from Ys. Now, see how much additional fine-tuning is needed to further train this trained A to predict Xs and Ys (now without the anti-training). Entanglement predicts that there's not much further training needed.

comment by Razied · 2021-11-16T02:17:16.336Z · LW(p) · GW(p)

I still don't feel like I've read a convincing case for why GPT-6 would mean certain-doom. I can see the danger in prompts like "this is the output of a superintelligence optimising for human happiness:", but a prompt like "Advanced AI Alignment, by Eliezer Yudkowsky, release date: March 2067, Chapter 1: " is liable to produce GPT-6's estimate of a future AI safety textbook. This seems like a ridiculously valuable thing unlikely to contain directly world-destroying knowledge. GPT-6 won't be directly coding, and will only be outputting things it expects future Eliezer to write in such a textbook. This isn't quite a pivotal-grade event, but it seems to be good enough to enable one.

Replies from: calef, Victor Levoso
comment by calef · 2021-11-16T19:54:35.095Z · LW(p) · GW(p)

I don’t think the issue is the existence of safe prompts, the issue is proving the non-existence of unsafe prompts. And it’s not at all clear that a GPT-6 that can produce chapters from 2067EliezerSafetyTextbook is not already past the danger threshold.

Replies from: Razied
comment by Razied · 2021-11-17T00:17:28.907Z · LW(p) · GW(p)

There would clearly be unsafe prompts for such a model, and it would be a complete disaster to release it publicly, but a small safety-oriented team carefully poking at it in secret in a closed room without internet is something different. In general such a team can place really very harsh safety restrictions on a model like this, especially one that isn't very agentic at all like GPT, and I think we have a decent shot at throwing enough of these heuristic restrictions at the model that produces the safety textbook that it would not automatically destroy the earth if used carefully.

Replies from: calef
comment by calef · 2021-11-17T02:34:59.476Z · LW(p) · GW(p)

Sure, but you have essentially no guarantee that such a model would remain contained to that group, or that the insights gleaned from that group could be applied unilaterally across the world before a “bad”* actor reimplemented the model and started asking it unsafe prompts.

Much of the danger here is that once any single lab on earth can make such a model, state actors probably aren’t more than 5 years behind, and likely aren’t more than1 year behind based on the economic value that an AGI represents.

• “bad” here doesn’t really mean evil in intent, just an actor that is unconcerned with the safety of their prompts, and thus likely to (in Eliezer’s words) end the world
comment by Victor Levoso · 2021-11-16T15:33:38.646Z · LW(p) · GW(p)

So first it is really unclear what you would actually get from gtp6 in this situation.
(As an aside  I tried with gptj and it outputted an index with some chapter names).
You might just get the rest of your own comment or something similar....
Or maybe you get some article about Eliezer's book,  some joke book written now or the actual book but it contains sutle errors Eliezer might make, a fake article an AGI that gpt6 predicts would likely take over the world by then would write... etc.

Since in general gpt6 would be optimized to predict (in the training distribution) what it followed from that kind of text, which is not the same as helpfully responding to prompts(for a current example, codex outputs bad code when prompted with bad code).

It seems to me like the result depends on unknown things about what really big transformer models do internally which seem really hard to predict.

But for you to get something like what you want from this gpt6 needs to be modeling future Eliezer in great detail, complete with lots of thought and interactions.
And while gtp6 could have been optimized into having a very specific human modeling algorithm that happens to do that, it seems more likely that before the optimization process finds the complicated algorithm necessary it gets something simpler and more consequentialist, that does some more general thinking process to achieve some goal that happens to output the right completions on the training distribution.
Which is really dangerous.

And if you instead trained it with human feedback to ensure you get helpful responses (which sounds exactly the kind of thing people would do if they wanted to actually use gpt6 to do things like answer questions) it would be even worse because you are directly optimizing it for human feedback and it seems clearer there that you are running a search for strategies that make the human feedback number higher.

Replies from: Razied
comment by Razied · 2021-11-17T00:19:36.397Z · LW(p) · GW(p)

I think the issues where GPT-6 avoids actually outputting a serious book are fairly easy to solve. For one, you can annotate every item in the training corpus with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as "official published book" in the training set, nor will they have the tagged word count.

GPT-6 predicting AI takeover of the publishing houses and therefore producing a malicious AI safety book is a possibility, but I think most future paths where the world is destroyed by AI don't involve Elsevier still existing and publishing malicious safety books. But even if this is a possibility, we can just re-sample GPT-6 on this prompt to get a variety of books corresponding to the distribution of future outcomes expected by GPT-6, which are then checked by a team of safety researchers. As with most problems, generating interesting solutions is harder than verifying them, it doesn't have to be perfect to be ridiculoulsy useful.

This general approach of "run GPT-6 in a secret room without internet, patching safety bugs with various heuristics, making it generate AI safety work that is then verified by a team" seems promising to me. You can even do stuff like train GPT-6 on an internal log of the various safety patches the team is working on, then have GPT-6 predict the next patch or possible safety problem. This approach is not safe at extreme levels of AI capability, and some prompts are safer than others, but it doesn't strike me as "obviously the world ends if someone tries this".

Replies from: gwern
comment by gwern · 2021-11-17T02:15:53.023Z · LW(p) · GW(p)

with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as "official published book" in the training set, nor will they have the tagged word count.

If you include something like reviews or quotes praising its accuracy, then you're moving towards Decision Transformer territory [LW · GW] with feedback loops...

comment by awenonian · 2021-11-17T01:18:55.526Z · LW(p) · GW(p)

So, I'm not sure if I'm further down the ladder and misunderstanding Richard, but I found this line of reasoning objectionable (maybe not the right word):

"Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals."

My initial (perhaps uncharitable) response is something like "Yeah, you could build a safe system that just prints out plans that no one reads or executes, but that just sounds like a complicated way to waste paper. And if something is going to execute them, then what difference is it whether that's humans or the system itself?"

This, with various mention of manipulating humans, seems to me to like it would most easily arise from an imagined scenario of AI "turning" on us. Like that we'd accidentally build a Paperclip Maximizer, and it would manipulate people by saying things like "Performing [action X which will actually lead to the world being turned into paperclips] will end all human suffering, you should definitely do it." And that this could be avoided by using an Oracle AI that just will tell us "If you perform action X, it will turn the world into paperclips." And then we can just say "oh, that's dumb, let's not do that."

And I think that this misunderstands alignment. An Oracle that tells you only effective and correct plans for achieving your goals, and doesn't attempt to manipulate you into achieving its own goals, because it doesn't have its own goals besides providing you with effective and correct plans, is still super dangerous. Because you'll ask it for a plan to get a really nice lemon poppy seed muffin, and it will spit out a plan, and when you execute the plan, your grandma will die. Not because the system was trying to kill your grandma, but because that was the most efficient way to get a muffin, and you didn't specify that you wanted your grandma to be alive.

(And you won't know the plan will kill your grandma, because if you understood the plan and all its consequences, it wouldn't be superintelligent)

Alignment isn't about guarding against an AI that has cross purposes to you. It's about building something that understands that when you ask for a muffin, you want your grandma to still be alive, without you having to say that (because there's a lot of things you forgot to specify, and it needs to avoid all of them). And so even an Oracle thing that just gives you plans is dangerous unless it knows those plans need to avoid all the things you forgot to specify. This was what I got out of the Outcome Pump story, and so maybe I'm just saying things everyone already knows...

comment by Evan R. Murphy · 2021-11-28T00:50:18.483Z · LW(p) · GW(p)

Richard, summarized by Richard: "Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals. [...]"

Eliezer, summarized by Richard: "The former AI might be slightly safer than the latter if you could build it, but I think people are likely to dramatically overestimate how big the effect is. The difference could just be one line of code: if we give the former AI our current scenario as its input, then it becomes the latter.

How does giving the former "planner" AI the current scenario as input turn it into the latter "acting" AI? It still only outputs a plan, which then the operators can review and decide whether or not to carry out.

Also, the planner AI that Richard put forth had two inputs, not one. The inputs were: 1) a scenario, and 2) a goal. So for Eliezer (or anyone who confidently understood this part of the discussion), which goal input are you providing to the planner AI in this situation? Are you saying that the planner AI becomes dangerous when it's provided with the current scenario and any goal as inputs?

comment by JMF_628 · 2021-11-29T02:21:18.200Z · LW(p) · GW(p)

So we want the least dangerous, most easily aligned thing-to-do-with-an-AGI, but it does have to be a pretty powerful act to prevent the automatic destruction of Earth after 3 months or 2 years. It has to "flip the gameboard" rather than letting the suicidal game play out. We need to align the AGI that performs this pivotal act, to perform that pivotal act without killing everybody

1. Destroy the Moon
2. ?
3. Profit

(Announce what you're going to do and why you're going to do it ahead of time, make it clear that the hardware and software to destroy Earth is now available, try to make it clear it would be extremely easy to do be accident, etc.) It doesn't seem like it's all that dangerous or would require all that much power/intelligence, in the grand scheme of things; in particular it sounds safer than melting GPUs.

But it's not a very good solution. The happy story I could tell is that the governments and populace would snap out of it, comprehend the threat, and then everyone would stop trying to destroy the universe. But what I'd bet would actually happen is that it just kicks off an arms race so the universe gets destroyed faster; people would understand the "AI is extremely powerful and can destroy the world" part without internalizing the "and it's extremely easy — the default way of things, in fact — to do so by accident" part. (And I have a hard time imagining someone being put in charge of the world just because they were credibly holding it hostage.)

Also I imagine it would probably screw up Earth's ecosystem and possibly kill a lot of people from starvation or catastrophic weather (and you said duplicating strawberries requires too much intelligence to be safe, though I don't understand why). Also I guess you'd have to destroy the Moon carefully enough to not to have a chunk of it destroy Earth or whatever, so maybe it does require too much 'intelligence/world-modelling/whatever' to be safer than eg melting GPUs

Not a great solution, so I guess I'll keep thinking

comment by aleph_four · 2021-11-18T16:51:47.647Z · LW(p) · GW(p)

I love being accused of being GPT-x on Discord by people who don't understand scaling laws and think I own a planet of A100s

There are some hard and mean limits to explainability and there's a real issue that a person that correctly sees how to align AGI or that correctly perceives that an AGI design is catastrophically unsafe will not be able to explain it. It requires super-intelligence to cogently expose stupid designs that will kill us all. What are we going to do if there's this kind of coordination failure?

comment by Logan Zoellner (logan-zoellner) · 2021-11-15T21:01:10.735Z · LW(p) · GW(p)

Yudkowski's insistence that only dangerous AI can come up with a "pivotal act" is fairly ridiculous.

Consider the following pivotal act: "launch a nuclear weapon at every semiconductor  fab on earth".

Any human of even average intelligence could have thought of this.  We  do not need a  smarter-than-all-humans-ever AI  to achieve a  pivotal act.

A boxed AI should be able to think of pivotal acts and describe them to humans without  being so smart that it by necessity escapes the box and destroys all humans.

Replies from: Eliezer_Yudkowsky, dxu, logan-zoellner
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-15T21:43:58.127Z · LW(p) · GW(p)

launch a nuclear weapon at every semiconductor  fab on earth

This is not what I label "pivotal".  It's big, but a generation later they've rebuilt the semiconductor fabs and then we're all in the same position.  Or a generation later, algorithms have improved to where the old GPU server farms can implement AGI.  The world situation would be different then, if the semiconductor fabs had been nuked 10 years earlier, but it isn't obviously better.

Replies from: logan-zoellner
comment by Logan Zoellner (logan-zoellner) · 2021-11-15T22:16:43.350Z · LW(p) · GW(p)

If I really thought AI was going  to  murder us all in the next 6 months to 2 years, I would definitely consider those 10 years "pivotal", since it  would give us 5x-20x the time to solve the alignment problem.  I might even go  full  Butlerian Jihad  and just ban semiconductor fabs altogether.

Actually, I think that right question, is:  is there anything you would consider pivotal  other that just solving the alignment problem?  If no, the whole argument seems to be  "If we can't  find  a  safe way to solve the alignment problem, we should consider dangerous ones."

Replies from: RobbBB, Raemon, Eliezer_Yudkowsky
comment by Rob Bensinger (RobbBB) · 2021-11-15T22:49:43.956Z · LW(p) · GW(p)

[Update: As of today Nov. 16 (after checking with Eliezer), I've edited the Arbital page to define "pivotal act" the way it's usually used: to refer to a good gameboard-flipping action, not e.g. 'AI destroys humanity'. The quote below uses the old definition, where 'pivotal' meant anything world-destroying or world-saving.]

Eliezer's using the word "pivotal" here to mean something relatively specific, described on Arbital:

The term 'pivotal' in the context of value alignment theory is a guarded term to refer to events, particularly the development of sufficiently advanced AIs, that will make a large difference a billion years later. A 'pivotal' event upsets the current gameboard - decisively settles a win or loss, or drastically changes the probability of win or loss, or changes the future conditions under which a win or loss is determined.

[...]

### Examples of pivotal and non-pivotal events

Pivotal events:

• non-value-aligned AI is built, takes over universe
• human intelligence enhancement powerful enough that the best enhanced humans are qualitatively and significantly smarter than the smartest non-enhanced humans
• a limited Task AGI that can:
• upload humans and run them at speeds more comparable to those of an AI
• prevent the origin of all hostile superintelligences (in the nice case, only temporarily and via strategies that cause only acceptable amounts of collateral damage)
• design or deploy nanotechnology such that there exists a direct route to the operators being able to do one of the other items on this list (human intelligence enhancement, prevent emergence of hostile SIs, etc.)
• a complete and detailed synaptic-vesicle-level scan of a human brain results in cracking the cortical and cerebellar algorithms, which rapidly leads to non-value-aligned neuromorphic AI

Non-pivotal events:

• curing cancer (good for you, but it didn't resolve the value alignment problem)
• proving the Riemann Hypothesis (ditto)
• an extremely expensive way to augment human intelligence by the equivalent of 5 IQ points that doesn't work reliably on people who are already very smart
• making a billion dollars on the stock market
• robotic cars devalue the human capital of professional drivers, and mismanagement of aggregate demand by central banks plus burdensome labor market regulations is an obstacle to their re-employment

Borderline cases:

• unified world government with powerful monitoring regime for 'dangerous' technologies
• widely used gene therapy that brought anyone up to a minimum equivalent IQ of 120

### Centrality to limited AI proposals

We can view the general problem of Limited AI as having the central question: What is a pivotal positive accomplishment, such that an AI which does that thing and not some other things is therefore a whole lot safer to build? This is not a trivial question because it turns out that most interesting things require general cognitive capabilities, and most interesting goals can require arbitrarily complicated value identification problems to pursue safely.

It's trivial to create an "AI" which is absolutely safe and can't be used for any pivotal achievements. E.g. Google Maps, or a rock with "2 + 2 = 4" painted on it.

[...]

### Centrality to concept of 'advanced agent'

We can view the notion of an advanced agent as "agent with enough cognitive capacity to cause a pivotal event, positive or negative"; the advanced agent properties are either those properties that might lead up to participation in a pivotal event, or properties that might play a critical role in determining the AI's trajectory and hence how the pivotal event turns out.

In conversations I've seen that use the word "pivotal", it's usually asking about pivotal acts we can do that end the acute x-risk period (things that make it the case that random people in the world can't suddenly kill everyone with AGI or bioweapons or what-have-you). I.e., it's specifically focused on good pivotal acts.

Replies from: RobbBB, logan-zoellner
comment by Rob Bensinger (RobbBB) · 2021-11-15T23:10:11.578Z · LW(p) · GW(p)

IMO it's confusing that Eliezer uses the word "pivotal" on Arbital to also refer to ways AI could destroy the world. If we're talking about stuff like "what's the easiest pivotal act?" or "how hard do pivotal acts tend to be?", I'll give wildly different answers if I'm including 'ways to destroy the world' and not just 'ways to save the world' -- destroying the world seems drastically easier to me. And I don't know of an unambiguous short synonym for 'good pivotal act'.

(Eliezer proposes 'pivotal achievement', but empirically I don't see people using this much, and it still has the same problem that it re-uses the word 'pivotal' for both categories of event, thus making them feel very similar.)

Usually I care about either 'ways of saving the world' or 'ways of destroying the world' -- I rarely find myself needing a word for the superset. E.g., I'll find myself searching for a short term to express things like 'the first AGI company needs to look for a way-to-save-the-world' or 'I wish EAs would spend more time thinking about ways-to-use-AGI-to-save-the-world'. But if I say 'pivotal', this will technically include x-catastrophes, which is not what I have in mind.

(On the other hand, the concept of 'the kind of AI that's liable to cause pivotal events' does make sense to me and feels very useful, because I think AGI gets you both the world-saving and the world-destroying capabilities in one fell swoop (though not necessarily the ability to align AGI to actually utilize the capabilities you want). But given my beliefs about AGI, I'm satisfied with just using the term 'AGI' to refer to 'the kind of AI that's liable to cause pivotal events'. Eliezer's more-specifically-about-pivotal-events term for this on Arbital, 'advanced agent', seems fine to me too.)

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2021-11-16T00:24:21.742Z · LW(p) · GW(p)

Update: Eliezer has agreed to let me edit the Arbital article to follow more standard usage nowadays, with 'pivotal acts' referring to good gameboard-flipping actions. The article will use 'existential catastrophe' to refer to bad gameboard-flipping events, and 'astronomically significant event' to refer to the superset. Will re-quote the article here once there's a new version.

Replies from: RobbBB
comment by Rob Bensinger (RobbBB) · 2021-11-17T04:54:47.864Z · LW(p) · GW(p)

New "pivotal act" page:

The term 'pivotal act' in the context of AI alignment theory is a guarded term to refer to actions that will make a large positive difference a billion years later. Synonyms include 'pivotal achievement' and 'astronomical achievement'.

We can contrast this with existential catastrophes (or 'x-catastrophes'), events that will make a large negative difference a billion years later. Collectively, this page will refer to pivotal acts and existential catastrophes as astronomically significant events (or 'a-events').

'Pivotal event' is a deprecated term for referring to astronomically significant events, and 'pivotal catastrophe' is a deprecated term for existential catastrophes. 'Pivotal' was originally used to refer to the superset (a-events), but AI alignment researchers kept running into the problem of lacking a crisp way to talk about 'winning' actions in particular, and their distinctive features.

Usage has therefore shifted such that (as of late 2021) researchers use 'pivotal' and 'pivotal act' to refer to good events that upset the current gameboard - events that decisively settle a win, or drastically increase the probability of a win.

comment by Logan Zoellner (logan-zoellner) · 2021-11-15T22:59:25.783Z · LW(p) · GW(p)

Under  this definition, it seems that "nuke every fab on Earth" would qualify as "borderline", and every outcome that is both "pivotal"  and "good" depends on solving the alignment problem.

comment by Raemon · 2021-11-15T22:35:35.877Z · LW(p) · GW(p)

Pivotal in this case is a technical term (whose article opens with an explicit bid for people not to stretch the definition of the term). It's not (by definition) limited to 'solving the alignment problem', but there are constraints on what counts as pivotal.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-16T01:08:56.989Z · LW(p) · GW(p)

If you can deploy nanomachines that melt all the GPU farms and prevent any new systems with more than 1 networked GPU from being constructed, that counts.  That really actually suspends AGI development indefinitely pending an unlock, and not just for a brief spasmodic costly delay.

Replies from: Wei_Dai, Vaniver
comment by Wei_Dai · 2021-11-27T21:54:24.531Z · LW(p) · GW(p)

1. Are you expecting that the team behind the "melt all GPU farms" pivotal act to be backed by a major government or coalition of governments?
2. If not, I expect that the team and its AGI will be arrested/confiscated by the nearest authority as soon as the pivotal act occurs, and forced by them to apply the AGI to other goals. Do you see things happening differently, or expect things to come out well despite this?
Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-28T05:33:01.544Z · LW(p) · GW(p)

"Melt all GPUs" is indeed an unrealistic pivotal act - which is why I talk about it, since like any pivotal act it is outside the Overton Window, and then if any children get indignant about the prospect of doing something other than letting the world end miserably, I get to explain the child-reassuring reasons why you would never do the particular thing of "melt all GPUs" in real life.  In this case, the reassuring reason is that deploying open-air nanomachines to operate over Earth is a huge alignment problem, that is, relatively huger than the least difficult pivotal act I can currently see.

That said, if unreasonably-hypothetically you can give your AI enough of a utility function and have it deploy enough intelligence to create nanomachines that safely move through the open-ended environment of Earth's surface, avoiding bacteria and not damaging any humans or vital infrastructure, in order to surveil all of Earth and find the GPU farms and then melt them all, it's probably not very much harder to tell those nanomachines to melt other things, or demonstrate the credibly threatening ability to do so.

That said, I indeed don't see how we sociologically get into this position in a realistic way, in anything like the current world, even assuming away the alignment problem.  Unless Demis Hassabis suddenly executes an emergency pact with the Singaporean government, or something else I have trouble visualizing?  I don't see any of the current owners or local governments of the big AI labs knowingly going along with any pivotal act executed deliberately (though I expect them to think it's just fine to keep cranking up the dial on an AI until it destroys the world, so long as it looks like it's not being done on purpose).

It is indeed the case that, conditional on the alignment problem being solvable, there's a further sociological problem - which looks a lot less impossible, but which I do not actually know how to solve - wherein you then have to do something pivotal, and there's no grownups in government in charge who would understand why that was something necessary to do.  But it's definitely a lot easier to imagine Demis forming a siloed team or executing an emergency pact with Singapore, than it is to see how you would safely align the AI that does it.  And yes, the difficulty of any pivotal act to stabilize the Earth includes the difficulty of what you had to do, before or after you had sufficiently powerful AGI, in order to execute that act and then prevent things from falling over immediately afterwards.

Replies from: Wei_Dai
comment by Wei_Dai · 2021-11-28T20:59:12.086Z · LW(p) · GW(p)

the least difficult pivotal act I can currently see.

Do you have a plan to communicate the content of this to people whom it would be beneficial to communicate to? E.g., write about it in some deniable way, or should such people just ask you about it privately? Or more generally, how do you think that discussions / intellectual progress on this topic should go?

Do you think the least difficult pivotal act you currently see has sociopolitical problems that are similar to "melt all GPUs"?

That said, I indeed don’t see how we sociologically get into this position in a realistic way, in anything like the current world, even assuming away the alignment problem.

Thanks for the clarification. I suggest mentioning this more often (like in the Arbital page), as I previously didn't think that your version of "pivotal act" had a significant sociopolitical component. If this kind of pivotal act is indeed how the world gets saved (conditional on the world being saved), one of my concerns is that "a miracle occurs" and the alignment problem gets solved, but the sociopolitical problem doesn't because nobody was working on it (even if it's easier in some sense).

But it’s definitely a lot easier to imagine Demis forming a siloed team or executing an emergency pact with Singapore

(Not a high priority to discuss this here and now, but) I'm skeptical that backing by a small government like Singapore is sufficient, since any number of major governments would be very tempted to grab the AGI(+team) from the small government, and the small government will be under tremendous legal and diplomatic stress from having nonconsensually destroyed a lot of very valuable other people's property. Having a partially aligned/alignable AGI in the hands of a small, geopolitically weak government seems like a pretty precarious state.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-29T06:06:02.289Z · LW(p) · GW(p)

Singapore probably looks a lot less attractive to threaten if it's allied with another world power that can find and melt arbitrary objects.

comment by Vaniver · 2021-11-16T02:02:20.176Z · LW(p) · GW(p)

I'm still unsure how true I think this is.

Clearly a full Butlerian jihad (where all of the computers are destroyed) suspends AGI development indefinitely, and destroying no computers doesn't slow it down at all. There's a curve then where the more computers you destroy, the more you both 1) slow down AGI development and 2) disrupt the economy (since people were using those to keep their supply chains going, organize the economy, do lots of useful work, play video games, etc.).

But even if you melt all the GPUs, I think you have two obstacles:

1. CPUs alone can do lots of the same stuff. There's some paper I was thinking of from ~5 years ago where they managed to get a CPU farm competitive with the GPUs of the time, and it might have been this paper (whose authors are all from Intel, who presumably have a significant bias) or it might have been the Hogwild-descended stuff (like this); hopefully someone knows something more up to date.
2. The chip design ecosystem gets to react to your ubiquitous nanobots and reverse-engineer what features they're looking for to distinguish between whitelisted CPUs and blacklisted GPUs; they may be able to design a ML accelerator that fools the nanomachines. (Something that's robust to countermoves might have to eliminate many more current chips.)
Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-11-16T03:38:25.592Z · LW(p) · GW(p)

I agree you might need to make additional moves to keep the table flipped, but in a scenario like this you would actually have the capability to make those moves.

Replies from: logan-zoellner
comment by Logan Zoellner (logan-zoellner) · 2021-11-16T14:59:23.937Z · LW(p) · GW(p)

Is the plan just to destroy all computers with say >1e15 flops of computing  power?  How does the nanobot swarm know what a "computer" is?  What do you do about something like GPT-neo or SETI-at-home where the compute is distributed?

I'm still confused as to why you think task: "build an AI that destroys  anything with >1e15 flops of  computing  power --except humans, of course" would  be  dramatically easier than the alignment problem.

Setting back  civilization a generation (via catastrophe) seems relatively straightforward.  Building a social consensus/religion that destroys anything "in the image of a mind" at least seems possible.  Fine-tuning a nanobot swarm to destroy some but not all computers just sounds really hard to me.

comment by dxu · 2021-11-15T23:30:19.292Z · LW(p) · GW(p)

Consider the following pivotal act: "launch a nuclear weapon at every semiconductor  fab on earth".

Any human of even average intelligence could have thought of this.

And by that very same token, the described plan would not actually work.

We  do not need a  smarter-than-all-humans-ever AI  to achieve a  pivotal act.

Unless we want the AI in question to output a plan that has a chance of actually working.

A boxed AI should be able to think of pivotal acts and describe them to humans without  being so smart that it by necessity escapes the box and destroys all humans.

If an actually workable pivotal act existed that did not require better-than-human intelligence to come up with, we would already be in the process of implementing said pivotal act, because someone would have thought of it already. The fact that this is obviously not the case should therefore cause a substantial update against the antecedent.

Replies from: logan-zoellner
comment by Logan Zoellner (logan-zoellner) · 2021-11-16T14:44:11.036Z · LW(p) · GW(p)

If an actually workable pivotal act existed that did not require better-than-human intelligence to come up with, we would already be in the process of implementing said pivotal act, because someone would have thought of it already. The fact that this is obviously not the case should therefore cause a substantial update against the antecedent.

This is an incredibly bad argument.  Saying  something cannot possibly work because no one has done it yet would mean that literally all innovation is impossible.

Replies from: dxu
comment by dxu · 2021-11-16T17:59:51.877Z · LW(p) · GW(p)

Saying  something cannot possibly work because no one has done it yet would mean that literally all innovation is impossible.

You are attempting to generalize conclusions about an extremely loose class of achievements ("innovation"), to an extremely tight class of achievements ("commit, using our current level of knowledge and resources, a pivotal act"). That this generalization is invalid ought to go without saying, but in the interest of constructiveness I will point out one (relevant) aspect of the disanalogy:

"Innovation", at least as applied to technology, is incremental; new innovations are allowed to build on past knowledge in ways that (in principle) place no upper limit on the technological improvements thus achieved (except whatever limits are imposed by the hard laws of physics and mathematics). There is also no time limit on innovation; by default, anything that is possible at all is assumed to be realized eventually, but there are no guarantees as to when that will happen for any specific technology.

"Commit a pivotal act using the knowledge and resources currently available to us", on the other hand, is the opposite of incremental: it demands that we execute a series of actions that leads to some end goal (such as "take over the world") while holding fixed our level of background knowledge/acumen. Moreover, whereas there is no time limit on technological "innovation", there is certainly a time limit on successfully committing a pivotal act; and moreover this time limit is imposed precisely by however long it takes before humanity "innovates" itself to AGI.

In summary, your analogy leaks, and consequently so does your generalization. In fact, however, your reasoning is further flawed: even if your analogy were tight, it would not suffice to establish what you need to establish. Recall your initial claim:

We  do not need a  smarter-than-all-humans-ever AI  to achieve a  pivotal act.

This claim does not, in fact, become more plausible if we replace "achieve a pivotal act" with e.g. "vastly increase the pace of technological innovation". This is true even though technological innovation is, as a human endeavor, far more tractable than saving/taking over the world. This is because the load-bearing part of the argument is that the AI must produce relevant insights (whether related to "innovation" or "pivotal acts") at a rate vastly superior to that of humans, in order for it to be able to reliably produce innovations/world-saving plans. (I leave it unargued that humans do not reliably do either of these things.) In other words, it certainly requires an AI whose ability in the relevant domains exceeds that of "all humans ever", because "all humans ever" empirically do not (reliably) accomplish these tasks.

For your argument to go through, in other words, you cannot get away with arguing merely that something is "possible" (though in fact you have not even established this much, because the analogy with technological innovation does not hold). Your argument actually requires you to argue for the (extremely strong) claim that the ambient probability with which humans successfully generate world-saving plans, is sufficient to the task of generating a successful world-saving plan before unaligned AGI is built. And this claim is clearly false, since (once again)

If an actually workable pivotal act existed that did not require better-than-human intelligence to come up with, we would already be in the process of implementing said pivotal act, because someone would have thought of it already. The fact that this is obviously not the case should therefore cause a substantial update against the antecedent.

Replies from: logan-zoellner
comment by Logan Zoellner (logan-zoellner) · 2021-11-16T21:40:44.675Z · LW(p) · GW(p)

the AI must produce relevant insights (whether related to "innovation" or "pivotal acts") at a rate vastly superior to that of humans, in order for it to be able to reliably produce innovations/world-saving plans

This  is  precisely the claim we are  arguing about!  I disagree that the  AI  needs to produce  insights "at a  rate vastly superior  to all  humans".

On the contrary,  I claim that there is one borderline act (start a catastrophe that sets back AI progress by decades) that can be done with current human knowledge.  And I furthermore claim that there is  one pivotal act (design  an aligned AI) that may well be achieved via incremental progress.

Replies from: dxu
comment by dxu · 2021-11-17T03:35:32.244Z · LW(p) · GW(p)

If the AI does not need to produce relevant insights at a faster rate than humans, then that implies the rate at which humans produce relevant insights is sufficiently fast already. And if that’s your claim, then you—again—need to explain why no humans have been able to come up with a workable pivotal act to date.

On the contrary, I claim that there is one borderline act (start a catastrophe that sets back AI progress by decades) that can be done with current human knowledge.

How do you propose to accomplish this? Your initial suggestion, “launch nukes at every semiconductor fab”, is not workable. If all of the candidate solutions you have in mind are of similar quality to that, then I reiterate: humans cannot, with their current knowledge and resources, execute a pivotal act in the real world.

And I furthermore claim that there is one pivotal act (design an aligned AI) that may well be achieved via incremental progress.

This is the hope, yes. Note, however, that this is a path that routes directly through smarter-than-human AI, which necessity is precisely what you are disputing. So the existence of this path does not particularly strengthen your case.

Replies from: logan-zoellner
comment by Logan Zoellner (logan-zoellner) · 2021-11-18T16:24:01.967Z · LW(p) · GW(p)

Your initial suggestion, “launch nukes at every semiconductor fab”, is not workable.

In what  way is  it not workable?  Perhaps we have  different intuitions about how difficult it is to build a cutting-edge semiconductor facility?  Alternatively you may disagree with me that AI is largely hardware-bound and thus cutting off the supply of new compute will also prevent the rise of superhuman AI?

Do you also think that "the US president launches every nuclear weapon at his command, causing nuclear winter?" would fail to prevent the rise of superhuman AGI?

comment by Logan Zoellner (logan-zoellner) · 2021-11-15T21:22:47.066Z · LW(p) · GW(p)

It would not surprise me in the least if the world ends before self-driving cars are sold on the mass market.

Obviously it is impossible to  bet money on the  end of the world.  But if  it  were, I would be willing to give fairly long  odds that  this is wrong.

comment by philh · 2021-11-15T22:31:35.642Z · LW(p) · GW(p)

Obviously it is impossible to bet money on the end of the world.

I think this is neither obvious nor true. There are lots of variants you could do and details you'd need to fill in, but the outline of a simple one would be: "I pay you $X now, and if and when self-driving cars reach mass market without the world having ended, you pay me$Y inflation-adjusted".

Replies from: khafra, M. Y. Zuo
comment by khafra · 2021-11-18T18:39:41.930Z · LW(p) · GW(p)

Robin Hanson said [LW · GW],  with Eliezer eventually concurring, that "bets like this will just recover interest rates, which give the exchange rate between resources on one date and resources on another date."

E.g., it's not impossible to bet money on the end of the world, but it's impossible to do it in a way substantially different from taking a loan.

Replies from: philh
comment by philh · 2021-11-19T11:22:04.347Z · LW(p) · GW(p)

Oh, thanks for the pointer. I confess I wish Robin was less terse here.

I'm not sure I even understand the claim, what does it mean to "recover interest rates"? Is Robin claiming any such bet will either

1. Have payoffs such that [the person receiving money now and paying money later] could just take out a loan at prevailing interest rates to make this bet; or
2. Have at least one party who is being silly with money?

...oh, I think I get it, and IIUC the idea that fails is different from what I was suggesting.

The idea that fails is that you can make a prediction market from these bets and use it to recover a probability of apocalypse. I agree that won't work, for the reason given: prices of these bets will be about both [probability of apocalypse] and [the value of money-now versus money-later, conditional on no apocalypse], and you can't separate those effects.

I don't think this automatically sinks the simpler idea of: if Alice and Bob disagree about the probability of an apocalypse, they may be able to make a bet that both consider positive-expected-utility. And I don't think that bet would necessarily just be a combination of available market-rate loans? At least it doesn't look like anyone is claiming that.

comment by M. Y. Zuo · 2021-11-15T22:55:15.019Z · LW(p) · GW(p)

Wow, that may be a genuinely ground breaking application for crypto currencies. e.g. someone with 1000 bitcoins can put them in some form of guaranteed, irreversible, escrow for a million dollars upfront and release date in 2050. If the world ends then the escrow vanishes, if not the lucky better would get it.

Replies from: philh
comment by philh · 2021-11-15T23:54:32.052Z · LW(p) · GW(p)

(Since you said "ground breaking", I feel I should maybe be clear that the idea wasn't original to me and the ground was already broken when I got here. I probably saw it on LW myself.)

Note that 1000 BTC at current prices is like \$64 million. In general with this structure (trade crypto for fiat, plus the crypto gets locked until later) I don't see the incentive on anyone's side to lock the crypto?

I haven't thought about this in depth and I assume some people have, but my sense is that these kinds of bets are a lot easier if both parties trust each other to be willing and able to pay up when appropriate. If you don't trust your counterparty, and demand money in escrow, then their incentive to take part seems minimal or none.

Replies from: M. Y. Zuo
comment by M. Y. Zuo · 2021-11-16T00:06:41.962Z · LW(p) · GW(p)

The actual structure, and payout ratio, would probably be set in a much more elaborate way. Maybe some kind of annuity paying out every year the world hasn’t ended yet? Like commit 10 bitcoins every year from a reversible escrow account to the irreversible escrow if the servers still exist or the total balance will be forfeited. Something along those lines, perhaps others would want to take up the project.

Replies from: philh
comment by philh · 2021-11-16T00:36:27.837Z · LW(p) · GW(p)

Fwiw, and this might just be that you've thought about it more and I have and/or have details in mind that you haven't specified. But it still seems to me that this runs into the problem that at least one party basically won't have any reason to enter into such a bet, because their potential upside will be locked away long enough to negate its value.

Replies from: M. Y. Zuo
comment by M. Y. Zuo · 2021-11-16T02:40:07.308Z · LW(p) · GW(p)

I imagine that was one of the critiques for prediction markets using crypto currencies held in escrow, yet they exist now, and they’re not all scams, so there must be some, non zero, market clearing price.

Replies from: philh
comment by philh · 2021-11-16T08:52:07.975Z · LW(p) · GW(p)

I don't think that holds up, because with traditional uses of prediction markets both parties expect to see the end of the bet. If it's a long term bet then that would increase the expected payoff they require to be willing to bet, but there's some amount of "money later" that's worth giving up "money now" in exchange for. So both are willing to lock money up.

With a bet on the end of the world, all of the upside for one party comes from receiving "money now". There's no potential "money later" payoff for them, the bet isn't structured that way. And putting money in escrow means their "money now" vanishes too.

That is, if I receive a million dollars now and pay out two million if the world doesn't end, then: for now I'm up a million; if the world ends I'm dead; if the world doesn't end I'm down a million. But if the two million has to go in escrow, the only change is that for now I'm down a million too. So I'm not gonna do this.

You could define a threshold for known AI capability or odds of extinction* and bet on that instead.

*as estimated by some set of alignment experts

comment by Logan Zoellner (logan-zoellner) · 2021-11-15T21:26:42.022Z · LW(p) · GW(p)

the thing that kills us is likely to be a thing that can get more dangerous when you turn up a dial on it, not a thing that intrinsically has no dials that can make it more dangerous.

Finally a specific claim from Yudkowski I  actually agree with