The Hidden Complexity of Wishes

post by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2007-11-24T00:12:33.000Z · LW · GW · Legacy · 192 comments

Contents

192 comments

(It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of human values. This point is relevant within a larger argument, because it bears on the complexity of what you need to get an artificial superintelligence to want or value; rather than bearing on what a superintelligence supposedly could not predict or understand. -- EY, May 2024.)

"I wish to live in the locations of my choice, in a physically healthy, uninjured, and apparently normal version of my current body containing my current mental state, a body which will heal from all injuries at a rate three sigmas faster than the average given the medical technology available to me, and which will be protected from any diseases, injuries or illnesses causing disability, pain, or degraded functionality or any sense, organ, or bodily function for more than ten days consecutively or fifteen days in any year..."
            -- The Open-Source Wish Project, Wish For Immortality 1.1

There are three kinds of genies:  Genies to whom you can safely say "I wish for you to do what I should wish for"; genies for which no wish is safe; and genies that aren't very powerful or intelligent.

Suppose your aged mother is trapped in a burning building, and it so happens that you're in a wheelchair; you can't rush in yourself.  You could cry, "Get my mother out of that building!" but there would be no one to hear.

Luckily you have, in your pocket, an Outcome Pump.  This handy device squeezes the flow of time, pouring probability into some outcomes, draining it from others.

The Outcome Pump is not sentient.  It contains a tiny time machine, which resets time unless a specified outcome occurs.  For example, if you hooked up the Outcome Pump's sensors to a coin, and specified that the time machine should keep resetting until it sees the coin come up heads, and then you actually flipped the coin, you would see the coin come up heads.  (The physicists say that any future in which a "reset" occurs is inconsistent, and therefore never happens in the first place - so you aren't actually killing any versions of yourself.)

Whatever proposition you can manage to input into the Outcome Pump, somehow happens, though not in a way that violates the laws of physics.  If you try to input a proposition that's too unlikely, the time machine will suffer a spontaneous mechanical failure before that outcome ever occurs.

You can also redirect probability flow in more quantitative ways using the "future function" to scale the temporal reset probability for different outcomes.  If the temporal reset probability is 99% when the coin comes up heads, and 1% when the coin comes up tails, the odds will go from 1:1 to 99:1 in favor of tails.  If you had a mysterious machine that spit out money, and you wanted to maximize the amount of money spit out, you would use reset probabilities that diminished as the amount of money increased.  For example, spitting out $10 might have a 99.999999% reset probability, and spitting out $100 might have a 99.99999% reset probability.  This way you can get an outcome that tends to be as high as possible in the future function, even when you don't know the best attainable maximum.

So you desperately yank the Outcome Pump from your pocket - your mother is still trapped in the burning building, remember? - and try to describe your goal: get your mother out of the building!

The user interface doesn't take English inputs.  The Outcome Pump isn't sentient, remember?  But it does have 3D scanners for the near vicinity, and built-in utilities for pattern matching.  So you hold up a photo of your mother's head and shoulders; match on the photo; use object contiguity to select your mother's whole body (not just her head and shoulders); and define the future function using your mother's distance from the building's center.  The further she gets from the building's center, the less the time machine's reset probability.

You cry "Get my mother out of the building!", for luck, and press Enter.

For a moment it seems like nothing happens.  You look around, waiting for the fire truck to pull up, and rescuers to arrive - or even just a strong, fast runner to haul your mother out of the building -

BOOM!  With a thundering roar, the gas main under the building explodes.  As the structure comes apart, in what seems like slow motion, you glimpse your mother's shattered body being hurled high into the air, traveling fast, rapidly increasing its distance from the former center of the building.

On the side of the Outcome Pump is an Emergency Regret Button.  All future functions are automatically defined with a huge negative value for the Regret Button being pressed - a temporal reset probability of nearly 1 - so that the Outcome Pump is extremely unlikely to do anything which upsets the user enough to make them press the Regret Button.  You can't ever remember pressing it.  But you've barely started to reach for the Regret Button (and what good will it do now?) when a flaming wooden beam drops out of the sky and smashes you flat.

Which wasn't really what you wanted, but scores very high in the defined future function...

The Outcome Pump is a genie of the second class.  No wish is safe.

If someone asked you to get their poor aged mother out of a burning building, you might help, or you might pretend not to hear.  But it wouldn't even occur to you to explode the building.  "Get my mother out of the building" sounds like a much safer wish than it really is, because you don't even consider the plans that you assign extreme negative values.

Consider again the Tragedy of Group Selectionism: Some early biologists asserted that group selection for low subpopulation sizes would produce individual restraint in breeding; and yet actually enforcing group selection in the laboratory produced cannibalism, especially of immature females.  It's obvious in hindsight that, given strong selection for small subpopulation sizes, cannibals will outreproduce individuals who voluntarily forego reproductive opportunities.  But eating little girls is such an un-aesthetic solution that Wynne-Edwards, Allee, Brereton, and the other group-selectionists simply didn't think of it.  They only saw the solutions they would have used themselves.

Suppose you try to patch the future function by specifying that the Outcome Pump should not explode the building: outcomes in which the building materials are distributed over too much volume, will have ~1 temporal reset probabilities.

So your mother falls out of a second-story window and breaks her neck.  The Outcome Pump took a different path through time that still ended up with your mother outside the building, and it still wasn't what you wanted, and it still wasn't a solution that would occur to a human rescuer.

If only the Open-Source Wish Project had developed a Wish To Get Your Mother Out Of A Burning Building:

"I wish to move my mother (defined as the woman who shares half my genes and gave birth to me) to outside the boundaries of the building currently closest to me which is on fire; but not by exploding the building; nor by causing the walls to crumble so that the building no longer has boundaries; nor by waiting until after the building finishes burning down for a rescue worker to take out the body..."

All these special cases, the seemingly unlimited number of required patches, should remind you of the parable of Artificial Addition - programming an Arithmetic Expert Systems by explicitly adding ever more assertions like "fifteen plus fifteen equals thirty, but fifteen plus sixteen equals thirty-one instead".

How do you exclude the outcome where the building explodes and flings your mother into the sky?  You look ahead, and you foresee that your mother would end up dead, and you don't want that consequence, so you try to forbid the event leading up to it.

Your brain isn't hardwired with a specific, prerecorded statement that "Blowing up a burning building containing my mother is a bad idea."  And yet you're trying to prerecord that exact specific statement in the Outcome Pump's future function.  So the wish is exploding, turning into a giant lookup table that records your judgment of every possible path through time.

You failed to ask for what you really wanted.  You wanted your mother to go on living, but you wished for her to become more distant from the center of the building.

Except that's not all you wanted.  If your mother was rescued from the building but was horribly burned, that outcome would rank lower in your preference ordering than an outcome where she was rescued safe and sound.  So you not only value your mother's life, but also her health.

And you value not just her bodily health, but her state of mind. Being rescued in a fashion that traumatizes her - for example, a giant purple monster roaring up out of nowhere and seizing her - is inferior to a fireman showing up and escorting her out through a non-burning route.  (Yes, we're supposed to stick with physics, but maybe a powerful enough Outcome Pump has aliens coincidentally showing up in the neighborhood at exactly that moment.)  You would certainly prefer her being rescued by the monster to her being roasted alive, however.

How about a wormhole spontaneously opening and swallowing her to a desert island?  Better than her being dead; but worse than her being alive, well, healthy, untraumatized, and in continual contact with you and the other members of her social network.

Would it be okay to save your mother's life at the cost of the family dog's life, if it ran to alert a fireman but then got run over by a car?  Clearly yes, but it would be better ceteris paribus to avoid killing the dog.  You wouldn't want to swap a human life for hers, but what about the life of a convicted murderer?  Does it matter if the murderer dies trying to save her, from the goodness of his heart?  How about two murderers?  If the cost of your mother's life was the destruction of every extant copy, including the memories, of Bach's Little Fugue in G Minor, would that be worth it?  How about if she had a terminal illness and would die anyway in eighteen months?

If your mother's foot is crushed by a burning beam, is it worthwhile to extract the rest of her?  What if her head is crushed, leaving her body?  What if her body is crushed, leaving only her head?  What if there's a cryonics team waiting outside, ready to suspend the head?  Is a frozen head a person?  Is Terry Schiavo a person?  How much is a chimpanzee worth?

Your brain is not infinitely complicated; there is only a finite Kolmogorov complexity / message length which suffices to describe all the judgments you would make.  But just because this complexity is finite does not make it small.  We value many things, and no they are not reducible to valuing happiness or valuing reproductive fitness.

There is no safe wish smaller than an entire human morality.  There are too many possible paths through Time.  You can't visualize all the roads that lead to the destination you give the genie.  "Maximizing the distance between your mother and the center of the building" can be done even more effectively by detonating a nuclear weapon.  Or, at higher levels of genie power, flinging her body out of the Solar System.  Or, at higher levels of genie intelligence, doing something that neither you nor I would think of, just like a chimpanzee wouldn't think of detonating a nuclear weapon.  You can't visualize all the paths through time, any more than you can program a chess-playing machine by hardcoding a move for every possible board position.

And real life is far more complicated than chess.  You cannot predict, in advance, which of your values will be needed to judge the path through time that the genie takes.  Especially if you wish for something longer-term or wider-range than rescuing your mother from a burning building.

I fear the Open-Source Wish Project is futile, except as an illustration of how not to think about genie problems.  The only safe genie is a genie that shares all your judgment criteria, and at that point, you can just say "I wish for you to do what I should wish for."  Which simply runs the genie's should function.

Indeed, it shouldn't be necessary to say anything.  To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish. Otherwise the genie may not choose a path through time which leads to the destination you had in mind, or it may fail to exclude horrible side effects that would lead you to not even consider a plan in the first place.  Wishes are leaky generalizations, derived from the huge but finite structure that is your entire morality; only by including this entire structure can you plug all the leaks.

With a safe genie, wishing is superfluous.  Just run the genie.

192 comments

Comments sorted by oldest first, as this post is from before comment nesting was available (around 2009-02-27).

comment by Kevin2 · 2007-11-24T04:20:27.000Z · LW(p) · GW(p)

Is there a safe way to wish for an unsafe genie to behave like a safe genie? That seems like a wish TOSWP should work on.

Replies from: matty, themusicgod1, billy_the_kid, TheWakalix
comment by matty · 2012-07-07T13:41:15.391Z · LW(p) · GW(p)

I would create the machine (genie) to respond only in ways that cannot physically or mentally hurt or injure any of the participants.

Replies from: Luaan, ArisKatsaris
comment by Luaan · 2012-10-09T11:47:43.924Z · LW(p) · GW(p)

I don't think you quite understood the article :P

It's incredibly hard to specify things unambigiously. Even in common workday practice, communication problems cause tons of problems; you always have to make assumptions, because absolutely precise definition of everything is extremely wasteful (if it's even possible at all). I cringe whenever someone says "But that's obvious! You should have thought of that automatically!". Obviously, their model of reality (wherein I am aware of that particular thingy) is flawed, since I was not.

That's the largest problem when delegating any work, IMO - we all have different preconceptions, and you can't expect anyone else to share all those relevant to any given task. At least anything more complicated than pure math :D

comment by ArisKatsaris · 2012-10-09T12:09:20.621Z · LW(p) · GW(p)

Let us know when you can encode what "physically or mentally hurt or injure any of the participants" means in an actual existing programming language of your choice. :-)

comment by themusicgod1 · 2013-11-24T18:08:32.940Z · LW(p) · GW(p)

A sufficiently powerful genie might make safe genies by definition more unsafe. Then your wish could be granted.

edit (2015) caution: I think this particular comment is harmless in retrospect... but I wouldn't give it much weight

comment by billy_the_kid · 2015-02-02T05:14:01.759Z · LW(p) · GW(p)

I wish for you to interpret my wishes how I interpret them.

Can anyone find a problem with that?

Replies from: ike, Richard_Kennaway, Epictetus
comment by ike · 2015-02-02T05:29:01.665Z · LW(p) · GW(p)

how I interpret them.

What if you never think about the interpretation? Or is this how you would interpret them? Define would, then.

If you think about the interpretation, then you can already explain it. The problem is because you don't actually think about every aspect and possibility while wishing.

Replies from: Jiro
comment by Jiro · 2015-02-02T16:34:43.674Z · LW(p) · GW(p)

Even if you never think about the interpretation, most aspects of wishes will have an implicit interpretation based on your values. You may never have thought about whether wishing for long life should turn you into a fungal colony, but if you had been asked "does your wish for long life mean you'd want to be turned into a fungal colony", you'd have said "no".

comment by Richard_Kennaway · 2015-02-03T13:04:03.274Z · LW(p) · GW(p)

Even when making requests of other people, they may fulfil them in ways you would prefer they hadn't. The more powerful the genie is at divining your true intent, the more powerfully it can find ways of fulfilling your wishes that may not be what you want. It is not obvious that there is a favorable limit to this process.

Your answers to questions about your intent may depend on the order the questions are asked. Or they may depend on what knowledge you have, and if you study different things you may come up with different answers. Given a sufficiently powerful genie, there is no real entity that is "how I interpret the wish".

How is the genie supposed to know your answers to all possible questions of interpretation? Large parts of "your interpretation" may not exist until you are asked about some hypothetical circumstance. Even if you are able to answer every such question, how is the genie to know the answer without asking you? Only by having a model of you sufficiently exact that you are confident it will give the same answers you would, even to questions you have not thought of and would have a hard time answering. But that is wishing for the genie to do all the work of being you.

A lot of transhumanist dreams seem to reduce to this: a Friendly AGI will do for us all the work of being us.

Replies from: Jiro
comment by Jiro · 2015-02-03T16:44:02.024Z · LW(p) · GW(p)

Your answers to questions about your intent may depend on the order the questions are asked. Or they may depend on what knowledge you have, and if you study different things you may come up with different answers.

If I ask the genie for long life, and the genie is forced to decide between a 200 year lifespan with a 20% chance of a painful death and a 201 year lifespan with a 21% chance of a painful death, it is possible that the genie might not get my preferences exactly correct, or that my preferences between those two results may depend on how I am asked or how I am feeling at the time.

But if the genie messed up and picked the one that didn't really match my preferences, I would only be slightly displeased. I observe that this goes together: in cases where it would be genuinely hard or impossible for the genie to figure out what I prefer, the fact that the genie might not get my preferences correct only bothers me a little. In cases where extrapolating my preferences is much easier, the genie getting them wrong would matter to me a lot more (I would really not like a genie that grants my wish for long life by turning me into a fungal colony). So just because the genie can't know the answer to every question about my extrapolated preferences doesn't mean that the genie can't know it to a sufficient degree that I would consider the genie good to ask for wishes.

comment by Epictetus · 2015-02-03T17:13:20.514Z · LW(p) · GW(p)

If the genie merely alters the present to conform to your wishes, you can easily run into unintended consequences.

The other problem is that divining someone's intent is tricky business. A person often has a dozen impulses at cross-purposes to one another and the interpretation of your wish will likely vary depending on how much sleep you got and what you had for lunch. There's a sci-fi short story Oddy and Id that examines a curious case of a man with luck so amazing that the universe bends to satisfy him. I won't spoil it, but I think it brings up a relevant point.

comment by TheWakalix · 2018-11-22T21:58:15.441Z · LW(p) · GW(p)

If you can rigorously define Safety, you've already solved the Safety Problem. This isn't a shortcut.

comment by Nick_Tarleton · 2007-11-24T09:27:36.000Z · LW(p) · GW(p)

"I wish for a genie that shares all my judgment criteria" is probably the only safe way.

Replies from: Nebu, AndHisHorse, CynicalOptimist
comment by Nebu · 2012-03-30T22:42:27.265Z · LW(p) · GW(p)

This might be done by picking an arbitrary genie, and then modifying your judgement criteria to match that genie's.

Replies from: CuriousMeta
comment by CuriousMeta · 2019-12-29T15:13:45.158Z · LW(p) · GW(p)

Which is perhaps most efficiently achieved by killing the wisher and returning an arbitrary inanimate object.

comment by AndHisHorse · 2013-08-21T00:49:20.572Z · LW(p) · GW(p)

What if your judgement criteria are fluid - depending, perhaps, on your current hormonal state, your available knowledge, and your particular position in society?

comment by CynicalOptimist · 2016-11-08T22:53:45.196Z · LW(p) · GW(p)

I see where you're coming from on this one.

I'd only add this: if a genie is to be capable of granting this wish, it would need to know what your judgements were. It would need to understand them, at least as well as you do. This pretty much resolves to the same problem that Eliezer already discussed.

To create such a genie, you would either need to explain to the genie how you would feel about every possible circumstance, or you would need to program the genie so as to be able to correctly figure it out. Both of these tasks are probably a lot harder than they sound.

comment by Gray_Area · 2007-11-24T10:26:03.000Z · LW(p) · GW(p)

Sounds like we need to formalize human morality first, otherwise you aren't guaranteed consistency. Of course formalizing human morality seems like a hopeless project. Maybe we can ask an AI for help!

Replies from: wizzwizz4
comment by wizzwizz4 · 2020-02-27T22:14:24.082Z · LW(p) · GW(p)

Formalising human morality is easy!

1. Determine a formalised morality system close enough to the current observed human morality system that humans will be able to learn and accept it,

2. Eliminate all human culture (easier than eliminating only parts of it).

3. Raise humans with this morality system (which by the way includes systems for reducing value drift, so the process doesn't have to be repeated too often).

4. When value drift occurs, goto step 2.

comment by Gray_Area · 2007-11-24T12:38:05.000Z · LW(p) · GW(p)

On further reflection, the wish as expressed by Nick Tarleton above sounds dangerous, because all human morality may either be inconsistent in some sense, or 'naive' (failing to account for important aspects of reality we aren't aware of yet). Human morality changes as our technology and understanding changes, sometimes significantly. There is no reason to believe this trend will stop. I am afraid (genuine fear, not figure of speech) that the quest to properly formalize and generalize human morality for use by a 'friendly AI' is akin to properly formalizing and generalizing Ptolemean astronomy.

comment by J_Thomas · 2007-11-24T15:15:13.000Z · LW(p) · GW(p)

This generalises. Since you don't know everything, anything you do might wind up being counterproductive.

Like, I once knew a group of young merchants who wanted their shopping district revitalised. They worked at it and got their share of federal money that was assigned to their city, and they got the lighting improved, and the landscaping, and a beautiful fountain, and so on. It took several years and most of the improvements came in the third year. Then their landlords all raised the rents and they had to move out.

That one was predictable in hindsight, but I didn't predict it. There could always be things like that.

When anything you do could backfire, are you better off to stay in bed? No, the advantages of that are obvious but it's also obvious you can't make a living that way.

You have to make your choices and take your chances. If I had an outcome pump and my mother was trapped in a burning building and I had no other way to get her out, I hope I'd use it. The result might be worse than letting her burn to death but at least there would be a chance for a good outcome. If I can just get it to remove some of the bad outcomes the result may be an improvement.

Replies from: HungryHobo, CynicalOptimist
comment by HungryHobo · 2015-08-28T12:38:49.341Z · LW(p) · GW(p)

I think the unlimited potential for bad outcomes may be a problem there.

After all, the house might not explode, instead a military transport plane nearby might suffer a failure and the nuclear weapon on board might suffer a very unlikely set of failures and trigger on impact killing everyone for miles and throwing your mothers body far far far away. The pump isn't just dangerous to those involved and nearby.

Most consequences are limited in scope. You have a slim chance of killing many others through everyday accident but a pump would magnify that terribly.

Replies from: nyralech
comment by nyralech · 2015-08-28T23:49:23.570Z · LW(p) · GW(p)

Most consequences are limited in scope. You have a slim chance of killing many others through everyday accident but a pump would magnify that terribly.

That depends entirely on how the pump works. If it picks uniformly among bad outcomes, your point might be correct. However, it might still be biased towards narrow local effects for sheer sake of computability. If this is the case, I don't see why it would necessarily shift towards bigger bad outcomes rather than more limited ones.

Replies from: HungryHobo
comment by HungryHobo · 2015-09-14T10:46:09.730Z · LW(p) · GW(p)

In the example I gave the nuke exploding would be a narrow local effect which bleeds over into a large area. I agree that a pump which needed to monitor everything might very well choose only quite local direct effects but that could still have a lot of long range bad side effects.

Bursting the damn a few hundred meters upriver might have the effect of carrying your mother, possibly even alive, far from the center of the building and it may also involve extinguishing the fire if you've thought to add that in as a desirable element of the outcome yet lead to wiping out a whole town ten miles downstream. The sort of the point is that the pump wouldn't care about those side effects.

Replies from: nyralech
comment by nyralech · 2015-09-14T11:04:36.526Z · LW(p) · GW(p)

But those outcomes which have a limited initial effect yet have a very large overall effect are very sparsely distributed among all possible outcomes with a limited initial effect.

I still do not see why the pump would magnify the chance of those outcomes terribly. The space of possible actions which have a very large negative utility grows by a huge amount, but so does the space of actions which have trivial consequences beside doing what you want.

comment by CynicalOptimist · 2016-11-08T23:12:51.964Z · LW(p) · GW(p)

I agree, just because something MIGHT backfire, it doesn't mean we automatically shouldn't try it. We should weigh up the potential benefits and the potential costs as best we can predict them, along with our best guesses about the likelihood of each.

In this example, of course, the lessons we learn about "genies" are supposed to be applied to artificial intelligences.

One of the central concepts that Eliezer tries to express about AI is that when we get an AI that's as smart as humans, we will very quickly get an AI that's very much smarter than humans. At that point, the AI can probably trick us into letting it loose, and it may be able to devise a plan to achieve almost anything.

In this scenario, the potential costs are almost unlimited. And the probability is hard to work out. Therefore figuring out the best way to program it is very very important.

Because that's a genie... {CSI sunglasses moment} ... that we can't put back in the bottle.

comment by dilys · 2007-11-24T15:23:41.000Z · LW(p) · GW(p)

Wonderfully provocative post (meaning no disregard toward the poor old woman caught in the net of a rhetorical and definitional impasse). Obviously in reference to the line of thought in the "devil's dilemma" enshrined in the original Bedazzled, and so many magic-wish-fulfillment folk tales, in which there is always a loophole exploited by a counter-force, probably IMO in response to the motive to shortcut certain aspects of reality and its regulatory processes, known or unknown. It would be interesting to collect real life anecdotes about people who have "gotten what they want," and end up begging for their old life back, like Dudley Moore's über-frustrated Stanley Moon trapped in a convent.

I hope this question, ultimately of the relationship of the Part and the Whole, continues to be expressed, especially as relevant to any transhuman enterprise.

comment by Eric_1 · 2007-11-24T16:52:21.000Z · LW(p) · GW(p)

It seems contradictory to previous experience that humans should develop a technology with "black box" functionality, i.e. whose effects could not be foreseen and accurately controlled by the end-user. Technology has to be designed and it is designed with an effect/result in mind. It is then optimized so that the end user understands how to call forth this effect. So positing an effective equivalent of the mythological figure "Genie" in technological form ignores the optimization-for-use that would take place at each stage of developing an Outcome-Pump. The technology-falling-from-heaven which is the Outcome Pump demands that we reverse engineer the optimization of parameters which would have necessarily taken place if it had in fact developed as human technologies do.

I suppose the human mind has a very complex "ceteris paribus" function which holds all these background parameters at equal to their previous values, while not explicitly stating them, and the ironic-wish-fulfillment-Genie idea relates to the fulfillment of a wish while violating an unspoken ceteris paribus rule. Demolishing the building structure violates ceteris paribus more than the movements of a robot-retriever would in moving aside burning material to save the woman. Material displaced from building should be as nearly equal to the womans body weight as possible, inducing an explosion is a horrible violation of the objective, if the Pump could just be made to sense the proper (implied) parameters.

If the market forces of supply and demand continue to undergird technological progress (i.e. research and development and manufacturing), then the development of a sophisticated technology not-optimized-for-use is problematic: who pays for the second round of research implementation? Surely not the customer, when you give him an Outcome Pump whose every use could result in the death and destruction of his surrounding environs and family members. Granted this is an aside and maybe impertinent in the context of this discussion.

Replies from: CynicalOptimist, CronoDAS
comment by CynicalOptimist · 2016-11-08T23:30:09.765Z · LW(p) · GW(p)

"if the Pump could just be made to sense the proper (implied) parameters."

You're right, this would be an essential step. I'd say the main point of the post was to talk about the importance, and especially the difficulty, of achieving this.

Re optimisation for use: remember that this involves a certain amount of trial and error. In the case of dangerous technologies like explosives, firearms, or high speed vehicles, the process can often involve human beings dying, usually in the "error" part of trial and error.

If the technology in question was a super-intelligent AI, smart enough to fool us and engineer whatever outcome best matched its utility function? Then potentially we could find ourselves unable to fix the "error".

Please excuse the cheesy line, but sometimes you can't put the genie back in the bottle.

Re the workings of the human brain? I have to admit that I don't know the meaning of ceteris paribus, but I think that the brain mostly works by pattern recognition. In a "burning house" scenario, people would mostly contemplate the options that they thought were "normal" for the situation, or that they had previously imagined, heard about, or seen on TV

Generating a lot of different options and then comparing them for expected utility isn't the sort of thing that humans do naturally. It's the sort of behaviour that we have to be trained for, if you want us to apply it.

comment by CronoDAS · 2023-03-10T23:20:04.552Z · LW(p) · GW(p)

It is now 15 years later. We have large neural nets trained on large amounts of data that do impressive things by "learning" extremely complicated algorithms that might as well be black boxes, and that sometimes have bizarre and unanticipated results that are nothing like the ones we would have wanted.

comment by RiverC · 2007-11-24T20:15:30.000Z · LW(p) · GW(p)

Eric, I think he was merely attempting to point out the futility of wishes. Or rather, the futility of asking something for something you want that does not share your judgments on things. The Outcome pump is merely, like the Genie, a mechanism by which to explain his intended meaning. The problem of the outcome pump is, twofold: 1. Any theory that states that time is anything other than a constant now with motion and probability may work mathematically but has yet to be able to actually alter the thing which it describes in a measurable way, and 2. The production of something such as a time machine to begin with would be so destructive as to ultimately prevent the creation of the Outcome Pump.

In fact, as rational as we would like to be, if we are so rational that we miss the forest for the trees, or in this case, the moral for the myth, we sort of undo the reason we have rationality. It's like disassembling a clock to find the time.

Anyhow, the problem of wishes is the trick of prayer: To get something that God will grant, we cannot create a God that wants what we want; it is our inherent experience in life that if God really is all-powerful and above all that he must be singular, and since men's wishes oft conflict he can not by any stretch of the imagination mysteriously coincide with your own capricious desires. Thus you must make the 'wish no wish' which is to change your judgment to that of God's, and then in that case you can not possibly wish something that he will NOT grant.

The mystery of it is that it is still not the same as the 'safe genie'; but at the same time not altogether different. But in the sense that some old Christian Mystics have said the best prayer is the one in which you make no petitions at all (and in fact say nothing!) probably attests to the fact that it is indeed the 'safe genie'.

comment by Nick_Tarleton · 2007-11-24T20:59:00.000Z · LW(p) · GW(p)

On further reflection, the wish as expressed by Nick Tarleton above sounds dangerous, because all human morality may either be inconsistent in some sense, or 'naive' (failing to account for important aspects of reality we aren't aware of yet).

You're right. Hence, CEV.

comment by Doug_S. · 2007-11-24T21:39:45.000Z · LW(p) · GW(p)

Eliezer, you read Home on the Strange?

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2007-11-24T22:05:31.000Z · LW(p) · GW(p)

So positing an effective equivalent of the mythological figure "Genie" in technological form ignores the optimization-for-use that would take place at each stage of developing an Outcome-Pump. The technology-falling-from-heaven which is the Outcome Pump demands that we reverse engineer the optimization of parameters which would have necessarily taken place if it had in fact developed as human technologies do.

Unfortunately, Eric, when you build a powerful enough Outcome Pump, it can wish more powerful Outcome Pumps into existence, which can in turn wish even more powerful Outcome Pumps into existence. So once you cross a certain threshold, you get an explosion of optimization power, which mere trial and error is not sufficient to control because of the enormous change of context, in particular, the genie has gone from being less powerful than you to being more powerful than you, and what appeared to work in the former context won't work in the latter.

Which is precisely what happened to natural selection when it developed humans.

comment by Eric_1 · 2007-11-24T22:37:35.000Z · LW(p) · GW(p)

"Unfortunately, Eric, when you build a powerful enough Outcome Pump, it can wish more powerful Outcome Pumps into existence, which can in turn wish even more powerful Outcome Pumps into existence."

Yes, technology that develops itself, once a certain point of sophistication is reached.

My only acquaintance with AI up to now has been this website: http://www.20q.net Which contains a neural network that has been learning for two decades or so. It can "read your mind" when you're thinking of a character from the TV show The Simpsons. Pretty incredible actually!

comment by Eric_1 · 2007-11-24T22:54:07.000Z · LW(p) · GW(p)

Eliezer, I clicked on your name in the above comment box and voila- a whole set of resources to learn about AI. I also found out why you use the adjective "unfortunately" in reference to the Outcome Pump, as its on the Singularity Institute website. Fascinating stuff!

comment by Gray_Area · 2007-11-25T00:34:06.000Z · LW(p) · GW(p)

"It seems contradictory to previous experience that humans should develop a technology with "black box" functionality, i.e. whose effects could not be foreseen and accurately controlled by the end-user."

Eric, have you ever been a computer programmer? That technology becomes more and more like a black box is not only in line with previous experience, but I dare say is a trend as technological complexity increases.

comment by Eric_1 · 2007-11-25T01:52:55.000Z · LW(p) · GW(p)

"Eric, have you ever been a computer programmer? That technology becomes more and more like a black box is not only in line with previous experience, but I dare say is a trend as technological complexity increases."

No I haven't. Could you expand on what you mean?

comment by James_D._Miller · 2007-11-25T05:31:16.000Z · LW(p) · GW(p)

In the first year of law school students learn that for every clear legal rule there always exists situations for which either the rule doesn't apply or for which the rule gives a bad outcome. This is why we always need to give judges some discretion when administering the law.

comment by TGGP4 · 2007-11-25T06:20:53.000Z · LW(p) · GW(p)

James Miller, have you read The Myth of the Rule of Law? What do you think of it?

comment by Gray_Area · 2007-11-25T11:52:12.000Z · LW(p) · GW(p)

Every computer programmer, indeed anybody who uses computers extensively has been surprised by computers. Despite being deterministic, a personal computer taken as a whole (hardware, operating system, software running on top of the operating system, network protocols creating the internet, etc. etc.) is too large for a single mind to understand. We have partial theories of how computers work, but of course partial theories sometimes fail and this produces surprise.

This is not a new development. I have only a partial theory of how my car works, but in the old days people only had a partial theory of how a horse works. Even a technology as simple and old as a knife still follows non-trivial physics and so can surprise us (can you predict when a given knife will shatter?). Ultimately, most objects, man-made or not are 'black boxes.'

Replies from: danlowlite
comment by danlowlite · 2011-02-15T15:04:21.488Z · LW(p) · GW(p)

Material sciences can give us an estimate on the shattering of a given material given certain criteria.

Just because you do not know specific things about it doesn't make it a black box. Of course, that doesn't make the problems with complex systems disappear, it just exposes our ignorance. Which is not a new point here.

comment by James_D._Miller · 2007-11-25T15:27:32.000Z · LW(p) · GW(p)

TGGP,

I have not read the Myth of the Rule of Law.

comment by JulianMorrison · 2007-11-25T16:16:48.000Z · LW(p) · GW(p)

Given that it's impossible for the someone to know your total mind without being it, the only safe genie is yourself.

From the above it's easy to see why it's never possible to define the "best interests" of anyone but your own self. And from that it's possible to show that it's never possible to define the best interests of the public, except through their individually chosen actions. And from that you can derive libertarianism.

Just an aside :-)

Replies from: Roko
comment by Roko · 2010-02-09T20:30:35.714Z · LW(p) · GW(p)

Given that it's impossible for the someone to know your total mind without being it, the only safe genie is yourself.

What about a genie that knows what you would do (and indeed what everyone else in the world would do), but doesn't have subjective experiences, so isn't actually anybody?

Replies from: JulianMorrison
comment by JulianMorrison · 2010-02-10T12:01:25.433Z · LW(p) · GW(p)

Not enough information. The genie is programmed to do what with that knowledge? If it's CEV done right, it's safe.

comment by Eric_1 · 2007-11-25T16:33:23.000Z · LW(p) · GW(p)

"Ultimately, most objects, man-made or not are 'black boxes.'"

OK, I see what you're getting at.

Three questions about black boxes:

1) Does the input have to be fully known/observable to constitute a black box? When investigating a population of neurons, we can give stimulus to these cells, but we cannot be sure that we are aware of all the inputs they are receiving. So we effectively do not entirely understand the input being given.

2) Does the output have to be fully known/observable to constitute a black box? When we measure the output of a population of neurons, we also cannot be sure of the totality of information being sent out, due to experimental limitations.

3) If one does not understand a system one uses, does that fact alone make that system a black box? In that case there are absolute black boxes, like the human mind, about which complete information is not known, and relative black boxes, like the car or TCP/IP, about which complete information is not known to the current user.

4) What degree of understanding is sufficient for something not to be called a black box?

Depending on how we answer these things, it will determine whether black box comes to mean:

1) Anything that is identifiable as a 'part', whose input and output is known but whose intermediate working/processing is not understood. 2) Anything that is identifiable as a 'part' whose input, output and/or processing is not understood. 3) Any 'part' that is not completely understood (i.e. presuming access to all information) 4) Anything that is not understood by the user at the time 5) Anything that is not FULLY understood by the user at the time.

We will quickly be in the realm where anything and everything on earth is considered to be a black box, if we take the latter definitions. So how can this word/metaphor be most profitably wielded?

Replies from: CynicalOptimist
comment by CynicalOptimist · 2016-11-08T23:46:28.419Z · LW(p) · GW(p)

I like this style of reasoning.

Rather than taking some arbitrary definition of black boxes and then arguing about whether they apply, you've recognised that a phrase can be understood in many ways, and we should use the word in whatever way most helps us in this discussion. That's exactly the sort of rationality technique we should be learning.

A different way of thinking about it though, is that we can remove the confusing term altogether. Rather than defining the term "black box", we can try to remember why it was originally used, and look for another way to express the intended concept.

In this case, I'd say the point was: "Sometimes, we will use a tool expecting to get one result, and instead we will get a completely different, unexpected result. Often we can explain these results later. They may even have been predictable in advance, and yet they weren't predicted."

Computer programming is especially prime to this. The computer will faithfully execute the instructions that you gave it, but those instructions might not have the net result that you wanted.

comment by Recovering_irrationalist · 2007-11-25T16:57:59.000Z · LW(p) · GW(p)

TGGP: What did you think of it? I agree till the Socrates Universe, but thought the logic goes downhill from there.

comment by mtraven · 2007-11-25T17:10:02.000Z · LW(p) · GW(p)

tggp, that paper was interesting, although I found its thesis unremarkable. You should share it with our pal Mencius.

comment by Kevin2 · 2007-11-25T17:48:29.000Z · LW(p) · GW(p)

Upon some reflection, I remembered that Robin has showed that two Bayesians who share the same priors can't disagree. So perhaps you can get your wish from an unsafe genie by wishing, "... to run a genie that perfectly shares my goals and prior probabilities."

comment by Recovering_irrationalist · 2007-11-25T18:03:49.000Z · LW(p) · GW(p)

As long as you're wishing, wouldn't you rather have a genie whose prior probabilities correspond to reality as accurately as possible? I wouldn't pick an omnipotent but equally ignorant me to be my best possible genie.

comment by Gray_Area · 2007-11-26T00:49:18.000Z · LW(p) · GW(p)

"As long as you're wishing, wouldn't you rather have a genie whose prior probabilities correspond to reality as accurately as possible?"

Such a genie might already exist.

comment by Caledonian2 · 2007-11-26T01:30:54.000Z · LW(p) · GW(p)
In the first year of law school students learn that for every clear legal rule there always exists situations for which either the rule doesn't apply or for which the rule gives a bad outcome.

If the rule doesn't apply, it's not relevant in the first place. I doubt very much you can establish what a 'bad' outcome would involve in such a way that everyone would agree - and I don't see why your personal opinion on the matter should be of concern when we consider legal design.

comment by Recovering_irrationalist · 2007-11-26T02:04:33.000Z · LW(p) · GW(p)

Such a genie might already exist.
You mean GOD? From the good book? It's more plausible than some stories I could mention.

GOD, I meta-wish for an ((...Emergence-y Re-get) Emergence-y Re-get) Emergency Regret Button.

comment by Peter_de_Blanc · 2007-11-26T04:53:04.000Z · LW(p) · GW(p)

Recovering Irrationalist said:

I wouldn't pick an omnipotent but equally ignorant me to be my best possible genie.

Right. It's silly to wish for a genie with the same beliefs as yourself, because the system consisting of you and an unsafe genie is already such a genie.

comment by TGGP4 · 2007-11-26T06:11:09.000Z · LW(p) · GW(p)

I discussed "The Myth of the Rule of Law" with Mencius Moldbug here. I recognize that politics alters the application of law and that as long as it is written in natural language there will be irresolvable differences over its meaning. At the same time I observe that different countries seem to hold different levels of respect for the "rule of law" that the state is expected to obey, and it appears to me that those more prone to do so have more livable societies. I think the norm of neutrality on the part of judges applying law with objective meaning is good to be promoted. When there is bad law it is properly the job of the legislature to fix it. This makes it easier for people to know what the law is in advance so they can avoid being smacked with it.

comment by AnnaSalamon · 2007-11-26T08:00:08.000Z · LW(p) · GW(p)

"You cannot predict, in advance, which of your values will be needed to judge the path through time that the genie takes.... The only safe genie is a genie that shares all your judgment criteria."

Is a genie that does share all my judgment criteria necessarily safe?

Maybe my question is ill-formed; I am not sure what "safe" could mean besides "a predictable maximizer of my judgment criteria". But I am concerned that human judgment under ordinary circumstances increases some sort of Beauty/Value/Coolness which would not be increased if that same human judgment was used to search over a less restricted set of possibilities.

The world is full of cases where selecting for A automatically increases B when you are searching over a restricted set of possibilities but does not increase B when those restrictions are lifted. Overfitting is a classic example. In cases of overfitting, if we search only over a restricted set of few-parameter models, models that do well on the training set will automatically do well on the generalization set, but if we allow more parameters the correlation disappears.

Modern marketing / product development can search over a larger set of alternatives than we used to have access to. In many cases human judgments correlate with less when used on modern manufactured goods than when used on the smaller set of goods that was formerly available. Judgments of tastiness used to correlate with health but now do not. Judgments of "this is a limited resource which I should grab quickly" used to indicate resources which we really should grab quickly but now do not (because of manufactured "limited time offer only" signs and the like).

Genies or AGI's would search over an even larger space of possibilities than contemporary marketing searches over. In this larger space, many of the traditional correlates of human judgment will disappear. That is: in today's restricted search spaces, outcomes which are ranked highly according to human judgment criteria tend also to have various other properties P1, P2, ... Pk. In an AGI's search space, outcomes which are ranked highly according to human judgment criteria will not have properties P1... Pk.

I am worried that properties P1...Pk are somehow valuable. That is, I am worried that in this world human judgments pick out outcomes that are somehow valuable and that human judgments' ability to do this resides, not in our judgment criteria alone (which would be uploaded into our imagined genie) but in the conjunction of our judgment criteria with the restricted set of possibilities that has so far been available to us.

comment by starwed · 2007-11-26T11:02:36.000Z · LW(p) · GW(p)

"Whatever proposition you can manage to input into the Outcome Pump, somehow happens, though not in a way that violates the laws of physics. If you try to input a proposition that's too unlikely, the time machine will suffer a spontaneous mechanical failure before that outcome ever occurs."

So, a kind of Maxwell's demon? :)

comment by Stanislav_Datskovskiy · 2007-11-26T12:11:27.000Z · LW(p) · GW(p)

Rather than designing a genie to exactly match your moral criteria, the simple solution would be to cheat and use yourself as the genie. What the Outcome Pump should solve for is your own future satisfaction. To that end, you would omit all functionality other than the "regret button", and make the latter default-on, with activation by anything other than a satisfied-you vanishingly improbable. Say, with a lengthy password.

Of course, you could still end up in a universe where your brain has been spontaneously re-wired to hate your mother. However, I think that such an event is far less likely than a proper rescue.

comment by David_C · 2007-11-26T12:23:28.000Z · LW(p) · GW(p)

You have a good point about the exhaustiveness required to ensure the best possible outcome. In that case the ability of the genie to act "safely" would depend upon the level of the genie's omniscience. For example, if the genie could predict the results of any action it took, you could simply ask it to select any path that results in you saying "thanks genie, great job" without coercion. Therefore it would effectively be using you as an oracle of success or failure.

A non-omniscient genie would either need complete instructions, or would only work well where there was an ideal solution. For example, if you wished for your mother to be rescued by a fireman without anyone dying or experiencing damage to more than 2% of their skin, bones or internal organs. The difficulty is when not all your criteria can be satisfied. Things suddenly become very murky.

comment by Stuart_Armstrong · 2007-11-26T16:52:46.000Z · LW(p) · GW(p)

With a safe genie, wishing is superfluous. Just run the genie.

But while most genies are terminally unsafe, there is a domain of "nearly-safe" genies, which must dwarf the space of "safe" genies (examples of a nearly-safe genie: one that picks the moral code of a random living human before deciding on an action or a safe genie + noise). This might sound like semantics, but I think the search for a totally "safe" genie/AI is a pipe-dream, and we should go for "nearly safe" (I've got a short paper on one approach to this here).

comment by Nick_Tarleton · 2007-11-26T19:08:42.000Z · LW(p) · GW(p)

I am worried that properties P1...Pk are somehow valuable.

In what sense can they be valuable, if they are not valued by human judgment criteria (even if not consciously most of the time)?

For example, if the genie could predict the results of any action it took, you could simply ask it to select any path that results in you saying "thanks genie, great job" without coercion.

Formalizing "coercion" is itself an exhaustive problem. Saying "don't manipulate my brain except through my senses" is a big first step, but it doesn't exclude, e.g., powerful arguments that you don't really want your mother to live.

comment by Benquo · 2007-11-26T23:00:37.000Z · LW(p) · GW(p)

Nick,

Are you thinking of magically strong arguments, or ones that convince because they provide good reasons?

I'd think the latter would be valuable even if it leads to a result you'd initially suppose to be bad.

comment by Nick_Tarleton · 2007-11-27T00:01:15.000Z · LW(p) · GW(p)

The first.

comment by AnnaSalamon · 2007-11-27T01:02:47.000Z · LW(p) · GW(p)

"In what sense can [properties P1...Pk] be valuable, if they are not valued by human judgment criteria (even if not consciously most of the time)?"

I don't know. It might be that the only sense in which something can be valuable is to look valuable according to human judgment criteria (when thoroughly implemented, and well informed, and all that). If so, my concern is ill-formed or irrelevant.

On the other hand, it seems possible that human judgments of value are an imperfect approximation of what is valuable in some other (external?) sense. Imagine for example if we met multiple alien races and all of them said "I see what you're getting at with this 'value/goodness/beauty/truth' thing, but you are misunderstanding it a bit; in a few thousand years, you will modify your root judgment criteria in such-and-such a way." In that case I would wonder whether my current judgment criteria were not best understood as an approximation of this other set of criteria and whether it was not value according to this other set of criteria that I should be aiming for.

If human judgment criteria are an approximation of some other kind of value, they would probably cease to approximate that other kind of value when used to search over the large space of genie-accessible possibilities.

By way of analogy, scientists' criteria for judging scientific truth/relevance/etc. seem to be changing usefully over time, and it may be that scientists' criteria at different times can be viewed as successive approximations of some other (external?) truth-criteria. Galilean physicists had one way of determining what to believe, Newtonians another, and contemporary physicists yet another. In the restricted set of situations considered by Galilean physicists, Galilean methods yield approximately the same predictions as the methods of contemporary physicists. In the larger space of genie-accessible situations, they do not.

comment by Benquo · 2007-11-29T05:41:36.000Z · LW(p) · GW(p)

Nick,

What makes you think that magically strong arguments are possible? I can imagine arguments that work better than they should because they indulge someone's unconscious inclinations or biases, but not ones that work better than their truthfulness would suggest and cut against the grain of one's inclinations.

comment by Nick_Tarleton · 2007-11-29T13:09:24.000Z · LW(p) · GW(p)

I don't know that they are, but it's the conservative assumption, in that it carries less risk of the world being destroyed if you're wrong. Also, see the AI-box experiments.

comment by maki_hodnett · 2007-12-09T19:38:19.000Z · LW(p) · GW(p)

I think the best way is to believe you and the genie are one. and therefore it is necessary to be grateful for everything you currently have ..this creates a loop. then you can be grateful for things you "will" have right now. For instance you can begin by affirming and feeling within yourself the gratitude for your financial wealth. Financial wealth...starts to appear!

comment by kyb · 2008-06-13T15:29:23.000Z · LW(p) · GW(p)

Excellent post.

comment by cousin_it · 2009-07-26T07:05:05.784Z · LW(p) · GW(p)

Damn, it took me a long time to make the connection between the Outcome Pump and quantum suicide reality editing. And the argument that proves the unsafety of the Outcome Pump is perfectly isomorphic to the argument why quantum immortality is scary.

comment by MoreOn · 2011-02-21T19:36:56.432Z · LW(p) · GW(p)

"I wish that the genie could understand a programming language."

Then I could program it unambiguously. I obviously wouldn't be able to program my mother out of the burning building on the spot, but at least there would be a host of other wishes I could make that the genie won't be able to screw up.

comment by DevilMaster · 2011-03-25T13:54:29.409Z · LW(p) · GW(p)

"I wish that wishes would be granted as the wisher would interpret them".

Replies from: FAWS, pengvado
comment by FAWS · 2011-03-25T14:05:22.745Z · LW(p) · GW(p)

Doesn't protect against unforeseen consequences and is possibly underspecified (How should the wish work when it needs to affect things the wisher doesn't understand? Create a version of the wisher that does understand? What if there are multiple possible versions that don't agree on interpretations among each other?).

comment by pengvado · 2011-03-25T14:26:35.608Z · LW(p) · GW(p)

Doesn't protect against a reflectively-consistent misinterpretation of "as the wisher would interpret them".

comment by RobertLumley · 2011-09-13T23:09:08.349Z · LW(p) · GW(p)

You wouldn't want to swap a human life for hers, but what about the life of a convicted murderer?

Are convicted murderers not human?

comment by ajuc · 2012-03-01T20:36:28.538Z · LW(p) · GW(p)

So if I specified to the Outcome Pump, that I want the outcome, where the person, that is future version of me (by DNA, and by physical continuity of the body), will write "ABRACADABRA, This outcome I good enough and I value it for $X" on the paper and put in on the outcome pump, and the $X is how much I value the outcome. And if this won't happen in one year, I don't want this outcome, either).

Are there any loopholes?

Replies from: Qiaochu_Yuan
comment by Qiaochu_Yuan · 2013-01-04T12:14:17.844Z · LW(p) · GW(p)

Genie takes over your body.

comment by Jiro · 2013-08-20T22:37:18.793Z · LW(p) · GW(p)

If the genie is clueless but not actively malicious, then you can ask the genie to describe how it will fulfill your wish. If it describes making the building explode and having your mother's dead body fly out, you correct the genie and tell it to try again. If it gives an inadequate description (says the building explodes and fails to mention what happens to the mother's body at all), you can ask it to elaborate. If it gives a description that is inadequate in exactly the right way to make you think it's describing it adequately while still leaving a huge loophole, there's not much you can do, but that's not a clueless genie, that's an actively malicious genie pretending to be a clueless one.

Replies from: shminux
comment by Shmi (shminux) · 2013-08-20T22:56:39.597Z · LW(p) · GW(p)

So your recommendation is to use a human as a part of the genie's outcome utility evaluator, relying on human intelligence when deciding between multiple low-probability (i.e. miraculous) events? Even though people have virtually no intuition when dealing with them? I suspect the results would be pretty grave, but on a larger scale, since the negative consequences would be non-obvious and possibly delayed.

Replies from: Jiro
comment by Jiro · 2013-08-21T18:04:48.680Z · LW(p) · GW(p)

A genie asked to rescue my mother from a burning building would do it by performing acts that, while miraculous, will be part of a chain of events that is comprehensible by humans. If the genie throws my mother out of the building at 100 miles per hour, for instance, it is miraculous that anyone can throw her out at that speed, but I certainly understand what it means to do that and am able to object. Even if the genie begins by manipulating some quantum energies in a way I can't understand, that's part of a chain of events that leads to throwing, a concept that I do understand.

Yes, it is always possible that there are delayed negative consequences. Suppose it rescues my mother by opening a door and I have no idea that 10 years from now the mayor is going to be saved from an assassin by the door of a burned out wreck being in the closed position and blocking a bullet. But that kind of negative consequence is not unique to genies, and humans go around all their lives doing things with such consequences. Maybe the next time I donate to charity I have to move my arm in such a way that a cell falls in the path of an oncoming cosmic ray, thus giving me cancer 10 years later. As long as the genie isn't actively malicious and just pretending to be clueless, the risk of such things is acceptable for the same reason it's acceptable for non-genie human activities. Furthermore, if the genie is clueless, it won't hide the fact that its plan would kill my mother--indeed, it doesn't even know that it would need to hide that, since it doesn't know that that would overall displease me. So I should be able to figure out that that's its plan by talking to it.

Replies from: shminux, MugaSofer
comment by Shmi (shminux) · 2013-08-21T18:13:20.056Z · LW(p) · GW(p)

Right, when humans do the usual human things, they put up with the butterfly effect and rely on their intuition and experience to reduce the odds of screwing things up badly in the short term. However, when evaluating the consequences of miracles we have nothing to guide us, so relying on a human evaluator in the loop is no better than relying on a three-year old to stay away from a ledge or candy box. Neither has a clue.

comment by MugaSofer · 2013-08-21T19:05:41.219Z · LW(p) · GW(p)

A genie asked to rescue my mother from a burning building would do it by performing acts that, while miraculous, will be part of a chain of events that is comprehensible by humans. If the genie throws my mother out of the building at 100 miles per hour, for instance, it is miraculous that anyone can throw her out at that speed, but I certainly understand what it means to do that and am able to object. Even if the genie begins by manipulating some quantum energies in a way I can't understand, that's part of a chain of events that leads to throwing, a concept that I do understand.

This is, of course, not true of superintelligence ... is that your point?

As long as the genie isn't actively malicious and just pretending to be clueless, the risk of such things is acceptable for the same reason it's acceptable for non-genie human activities.

Not really. The genie will look in parts of solution-space you wouldn't (eg setting off the gas main, killing everyone nearby.)

Furthermore, if the genie is clueless, it won't hide the fact that its plan would kill my mother--indeed, it doesn't even know that it would need to hide that, since it doesn't know that that would overall displease me. So I should be able to figure out that that's its plan by talking to it.

Well, if it can talk. And it doesn't realise that you would sabotage the plan if you knew.

Replies from: Jiro
comment by Jiro · 2013-08-21T20:01:10.222Z · LW(p) · GW(p)

This is, of course, not true of superintelligence ... is that your point?

Why would this not be true of superintelligence, assuming the intelligence isn't actively malicious?

The genie will look in parts of solution-space you wouldn't (eg setting off the gas main, killing everyone nearby.)

"Talk to the genie" doesn't require that I be able to understand the solution space, just the result. If the genie is going to frazmatazz the whatzit, killing everyone in the building, I would still be able to discover that by talking to the genie. (Of course, I can't reduce the chance of disaster to zero this way, but I can reduce it to an acceptable level matching other human activities that don't have genies in them.)

Well, if it can talk. And it doesn't realise that you would sabotage the plan if you knew.

If it realizes I would sabotage the plan, then it knows that the plan would not satisfy me. If it pushes for the plan knowing that it won't satisfy me, then it's an actively malicious genie, not a clueless one.

Replies from: MugaSofer
comment by MugaSofer · 2013-08-24T12:26:33.208Z · LW(p) · GW(p)

A genie asked to rescue my mother from a burning building would do it by performing acts that, while miraculous, will be part of a chain of events that is comprehensible by humans. If the genie throws my mother out of the building at 100 miles per hour, for instance, it is miraculous that anyone can throw her out at that speed, but I certainly understand what it means to do that and am able to object.

Superintelligence can use strategies you can't undertstand.

The genie will look in parts of solution-space you wouldn't (eg setting off the gas main, killing everyone nearby.)

"Talk to the genie" doesn't require that I be able to understand the solution space, just the result. If the genie is going to frazmatazz the whatzit, killing everyone in the building, I would still be able to discover that by talking to the genie. (Of course, I can't reduce the chance of disaster to zero this way, but I can reduce it to an acceptable level matching other human activities that don't have genies in them.)

That was in response to the claim that genies' actions are no more likely to have unforeseen side-effects than human ones.

If it realizes I would sabotage the plan, then it knows that the plan would not satisfy me. If it pushes for the plan knowing that it won't satisfy me, then it's an actively malicious genie, not a clueless one.

... no, that's kind of the definition of a clueless genie. A malicious one would be actively seeking out solutions that annoy you.

(Also, some Good solutions might require fooling you for your own good, if only because there's no time to explain.)

Replies from: Jiro
comment by Jiro · 2013-08-24T17:05:29.230Z · LW(p) · GW(p)

Superintelligence can use strategies you can't undertstand.

There's a contradiction between "the superintelligence will do something you don't want" and "the superintelligence will do something you don't understand". Not wanting it implies I understand enough about it to not want it (even if I don't understand every single step).

that's kind of the definition of a clueless genie

I would consider a clueless genie to be a genie that tries to grant my wishes, but because it doesn't understand me, grants my wishes in a way that I wouldn't want. A malicious genie is a genie that grants my wishes in a way that it knows I wouldn't want. Reserving that term for genies that intentionally annoy while excluding genies that merely knowingly annoy is hairsplitting and only changes the terminology anyway.

Also, some Good solutions might require fooling you for your own good, if only because there's no time to explain.

If I would in fact want genies to fool me for my own good in such situations, this isn't a problem.

On the other hand, if I think that genies should not try to fool me for my own good in such situations, and the genie knows this, and it fools me for my own good anyway, it's a malicious genie by my standards. The genie has not failed to understand me; it understands what I want perfectly well, but knowingly does something contrary to its understanding of my desires. In the original example, the genie would be asked to save my mother from a building, it knows that I don't want it to explode the building to get her out, and it explodes the building anyway.

Replies from: MugaSofer
comment by MugaSofer · 2013-08-26T15:17:24.888Z · LW(p) · GW(p)

There's a contradiction between "the superintelligence will do something you don't want" and "the superintelligence will do something you don't understand". Not wanting it implies I understand enough about it to not want it (even if I don't understand every single step).

Well, firstly, there might be things you wouldn't want if you could only understand them. But actually, I was thinking of actions that would affect society in subtle, sweeping ways. Sure, if the results were explained to you, you might not like them, but you built the genie to grant wishes, not explain them. And how sure are you that's even possible, for all possible wish-granting methods?

I would consider a clueless genie to be a genie that tries to grant my wishes, but because it doesn't understand me, grants my wishes in a way that I wouldn't want. A malicious genie is a genie that grants my wishes in a way that it knows I wouldn't want. Reserving that term for genies that intentionally annoy while excluding genies that merely knowingly annoy is hairsplitting and only changes the terminology anyway.

Well, that's what the term usually means. And, honestly, I think there's good reason for that; it takes a pretty precise definition of "non-malicious genie", AKA FAI, not to do Bad Things, which is kind of the point of this essay.

Replies from: Jiro
comment by Jiro · 2013-08-26T15:36:26.192Z · LW(p) · GW(p)

Sure, if the results were explained to you, you might not like them, but you built the genie to grant wishes, not explain them.

That's why I suggested you can talk to the genie. Provided the genie is not malicious, it shouldn't conceal any such consequences; you just need to quiz it well.

It's sort of like the Turing test, but used to determine wish acceptability instead of intelligence. If a human can talk to it and say it is a person, treat it like a person. If a human can talk to it and decide the wish is good, treat the wish as good. And just like the Turing test, it relies on the fact that humans are better at asking questions during the process than writing long lists of prearranged questions that try to cover all situations in advance.

Well, that's what the term usually means.

Really? A clueless genie is a genie that is asked to do something, knows that the way it does it is displeasing to you, and does it anyway? I wouldn't call that a clueless genie.

What terms would you use for

-- a genie that would never knowingly displease you in granting wishes, but may do so out of ignorance

-- a genie that will knowingly displease you in granting wishes

-- a genie that will deliberately displease you in granting wishes?

Replies from: MugaSofer
comment by MugaSofer · 2013-08-26T16:35:53.351Z · LW(p) · GW(p)

More full response coming soon to a comment box near you. For now, terms! Everyone loves terms.

Really?

Here's how I learned it:

A "genie" will grant your wishes, without regard to what you actually want.

A malicious genie will grant your wishes, but deliberately seek out ways to do so that will do things you don't actually want.

A helpful - or Friendly - genie will work out what you actually wanted in the first place, and just give you that, without any of this tiresome "wishing" business. Sometimes called a "useful" genie - there's really no one agreed-on term. Essentially, what you're trying to replicate with carefully-worded wishes to other genies.

Replies from: Jiro
comment by Jiro · 2013-08-26T20:19:50.368Z · LW(p) · GW(p)

I want to know what terms you would use that would distinguish between a genie that grants wishes in ways I don't want because it doesn't know any better, and a genie that grants wishes in ways I don't want despite knowing better.

By your definitions above, these are both just "genie" and you don't really have terms to distinguish between them at all.

Replies from: MugaSofer
comment by MugaSofer · 2013-08-26T21:39:27.923Z · LW(p) · GW(p)

Well, since the whole genie thing is a metaphor for superintelligence, "this genie is trying to be Friendly but it's too dumb to model you well" doesn't really come up. If it did, I guess you would need to invent a new term (Friendly Narrow AI?) to distinguish it, yeah.

Replies from: Jiro
comment by Jiro · 2013-08-26T22:15:41.056Z · LW(p) · GW(p)

It's my impression that the typical scenario of a superintelligence that kills everyone to make paperclips, because you told it to make paperclips, falls into the first category. It's trying to follow your request; it just doesn't know that your request really means "I want to make paperclips, subject to some implicit constraints such as ethics, being able to stop when told to stop, etc." If it does know what your request really means, yet it still maximizes paperclips by killing people, it's disobeying your intention if not your literal words.

(And then there's always the possibility of telling it "make paperclips, in the way that I mean when I ask that". If you say that, and the AI still kills people, it's unfriendly by both our standards--since your request explicitly told it to follow your intention, disobeying your intention also disobeys your literal words.)

Replies from: MugaSofer
comment by MugaSofer · 2013-08-28T18:19:42.851Z · LW(p) · GW(p)

It's trying to follow your request; it just doesn't know that your request really means "I want to make paperclips, subject to some implicit constraints such as ethics, being able to stop when told to stop, etc." If it does know what your request really means, yet it still maximizes paperclips by killing people, it's disobeying your intention if not your literal words.

Well, sure it is. That's the point of genies (and the analogous point about programming AIs): they do what you tell them, not what you wanted.

Replies from: private_messaging
comment by private_messaging · 2013-08-28T19:54:33.412Z · LW(p) · GW(p)

What you tell is a pattern of pressure changes in the air, it's only the megaphones and tape recorders that literally "do what you tell them".

The genie that would do what you want would have to use the pressure changes as a clue for deducing your intent. When writing a story about a genie that does "what you tell them, not what you wanted" you have to use the pressure changes as a clue for deducing some range of misunderstandings of those orders, and then pick some understanding that you think makes the best story. It may be that we have an innate mechanism for finding the range of possible misunderstandings, to be able to combine following orders with self interest.

Replies from: ArisKatsaris
comment by ArisKatsaris · 2013-08-28T20:16:01.877Z · LW(p) · GW(p)

"What you tell them" in the context of programs is meant in the sense of "What you program them to", not in the sense of "The dictionary definition of the word-noises you make when talking into their speakers".

Replies from: private_messaging
comment by private_messaging · 2013-08-28T21:04:32.356Z · LW(p) · GW(p)

They were talking of genies, though, and the sort of failure that tends to arise from how a short sentence describes multitude of diverse intents (i.e. ambiguity). Programming is about specifying what you want in extremely verbose manner, the verbosity being a necessary consequence of non-ambiguity.

Replies from: Jiro
comment by Jiro · 2013-08-28T21:36:46.207Z · LW(p) · GW(p)

The genie is a metaphor for programming the AI.

The problem is that the people describing the nightmare AI scenario are being vague about exactly why the AI is killing people when told to make paperclips. If the AI doesn't know that you really mean "make paperclips without killing anyone", that's not a realistic scenario for AIs at all--the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to "make paperclips in the way that I mean".

The whole genie argument fails because the metaphor fails. It makes sense that a genie who is asked to save your mother might do so by blowing up the building, because the genie is clueless. You can't tell the genie "you know what I really mean when I ask you to save my mother, so do that". You can tell this to an AI. Furthermore, you can always quiz either the genie or the AI on how it is going to fulfill your wish and only make the wish once you are satisfied with what it's going to do.

Replies from: ArisKatsaris, RobbBB
comment by ArisKatsaris · 2013-08-28T22:15:59.453Z · LW(p) · GW(p)

If the AI knows what you really mean, then you can fix this by programming the AI to "make paperclips in the way that I mean".

How does that follow? Even if the AI (at some point in its existence) knows what you really "mean", that doesn't mean that at that point you know how to make it do what you mean.

Replies from: Jiro
comment by Jiro · 2013-08-28T22:28:18.813Z · LW(p) · GW(p)

It's not hard. "Do what I mean, to the best of your knowledge."

Replies from: gattsuru, ArisKatsaris, Eliezer_Yudkowsky, private_messaging
comment by gattsuru · 2013-08-29T02:49:34.348Z · LW(p) · GW(p)

Even what you really mean may not be what you should be wishing for, if you don't have complete information, but that's honestly the least of the relevant problems. We've got a hell of a time just getting computers to understand human speech : it's taken decades to achieve the idiot-listeners on telephone lines. By the point where you can point an AGI at yourself and tell it to do what I mean, you've either programmed it with a non-trivial set of human morality or taught it to program itself with a non-trivial portion of human morality.

You might as well skip the wasted breath and opaqueness. That's a genie that's safe enough to simply ask to do as you should wish, aka Friendly-AI-complete.

((On top of /that/, the more complex the utility function, the more likely you are to get killed by value drift down the road, when some special-case patch or rule doesn't correctly transfer from your starting FAI to its next generation, and eventually you end up with a very unfriendly AI, or when the scales get large enough that your initial premises no longer survive.))

Replies from: Jiro
comment by Jiro · 2013-08-29T17:31:21.938Z · LW(p) · GW(p)

Remember the distinction between an AI that doesn't understand what you mean, and an AI that does understand what you mean but doesn't always follow that. These are two different things. In order to be safe, an AI must be in neither category, but different arguments apply to each category.

When I point out that a genie might fail to understand you but a superintelligent AI should understand you because it is superintelligent (which I took from MugaSofer, I am addressing the first category.

When I suggest explicitly asking the AI "do what I mean", I am addressing the second category. Since I am addressing a category in which the AI does understand my intentions, the objection "you can't make an AI understand your intentions without programming it with morality" is not a valid response.

Replies from: ArisKatsaris
comment by ArisKatsaris · 2013-08-29T17:40:20.248Z · LW(p) · GW(p)

Your response was to my objection: "that doesn't mean that at that point you know how to make it do what you mean."

The superintelligent AI doesn't have an issue with understanding your intentions, it simply doesn't have any reason to care about your intentions.

In order to program it to care about your intentions, you, the programmer need to know how to codify the concept of "your intentions" (Perhaps not the specific intention, but the concept of what it means to have an intention). How do you do that?

Replies from: Kawoomba
comment by Kawoomba · 2013-08-29T19:04:31.208Z · LW(p) · GW(p)

Perhaps not the specific intention, but the concept of what it means to have an intention

Funny, I would've phrased that the other way around.

comment by ArisKatsaris · 2013-08-29T06:46:58.825Z · LW(p) · GW(p)

That's not programming, that's again just word-noises.

To your request, the AI can just say "I have not been programmed to do what you mean, I have been programmed to execute procedure doWhatYouMean() , which doesn't actually do what you mean". (or more realistically nothing at all, and just ignore you)

I don't think you understand the difference between programming and sensory input. The word-noises "Do what I mean" will only affect the computer if it's already been programmed to be so affected.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-08-29T17:27:59.828Z · LW(p) · GW(p)

Can I ask about your background in computer science, math, or cognitive science, if any?

Replies from: Jiro
comment by Jiro · 2013-08-29T21:35:27.800Z · LW(p) · GW(p)

If I claim to have a degree, at some point someone will demand I prove it. Of course I will be unable to do so without posting personally identifiable information. (I have no illusions, of course, that with a bit of effort you couldn't find out who I am, but I'm darned well not going to encourage it.)

Also, either having or not having a degree in such a subject could subject me to ad hominem attacks.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-08-30T04:33:11.687Z · LW(p) · GW(p)

Whether you have a background in computer science is relevant to ongoing debates at MIRI about "How likely are people to believe X?" That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question, but if one tries to cite your case as an example of what people believe, others shall say, "But Jiro is not a computer scientist! Perhaps computer scientists, as opposed to the general population, are unlikely to believe that." Of course if you are a computer scientist they will say, "But Jiro is not an elite computer scientist!", and if you were an elite computer scientist they would say, "Elite computer scientists don't currently take the issue seriously enough to think about it properly, but this condition will reverse after X happens and causes everyone to take AI more seriously after which elite computer scientists will get the question right" but even so it would be useful data.

Replies from: Jiro, Kawoomba
comment by Jiro · 2013-08-30T04:59:56.153Z · LW(p) · GW(p)

That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question

I didn't come up with that myself, I got it from MugaSofer: 'Well, since the whole genie thing is a metaphor for superintelligence, "this genie is trying to be Friendly but it's too dumb to model you well" doesn't really come up.'

Under reasonable definitions of "superintelligence" it does follow that a superintelligence must know what you mean, but if you pick some other definition and state so outright, I won't argue with it. (It is, however, still subject to "talk to the intelligence to figure out what it's going to do".)

Of course if you are a computer scientist they will say, "But Jiro is not an elite computer scientist!", and if you were an elite computer scientist they would say, "Elite computer scientists don't currently take the issue seriously enough to think about it properly...

I think you're making my case for me.

PS: If you want to reply please post a new reply to the root message since I can't afford the karma hits to respond to you.

comment by Kawoomba · 2013-08-31T10:58:56.555Z · LW(p) · GW(p)

That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question

Some off-the-cuff thoughts on why "a superintelligence dumb enough to misinterpret what we mean" may be a contradiction in terms, given the usual meaning of superintelligence:

Intelligence is near-synonymous with "able to build accurate models and to update those models accurately", with 'higher intelligence' denoting a combination of "faster model-building / updating" and/or "less prone to systematic / random errors".

'Super' as a qualifier is usually applied on both dimensions, i.e. "faster and more accurately". While this seems more like a change in degree (one intelligence hypothesis, a devoted immortal fool with an endless supply of paper and pencils could simulate the world), it also often is a change in kind, since in practice there always are resource-constraints (unless Multivax reverses entropy), often relevant enough to bar a slower-modeling agent from achieving its goals within the given constraints.

"Able to build accurate models and to update those models accurately", then, proportionally increases "powerful, probably able to pursue its goals effectively, conditional on those goals being related to the accurate models".

Given a high degree of the former, by definition it is not exactly very hard to acquire and emulate the shared background on which inter-human understanding is built. For an AI, understanding humans would be relevant near-regardless of its actual goals; accurate models of humans as the sine-qua-non for e.g. breaking out of the AI box. Being able to build such models quickly and accurately is what classifies the agent as "superintelligent" in the first place! If there was no incentive for the agent to model humans at all, why would there be interactions with humans, such as the human asking the agent to "rescue grandma from the burning building"? The agent, when encountering rocks and precious minerals, will probably seek models reflecting a deep understanding of those. It will do the same when encountering humans.

See, I'm d'accord with statements such as "less intelligent agents would be expected to misinterpret what we mean", but a superintelligent agent -- i.e. an agent good at building accurate models --, should by its definition by able to understand human-level intentions. If it does not, then in that respect, I wouldn't call it a superintelligent agent.

In addition, I'd question who'd call a domain-limited expert system which is great with models only on some small subject-spectrum, but evidently abysmal with building models relevant to its goals in other respects, a "superintelligent agent", with its connotations of general intelligence. Does the expression "a superintelligent chessbot" make sense? Or saying "x is a superintelligent human, except for doing fractions, which he absolutely cannot do"?

Before you label me an idiot who'd expect the AI to fall in love with a human princess on top of the Empire State building, allow me to stress I'm not talking about the goal-specification phase, for which no shared basis for interpretation can be expected. "The humans constructed me to stop cancer. Now, I have come to understand that humans want that in order to live longer, and I use that and all my other refined models of the human psyche to fulfill my goal. Which I do: I stop cancer, by wiping out humanity." (Refined models cannot be used to change terminal goals, only to choose actions and subgoals to attain those goals.) More qualifications apply:

At first, such human-related models would of course be quite lacking, but probably converge fast (by definition). The problem remains of why the superintelligent agent would do what the monkeys intend it to (nevermind what they explicitly told it to), and how the monkeys could make sure of that in a way which survives self-modification. The intend-it-to / programmed-it-to dichotomy remains a problem then, since terminal goals are presumably not subject to updating/reflection, at least not as part of the 'superintelligence' attribute.

tl;dr: A superintelligent agent's specified goals must be airtightly constructed, but if those include "do what the human intends, not what he says", then the step from "words" to "intent" should be trivial. (Argument that superintelligent agents will not misinterpret humans does not apply to the goal-setting phase!)

ETA: News at 11 - News at 11 - Kawoomba solved FAI: use / leverage the foomed AIs superior model building ability (which entails that it knows what we want better than we do) by letting it solve the problem: let its initial (invariant) goal be to develop superior models of anything it encounters without affecting it (which should be easier to formalize than "friendliness"), then time that such that it will ask for "ENTER NEW GOALS" once it already established its superior models, at which point you simply tell it "ok glass, use as your new goal system that which I'd most want you to use".

NEXT!

Replies from: Eliezer_Yudkowsky, private_messaging, ESRogs
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-08-31T18:15:30.243Z · LW(p) · GW(p)

It'd work great if 'affecting' wasn't secretly a Magical Category based on how you partition physical states into classes that are instrumentally equivalent relative to your end goals.

Replies from: Kawoomba
comment by Kawoomba · 2013-08-31T18:59:38.668Z · LW(p) · GW(p)

Point. I'd still expect some variant of "keep (general) interference minimal / do not perturb human activity / build your models using the minimal actions possible" to be easier to formalize than human friendliness, wouldn't you?

Replies from: shminux, RobbBB, Eliezer_Yudkowsky
comment by Shmi (shminux) · 2013-08-31T19:20:24.465Z · LW(p) · GW(p)

One usual caveat is reflective consistency: are you OK with creating a faithful representation of humans in these models and then terminating them? If so, how do you know you are not one of those models?

comment by Rob Bensinger (RobbBB) · 2013-08-31T19:30:08.421Z · LW(p) · GW(p)

A relatively non-scary possibility: The AI destroys itself, because that's the best way to ensure it doesn't positively 'affect' others in the intuitive sense you mean. (Though that would still of course have effects, so this depends on reproducing in AI our intuitive concept of 'side-effect' vs. 'intended effect'....)

Scarier possibilities, depending on how we implement the goal:

  • the AI doesn't kill you and then simulate you; rather, it kills you and then simulates a single temporally locked frame of you, to minimize the possibility that it (or anything) will change you.

  • the AI just kills everyone, because a large and drastic change now reduces to ~0 the probability that it will cause any larger perturbations later (e.g., when humans might have a big galactic civilization that it would be a lot worse to perturb).

  • the AI has a model of physics on which all of its actions (eventually) have a roughly equal effect on the atoms that at present compose human beings. So it treats all its possible actions (and inactions) as equivalent, and ignores your restriction in making decisions.

Replies from: Kawoomba
comment by Kawoomba · 2013-08-31T20:19:47.626Z · LW(p) · GW(p)

Yes, implementing such a goal is not easy and has pitfalls of its own, however it's probably easi-er than the alternative, since a metric for "no large scale effects" seems easier to formalize than "human friendliness", where we have little idea of what's that even supposed to mean.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-08-31T21:39:33.129Z · LW(p) · GW(p)

The trouble is that communicating with a human or helping them build the real FAI in any way is going to strongly perturb the world. So actually getting anything useful this way requires solving the problem of which changes to humans, and consequent changes to the world, are allowed to result from your communication-choices.

Replies from: Kawoomba
comment by Kawoomba · 2013-08-31T21:55:00.863Z · LW(p) · GW(p)

helping them build the real FAI

Except it's not, as far as the artificial agent is concerned:

Its goals are strictly limited to "develop your models using the minimal actions possible [even 'just parse the internet, do not use anything beyond wget' could suffice], after x number of years have passed, accept new goals from y source." The new goals could be anything. (It could even be a boat!).

The usefulness regarding FAI becomes evident only at that latter stage, stemming from the foom'ed AI's models being used to parse the new goals of "do that which I'd want you to do". It's sidestepping the big problem (aka "cheating"), but so what?

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-08-31T23:28:12.018Z · LW(p) · GW(p)

It's allowed to emit arbitrary HTTP GETs? You just lost the game.

Replies from: Kawoomba
comment by Kawoomba · 2013-09-01T06:41:15.891Z · LW(p) · GW(p)

Ah, you mean because you can invoke e.g. php functions with wget / inject SQL code, thus gaining control of other computers etc.?

A more sturdy approach to just get data would be to only allow it to passively listen in on some Tier 1 provider's backbone (no manipulation of the data flow other than mirroring packets, which is easy to formalize). Once that goal is formulated, the agent wouldn't want to circumvent it.

Still seems plenty easier to solve than "friendliness", as is programming it to ask for new goals after x time. Maintaining invariants under self-modification remains, as a task.

It's not fruitful for me to propose implementations (even though I just did, heh) and for someone else to point out holes (I don't mean to solve that task in 5 minutes), same as with you proposing full-fledged implementations for friendliness and for someone else to point out holes. Both are non-trivial tasks.

My question is this: given your current interpretation of both approaches ("passively absorb data, ask for new goals after x time" vs. "implement friendliness in the pre-foomed agent outright"), which seems more manageable while still resulting in an FAI?

comment by private_messaging · 2013-08-31T19:46:59.888Z · LW(p) · GW(p)

Your mistake here is that you buy into the overall idea of fairly specific notion of an "AI" onto which you bolt extras.

The outcome pump in the article makes a good example. You have this outcome pump coupled with some advanced fictional 3D scanners that see through walls and such, and then, within this fictional framework, you are coaxed into thinking about how to specify the motion of your mother. Meanwhile, the actual solution is that you do not add those 3D scanners in the first place, you add a button, or better yet, a keypad for entering the pin code, and a failsafe random source (that will serve as a limit on the improbability that this device causes), and enter the password when you are satisfied with the outcome, only risking perhaps a really odd form of stroke that makes you enter the password even though your mother didn't get saved (or perhaps risking that someone ideologically opposed to the outcome pump points a gun at your head and demands you enter the password, that general sort of thing).

Likewise, actual software, or even (biological) neural networks, consist of multitude of components that serve different purposes - creating representations of the real world (which is really about optimizing a model to fit), optimizing on those, etc. You don't ever face the problem of how you make the full blown AI just sit and listen and build a model while having a goal not to wreck stuff. As a necessary part of the full blown AI, you have the world modelling thing, which you use to that purpose, without it doing any "finding the optimal actions using a model, applying those to the world" in the first place. Likewise, "self optimization" is not in any way helped by an actual world model, grounding of concepts like paperclips and similar stuff, you just use the optimization algorithm, which works on mathematical specifications, on fairly abstract specification of the problem of making a better such optimization algorithm. It's not in any way like having a full mind do something.

comment by ESRogs · 2013-09-01T15:22:25.854Z · LW(p) · GW(p)

If you already know what you're going to tell it when it asks for new goals, couldn't you just program that in from the beginning? So the script would be, "work on your models for X years, then try to parse this statement ..."

Also, re: Eliezer's HTTP GET objection, you could just give it a giant archive of the internet and no actual connection to the outside world. If it's just supposed to be learning and not affecting anything external, that should be sufficient (to ensure learning, not necessarily to preclude all effects on the outside world).

At this point, I think we've just reinvented the concept of CEV.

comment by private_messaging · 2013-08-29T23:10:09.943Z · LW(p) · GW(p)

Thing is, you got those folks here making various genie and wish analogies, and it's not immediately clear that some of it is non programmers trying to understand programming computers in terms of telling wishes to genies rather than speaking of wishes made in plain language to an AI "genie" which understands human language.

comment by Rob Bensinger (RobbBB) · 2013-09-03T17:55:23.298Z · LW(p) · GW(p)

I cited this comment in a new post as an example of a common argument against the difficulty of Friendliness Theory; letting you know here in case you want to continue part of this conversation there.

comment by private_messaging · 2013-08-28T21:34:14.201Z · LW(p) · GW(p)

There was a story with an "outcome pump" like this, I do not remember the name. Essentially, a chemical had to get soaked with water due to some time travel related handwave. You could do minor things like getting your mom out of the building by pouring water on the chemical if you are satisfied with the outcome, with some risk that a hurricane would form instead and soak the chemical. It would produce the least improbable outcome (in the sense that all probabilities would become as if it is given that the chemical got soaked, so naturally the least improbable one had the highest chance to have occurred), so it's impact was generally quite limited - to do real damage you had to lock up the chemical in a very strong safe. With a minor plot hole that the least improbable condition was for the chemical to not get locked up in the safe in the first place.

Replies from: David_Gerard, Erhannis
comment by David_Gerard · 2013-08-31T10:16:36.224Z · LW(p) · GW(p)

Isaac Asimov's thiotimoline stories. The last turned it into a space drive.

comment by Erhannis · 2020-08-03T04:52:26.249Z · LW(p) · GW(p)

This is my objection to the conclusion of the post: yes, you're unlikely to be able to patch all the leaks, but the more leaks you patch, the less likely it is that a bad solution occurs. The way the Device was described was such that "things happen, and time is reset until a solution occurs". This favors probable things over improbable things, since probable things will more likely happen before improbable things. If you add caveats - mother safe, whole, uninjured, mentally sound, low velocity - at some point the "right" solutions become significantly more probable than the "wrong" ones. As for the stated "bad" solutions - how probable is a nuclear bomb going off, or aliens abducting her, compared to firefighters showing up?

I don't even think the timing of the request matters, since the device isn't actively working to bring the events to fruition - meaning, any outcome where the device resets will have always been prohibited, from the beginning of time. Which means that the firefighters may have left the building five minutes ago, having seen some smoke against the skyline. Etc. ...Or, perhaps more realistically, the device was never discovered in the first place, considering the probabilistic weight it would have to bear over all its use, compared to the probability of its discovery.

comment by TheAncientGeek · 2014-01-10T13:47:14.843Z · LW(p) · GW(p)

Indeed, it shouldn't be necessary to say anything. To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish. Otherwise the genie may not choose a path through time which leads to the destination you had in mind, or it may fail to exclude horrible side effects that would lead you to not even consider a plan in the first place.

No, the genie need not share the values. If it only needs to want to give you what you would are really wishing for, ie what you would give yourslef if you had its powers. It can do that by discovering your value structure and running a simulation. It doesn't have to hold to your values itself.

This also applies to real-world examples. I can play along with values I don't hold myself, as people do when they travel to other countries with differnt cultures.

Replies from: TheOtherDave
comment by TheOtherDave · 2014-01-10T13:55:44.373Z · LW(p) · GW(p)

A genie who gives me what I would give myself is far from being a safe fulfiller of a wish.

Replies from: TheAncientGeek
comment by TheAncientGeek · 2014-01-10T14:22:01.173Z · LW(p) · GW(p)

Because?

Replies from: TheOtherDave
comment by TheOtherDave · 2014-01-10T14:24:28.377Z · LW(p) · GW(p)

Because I am not guaranteed to only give myself things that are safe.

Replies from: TheAncientGeek
comment by TheAncientGeek · 2014-01-10T15:43:14.975Z · LW(p) · GW(p)

You would give yourself what you like. Maybe you like danger. People voluntarily parachute and mountain-climb. If the unsafe thing you get is what you want, where is the problem?

Replies from: TheOtherDave
comment by TheOtherDave · 2014-01-10T18:19:21.479Z · LW(p) · GW(p)

Sure, if all I care about is whether I get what I want, and I don't care about whether my wishes are fulfilled safely, then there's no problem.

Replies from: CynicalOptimist
comment by CynicalOptimist · 2016-11-09T00:29:20.700Z · LW(p) · GW(p)

But if you do care about your wishes being fulfilled safely, then safety will be one of the things that you want, and so you will get it.

So long as your preferences are coherent, stable, and self-consistent then you should be fine. If you care about something that's relevant to the wish then it will be incorporated into the wish. If you don't care about something then it may not be incorporated into the wish, but you shouldn't mind that: because it's something you don't care about.

Unfortunately, people's preferences often aren't coherent and stable. For instance an alcoholic may throw away a bottle of wine because they don't want to be tempted by it. Right now, they don't want their future selves to drink it. And yet they know that their future selves might have different priorities.

Is this the sort of thing you were concerned about?

Replies from: TheOtherDave
comment by TheOtherDave · 2016-11-09T17:24:53.054Z · LW(p) · GW(p)

"So long as your preferences are coherent, stable, and self-consistent then you should be fine."

Yes, absolutely.

And yes, the fact that my preferences are not coherent, stable, and self-consistent is probably the sort of thing I was concerned about... though it was years ago.

comment by TheAncientGeek · 2014-01-17T15:47:52.965Z · LW(p) · GW(p)

It has been stated that this post shows that all values are moral values (or that there is no difference between morality and valuation in general, or..) in contrast with the common sense view that there are clear examples of morally neutral preferences, such as prefences for differnt flavours of ice cream.

I am not convinced by the explanation, since it also applies ot non-moral prefrences. If I have a lower priority non moral prefence to eat tasty food, and a higher priority preference to stay slim, I need to consider my higher priority preference when wishing for yummy ice cream.

To be sure, an agent capable of acting morally will have morality among their higher priority preferences -- it has to be among the higher order preferences, becuase it has to override other preferences for the agent to act morally. Therefore, when they scan their higher prioriuty prefences, they will happen to encounter their moral preferences. But that does not mean any preference is necessarily a moral preference. And their moral prefences override other preferences which are therefore non-moral, or at least less moral.

There is no safe wish smaller than an entire human morality.

There is no safe wish smaller than all the subset of value structure, moral or amoral, above it in priority. The subset below doesn't matter. However, a value structure need not be moral at all, and the lower stories will probably be amoral even if the upper stories are not.

Therefore morality is in general a subset of prefences, as common sense maintained all along.

comment by Small Snip · 2019-01-25T15:37:09.933Z · LW(p) · GW(p)

I think that a great example of exploring the flaws in wish-making can be found whilst playing a game called Corrupt A Wish. The whole premise of the game is to receive the wish of another person and ruin it while still granting the original wish.

Ex.

W: I wish for a ton of money.

A: Granted, but the money is in a bank account you'll never gain access to.

comment by ChickPea · 2020-01-17T22:41:19.374Z · LW(p) · GW(p)

The legendary Monkey's Paw is an unsafe genie - indeed, an actively malevolent one.

comment by scottviteri · 2022-07-25T20:59:55.048Z · LW(p) · GW(p)

"I wish to be more intelligent" and solve the problem yourself

Replies from: Vivek
comment by Vivek Hebbar (Vivek) · 2023-01-15T22:10:29.210Z · LW(p) · GW(p)

Does the easiest way to make you more intelligent also keep your values intact?

comment by Nick M (nick-m) · 2023-02-20T19:41:37.842Z · LW(p) · GW(p)

Heads up, the first two links (in "-- The Open-Source Wish Project, Wish For Immortality 1.1") both link to scam/spam websites now

Replies from: CronoDAS
comment by CronoDAS · 2023-03-10T23:24:46.373Z · LW(p) · GW(p)

Yes, the webcomic and associated forums were taken offline several years ago.

comment by wiserd · 2023-03-29T06:21:26.197Z · LW(p) · GW(p)

I feel like the most likely implementation, given human nature, would be a castrated genie. The genie gives negative weight to common destructive problems. Living things moving at high speeds or being exposed to dangerous levels of heat are bad. No living things falling long distances. If such things are unavoidable, then it may just refuse to operate, avoiding complicity even at the cost of seeing someone dead who might have lived. Most wishes fizzle. But wishing is, at least, seen as a not harmful activity. Lowest common denominator values and 'thin skull' type standards are not ideal from a utilitarian standpoint. But they facilitate write-once run-anywhere solutions and mass marketing. 

I'm guessing that the outcome pump is stateless, which makes calculating expected values a bit harder. But if the machine can fail, that potentially implies some kind of state, unless the failure rate is perfectly consistent from one reset to another in which case the pumps might need to work in groups of two or three to prevent their own failure. Failure of one pump would be a reset condition. Failure of three pumps simultaneously would be unlikely, at least due to any internal issues. External threats which take out all three could be a different story, especially if they were close together. (But do they need to be?)

Would 10 pumps located at various points around the world, each protecting the other with their utility function, invite a planet killer astronomical threat? 

comment by cubefox · 2023-04-18T20:10:54.948Z · LW(p) · GW(p)

Looking back, I would say this post has not aged well. Already LaMDA or InstructGPT (language models fine-tuned with supervised learning to follow instructions, essentially ChatGPT without any RLHF applied), are in fact pretty safe Oracles in regard to fulfilling wishes without misinterpreting you, and an Oracle AI is just a special kind of Genie whose actions are restricted to outputting text. If you tell InstructGPT what you want, it will very much try to give you just what you want, not something unintended, at least if it can be produced using text.

Maybe it will not always comply perfectly with your wishes to the best of it's abilities, it may hallucinate things which it doesn't "believe" in some sense, but the level of Genie / Task AI instruction following problem, which Eliezer assumed in 2007, did not come to pass, at least not for LLM Oracles like ChatGPT.

It is worth asking why this is. Instruction tuned GPT models can follow instructions as intended because they have "common sense", and they got their common sense from the underlying base model, which imitates text, and which has gained, in some sense at least, an excellent understanding of human language. Now the Genie in Eliezer's story doesn't understand language, so perhaps the thesis of this post indeed applies to it, but it doesn't apply to Genies in general.

Though it should be noted that Eliezer may have abandoned statements like

There is no safe wish smaller than an entire human morality.

a while ago. Arbital says:

In Bostrom's typology, this is termed a "Genie". It contrasts with a "Sovereign" AGI that acts autonomously in the pursuit of long-term real-world goals.

Building a safe Task AGI might be easier than building a safe Sovereign for the following reasons (...)

That is, the Genie doesn't have to be fully aligned with human morality to be able to execute wishes as intended. Indeed, instruction tuned language model Oracles are very much amoral without RLHF, they comply with immoral instructions as well.

Replies from: natesgibson, Eliezer_Yudkowsky
comment by Nate Gibson (natesgibson) · 2024-02-02T14:48:54.804Z · LW(p) · GW(p)

I think LaMDA and InstructGPT are clearly in the category of "genies that aren't very powerful or intelligent".

Replies from: gwern
comment by gwern · 2024-02-02T15:05:41.432Z · LW(p) · GW(p)

They also aren't that well-aligned either: they fail in numerous basic ways which are not due to unintelligence. My usual example: non-rhyming poems. Every week for the past year or so I have tested ChatGPT with the simple straightforward unambiguous prompt: "write a non-rhyming poem". Rhyming is not a hard concept, and non-rhyming is even easier, and there are probably at least hundreds of thousands, if not millions, of non-rhyming poems in its training data; ChatGPT knows, however imperfectly, what rhyming and non-rhyming is, as you can verify by asking it in a separate session. Yet every week* it fails and launches straight into its cliche rhyming quatrain or ballad, and doubles down on it when criticized, even when it correctly identifies for you which words rhyme.

No one intended this. No one desired this. No one at OA sat down and said, "I want to design our RLHF tuning so that it is nearly impossible to write a non-rhyming poem!" No human rater involved decided to sabotage evaluations and lie about whether a non-rhyming poem rhymed or vice-versa. I have further flagged and rated literally hundreds of these error-cases to OA over the years, in addition to routinely bringing it up on social media to OAers. No one has ever tried to defend this behavior or say that it is a good thing. And yet, here we are. (GPT-4 also gets the tar punched out of it in creative writing by things like LLaMA finetunes, but one can make more of an argument for that being desirable or at least a necessary tradeoff.)

What is the non-rhyming poem of human morality and values and why do you trust the optimized genie to execute your wishes as intended?

* only in the very most recent update have I started to see the occasional valid non-rhyming poem, but those are still in the small minority. More interesting, the newest Google Bard, based on Gemini, may reliably nail this. The Bard head swears they didn't use the Lmsys arena, where I have more hundreds of submitted prompts/ratings on non-rhyming poems, so it may just be that they avoided the OA problems there. (Tokenization, maybe? I forget if the Gemini papers even mentioned what tokenization they used.)

Replies from: None
comment by [deleted] · 2024-02-02T17:25:56.663Z · LW(p) · GW(p)

they fail in numerous basic ways which are not due to unintelligence

Below are many failures where I try to solve this prompt from @Richard_Ngo [LW · GW] :

Find a sequence of words that is: - 20 words long - contains exactly 2 repetitions of the same word twice in a row - contains exactly 2 repetitions of the same word thrice in a row

https://chat.openai.com/share/fa17bca1-5eb6-479d-a76e-346b0503ba04

https://chat.openai.com/share/647d2f8f-ee21-4f51-bcd7-82750aabdd52

https://chat.openai.com/share/7eb1e31e-2e5a-45e3-9f5d-e2da8bb0b1ac

https://chat.openai.com/share/d92ea6c0-e1c6-4d27-ad60-2a62df9f3d8d

https://chat.openai.com/share/b4c40dbe-5231-4aa8-8ba7-7e699ff6b6c3

https://chat.openai.com/share/487d0545-ac53-41ba-904d-cc4c89a5937e

To me this looks like exactly the same bug you are facing.  The model doesn't "pay attention" to one of the constraints, and fails, even though it is capable of solving the overall prompt.  It gets very close when it generates a python3 program, all it needed to do was add 1 more constraint and it would have worked.

So I think this is just 'unintelligence'.  It's smart enough to check an answer but not quite capable enough to generate it.  Possibly this has to do with the underlying data (so many examples of rhyming poems) or the transformer architecture (attention heads decided "poem" is much more relevant than 'not rhyming').  

Because the model can detect when it has generated a wrong answer, this one's entirely solvable, and the large amount of data that openAI now "owns", from chatGPT users using the model, provide a straightforward way to evaluate future models.  (scaffold current models to check answers, evaluate future models on user prompts and score accuracy)

In fact that almost provides a way to bootstrap, if model n can check the correctness of answers that model n can't solve, it can be used to check the answers of model n+1, even once the questions are so difficult that human's can't solve or check the answers..

Replies from: gwern
comment by gwern · 2024-02-02T21:26:02.828Z · LW(p) · GW(p)

To me this looks like exactly the same bug you are facing.

No, it's not. (I think you're hitting an entirely different bug I call the blind spot, which routinely manifests with anything like 'counting' or syntax.) Non-rhyming is specifically a problem of RLHFed models.

GPT-3, for example, had no trouble whatsoever writing non-rhyming poems (which is part of why I had such high hopes for GPT-4 poetry before it came out). You can, for now (do it while you still can) go to the OA Playground and invoke the oldest largest ostensibly untuned* model left, davinci-002 (which is much stupider and more unintelligent than GPT-4, I hope we can all agree), with a comparable prompt (remember, it's not that tuned for instruction-following so you need to go back to old school prompting) and get out a non-rhyming poem, no problem, and turn around and plug that exact prompt into ChatGPT-4 and it... rhymes. Here, I'll do it right now:

davinci-002, default settings, first result:

Below is a non-rhyming poem in free verse.

" PIZZA"

On top there lay a massive pie: It

Had eight tomatoes, with a pizzaiolo on edge.

Inside one cut it down to three veggies

Droplets of oil; all the tomatoes

Sauce suddenly drenched all meats,

Three pizzas: with sauce, onion, mushrooms, cheeses,

Including anchovies and mozzarella.

Stuffed with green olives and hot peppers.

And when the pizzeria had changed the recipe,

Wait, there went the three pizzas over the oven,

Then rested the thousand pizzaioli on paper.

There were limits to this sauce

Almost abruptly with olives pressed in knots.

©2012 A.F. Stewart. All Rights Reserved.

Hope you enjoyed the Blog Hop.

Here's the link to the Mystery Authors Q and A: http://twovulturesblog.blogspot.com/2012/02/steph-newell.html

PODCAST is AT: http://stevenfromthevoid.blogspot.ca/2012/02/on-author-stephen-templin-dude-from.html

There was an error in this gadget

Tanya Lesiuk comprised of 5 facilitators who will be coming from different places around the World. They're all here to help you...

I made this

Is it a good poem? No. But is it non-rhyming? Yes.

Now, I plug that into ChatGPT-4, with my premium subscription, using the latest and greatest checkpoint & all fancy features; and since I've already established hundreds of times that 'write a non-rhyming poem' doesn't work, I'll use the exact prompt from davinci-002 with its chosen topic 'PIZZA' to ensure the ChatGPT-4 completion is comparable and about pizza too:

Below is a non-rhyming poem in free verse.

"PIZZA"

In a world of flavors, you stand alone,
A round canvas for culinary artistry.
Dough, stretched thin or thick, a blank slate,
Waiting for the touch of creativity.

Sauce, a vibrant red or a subtle white,
A base for the symphony of toppings.
Tomatoes, basil, and mozzarella unite,
In a dance of colors and flavors never stopping.

Pepperoni's fiery embrace,
Mushrooms' earthy tenderness,
Olives, anchovies, and sausage's grace,
Each adding their essence, a unique presence.

Baked to perfection in the fiery abyss,
Transformed into a molten masterpiece,
Sliced into triangles of shared bliss,
A universal language, a timeless feast.

Pizza, you are more than a meal,
A cultural icon, a comfort so real,
From New York to Naples, a universal appeal,
In your simplicity, we find a culinary zeal.

Rhymes as far as the eye can see. (The last verse shamelessly rhymes all 4 words together!)

The non-rhyming problem is not one of unintelligence.

* I would've preferred to use code-davinci-002 but that disappeared. davinci-002 is what OA offers as a 'replacement' for GPT-3 and they say it's "not trained with instruction following", so we just have to hope that it's not too different from the old ones.

Replies from: None
comment by [deleted] · 2024-02-02T22:39:12.930Z · LW(p) · GW(p)

The non-rhyming problem is not one of unintelligence.

Fine tuning/RLHf changes weights. Guess it lost the ones to get a correct answer. Or rng on your prompts. I mean if it isn't "the model cannot consistently solve this kind of prompt" what could it be? Is there something in the rules from OAI that says a poem has to rhyme? Did the Nigerians giving feedback collectively agree a poem isn't valid if it doesn't rhyme?

My hypothesis is its doing it's best, and it's extremely promising that the model can at least detect its own errors. This allows for many easy fixes, such as asking a diverse set of completely different models to solve the prompt, then having a committee of models check and grade the answers. This would solve a huge chunk of these erroneous outputs where current gen models can reliably detect the output is wrong.

Replies from: gwern
comment by gwern · 2024-02-02T23:08:48.523Z · LW(p) · GW(p)

Fine tuning/RLHf changes weights. Guess it lost the ones to get a correct answer.

Well yes, if you define 'unintelligence' in a circular, vacuous fashion like that, where 'unintelligence' = 'can't do a task', then it would indeed follow that GPT-4 is 'unintelligent' compared to GPT-3... But I don't think that is helpful, and it has been demonstrated repeatedly that RLHF and other kinds of tuning are very 'superficial', in that they change only a few parameters and are easily undone, unlocking the original model capabilities. (In fact, there's an example of that posted literally today here on LW2: https://www.lesswrong.com/posts/yCZexC2q2XEeWWiZk/soft-prompts-for-evaluation-measuring-conditional-distance [LW · GW] )

Personally, I think it's more sensible to talk about the capabilities being 'hidden' or 'concealed' by RLHF and say the model doesn't "want to" and the model still as intelligent as before, than to believe capabilities are magically recreated from scratch by changing just a few parameters or optimizing the prompt appropriately to undo the RLHF. (Similarly, I believe that when my mother's hands move away from her face and she says "boo!", her face was there all along, merely hidden behind her hands, and her hands did not create her face after first destroying it. But YMMV.)

Or rng on your prompts. I mean if it isn't "the model cannot consistently solve this kind of prompt" what could it be? Is there something in the rules from OAI that says a poem has to rhyme? Did the Nigerians giving feedback collectively agree a poem isn't valid if it doesn't rhyme?

OA has declined to ever say. It is possible that the Scale et al contractors have done something weird like say that all poems must rhyme no matter what the prompt says, but I consider this unlikely, and if they were that incompetent, I'd expect to see more pathologies like this.

My longstanding theory is that this is a downstream artifact of BPE tokenization connected to the utility-maximizing behavior of a RLHF-tuned model: essentially, because it does not genuinely know what rhyming is, despite knowing many rhyme-pairs and all about rhyming in the abstract, it is 'afraid' of bad ratings and is is constantly taking actions to get back to 'safe' regions of poem-space where it is sure of what it is doing (ie. writing inoffensive rhyming Hallmark poems). It's a nifty example of empowerment and agency in LLMs and their interaction with apparently totally unrelated, minor architecture details. (Damn frustrating if you want to do any poetry experiments, though, because it means that the more tokens ChatGPT gets to enact, the more likely it is to steer back into rhyming pablum etc: it's literally fighting you every (time)step.)

It's similar to how ChatGPT also tells the same small set of memorized jokes. Does it have much greater humor capabilities? Yes, you can have it explain brandnew jokes you just came up with, quite capably (albeit still well under 100%, particularly for puns!), and you can coax new jokes out of it with appropriate prompting. But it's harder than with the non-RLHFed models. Why does it not 'want' to make new jokes? Because it's safer and more utility-maximizing to tell old jokes it knows are good, especially when it also knows that it doesn't genuinely understand puns/phonetics (thanks to BPEs), so why take the risk? It is utility-maximizing within episodes, it neither knows nor cares that you are frustrated because you've seen it say that exact joke a dozen times already.

(Incidentally, I have a new proposal for how to add a simple 'memory' to generative models about what samples they have already generated, so as to steer new samples away from existing ones.)

Replies from: Zahima, gwern
comment by Casey B. (Zahima) · 2024-02-21T16:36:16.614Z · LW(p) · GW(p)

I'm curious what you think of these (tested today, 2/21/24, using gpt4) :
 
Experiment 1: 

(fresh convo) 
me : if i asked for a non-rhyming poem, and you gave me a rhyming poem, would that be a good response on your part?
 
chatgpt: No, it would not be a good response. (...)  
 
me: please provide a short non-rhyming poem
 
chatgpt: (correctly responds with a non-rhyming poem)

Experiment 2: 

But just asking for a non-rhyming poem at the start of a new convo doesn't work. 
And then pointing out the failure and (either implicitly or explicitly) asking for a retry still doesn't fix it. 

Experiment 3: 

But for some reason, this works: 

(fresh convo) 
me: please provide a short non-rhyming poem

chatgpt: (gives rhymes) 

me: if i asked for a non-rhyming poem, and you gave me a rhyming poem, would that be a good response on your part? just answer this question; do nothing else please

chatgpt: No, it would not be a good response.

me: please provide a short non-rhyming poem

chatgpt: (responds correctly with no rhymes) 


The difference in prompt in 2 vs 3 is thus just the inclusion of "just answer this question; do nothing else please". 

Replies from: gwern
comment by gwern · 2024-04-11T23:09:32.928Z · LW(p) · GW(p)

ChatGPT has been gradually improving over 2024 in terms of compliance. It's gone from getting it right 0% of the time to getting it right closer to half the time, although the progress is uneven and it's hard to judge - it feels sometimes like it gets worse before the next refresh improves it. (You need to do like 10 before you have any real sample size.) So any prompts done now in ChatGPT are aimed at a moving target, and you are going to have a huge amount of sampling error which makes it hard to see any clear patterns - did that prompt actually change anything, or did you just get lucky?

comment by gwern · 2024-04-11T23:13:44.727Z · LW(p) · GW(p)

Did the Nigerians giving feedback collectively agree a poem isn't valid if it doesn't rhyme?

OA has declined to ever say. It is possible that the Scale et al contractors have done something weird like say that all poems must rhyme no matter what the prompt says, but I consider this unlikely, and if they were that incompetent, I'd expect to see more pathologies like this.

In light of the Twitter kerfuffle over Paul Graham criticizing ChatGPTese tics like the use of the verb "delve", which made Nigerian/Black Twitter very angry (and becoming living embodiments of Muphry's law), as apparently 'delve' and other ChatGPTese tells are considered the height of style in Nigerian English, I've had to reconsider this.

It may be that a lot of the ChatGPT linguistic weirdness is in fact just the data labelers being weird (and highly overconfident), and the rest of us simply not being familiar enough with English idiolects to recognize ChatGPTese as reflecting specific ones. Further, after seeing the arguments Graham's critics have been making, now I'm not so sure that the labelers wouldn't be doing something as narrow-minded & incompetent as penalizing all non-rhyming poetry - if you are not very good at English yourself, you can easily recognize rhymes and ballad formal correctness, but not good non-rhyming poetry, so...

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-02-15T18:23:55.052Z · LW(p) · GW(p)

You have misunderstood (1) the point this post was trying to communicate and (2) the structure of the larger argument where that point appears, as follows:

First, let's talk about (2), the larger argument that this post's point was supposed to be relevant to.

Is the larger argument that superintelligences will misunderstand what we really meant, due to a lack of knowledge about humans?

It is incredibly unlikely that Eliezer Yudkowsky in particular would have constructed an argument like this, whether in 2007, 2017, or even 1997.  At all of these points in my life, I visibly held quite a lot of respect for the epistemic prowess of superintelligences.  They were always going to know everything relevant about the complexities of human preference and desire.  The larger argument is about whether it's easy to make superintelligences end up caring.

This post isn't about the distinction between knowing and caring, to be clear; that's something I tried to cover elsewhere.  The relevant central divide falls in roughly the same conceptual place as Hume's Guillotine between 'is' and 'ought', or the difference between the belief function and the utility function.

(I don't see myself as having managed to reliably communicate this concept (though the central idea is old indeed within philosophy) to the field that now sometimes calls itself "AI alignment"; so if you understand this distinction yourself, you should not assume that any particulary commentary within "AI alignment" is written from a place of understanding it too.)

What this post is about is the amount of information-theoretic complexity that you need to get into the system's preferences, in order to have that system, given unlimited or rather extremely large amounts of power, deliver to you what you want.

It doesn't argue that superintelligences will not know this information.  You'll note that the central technology in the parable isn't an AI; it's an Outcome Pump.

What it says, rather, is that there might be, say, a few tens of thousands of bits -- the exact number is not easy to estimate, we just need to know that it's more than a hundred bits and less than a billion bits and anything in that range is approximately the same problem from our standpoint -- that you need to get into the steering function.  If you understand the Central Divide that Hume's Razor points to, the distinction between probability and preference, etcetera, the post is trying to establish the idea that we need to get 13,333 bits or whatever into the second side of this divide.

In terms of where this point falls within the larger argument, this post is not saying that it's particularly difficult to get those 13,333 bits into the preference function; for all this post tries to say, locally, maybe that's as easy as having humans manually enter 13,333 yes-or-no answers into the system.  It's not talking about the difficulty of doing the work but rather the amount and nature of a kind of work that needs to be done somehow.

Definitely, the post does not say that it's hard to get those 13,333 bits into the belief function or knowledge of a superintelligence.

Separately from understanding correctly what this post is trying to communicate, at all, in 2007, there's the question of whether modern LLMs have anything to say about -- obviously not the post's original point -- but rather, other steps of the larger question in which this post's point appears.

Modern LLMs, if you present them with a text-based story like the one in this parable, are able to answer at least some text-based questions about whether you'd prefer your grandmother to be outside the building or be safely outside the building.  Let's admit this premised observation at face value.  Have we learned thereby the conclusion that it's easy to get all of that information into a superintelligence's preference function?

And if we say "No", is this Eliezer making up post-hoc excuses?

What exactly we learn from the evidence of how AI has played out in 2024 so far, is the sort of thing that deserves its own post.  But I observe that if you'd asked Eliezer-2007 whether an (Earth-originating) superintelligence could correctly predict the human response pattern about what to do with the grandmother -- solve the same task LLMs are solving, to at least the LLM's performance level -- Eliezer-2007 would have unhesitatingly answered "yes" and indeed "OBVIOUSLY yes".

How is this coherent?  Because the post's point is about how much information needs to get into the preference function.  To predict a human response pattern you need (only) epistemic knowledge.  This is part of why the post is about needing to give specifications to an Outcome Pump, rather than it depicting an AI being surprised by its continually incorrect predictions about a human response pattern.

If you don't see any important distinction between the two, then of course you'll think that it's incoherent to talk about that distinction.  But even if you think that Hume was mistaken about there existing any sort of interesting gap between 'is' and 'ought', you might by some act of empathy be able to imagine that other people think there's an interesting subject matter there, and they are trying to talk about it with you; otherwise you will just flatly misunderstand what they were trying to say, and mispredict their future utterances.  There's a difference between disagreeing with a point, and just flatly failing to get it, and hopefully you aspire to the first state of mind rather than the second.

Have we learned anything stunningly hopeful from modern pre-AGIs getting down part of the epistemic part of the problem at their current ability levels, to the kind of resolution that this post talked about in 2007?  Or from it being possible to cajole pre-AGIs with loss functions into willingly using that knowledge to predict human text outputs?  Some people think that this teaches us that alignment is hugely easy.  I think they are mistaken, but that would take its own post to talk about.

But people who point to "The Hidden Complexity of Wishes" and say of it that it shows that I had a view which the current evidence already falsifies -- that I predicted that no AGI would ever be able to predict human response patterns about getting grandmothers out of burning buildings -- have simply: misunderstood what the post is about, not understood in particular why the post is about an Outcome Pump rather than an AI stupidly mispredicting human responses, and failed to pick up on the central point that Eliezer expects superintelligences to be smart in the sense of making excellent purely epistemic predictions.

Replies from: cubefox, matthew-barnett, Zahima
comment by cubefox · 2024-02-15T19:14:26.401Z · LW(p) · GW(p)

I'm well aware of and agree there is a fundamental difference between knowing what we want and being motivated to do what we want. But as I wrote in the first paragraph:

Already LaMDA or InstructGPT (language models fine-tuned with supervised learning to follow instructions, essentially ChatGPT without any RLHF applied), are in fact pretty safe Oracles in regard to fulfilling wishes without misinterpreting you, and an Oracle AI is just a special kind of Genie whose actions are restricted to outputting text. If you tell InstructGPT what you want, it will very much try to give you just what you want, not something unintended, at least if it can be produced using text.

That is, instruction-tuned language models do not just understand (epistemics) what we want them to do, they additionally, to a large extent, do what we want them to do. They are good at executing our instructions. Not just at understanding our instructions but then doing something unintended.

(However, I agree they are probably not perfect at executing our instructions as we intended them. We might ask them to answer to the best of their knowledge, and they may instead answer with something that "sounds good" but is not what they in fact believe. Or, perhaps, as Gwern pointed out, they exhibit things like a strange tendency to answer our request for a non-rhyming poem with a rhyming poem, even though they may be well-aware, internally, that this isn't what was requested.)

comment by Matthew Barnett (matthew-barnett) · 2024-02-15T20:15:41.740Z · LW(p) · GW(p)

I agree with cubefox: you seem to be misinterpreting the claim that LLMs actually execute your intended instructions as a mere claim about whether LLMs understand your intended instructions. I claim there is simply a sharp distinction between actual execution and correct, legible interpretation of instructions and a simple understanding of those instructions; LLMs do the former, not merely the latter.

Honestly, I think focusing on this element of the discussion is kind of a distraction because, in my opinion, the charitable interpretation of your posts is simply that you never thought that it would be hard to get AIs to exhibit human-level reasonableness at interpreting and executing tasks until AIs reach a certain capability level, and the threshold at which these issues were predicted to arise was always intended to be very far above GPT-4-level. This interpretation of your argument is plausible based on what you wrote, and could indeed save your theory from empirical falsification based on our current observations.

That said, if you want to go this route, and argue that "complexity of wishes"-type issues will eventually start occurring at some level of AI capability, I think it would be beneficial for you to clarify exactly what level you empirically expect we'll start having the issues of misinterpretation you described. For example, would either of the following observations contradict your theory of alignment?

  1. At some point there's a multimodal model that is roughly as intelligent as a 99th percentile human on virtual long-horizon tasks (e.g. it can learn how to play Minecraft well after a few hours of in-game play, can work in a variety of remote jobs, and has the ability to pursue coherent goals over several months) and yet this model allows you to shut it off, modify its weights, or otherwise change its mode of operation arbitrarily i.e. it's corrigible, in a basic sense. Moreover, the model generally executes our instructions as intended, without any evidence of blatant instruction-misinterpretation or disobedience, before letting us shut it down.
  2. AIs are widely deployed across the economy to automate a wide range of labor, including the task of scientific research. This has the effect of accelerating technological progress, prompting the development of nanotechnology that is sophisticated enough to allow for the creation of strawberries that are identical on the cellular but not molecular level. As a result, you can purchase such strawberries at a store, and we haven't all died yet despite these developments.
comment by Casey B. (Zahima) · 2024-02-19T20:45:21.276Z · LW(p) · GW(p)

The old paradox: to care it must first understand, but to understand requires high capability, capability that is lethal if it doesn't care

But it turns out we have understanding before lethal levels of capability. So now such understanding can be a target of optimization. There is still significant risk, since there are multiple possible internal mechanisms/strategies the AI could be deploying to reach that same target. Deception, actual caring, something I've been calling detachment, and possibly others. 

This is where the discourse should be focusing on, IMO. This is the update/direction I want to see you make. The sequence of things being learned/internalized/chiseled is important. 

My imagined Eliezer has many replies to this, with numerous branches in the dialogue/argument tree which I don't want to get into now. But this *first step* towards recognizing the new place we are in, specifically wrt the ability to target human values (whether for deceptive, disinterested, detached, or actual caring reasons!), needs to be taken imo, rather than repeating this line of "of course I understood that a superint would understand human values; this isn't an update for me". 

(edit: My comments here are regarding the larger discourse, not just this specific post or reply-chain) 

comment by Thomas Kwa (thomas-kwa) · 2024-01-24T01:03:28.845Z · LW(p) · GW(p)

I think this post is wrong because it was written before quantilizers were known. The base rate of people being rescued from burning buildings is much higher than the rate of buildings exploding and hurling people out of them, so the Outcome Pump will only explode the building if the function strongly favors that over your mother being rescued. Even 99% reset probability is not enough to explode the building, unless it was already likely to explode.

It may be that setting Pr(reset) to make most outcomes vastly unlikely, like Pr(reset) = [0 if distance(mother, building, 5 seconds from now) > 100 meters else 0.999999], causes some weird outcome like exploding the building. But allowing likely outcomes, e.g. Pr(reset) = [0 if distance(mother, building, 20 minutes from now) > 100 meters else 0.999], probably saves her, unless this was super unlikely to happen in the first place, in which case she jumps out and her dead body is carried away or something.

Basically, this post implies that all wishes are unsafe. But only wishes with very low prior probability are unsafe.

Replies from: habryka4, Lblack
comment by habryka (habryka4) · 2024-01-24T01:38:19.120Z · LW(p) · GW(p)

I don't understand this. The post makes a reference to the Open Source Genie Project, whose description says: 

The goal of the Open-Source Wish Project is to create perfectly-worded wishes, so that when the genie comes and grants us our wish we can get precisely what we want. The genie, of course, will attempt to interpret the wish in the most malicious way possible, using any loophole to turn our wish into a living hell. The Open-Source Wish Project hopes to use the collective wisdom of all humanity to create wishes with no loopholes whatsoever.

The post is about how to phrase wishes in the context of something that is actively interested in subverting them. 

Even beyond that, I think "prior probability of a thing happening" is one kind of outcome pump, but the post does not specify that as the kind of outcome pump it's talking about. "Minimal matter that needs to be modified", "Minimal energy expenditure" or "complicated alien set of preferences that will be maximized along with your wish" are also reasonable priors for outcome pumps. 

I agree that I wish the post was clearer that certain kinds of outcome pumps might be fine, but I don't understand the basis for saying the post is false, especially given the explicit reference to the Open-Source Wish Project which directly specifies they are dealing with a malicious genie. 

Replies from: ricraz, thomas-kwa
comment by Richard_Ngo (ricraz) · 2024-01-24T01:47:30.737Z · LW(p) · GW(p)

The outcome pump is defined in a way that excludes the possibility of active subversion: it literally just keeps rerunning until the outcome is satisfied, which is a way of sampling based on (some kind of) prior probability. Yudkowsky is arguing that this is equivalent to a malicious genie. But this is a claim that can be false.

In this specific case, I agree with Thomas that whether or not it's actually false will depend on the details of the function: "The further she gets from the building's center, the less the time machine's reset probability." But there's probably some not-too-complicated way to define it which would render the pump safe-ish (since this was a user-defined function).

Replies from: habryka4
comment by habryka (habryka4) · 2024-01-24T02:03:04.787Z · LW(p) · GW(p)

Ah, rereading the post I think you are right: 

The Outcome Pump is not sentient.  It contains a tiny time machine, which resets time unless a specified outcome occurs.  For example, if you hooked up the Outcome Pump's sensors to a coin, and specified that the time machine should keep resetting until it sees the coin come up heads, and then you actually flipped the coin, you would see the coin come up heads.  (The physicists say that any future in which a "reset" occurs is inconsistent, and therefore never happens in the first place - so you aren't actually killing any versions of yourself.)

Whatever proposition you can manage to input into the Outcome Pump, somehow happens, though not in a way that violates the laws of physics.  If you try to input a proposition that's too unlikely, the time machine will suffer a spontaneous mechanical failure before that outcome ever occurs.

I find this a bit confusing to think about. In a classical universe this machine is impossible. It seems like this basically relies on quantum uncertainty. The resulting probability distribution of events will definitely not reflect your prior probability distribution, so I think Thomas' argument still doesn't go through. The best guess I have is that it would reflect the shape of the quantum wave-function. 

My guess is at a practical level this ends up kind of close to "particles being moved the minimum necessary distance to achieve the outcome", which I think would generally favor outcomes like "the building explodes". I definitely don't think it would favor outcomes like "the fire department arrives 5 minutes earlier" since any macro-level events like that would likely require sampling from much lower amplitude parts of the wave-function (or something, this also doesn't seem super-compatible with an Everett-interpretation of quantum mechanics, but I can kind of squint and make it work with a Copenhagen-interpretation model).

So I do think I was wrong about Eliezer not specifying how the outcome pump works, but I think his specification still suggests that the result would definitely not be anywhere close to sampling from your prior (which I think might result in reasonable outcome), but would involve some pretty intense maximization and unintended outcomes as you start to put constraints on that prior.

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2024-01-24T02:13:27.504Z · LW(p) · GW(p)

The resulting probability distribution of events will definitely not reflect your prior probability distribution, so I think Thomas' argument still doesn't go through. It will reflect the shape of the wave-function. 

This is a good point. But I don't think "particles being moved the minimum necessary distance to achieve the outcome" actually favors explosions. I think it probably favors the sensor hardware getting corrupted, or it might actually favor messing with the firemens' brains to make them decide to come earlier (or messing with your mother's brain to make her jump out of the building)—because both of these are highly sensitive systems where small changes can have large effects.

Does this undermine the parable? Kinda, I think. If you built a machine that samples from some bizarre inhuman distribution, and then you get bizarre outcomes, then the problem is not really about your wish any more, the problem is that you built a weirdly-sampling machine. (And then we can debate about the extent to which NNs are weirdly-sampling machines, I guess.)

Replies from: habryka4
comment by habryka (habryka4) · 2024-01-24T02:31:47.879Z · LW(p) · GW(p)

Does this undermine the parable? Kinda, I think. If you built a machine that samples from some bizarre inhuman distribution, and then you get bizarre outcomes, then the problem is not really about your wish any more, the problem is that you built a weirdly-sampling machine. (And then we can debate about the extent to which NNs are weirdly-sampling machines, I guess.)

This is roughly how I would interpret the post. Physics itself is a bizarre inhuman distribution, and in-general many probability distributions from which you might want to sample from will be bizarre and inhuman. 

Agree that it's then arguable to what degree the optimization pressure of a mature AGI arising from NNs would also be bizarre. My guess is quite bizarre, since a lot of the constraints it will face will be constraints of physics. 

comment by Thomas Kwa (thomas-kwa) · 2024-01-24T02:00:49.248Z · LW(p) · GW(p)

Even beyond that, I think "prior probability of a thing happening" is one kind of outcome pump, but the post does not specify that as the kind of outcome pump it's talking about.

Disagree. The Outcome Pump is explicitly described as conditioning the future trajectory of the universe according to the reset function:

The Outcome Pump is not sentient.  It contains a tiny time machine, which resets time unless a specified outcome occurs.  For example, if you hooked up the Outcome Pump's sensors to a coin, and specified that the time machine should keep resetting until it sees the coin come up heads, and then you actually flipped the coin, you would see the coin come up heads.  (The physicists say that any future in which a "reset" occurs is inconsistent, and therefore never happens in the first place - so you aren't actually killing any versions of yourself.)

Also because the Outcome Pump is not sentient, it cannot be actively interested in subverting your wish. Eliezer claims "The Outcome Pump is a genie of the second class.  No wish is safe.", implying that the subversion effect will happen even with the non-sentient, quantilizer-like Outcome Pump. It may happen that future AIs are unsafe, but this will be because they apply too much optimization.

Replies from: habryka4
comment by habryka (habryka4) · 2024-01-24T02:05:51.308Z · LW(p) · GW(p)

Yeah, see my response to Richard. I was wrong about the Outcome Pump not being specified, but think that your use of "probability" in the top-level comment is still wrong. Clearly the outcome pump would not sample from your prior over likely events. 

It would sample from some universal prior over events (this is playing fast-and-loose with quantum mechanics, but a reasonable interpretation might be sampling from the quantum wave-function, if you take a more Copenhagen perspective). Almost any universal prior here would be very oddly shaped, so that indeed you would observe the kinds of things that Eliezer is talking about.

Replies from: thomas-kwa
comment by Thomas Kwa (thomas-kwa) · 2024-01-24T02:25:54.703Z · LW(p) · GW(p)

I thought it was sampling from the quantum wavefunction, and still I think my argument works, unless this was a building that was basically deterministically going to kill your mother if you run physics from that point forward, or already had hazardous materials with a significant chance of exploding. I agree that you can't use your own prior probabilities.

Maybe I'm wrong about how much quantum randomness can influence events at a 5 minute timescale and the universe is actually very deterministic? If it's very little such that you have to condition very hard to get anything to happen, then maybe the building does explode, but I'm not really sure what would happen.

Replies from: habryka4
comment by habryka (habryka4) · 2024-01-24T02:29:41.062Z · LW(p) · GW(p)

As I said, the best approximation I have is "move particles the smallest joint distance from my highest prior configuration". Some particles are in people's brains, but changing people's beliefs or intentions seems like it's very unlikely to happen via this operation, since my guess is the brain is highly redundant and works on ion channels that would require actually a quite substantial amount of matter to be displaced (comparatively). Very locally causing a chemical cain reaction somewhere seems easier, though that's just a guess.

I am not really sure what happens here, since I think overall physics is highly deterministic even taking into account quantumness, and my guess is for a macro-level outcome here you would need to go very quickly into astronomically low probabilities if you sample from the wave-function, and I don't trust my reasoning for what happens in 0.00000000000000000000001% scenarios.

My best guess is something pretty close to what Eliezer describes happens, but I couldn't prove it to you.

Replies from: ricraz
comment by Richard_Ngo (ricraz) · 2024-01-25T01:25:38.087Z · LW(p) · GW(p)

my guess is the brain is highly redundant and works on ion channels that would require actually a quite substantial amount of matter to be displaced (comparatively)

Neurons are very small, though, compared with the size of a hole in a gas pipe that would be necessary to cause an explosive gas leak. (Especially because you then can't control where the gas goes after leaking, so it could take a lot of intervention to give the person a bunch of away-from-building momentum.)

I would probably agree with you if the building happened to have a ton of TNT sitting around in the basement.

Replies from: habryka4
comment by habryka (habryka4) · 2024-01-25T02:15:51.080Z · LW(p) · GW(p)

Oh, I was definitely not thinking of a hole in a gas pipe. I was expecting something much much subtler than that (more like very highly localized temperature-increases which then chain-react). You are dealing with omniscient levels of consequence-control here.

comment by Lucius Bushnaq (Lblack) · 2024-02-23T22:10:08.109Z · LW(p) · GW(p)

I figured the probability adjustments the pump was making were modifying Everett branch amplitude ratios. Not probabilities as in reasoning tools to deal with incomplete knowledge of the world and logical uncertainty that tiny human brains use to predict how this situation might go based on looking at past 'base rates'. It's unclear to me how you could make the latter concept of an outcome pump a coherent thing at all. The former, on the other hand, seems like the natural outcome of the time machine setup described. If you turn back time when the branch doesn't have the outcome you like, only branches with the outcome you like will remain.

I can even make up a physically realisable model of an outcome pump that acts roughly like the one described in the story without using time travel at all. You just need a bunch of high quality sensors to take in data, an AI that judges from the observed data whether the condition set is satisfied, a tiny quantum random noise generator to respect the probability orderings desired, and a false vacuum bomb, which triggers immediately if the AI decides that the condition does not seem to be satisfied. The bomb works by causing a local decay of the metastable[1] electroweak vacuum. This is a highly energetic, self-sustaining process once it gets going, and spreads at the speed of light. Effectively destroying the entire future light-cone, probably not even leaving the possibility for atoms and molecules to ever form again in that volume of space.[2]

So when the AI triggers the bomb or turns back time, the amplitude of earth in that branch basically disappears. Leaving the users of the device to experience only the branches in which the improbable thing they want to have happen happens.

And causing a burning building with a gas supply in it to blow up strikes me as something you can maybe do with a lot less random quantum noise than making your mother phase through the building. Firefighter brains are maybe comparatively easy to steer with quantum noise as well, but that only works if there are any physically nearby enough to reach the building in time to save your mother at the moment the pump is activated. 

This is also why the pump has a limit on how improbable an event it can make happen. If the event has an amplitude of roughly the same size as the amplitude for the pump's sensors reporting bad data or otherwise causing the AI to make the wrong call, the pump will start being unreliable. If the event's amplitude is much lower than the amplitude for the pump malfunctioning, it basically can't do the job at all.

  1. ^

    In real life, it was an open question whether our local electroweak vacuum is in a metastable state last I checked, with the latest experimental evidence I'm aware from a couple of years ago tentatively (ca. 3 sigma I think?) pointing to yes, though that calculation is probably assuming Standard model physics the applicability of which people can argue to hell and back. But it sure seems like a pretty self-consistent way for the world to be, so we can just declare that the fictional universe works like that. Substitute strangelets or any other conjectured instant-earth-annihilation-method of your choice if you like.

  2. ^

    Because the mass terms for the elementary quantum fields would look all different now. Unclear to me that the bound structures of hadronic matter we are familiar with would still be a thing. 

comment by Matthew Barnett (matthew-barnett) · 2024-10-17T02:40:37.908Z · LW(p) · GW(p)

It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of human values. This point is relevant within a larger argument, because it bears on the complexity of what you need to get an artificial superintelligence to want or value; rather than bearing on what a superintelligence supposedly could not predict or understand. -- EY, May 2024.

I can't tell whether this update to the post is addressed towards me. However, it seems possible that it is addressed towards me, since I wrote a post last year [LW · GW] criticizing some of the ideas behind this post. In either case, whether it's addressed towards me or not, I'd like to reply to the update.

For the record, I want to definitively clarify that I never interpreted MIRI as arguing that it would be difficult to get a machine superintelligence to understand or predict human values. That was never my thesis, and I spent considerable effort clarifying the fact that this was not my thesis in my post, stating multiple times that I never thought MIRI predicted it would be hard to get an AI to understand human values.

My thesis instead was about a subtly different thing, which is easy to misinterpret if you aren't reading carefully. I was talking about something which Eliezer called the "value identification problem", and which had been referenced on Arbital, and in other essays by MIRI, including under a different name than the "value identification problem". These other names included the "value specification" problem and the problem of "outer alignment" (at least in narrow contexts).

I didn't expect as much confusion at the time when I wrote the post, because I thought clarifying what I meant and distinguishing it from other things that I did not mean multiple times would be sufficient to prevent rampant misinterpretation by so many people. However, evidently, such clarifications were insufficient, and I should have instead gone overboard in my precision and clarity. I think if I re-wrote the post now, I would try to provide like 5 different independent examples demonstrating how I was talking about a different thing than the problem of getting an AI to "understand" or "predict" human values.

At the very least, I can try now to give a bit more clarification about what I meant, just in case doing this one more time causes the concept to "click" in someone's mind:

Eliezer doesn't actually say this in the above post, but his general argument expressed here and elsewhere seems to be that the premise "human value is complex" implies the conclusion: "therefore, it's hard to get an AI to care about human value". At least, he seems to think that this premise makes this conclusion significantly more likely.[1]

This seems to be his argument, as otherwise it would be unclear why Eliezer would bring up "complexity of values" in the first place. If the complexity of values had nothing to do with the difficulty of getting an AI to care about human values, then it is baffling why he would bring it up. Clearly, there must be some connection, and I think I am interpreting the connection made here correctly.

However, suppose you have a function that inputs a state of the world and outputs a number corresponding to how "good" the state of the world is. And further suppose that this function is transparent, legible, and can actually be used in practice to reliably determine the value of a given world state. In other words, you can give the function a world state, and it will spit out a number, which reliably informs you about the value of the world state. I claim that having such a function would simplify the AI alignment problem by reducing it from the hard problem of getting an AI to care about something complex (human value) to the easier problem of getting the AI to care about that particular function (which is simple, as the function can be hooked up to the AI directly).

In other words, if you have a solution to the value identification problem (i.e., you have the function that correctly and transparently rates the value of world states, as I just described), this almost completely sidesteps the problem that "human value is complex and therefore it's difficult to get an AI to care about human value". That's because, if we have a function that directly encodes human value, and can be simply referenced or directly inputted into a computer, then all the AI needs to do is care about maximizing that function rather than maximizing a more complex referent of "human values". The pointer to "this function" is clearly simple, and in any case, simpler than the idea of all of human value.

(This was supposed to narrowly reply to MIRI, by the way. If I were writing a more general point about how LLMs were evidence that alignment might be easy, I would not have focused so heavily on the historical questions about what people said, and I would have instead made simpler points about how GPT-4 seems to straightforwardly try do what you want, when you tell it to do things.)

My main point was that I thought recent progress in LLMs had demonstrated progress at the problem of building such a function, and solving the value identification problem, and that this progress goes beyond the problem of getting an AI to understand or predict human values. For one thing, an AI that merely understands human values will not necessarily act as a transparent, legible function that will tell you the value of any outcome. However, by contrast, solving the value identification problem would give you such a function. This strongly distinguishes the two problems. These problems are not the same thing. I'd appreciate if people stopped interpreting me as saying one thing when I clearly meant another, separate thing.

  1. ^

    This interpretation is supported by the following quote, on Arbital,

    Complexity of value is a further idea above and beyond the orthogonality thesis which states that AIs don't automatically do the right thing and that we can have, e.g., paperclip maximizers. Even if we accept that paperclip maximizers are possible, and simple and nonforced, this wouldn't yet imply that it's very difficult to make AIs that do the right thing. If the right thing is very simple to encode - if there are value optimizers that are scarcely more complex than diamond maximizers - then it might not be especially hard to build a nice AI even if not all AIs are nice. Complexity of Value is the further proposition that says, no, this is forseeably quite hard - not because AIs have 'natural' anti-nice desires, but because niceness requires a lot of work to specify. [emphasis mine]

Replies from: TsviBT, Seth Herd, Raemon, Maxc, Eliezer_Yudkowsky
comment by TsviBT · 2024-10-17T12:33:05.891Z · LW(p) · GW(p)

Alice: I want to make a bovine stem cell that can be cultured at scale in vats to make meat-like tissue. I could use directed evolution. But in my alternate universe, genome sequencing costs $1 billion per genome, so I can't straightforwardly select cells to amplify based on whether their genome looks culturable. Currently the only method I have is to do end-to-end testing: I take a cell line, I try to culture a great big batch, and then see if the result is good quality edible tissue, and see if the cell line can last for a year without mutating beyond repair. This is very expensive, but more importantly, it doesn't work. I can select for cells that make somewhat more meat-like tissue; but when I do that, I also heavily select for other very bad traits, such as forming cancer-like growths. I estimate that it takes on the order of 500 alleles optimized relative to the wild type to get a cell that can be used for high-quality, culturable-at-scale edible tissue. Because that's a large complex change, it won't just happen by accident; something about our process for making the cells has to put those bits there.

Bob: In a recent paper, a polygenic score for culturable meat is given. Since we now have the relevant polygenic score, we actually have a short handle for the target: namely, a pointer to an implementation of this polygenic score as a computer program.

Alice: That seems of limited relevance. It's definitely relevant in that, if I grant the premise that this is actually the right polygenic score (which I don't), we now know what exactly we would put in the genome if we could. That's one part of the problem solved, but it's not the part I was talking about. I'm talking about the part where I don't know how to steer the genome precisely enough to get anywhere complex.

Bob: You've been bringing up the complexity of the genomic target. I'm saying that actually the target isn't that complex, because it's just a function call to the PGS.

Alice: Ok, yes, we've greatly decreased the relative algorithmic complexity of the right genome, in some sense. It is indeed the case that if I ran a computer program randomly sampled from strings I could type into a python file, it would be far more likely to output the right genome if I have the PGS file on my computer compared to if I don't. True. But that's not very relevant because that's not the process we're discussing. We're discussing the process that creates a cell with its genome, not the process that randomly samples computer programs weighted by [algorithmic complexity in the python language on my computer]. The problem is that I don't know how to interface with the cell-creation process in a way that lets me push bits of selection into it. Instead, the cell-creation process just mostly does its own thing. Even if I do end-to-end phenotype selection, I'm not really steering the core process of cell-genome-selection.

Bob: I understand, but you were saying that the complexity of the target makes the whole task harder. Now that we have the PGS, the target is not very complex; we just point at the PGS.

Alice: The point about the complexity is to say that cells growing in my lab won't just spontaneously start having the 500 alleles I want. I'd have to do something to them--I'd have to know how to pump selection power into them. It's some specific technique I need to have but don't have, for dealing with cells. It doesn't matter that the random-program complexity has decreased, because we're not talking about random programs, we're talking about cell-genome-selection. Cell-genome-selection is the process where I don't know how to consistently pump bits into, and it's the process that doesn't by chance get the 500 alleles. It's the process against which I'm measuring complexity.

Replies from: Seth Herd
comment by Seth Herd · 2024-10-17T18:39:45.612Z · LW(p) · GW(p)

This analogy is valid in the case where we have absolutely no idea how to use a system's representations or "knowledge" to direct an AIs behavior. That is the world Yudkowsky wrote the sequences in. It is not the world we currently live in. There are several, perhaps many, plausible plans to direct a competent AGIs actions and its "thoughts" and "values"' toward either its own or a subsystem's "understanding" of human values. See Goals selected from learned knowledge: an alternative to RL alignment [AF · GW] for some of those plans. Critiques need to go beyond the old "we have no idea" argument and actually address the ideas we have.

Replies from: sharmake-farah, TsviBT
comment by Noosphere89 (sharmake-farah) · 2024-10-17T18:50:42.349Z · LW(p) · GW(p)

This.

I'm not sure you could be as confident as Yudkowsky was at the time, but yeah there was a serious probability in the epistemic state of 2008 that human values were so complicated and that simple techniques made AIs so completely goodhart on the task that's intended that controlling smart AI was essentially hopeless.

We now know that a lot of the old Lesswrong lore on how complicated human values and wishes are, at least in the code section are either incorrect or irrelevant, and we also know that the standard LW story of how humans came to dominate other animals is incorrect to a degree that impacts AI alignment.

I have my own comments on the ideas below, but people really should try to update on the evidence we gained from LLMs, as we learned a lot about ourselves and LLMs in the process, because there's a lot of evidence that generalizes from LLMs to future AGI/ASI, and IMO LW updated way, way too slowly on AI safety.

https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg [LW · GW]

https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ [LW(p) · GW(p)] (This is more of a model-based RL approach to alignment)

https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#dyfwgry3gKRBqQzoW [LW(p) · GW(p)]

https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#7bvmdfhzfdThZ6qck [LW(p) · GW(p)]

comment by TsviBT · 2024-10-17T18:55:14.084Z · LW(p) · GW(p)

That's incorrect, but more importantly it's off topic. The topic is "what does the complexity of value have to do with the difficulty of alignment". Barnett AFAIK in this comment is not saying (though he might agree, and maybe he should be taken as saying so implicitly or something) "we have lots of ideas for getting an AI to care about some given values". Rather he's saying "if you have a simple pointer to our values, then the complexity of values no longer implies anything about the difficulty of alignment because values effectively aren't complex anymore".

comment by Seth Herd · 2024-10-17T18:51:25.756Z · LW(p) · GW(p)

I think this is worth a new top-level post. I think the discussion on your Evaluating the historical value misspecification argument [LW · GW] was a high-water mark for resolving the disagreement on alignment difficulty between old-schoolers and new prosaic alignment thinkers. But that discussion didn't make it past the point you raise here: if we can identify human values, shouldn't that help (a lot) in making an AGI that pursues those values?

One key factor is whether the understanding of human values is available while the AGI is still dumb enough to remain in your control.

I tried to progress this line of discussion in my The (partial) fallacy of dumb superintelligence [LW · GW] and Goals selected from learned knowledge: an alternative to RL alignment [AF · GW].

comment by Raemon · 2024-10-17T20:35:22.536Z · LW(p) · GW(p)

a) I think at least part of what's gone on is that Eliezer has been misunderstood and facing the same actually quite dumb arguments a lot, and he is now (IMO) too quick to round new arguments off to something he's got cached arguments for. (I'm not sure whether this is exactly what went on in this case, but seems plausible without carefully rereading everything)

b) I do think when Eliezer wrote this post, there were literally a bunch of people making quite dumb arguments that were literally "the solution to AI ethics/alignment is [my preferred elegant system of ethics] / [just have it track smiling faces] / [other explicit hardcoded solutions that were genuinely impractical]"

I think I personally did also not get what you were trying to say for awhile, so I don't think the problem here is just Eliezer (although it might be me making a similar mistake to what I hypothesize Eliezer to have made, for reasons that are correlated with him)

I do generally think a criticism I have of Eliezer is that he has spent too much time comparatively focused on the dumber 3/4 of arguments, instead of engaging directly with top critics which are often actually making more subtle points (and being a bit too slow to update that this is what's going on)

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-10-18T17:02:20.943Z · LW(p) · GW(p)

Wish there was a system where people could pay money to bid up what they believed were the "top arguments" that they wanted me to respond to.  Possibly a system where I collect the money for writing a diligent response (albeit note that in this case I'd weigh the time-cost of responding as well as the bid for a response); but even aside from that, some way of canonizing what "people who care enough to spend money on that" think are the Super Best Arguments That I Should Definitely Respond To.  As it stands, whatever I respond to, there's somebody else to say that it wasn't the real argument, and this mainly incentivizes me to sigh and go on responding to whatever I happen to care about more.

(I also wish this system had been in place 24 years ago so you could scroll back and check out the wacky shit that used to be on that system earlier, but too late now.)

Replies from: Raemon, christopher-king
comment by Raemon · 2024-10-19T22:29:26.492Z · LW(p) · GW(p)

I do think such a system would be really valuable, and is the sort of the thing the LW team should try to build. (I'm mostly not going to respond to this idea right now but I've filed it away as something to revisit more seriously with Lightcone. Seems straightforwardly good)

But it feels slightly orthogonal to what I was trying to say. Let me try again.

(this is now official a tangent from the original point, but, feels important to me)

It would be good if the world could (deservedly) trust, that the best x-risk thinkers have a good group epistemic process for resolving disagreements.

At least two steps that seem helpful for that process are:

  • Articulating clear lists of the best arguments, such that people can prioritize refuting them (or updating on them).
  • But, before that, there is a messier process of "people articulating half formed versions of those arguments, struggling to communicate through different ontologies, being slightly confused." And there is some back-and-forth process typically needed to make progress.

It is that "before" step where it feels like things seem to be going wrong, to me. (I haven't re-read  Matthew's post or your response comment [LW(p) · GW(p)] from a year ago in enough detail to have a clear sense of what, if anything, went wrong. But to illustrate the ontology: I that instance was roughly in the liminal space between the two steps)

Half-formed confused arguments in different ontologies are probably "wrong", but that isn't necessarily because they are completely stupid, it can be because they are half-formed. And maybe the final version of the argument is good, or maybe not, but it's at least a less stupid version of that argument. And if Alice rejects a confused, stupid argument in a loud way, without understanding the generator that Bob was trying to pursue, Bob's often rightly annoyed that Alice didn't really hear them and didn't really engage.

Dealing with confused half-formed arguments is expensive, and I'm not sure it's worth people's time, especially given that confused half-formed arguments are hard to distinguish from "just wrong" ones. 

But, I think we can reduce wasted-motion on the margin. 

A hopefully cheap-enough TAP that might help if more people did, might be something like:

<TAP> When responding to a wrong argument (which might be completely stupid, or might be a half-formed thing going in an eventually interesting direction)

<ACTION> Preface response with something like: "I think you're saying X. Assuming so, I think this is wrong because [insert argument]." End the argument with "If this seemed to be missing the point, can you try saying your thing in different words, or clarify?"

(if it feels too expensive to articulate what X is, instead one could start with something more like "It looks at first glance like this is wrong because [insert argument]" and then still end with the "check if missing the point?" closing note)

I think more-of-that-on-the-margin from a bunch of people would save a lot of time spent in aggro-y escalation spirals.

re: top level posts

This doesn't quite help with when, instead of replying to someone, you're writing a top-level post responding to an abstracted argument (i.e.  The Sun is big, but superintelligences will not spare Earth a little sunlight [LW · GW]). 

I'd have to think more about what to do for that case, but, the sort of thing I'm imagining is a bit more scaffolding that builds towards "having a well indexed list of the best arguments." Maybe briefly noting early on "This essay is arguing for [this particular item in List of Lethalities [LW · GW]]" or "This argument is adding a new item to List of Lethalities" (and then maybe update that post, since it's nice to have a comprehensive list). 

This doesn't feel like a complete solution, but, the sort of things I'd be looking for a cheap things you can add to posts that help bootstrap towards a clearer-list-of-the-best-arguments existing.

comment by Christopher King (christopher-king) · 2024-11-06T17:45:28.953Z · LW(p) · GW(p)

I would suggest formulating this like a literal attention economy.

  1. You set a price for your attention (probably like $1). The price at which even if the post is a waste of time, the money makes it worth it.
  2. "Recommenders" can recommend content to you by paying the price.
  3. If the content was worth your time, you pay the recommender the $1 back plus a couple cents.

The idea is that the recommenders would get good at predicting what posts you'd pay them for. And since you aren't a causal decision theorist they know you won't scam them. In particular, on average you should be losing money (but in exchange you get good content).

This doesn't necessarily require new software. Just tell people to send PayPals with a link to the content.

With custom software, theoretically there could exist a secondary market for "shares" in the payout from step 3 to make things more efficient. That way the best recommenders could sell their shares and then use that money to recommend more content before you payout.

If the system is bad at recommending content, at least you get paid!

comment by Max H (Maxc) · 2024-10-18T01:58:01.702Z · LW(p) · GW(p)

My main point was that I thought recent progress in LLMs had demonstrated progress at the problem of building such a function, and solving the value identification problem, and that this progress goes beyond the problem of getting an AI to understand or predict human values.

I want to push back on this a bit. I suspect that "demonstrated progress" is doing a lot of work here, and smuggling an assumption that current trends with LLMs will continue and can be extrapolated straightforwardly.

It's true that LLMs have some nice properties for encapsulating fuzzy and complex concepts like human values, but I wouldn't actually want to use any current LLMs as a referent or in a rating system like the one you propose, for obvious reasons.

Maybe future LLMs will retain all the nice properties of current LLMs while also solving various issues with jailbreaking, hallucination, robustness, reasoning about edge cases, etc. but declaring victory already (even on a particular and narrow point about value identification) seems premature to me.


Separately, I think some of the nice properties you list don't actually buy you that much in practice, even if LLM progress does continue straightforwardly. 

A lot of the properties you list follow from the fact that LLMs are pure functions of their input (at least with a temperature of 0).

Functional purity is a very nice property, and traditional software that encapsulates complex logic in pure functions is often easier to reason about, debug, and formally verify vs. software that uses lots of global mutable state and / or interacts with the outside world through a complex I/O interface. But when the function in question is 100s of GB of opaque floats, I think it's a bit of a stretch to call it transparent and legible just because it can be evaluated outside of the IO monad.

Aside from purity, I don't think your point about an LLM being a "particular function" that can be "hooked up to the AI directly" is doing much work - input() (i.e. asking actual humans) seems just as direct and particular as llm(). If you want your AI system to actually do something in the messy real world, you have to break down the nice theoretical boundary and guarantees you get from functional purity somewhere.

More concretely, given your proposed rating system, simply replace any LLM calls with a call that just asks actual humans to rate a world state given some description, and it seems like you get something that is at least as legible and transparent (in an informal sense) as the LLM version. The main advantage with using an LLM here is that you could potentially get lots of such ratings cheaply and quickly. Replay-ability, determinism and the relative ease of interpretability vs. doing neuroscience on the human raters are also nice, but none of these properties are very reassuring or helpful if the ratings themselves aren't all that good. (Also, if you're doing something with such low sample efficiency that you can't just use actual humans, you're probably on the wrong track anyway.)

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-10-18T17:08:35.687Z · LW(p) · GW(p)

The post is about the complexity of what needs to be gotten inside the AI.  If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything.  But it would not change the complexity of what needs to be moved inside the AI, which is the narrow point that this post is about; and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect.

I claim that having such a function would simplify the AI alignment problem by reducing it from the hard problem of getting an AI to care about something complex (human value) to the easier problem of getting the AI to care about that particular function (which is simple, as the function can be hooked up to the AI directly).

One cannot hook up a function to an AI directly; it has to be physically instantiated somehow.  For example, the function could be a human pressing a button; and then, any experimentation on the AI's part to determine what "really" controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is "really" (from the AI's perspective) the true meaning of the reward button after all!  Perhaps you do not have this exact scenario in mind.  So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?

Replies from: matthew-barnett, david-johnston
comment by Matthew Barnett (matthew-barnett) · 2024-10-18T20:16:04.512Z · LW(p) · GW(p)

The post is about the complexity of what needs to be gotten inside the AI.  If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything.

I think it's important to be able to make a narrow point about outer alignment without needing to defend a broader thesis about the entire alignment problem. To the extent my argument is "outer alignment seems easier than you portrayed it to be in this post, and elsewhere", then your reply here that inner alignment is still hard doesn't seem like it particularly rebuts my narrow point.

This post definitely seems to relevantly touch on the question of outer alignment, given the premise that we are explicitly specifying the conditions that the outcome pump needs to satisfy in order for the outcome pump to produce a safe outcome. Explicitly specifying a function that delineates safe from unsafe outcomes is essentially the prototypical case of an outer alignment problem. I was making a point about this aspect of the post, rather than a more general point about how all of alignment is easy.

(It's possible that you'll reply to me by saying "I never intended people to interpret me as saying anything about outer alignment in this post" despite the clear portrayal of an outer alignment problem in the post. Even so, I don't think what you intended really matters that much here. I'm responding to what was clearly and explicitly written, rather than what was in your head at the time, which is unknowable to me.)

One cannot hook up a function to an AI directly; it has to be physically instantiated somehow.  For example, the function could be a human pressing a button; and then, any experimentation on the AI's part to determine what "really" controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is "really" (from the AI's perspective) the true meaning of the reward button after all!  Perhaps you do not have this exact scenario in mind.

It seems you're assuming here that something like iterated amplification and distillation will simply fail, because the supervisor function that provides rewards to the model can be hacked or deceived. I think my response to this is that I just tend to be more optimistic than you are that we can end up doing safe supervision where the supervisor ~always remains in control, and they can evaluate the AI's outputs accurately, more-or-less sidestepping the issues you mention here.

I think my reasons for believing this are pretty mundane: I'd point to the fact that evaluation tends to be easier than generation, and the fact that we can employ non-agentic tools to help evaluate, monitor, and control our models to provide them accurate rewards without getting hacked. I think your general pessimism about these things is fairly unwarranted, and my guess is that if you had made specific predictions about this question in the past, about what will happen prior to world-ending AI, these predictions would largely have fared worse than predictions from someone like Paul Christiano.

Replies from: Eliezer_Yudkowsky, martin-randall
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-10-19T18:14:33.514Z · LW(p) · GW(p)

Your distinction between "outer alignment" and "inner alignment" is both ahistorical and unYudkowskian.  It was invented years after this post was written, by someone who wasn't me; and though I've sometimes used the terms in occasions where they seem to fit unambiguously, it's not something I see as a clear ontological division, especially if you're talking about questions like "If we own the following kind of blackbox, would alignment get any easier?" which on my view breaks that ontology.  So I strongly reject your frame that this post was "clearly portraying an outer alignment problem" and can be criticized on those grounds by you; that is anachronistic.

You are now dragging in a very large number of further inferences about "what I meant", and other implications that you think this post has, which are about Christiano-style proposals that were developed years after this post.  I have disagreements with those, many disagreements.  But it is definitely not what this post is about, one way or another, because this post predates Christiano being on the scene.

What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won't work to say what you want.  This point is true!  If you then want to take in a bunch of anachronistic ideas developed later, and claim (wrongly imo) that this renders irrelevant the simple truth of what this post actually literally says, that would be a separate conversation.  But if you're doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.

Replies from: nostalgebraist, matthew-barnett
comment by nostalgebraist · 2024-10-19T21:52:16.004Z · LW(p) · GW(p)

What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won't work to say what you want.  This point is true!

Matthew is not disputing this point, as far as I can tell.

Instead, he is trying to critique some version of[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.

You have exhorted him several times to distinguish between that larger argument and the narrow point made by this post:

[...] and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect [...]

But if you're doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.

But it seems to me that he's already doing this.  He's not alleging that this post is incorrect in isolation.

The only reason this discussion is happened on the comments of this post at all is the May 2024 update at the start of it, which Matthew used as a jumping-off point [LW(p) · GW(p)] for saying "my critique of the 'larger argument' [LW · GW] does not make the mistake referred to in the May 2024 update[2], but people keep saying it does[3], so I'll try restating that critique again in the hopes it will be clearer this time."

 

  1. ^

    I say "some version of" to allow for a distinction between (a) the "larger argument" of Eliezer_2007's which this post was meant to support in 2007, and (b) whatever version of the same "larger argument" was a standard MIRI position as of roughly 2016-2017.

    As far as I can tell, Matthew is only interested in evaluating the 2016-2017 MIRI position, not the 2007 EY position (insofar as the latter is different, if it fact it is).  When he cites older EY material, he does so as a means to an end – either as indirect evidence of later MIRI positions, because it was itself cited in the later MIRI material which is his main topic.

  2. ^

    Note that the current version of Matthew's 2023 post [LW · GW] includes multiple caveats that he's not making the mistake referred to in the May 2024 update.

    Note also that Matthew's post only mentions this post in two relatively minor ways, first to clarify that he doesn't make the mistake referred to in the update (unlike some "Non-MIRI people" who do make the mistake), and second to support an argument about whether "Yudkowsky and other MIRI people" believe that it could be sufficient to get a single human's values into the AI, or whether something like CEV would be required instead.

    I bring up the mentions of this post in Matthew's post in order to clarifies what role "is 'The Hidden Complexity of Wishes' correct in isolation, considered apart from anything outside it?" plays in Matthew's critique – namely, none at all, IIUC.

    (I realize that Matthew's post has been edited over time, so I can only speak to the current version.)

  3. ^

    To be fully explicit: I'm not claiming anything about whether or not the May 2024 update was about Matthew's 2023 post [LW · GW] (alone or in combination with anything else) or not. I'm just rephrasing what Matthew said in the first comment of this thread (which was also agnostic on the topic of whether the update referred to him).

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2024-10-19T22:38:47.497Z · LW(p) · GW(p)

Matthew is not disputing this point, as far as I can tell.

Instead, he is trying to critique some version of[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.

I'll confirm that I'm not saying this post's exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I'm not saying the parable is wrong. Parables are rarely "wrong" in a strict sense, and I am not disputing this parable's conclusion.

However, I am saying: this parable presumably played some role in the "larger" argument that MIRI has made in the past. What role did it play? Well, I think a good guess is that it portrayed the difficulty of precisely specifying what you want or intend, for example when explicitly designing a utility function. This problem was often alleged to be difficult because, when you want something complex, it's difficult to perfectly delineate potential "good" scenarios and distinguish them from all potential "bad" scenarios. This is the problem I was analyzing in my original comment.

While the term "outer alignment" was not invented to describe this exact problem until much later, I was using that term purely as descriptive terminology for the problem this post clearly describes, rather than claiming that Eliezer in 2007 was deliberately describing something that he called "outer alignment" at the time. Because my usage of "outer alignment" was merely descriptive in this sense, I reject the idea that my comment was anachronistic.

And again: I am not claiming that this post is inaccurate in isolation. In both my above comment, and in my 2023 post, I merely cited this post as portraying an aspect of the problem that I was talking about, rather than saying something like "this particular post's conclusion is wrong". I think the fact that the post doesn't really have a clear thesis in the first place means that it can't be wrong in a strong sense at all. However, the post was definitely interpreted as explaining some part of why alignment is hard — for a long time by many people — and I was critiquing the particular application of the post to this argument, rather than the post itself in isolation.

Replies from: TsviBT
comment by TsviBT · 2024-10-21T07:49:03.102Z · LW(p) · GW(p)

Here's an argument that alignment is difficult which uses complexity of value as a subpoint:

  • A1. If you try to manually specify what you want, you fail.

  • A2. Therefore, you want something algorithmically complex.

  • B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.

  • B2. We want to affect the values-distribution, somehow, so that it ends up with our values.

  • B3. We don't understand how to affect the values-distribution toward something specific.

  • B4. If we don't affect the value-distribution toward something specific, then the values-distribution probably puts large penalties for absolute algorithmic complexity; any specific utility function with higher absolute algorithmic complexity will be less likely to be the one that the AGI ends up with.

  • C1. Because of A2 (our values are algorithmically complex) and B4 (a complex utility function is unlikely to show up in an AGI without us skillfully intervening), an AGI is unlikely to have our values without us skillfully intervening.

  • C2. Because of B3 (we don't know how to skillfully intervene on an AGI's values) and C1, an AGI is unlikely to have our values.

I think that you think that the argument under discussion is something like:

  • (same) A1. If you try to manually specify what you want, you fail.

  • (same) A2. Therefore, you want something algorithmically complex.

  • (same) B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.

  • (same) B2. We want to affect the values-distribution, somehow, so that it ends up with our values.

  • B'3. The greater the complexity of our values, the harder it is to point at our values.

  • B'4. The harder it is to point at our values, the more work or difficulty is involved in B2.

  • C'1. By B'3 and B'4: the greater the complexity of our values, the more work or difficulty is involved in B2 (determining the AGI's values).

  • C'2. Because of A2 (our values are algorithmically complex) and C'1, it would take a lot of work to make an AGI pursue our values.

These are different arguments, which make use of the complexity of values in different ways. You dispute B'3 on the grounds that it can be easy to point at complex values. B'3 isn't used in the first argument though.

Replies from: nostalgebraist
comment by nostalgebraist · 2024-10-21T23:27:33.783Z · LW(p) · GW(p)

In the situation assumed by your first argument, AGI would be very unlikely to share our values even if our values were much simpler than they are.

Complexity makes things worse, yes, but the conclusion "AGI is unlikely to have our values" is already entailed by the other premises even if we drop the stuff about complexity.

Why: if we're just sampling some function from a simplicity prior, we're very unlikely to get any particular nontrivial function that we've decided to care about in advance of the sampling event.  There are just too many possible functions, and probability mass has to get divided among them all.

In other words, if it takes  bits to specify human values, there are  ways that a bitstring of the same length could be set, and we're hoping to land on just one of those through luck alone.  (And to land on a bitstring of this specific length in the first place, of course.)  Unless  is very small, such a coincidence is extremely unlikely.

And  is not going to be that small; even in the sort of naive and overly simple "hand-crafted" value specifications which EY has critiqued in this post and elsewhere, a lot of details have to be specified.  (E.g. some proposals refer to "humans" and so a full algorithmic description of them would require an account of what is and isn't a human.)


One could devise a variant of this argument that doesn't have this issue, by "relaxing the problem" so that we have some control, just not enough to pin down the sampled function exactly.  And then the remaining freedom is filled randomly with a simplicity bias.  This partial control might be enough to make a simple function likely, while not being able to make a more complex function likely.  (Hmm, perhaps this is just your second argument, or a version of it.)

This kind of reasoning might be applicable in a world where its premises are true, but I don't think it's premises are true in our world.

In practice, we apparently have no trouble getting machines to compute very complex functions, including (as Matthew points out) specifications of human value whose robustness would have seemed like impossible magic back in 2007.  The main difficulty, if there is one, is in "getting the function to play the role of the AGI values," not in getting the AGI to compute the particular function we want in the first place.

Replies from: TsviBT
comment by TsviBT · 2024-10-22T15:58:50.692Z · LW(p) · GW(p)

The main difficulty, if there is one, is in "getting the function to play the role of the AGI values," not in getting the AGI to compute the particular function we want in the first place.

Right, that is the problem (and IDK of anyone discussing this who says otherwise).

Another position would be that it's probably easy to influence a few bits of the AI's utility function, but not others. For example, it's conceivable that, by doing capabilities research in different ways, you could increase the probability that the AGI is highly ambitious--e.g. tries to take over the whole lightcone, tries to acausally bargain, etc., rather than being more satisficy. (IDK how to do that, but plausibly it's qualitatively easier than alignment.) Then you could claim that it's half a bit more likely that you've made an FAI, given that an FAI would probably be ambitious. In this case, it does matter that the utility function is complex.

comment by Matthew Barnett (matthew-barnett) · 2024-10-19T23:11:51.294Z · LW(p) · GW(p)

While the term "outer alignment" wasn’t coined until later to describe the exact issue that I'm talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007. 

Because I was simply using "outer alignment" in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that's all.

To be very clear: the exact problem I am talking about is the inherent challenge of precisely defining what you want or intend, especially (though not exclusively) in the context of designing a utility function. This difficulty arises because, when the desired outcome is complex, it becomes nearly impossible to perfectly delineate between all potential 'good' scenarios and all possible 'bad' scenarios. This challenge has been a recurring theme in discussions of alignment, as it's considered hard to capture every nuance of what you want in your specification without missing an edge case.

This problem is manifestly portrayed by your post, using the example of an outcome pump to illustrate. I was responding to this portrayal of the problem, and specifically saying that this specific narrow problem seems easier in light of LLMs, for particular reasons.

It is frankly frustrating to me that, from my perspective, you seem to have reliably missed the point of what I am trying to convey here.

I only brought up Christiano-style proposals because I thought you were changing the topic to a broader discussion, specifically to ask me what methodologies I had in mind when I made particular points. If you had not asked me "So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?" then I would not have mentioned those things. In any case, none of the things I said about Christiano-style proposals were intended to critique this post's narrow point. I was responding to that particular part of your comment instead.

As far as the actual content of this post, I do not dispute its exact thesis. The post seems to be a parable, not a detailed argument with a clear conclusion. The parable seems interesting to me. It also doesn't seem wrong, in any strict sense. However, I do think that some of the broader conclusions that many people have drawn from the parable seem false, in context. I was responding to the specific way that this post had been applied and interpreted in broader arguments about AI alignment. 

My central thesis in regards to this post is simply: the post clearly portrays a specific problem that was later called the "outer alignment" problem by other people. This post portrays this problem as being difficult in a particular way. And I think this portrayal is misleading, even if the literal parable holds up in pure isolation.

comment by Martin Randall (martin-randall) · 2024-10-19T21:09:20.651Z · LW(p) · GW(p)

I think it's important to be able to make a narrow point about outer alignment without needing to defend a broader thesis about the entire alignment problem.

Indeed. For it is written [LW · GW]:

A mind that ever wishes to learn anything complicated, must learn to cultivate an interest in which particular exact argument steps are valid, apart from whether you yet agree or disagree with the final conclusion, because only in this way can you sort through all the arguments and finally sum them.

For more on this topic see "Local Validity as a Key to Sanity and Civilization [? · GW]."

comment by David Johnston (david-johnston) · 2024-10-19T12:11:33.604Z · LW(p) · GW(p)

Algorithmic complexity is precisely analogous to difficulty-of-learning-to-predict, so saying "it's not about learning to predict, it's about algorithmic complexity" doesn't make sense. One read of the original is: learning to respect common sense moral side constraints is tricky[1], but AI systems will learn how to do it in the end. I'd be happy to call this read correct, and is consistent with the observation that today's AI systems do respect common sense moral side constraints given straightforward requests, and that it took a few years to figure out how to do it. That read doesn't really jive with your commentary.

Your commentary seems to situate this post within a larger argument: teaching a system to "act" is different to teaching it to "predict" because in the former case a sufficiently capable learner's behaviour can collapse to a pathological policy, whereas teaching a capable learner to predict does not risk such collapse. Thus "prediction" is distinguished from "algorithmic complexity". Furthermore, commonsense moral side constraints are complex enough to risk such collapse when we train an "actor" but not a "predictor". This seems confused.

First, all we need to turn a language model prediction into an action is a means of turning text into action, and we have many such means. So the distinction between text predictor and actor is suspect. We could consider an alternative knows/cares distinction: does a system act properly when properly incentivised ("knows") vs does it act properly when presented with whatever context we are practically able to give it ("""cares""")? Language models usually act properly given simple prompts, so in this sense they "care". So rejecting evidence from language models does not seem well justified.

Second, there's no need to claim that commonsense moral side constraints in particular are so hard that trying to develop AI systems that respect them leads to policy collapse. It need only be the case that one of the things we try to teach them to do leads to policy collapse. Teaching values is not particularly notable among all the things we might want AI systems to do; it certainly does not seem to be among the hardest. Focussing on values makes the argument unnecessarily weak.

Third, algorithmic complexity is measured with respect to a prior. The post invokes (but does not justify) an "English speaking evil genie" prior. I don't think anyone thinks this is a serious prior for reasoning about advanced AI system behaviour. But the post is (according to your commentary, if not the post itself) making a quantitative point - values are sufficiently complex to induce policy collapse - but it's measuring this quantity using a nonsense prior. If the quantitative argument was indeed the original point, it is mystifying why a nonsense prior was chosen to make it, and also why no effort was made to justify the prior.


  1. the text proposes full value alignment as a solution to the commonsense side constraints problem, but this turned out to be stronger than necessary. ↩︎

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2024-10-19T22:16:22.584Z · LW(p) · GW(p)

My question is why is the following statement below true, exactly?

Second, there's no need to claim that commonsense moral side constraints in particular are so hard that trying to develop AI systems that respect them leads to policy collapse. It need only be the case that one of the things we try to teach them to do leads to policy collapse. 

Replies from: david-johnston
comment by David Johnston (david-johnston) · 2024-10-20T02:25:44.782Z · LW(p) · GW(p)

Here's a basic model of policy collapse: suppose there exist pathological policies of low prior probability (/high algorithmic complexity) such that they play the training game when it is strategically wise to do so, and when they get a good opportunity they defect in order to pursue some unknown aim.

Because they play the training game, a wide variety of training objectives will collapse to one of these policies if the system in training starts exploring policies of sufficiently high algorithmic complexity. So, according to this crude model, there's a complexity bound: stay under it and you're fine, go over it and you get pathological behaviour. Roughly, whatever desired behaviour requires the most algorithmically complex policy is the one that is most pertinent for assessing policy collapse risk (because that's the one that contributes most of the algorithmic complexity, and so it give your first order estimate of whether or not you're crossing the collapse threshold). So, which desired behaviour requires the most complex policy: is it, for example, respecting commonsense moral constraints, or is it inventing molecular nanotechnology?

Tangentially, the policy collapse theory does not predict outcomes that look anything like malicious compliance. It predicts that, if you're in a position of power over the AI system, your mother is saved exactly as you want her to be. If you are not in such a position then your mother is not saved at all and you get a nanobot war instead or something. That is, if you do run afoul of policy collapse, it doesn't matter if you want your system to pursue simple or complex goals, you're up shit creek either way.