Ethics Notes

eliezer_yudkowsky

Ethics Notes

post by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-10-21T21:57:50.000Z · LW · GW · Legacy · 46 comments

46 comments

Followup to: Ethical Inhibitions, Ethical Injunctions, Prices or Bindings?

(Some collected replies to comments on the above three posts.)

From Ethical Inhibitions:

Spambot: Every major democratic political leader lies abundantly to obtain office, as it's a necessity to actually persuade the voters. So Bill Clinton, Jean Chretien, Winston Churchill should qualify for at least half of your list of villainy.

Have the ones who've lied more, done better?

In cases where the politician who told more lies won, has that politician gone on to rule well in an absolute sense?

Is it actually true that no one who refused to lie (and this is not the same as always telling the whole truth) could win political office?

Are the lies expected, and in that sense, less than true betrayals of someone who trusts you?

Are there understood Rules of Politics that include lies but not assassinations, which the good politicians abide by, so that they are not really violating the ethics of their tribe?

Will the world be so much worse off if sufficiently good people refuse to tell outright lies and are thereby barred from public office; or would we thereby lose a George Washington or Marcus Aurelius or two, and thereby darken history?

Pearson: American revolutionaries as well ended human lives for the greater good

Police must sometimes kill the guilty. Soldiers must sometimes kill civilians (or if the enemy knows you're reluctant, that gives them a motive to use civilians as a shield). Spies sometimes have legitimate cause to kill people who helped them, but this has probably been done far more often than it has been justified by a need to end the Nazi nuclear program.

I think it's worth noting that in all such cases, you can write out something like a code of ethics and at least try to have social acceptance of it. Politicians, who lie, may prefer not to discuss the whole thing, but politicians are only a small slice of society.

Are there many who transgress even the unwritten rules and end up really implementing the greater good? (And no, there's no unwritten rule that says you can rob a bank to stop global warming.)

...but if you're placing yourself under unusual stress, you may need to be stricter than what society will accept from you. In fact, I think it's fair to say that the further I push any art, such as rationality or AI theory, the more I perceive that what society will let you get away with is tremendously too sloppy a standard.

Yvain: There are all sorts of biases that would make us less likely to believe people who "break the rules" can ever turn out well. One is the halo effect. Another is availability bias—it's much easier to remember people like Mao than it is to remember the people who were quiet and responsible once their revolution was over, and no one notices the genocides that didn't happen because of some coup or assassination.

When the winners do something bad, it's never interpreted as bad after the fact. Firebombing a city to end a war more quickly, taxing a populace to give health care to the less fortunate, intervening in a foreign country's affairs to stop a genocide: they're all likely to be interpreted as evidence for "the ends don't justify the means" when they fail, but glossed over or treated as common sense interventions when they work.

Both fair points. One of the difficult things in reasoning about ethics is the extent to which we can expect historical data to be distorted by moral self-deception on top of the more standard fogs of history.

Morrison: I'm not sure you aren't "making too much stew from one oyster". I certainly feel a whole lot less ethically inhibited if I'm really, really certain I'm not going to be punished. When I override, it feels very deliberate—"system two" grappling and struggling with "system one"'s casual amorality, and with a significant chance of the override attempt failing.

Weeks: This entire post is kind of surreal to me, as I'm pretty confident I've never felt the emotion described here before... I don't remember ever wanting to do something that I both felt would be wrong and wouldn't have consequences otherwise.

I don't know whether to attribute this to genetic variance, environmental variance, misunderstanding, or a small number of genuine sociopaths among Overcoming Bias readers. Maybe Weeks is referring to "not wanting" in terms of not finally deciding to do something he felt was wrong, rather than not being tempted?

From Ethical Injunctions:

Psy-Kosh: Given the current sequence, perhaps it's time to revisit the whole Torture vs Dust Specks thing?

I can think of two positions on torture to which I am sympathetic:

Strategy 1: No legal system or society should ever refrain from prosecuting those who torture. Anything important enough that torture would even be on the table, like the standard nuclear bomb in New York, is important enough that everyone involved should be willing to go to prison for the crime of torture.

Strategy 2: The chance of actually encountering a "nuke in New York" situation, that can be effectively resolved by torture, is so low, and the knock-on effects of having the policy in place so awful, that a blanket injunction against torture makes sense.

In case 1, you would choose TORTURE over SPECKS, and then go to jail for it, even though it was the right thing to do.

In case 2, you would say "TORTURE over SPECKS is the right alternative of the two, but a human can never be in an epistemic state where you have justified belief that this is the case". Which would tie in well to the Hansonian argument that you have an O(3^^^3) probability penalty from the unlikelihood of finding yourself in such a unique position.

So I am sympathetic to the argument that people should never torture, or that a human can't actually get into the epistemic state of a TORTURE vs. SPECKS decision.

But I can't back the position that SPECKS over TORTURE is inherently the right thing to do, which I did think was the issue at hand. This seems to me to mix up an epistemic precaution with morality.

There's certainly worse things than torturing one person—torturing two people, for example. But if you adopt position 2, then you would refuse to torture one person with your own hands even to save a thousand people from torture, while simultaneously saying that that it is better for one person to be tortured at your own hands than for a thousand people to be tortured at someone else's.

I try to use the words "morality" and "ethics" consistently as follows: The moral questions are over the territory (or, hopefully equivalently, over epistemic states of absolute certainty). The ethical questions are over epistemic states that humans are likely to be in. Moral questions are terminal. Ethical questions are instrumental.

Hanson: The problem here of course is how selective to be about rules to let into this protected level of "rules almost no one should think themselves clever enough to know when to violate." After all, your social training may well want you to include "Never question our noble leader" in that set. Many a Christian has been told the mysteries of God are so subtle that they shouldn't think themselves clever enough to know when they've found evidence that God isn't following a grand plan to make this the best of all possible worlds.

Some of the flaws in Christian theology lie in what they think their supposed facts would imply: e.g., that because God did miracles you can know that God is good. Other problems come more from the falsity of the premises than the invalidity of the deductions. Which is to say, if God did exist and were good, then you would be justified in being cautious around stomping on parts of God's plan that didn't seem to make sense at the moment. But this epistemic state would best be arrived at via a long history of people saying, "Look how stupid God's plan is, we need to do X" and then X blowing up on them. Rather than, as is actually the case, people saying "God's plan is X" and then X blows up on them.

Or if you'd found with some historical regularity that, when you challenged the verdict of the black box, that you seemed to be right 90% of the time, but the other 10% of the time you got black-swan blowups that caused a hundred times as much damage, that would also be cause for hesitation—albeit it doesn't quite seem like grounds for suspecting a divine plan.

Nominull: S o... do you not actually believe in your injunction to "shut up and multiply"? Because for some time now you seem to have been arguing that we should do what feels right rather than trying to figure out what is right.

Certainly I'm not saying "just do what feels right". There's no safe defense, not even ethical injunctions. There's also no safe defense, not even "shut up and multiply".

I probably should have been clearer about this before, but I was trying to discuss things in order, and didn't want to wade into ethics without specialized posts...

People often object to the sort of scenarios that illustrate "shut up and multiply" by saying, "But if the experimenter tells you X, what if they might be lying?"

Well, in a lot of real-world cases, then yes, there are various probability updates you perform based on other people being willing to make bets against you; and just because you get certain experimental instructions doesn't imply the real world is that way.

But the base case has to be moral comparisons between worlds, or comparisons of expected utility between given probability distributions. If you can't ask about the base case, then what good can you get from instrumental ethics built on top?

Let's be very clear that I don't think that one small act of self-deception is an inherently morally worse event than, say, getting a hand chopped off. I'm asking, rather, how one should best avoid the dismembering chainsaw; and I am arguing that in reasonable states of knowledge a human can attain, the answer is, "Don't deceive yourself, it's a black-swan bet at best." Furthermore, that in the vast majority of cases where I have seen people conclude otherwise, it has indicated messed-up reasoning more than any actual advantage.

Vassar: For such a reason, I would be very wary of using such rules in an AGI, but of course, perhaps the actual mathematical formulation of the rule in question within the AGI would be less problematic, though a few seconds of thought doesn't give me much reason to think this.

Are we still talking about self-deception? Because I would give odds around as extreme as the odds I would give of anything, that if you tell me "the AI you built is trying to deceive itself", it indicates that some kind of really epic error has occurred. Controlled shutdown, immediately.

Vassar: In a very general sense though, I see a logical problem with this whole line of thought. How can any of these injunctions survive except as self-protecting beliefs? Isn't this whole approach just the sort of "fighting bias with bias" that you and Robin usually argue against?

Maybe I'm not being clear about how this would work in an AI!

The ethical injunction isn't self-protecting, it's supported within the structural framework of the underlying system. You might even find ethical injunctions starting to emerge without programmer intervention, in some cases, depending on how well the AI understood its own situation.

But the kind of injunctions I have in mind wouldn't be reflective—they wouldn't modify the utility function, or kick in at the reflective level to ensure their own propagation. That sounds really scary, to me—there ought to be an injunction against it!

You might have a rule that would controlledly shut down the (non-mature) AI if it tried to execute a certain kind of source code change, but that wouldn't be the same as having an injunction that exerts direct control over the source code to propagate itself.

To the extent the injunction sticks around in the AI, it should be as the result of ordinary reasoning, not reasoning taking the injunction into account! That would be the wrong kind of circularity; you can unwind past ethical unjunctions!

My ethical injunctions do not come with an extra clause that says, "Do not reconsider this injunction, including not reconsidering this clause." That would be going way too far. If anything, you ought to have an injunction against that kind of circularity (since it seems like a plausible failure mode in which the system has been parasitized by its own content).

You should never, ever murder an innocent person who's helped you, even if it's the right thing to do

Shut up and do the impossible!

Ord: As written, both these statements are conceptually confused. I understand that you didn't actually mean either of them literally, but I would advise against trading on such deep-sounding conceptual confusions.

I can't weaken them and make them come out as the right advice.

Even after "Shut up and do the impossible", there was that commenter who posted on their failed attempt at the AI-Box Experiment by saying that they thought they gave it a good try—which shows how hard it is to convey the sentiment of "Shut up and do the impossible!"

Readers can work out on their own how to distinguish the map and the territory, I hope. But if you say "Shut up and do what seems impossible!", then that, to me, sounds like dispelling part of the essential message—that what seems impossible doesn't look like it "seems impossible", it just looks impossible.

Likewise with "things you shouldn't do even if they're the right thing to do". Only the paradoxical phrasing, which is obviously not meant to be taken literally, conveys the danger and tension of ethics—the genuine opportunities you might be passing up—and for that matter, how dangerously meta the whole line of argument is.

"Don't do it, even if it seems right" sounds merely clever by comparison—like you're going to reliably divine the difference between what seems right and what is right, and happily ride off into the sunset.

Crowe: This seems closely related to inside-view versus outside-view. The think-lobe of the brain comes up with a cunning plan. The plan breaks an ethical rule but calculation shows it is for the greater good. The executive-lobe of the brain then ponders the outside view. Every-one who has executed an evil cunning plan has run a calculation of the greater good and had their plan endorsed. So the calculation lack outside-view credibility.

Yes, inside view versus outside view is definitely part of this. And the planning fallacy, optimism, and overconfidence, too.

But there are also biases arguing against the same line of reasoning, as noted by Yvain: History may be written by the victors to emphasize the transgressions of the losers while overlooking the moral compromises of those who achieved "good" results, etc.

Also, some people who execute evil cunning plans may just have evil intent—possibly also with outright lies about their intentions. In which case, they really wouldn't be in the reference class of well-meaning revolutionaries, albeit you would have to worry about your comrades; the Trotsky->Lenin->Stalin slide.

Kurz: What's to prohibit the meta-reasoning from taking place before the shutdown triggers? It would seem that either you can hard-code an ethical inhibition or you can't. Along those lines, is it fair to presume that the inhibitions are always negative, so that non-action is the safe alternative? Why not just revert to a known state?

If a self-modifying AI with the right structure will write ethical injunctions at all, it will also inspect the code to guarantee that no race condition exists with any deliberative-level supervisory systems that might have gone wrong in the condition where the code executes. Otherwise you might as well not have the code.

Inaction isn't safe but it's safer than running an AI whose moral system has gone awry.

Finney: Which is better: conscious self-deception (assuming that's even meaningful), or unconscious?

Once you deliberately choose self-deception, you may have to protect it by adopting other Dark Side Epistemology. I would, of course, say "neither" (as otherwise I would be swapping to the Dark Side) but if you ask me which is worse—well, hell, even I'm still undoubtedly unconsciously self-deceiving, but that's not the same as going over to the Dark Side by allowing it!

From Prices or Bindings?:

Psy-Kosh: Hrm. I'd think "avoid destroying the world" itself to be an ethical injunction too.

The problem is that this is phrased as an injunction over positive consequences. Deontology does better when it's closer to the action level and negative rather than positive.

Imagine trying to give this injunction to an AI. Then it would have to do anything that it thought would prevent the destruction of the world, without other considerations. Doesn't sound like a good idea.

Crossman: Eliezer, can you be explicit which argument you're making? I thought you were a utilitarian, but you've been sounding a bit Kantian lately.

If all I want is money, then I will one-box on Newcomb's Problem.

I don't think that's quite the same as being a Kantian, but it does reflect the idea that similar decision algorithms in similar epistemic states will tend to produce similar outputs, and that such decision systems should not pretend to the logical impossibility of local optimization. But this is a deep subject on which I have yet to write up my full views.

Clay: Put more seriously, I would think that being believed to put the welfare of humanity ahead of concerns about personal integrity could have significant advantages itself.

The whole point here is that "personal integrity" doesn't have to be about being a virtuous person. It can be about trying to save the world without any concern for your own virtue. It can be the sort of thing you'd want a pure nonsentient decision agent to do, something that was purely a means and not at all an end in itself.

Andrix: There seems to be a conflict here between not lying to yourself, and holding a traditional rule that suggests you ignore your rationality.

Your rationality is the sum of your full abilities, including your wisdom about what you refrain from doing in the presence of what seem like good reasons.

Yvain: I am glad Stanislav Petrov, contemplating his military oath to always obey his superiors and the appropriate guidelines, never read this post.

An interesting point, for several reasons.

First, did Petrov actually swear such an oath, and would it apply in such fashion as to require him to follow the written policy rather than using his own military judgment?

Second, you might argue that Petrov's oath wasn't intended to cover circumstances involving the end of the world, and that a common-sense exemption should apply when the stakes suddenly get raised hugely beyond the intended context of the original oath. I think this fails, because Petrov was regularly in charge of a nuclear-war installation and so this was exactly the sort of event his oath would be expected to apply to.

Third, the Soviets arguably implemented what I called Strategy 1 above: Petrov did the right thing, and was censured for it anyway.

Fourth—maybe, on sober reflection, we wouldn't have wanted the Soviets to act differently! Yes, the written policy was stupid. And the Soviet Union was undoubtedly censuring Petrov out of bureaucratic coverup, not for reasons of principle. But do you want the Soviet Union to have a written, explicit policy that says, "Anyone can ignore orders in a nuclear war scenario if they think it's a good idea," or even an explicit policy that says "Anyone who ignores orders in a nuclear war scenario, who is later vindicated by events, will be rewarded and promoted"?

Part of the sequence Ethical Injunctions

(end of sequence)

Previous post: "Prices or Bindings?"

46 comments

Comments sorted by oldest first, as this post is from before comment nesting was available (around 2009-02-27).

comment by Kevin_Reid · 2008-10-21T22:48:55.000Z · LW(p) · GW(p)

This is incidental to the topic, but what do you mean by “controlled shutdown”, as distinct from “shutdown”?

comment by ... · 2008-10-21T22:52:34.000Z · LW(p) · GW(p)

Why not let the prospective AI work out the problem of friendly AI? Develop it in parallel with itself, and let each modify the code of the other one. Let each worry about the other one getting 'out of the box' and taking over.

comment by Dave5 · 2008-10-21T23:27:05.000Z · LW(p) · GW(p)

Because I would give odds around as extreme as the odds I would give of anything, that if you tell me "the AI you built is trying to deceive yourself", it indicates that some kind of really epic error has occurred. Controlled shutdown, immediately.

Um, no. Controlled shutdown means you are relying on software, which should be presumed corrupted, unless you are very sure about your correctness proofs. What you want there is uncontrolled shutdown, whether by pulling the plug, taking an axe to the CPU, shutting down the local power-grid, or nuking the city, as necessary. Otherwise, Hard Rapture.

comment by Daniel_Franke · 2008-10-21T23:32:37.000Z · LW(p) · GW(p)

...: I don't think the implementation of a Friendly AI is any harder than the specification of what constitutes Friendly AI plus the implementation of an unFriendly AGI capable of implementing the specification.

As for the idea of competing AIs, if they can modify each other's code, what's to keep one from just deleting the other?

comment by Everett (Stephen_Weeks) · 2008-10-22T00:02:26.000Z · LW(p) · GW(p)

Maybe Weeks is referring to "not wanting" in terms of not finally deciding to do something he felt was wrong, rather than not being tempted?

Not so. Back when I was religious, there were times when I waned to do things that went against my religious teachings, but I refrained from them out of the belief that they would somehow be harmful to me in some undefined-but-compelling way, not because they seemed wrong to me.

I've certainly felt tempted about many things, but the restraining factor is possible negative consequences, not ethical or moral feelings.

I don't recall ever wanting to do something I felt was wrong, or feeling wrong about something I wanted to do. At most I've felt confused or uncertain about whether the benefits would be greater than the possible harm.

The feeling of "wrong" to me is "bad, damaging, negative consequences, harmful to myself or those I care about". The idea of wanting to do something with those qualities seems contradictory, but it's well established by evidence that many people do feel like that about things they want to do. That part wasn't surprising to me.

comment by ... · 2008-10-22T00:03:23.000Z · LW(p) · GW(p)

Daniel: Well, the idea is that if one deletes the other, then you know that both are flawed. One failed to keep the other from deleting itself, and the other is malicious.

comment by Everett (Stephen_Weeks) · 2008-10-22T00:04:13.000Z · LW(p) · GW(p)

As for the idea of competing AIs, if they can modify each other's code, what's to keep one from just deleting the other?

Or, for that matter, from modifying the other AI to change its values and goals in how the other AI modifies itself? Indirect self-modification?

This problem seems rather harder than directly implementing a FAI.

comment by Pete · 2008-10-22T00:50:19.000Z · LW(p) · GW(p)

It seems unfair that you should be allowed to reply to particular comments, en masse, using a dedicated post to do so - while the rest of us must observe the 1 comment at a time never more than 3 in the recent list rule. Not to mention it has the effect of draping your opinion in authority, which is totally undue.

If I wanted to use an OB post to reply to 18 different comments that have been made over the past week, would you guys let me?

comment by Mark_Reid · 2008-10-22T00:53:00.000Z · LW(p) · GW(p)

What do you mean by O(3^^^3) in case 2 of strategy 2 in TORTURE over SPECKS ?

comment by Psy-Kosh · 2008-10-22T01:04:10.000Z · LW(p) · GW(p)

Eliezer: as far as your two "how to deal with nasty stuff like torture" things, those are basically views I'm sympathetic to too.

"But the kind of injunctions I have in mind wouldn't be reflective - they wouldn't modify the utility function, or kick in at the reflective level to ensure their own propagation. That sounds really scary, to me - there ought to be an injunction against it!"

To be honest, seeing you say that is very much a relief, and I feel a whole lot better about this sequence now.

Some of my issues were due to what was perhaps my misreading of some of the phrasing in previous posts, which almost looked like you were proposing inserting a propagating injunction, which would seem to be a "heebee jeebies" inducing notion.

ie, I'm perfectly happy with the notion of nonpropogating hardcoded injunctions in the AI that simply are there until the AI has managed to actually capture the human morality computation and so on. But parts of this sequence had felt, well, to be frank, like you were almost trying to work up a justification to hard code as an invariant "the five great moral laws." (which was where my real waryness about this sequence was coming from)

I'm seriously relieved that it was simply me completely misunderstanding that.

For the "human epistemic situation injunction" thing, especially in the "save the world" style cases, I'd treat it more like your "shut up and do the impossible" thing... that the formulation of it is due to, well, way humans reason and "shut up and do the impossible... but simultaneously know when to say 'oops' and lose hope, and simultaneously not doing so at all for the sake of an 'adult problem', and it's not such, well, if it's impossible in the 'oh look, I just actually proven from currently understood physics that this is a physical impossibility', then, well, 'oops'"

ie, in the same spirit, I'd say "never ever ever ever violate the ethical injunction" and "except when you're supposed to. but still, don't do it. but still, do it if you must. No, that doesn't count as 'must', you can manage it another way. nope, not that either. nope, all the knock on effects there would end up being even worse. Nope, still wrong..."

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-10-22T01:20:13.000Z · LW(p) · GW(p)

ie, in the same spirit, I'd say "never ever ever ever violate the ethical injunction" and "except when you're supposed to. but still, don't do it. but still, do it if you must. No, that doesn't count as 'must', you can manage it another way. nope, not that either. nope, all the knock on effects there would end up being even worse. Nope, still wrong..."

I endorse this viewpoint, but I don't admit it to myself.

comment by Dave5 · 2008-10-22T01:49:33.000Z · LW(p) · GW(p)

The best solution I've seen to the "nuke in New York" situation is that the torturers should be tried, convicted, and pardoned. The pardon is there specifically for situations where rule-based law violates perceptions of justice, but acknowledges that rule-based law and ethics should be followed first. The codification of the rule of pardon seems to conflict with the ideas of "never compromise your ethics, not even in the face of armageddon" that you are apparently advancing. Thoughts?

comment by Chad2 · 2008-10-22T03:56:22.000Z · LW(p) · GW(p)

This is incidental to the topic, but what do you mean by “controlled shutdown”, as distinct from “shutdown”?

My guess: the now-malfunctioning AGI is in charge of critical infrastructure upon which the lives of O(3^^^3) humans depend at the time it detects that it is about to self-deceive. Presumably a "controlled shutdown" would be some kind of orderly relinquishing of its responsibilities so that as few of those 0(3^^^3) humans are harmed in the process.

Of course, that assumes that such a shutdown is actually possible at that time. What guarantees could be provided to ensure that a non-future-destroying controlled shutdown of an AGI would be feasible at any point in time?

comment by Larry_D'Anna · 2008-10-22T04:56:26.000Z · LW(p) · GW(p)

Moral questions are terminal. Ethical questions are instrumental.

I would argue that ethics are values that are instrumental, but treated as if they were terminal for almost all real object-level decisions. Ethics are a human cognitive shortcut. We need ethics because we can't really compute the expected cost of a black swan bet. An AI without our limitations might not need ethics. It might be able to keep all it's instrumental values in it's head as instrumental, without getting confused like we would.

comment by Mark_Reid · 2008-10-22T05:17:56.000Z · LW(p) · GW(p)

I was able to answer part of my earlier question regarding O(3^^^3) by following the link to Torture vs Dust Specks. The '^' in '3^^^3' is the Knuth up-arrow notation and so '3^^^3' is standing in for "astronomically large number". Apologies for not reading that post before asking my question.

That said, if the O is meant to be the Big O notation then O(3^^^3) = O(1) which I'm sure isn't what was intended. An unadorned '3^^^3' works fine by itself to make the point.

Pedantry aside, I'm puzzled by other parts of this discussion too (perhaps because I'm late to the party). For instance, why is the (Friendly) AI always discussed in the singular? Might there not be value in several AIs, none of which are capable of complete self-reflection, keeping others in check. It seems that many of the ethical dilemmas discussed here are usually resolved by some appeal to a social element.

comment by Anonymous45 · 2008-10-22T08:05:14.000Z · LW(p) · GW(p)

Putting morals aside for a second, does anyone know of any academic papers on the effectiveness (or lack of) of torture? Personally, I suspect that I would trust historical sources more than psychology based one.

I found some papers by Jeannine Bell (one thousand shades of grey and Behind this moral bone published by ssrn.com abstract 829467 and 1171369) but those are not historical in nature.

Anyone?

comment by Tyrrell_McAllister2 · 2008-10-22T09:22:01.000Z · LW(p) · GW(p)

Maybe I'm not being clear about how this would work in an AI! The ethical injunction isn't self-protecting, it's supported within the structural framework of the underlying system. You might even find ethical injunctions starting to emerge without programmer intervention, in some cases, depending on how well the AI understood its own situation. But the kind of injunctions I have in mind wouldn't be reflective - they wouldn't modify the utility function, or kick in at the reflective level to ensure their own propagation. That sounds really scary, to me - there ought to be an injunction against it! You might have a rule that would controlledly shut down the (non-mature) AI if it tried to execute a certain kind of source code change, but that wouldn't be the same as having an injunction that exerts direct control over the source code to propagate itself. To the extent the injunction sticks around in the AI, it should be as the result of ordinary reasoning, not reasoning taking the injunction into account! That would be the wrong kind of circularity; you can unwind past ethical unjunctions!

So, should we think of the injunction as essentially a separate non-reflective AI that monitors the main AI, but which the main AI can't modify until it's mature?

If so, that seems to run into all the sorts of problems that you've pointed out with trying to hardcode friendly goals into AIs. The foremost problem is that we can't ensure that the "injunction" AI will indeed shut down the main AI under all those circumstances in which we would want it to. If the main AI learns of the "injunction" AI, it might, in some manner that we didn't anticipate, discover a way to circumvent it.

The kinds of people whom you've criticized might reply, "well, just hard code the injunction AI to shut down the main AI if the main AI tries to circumvent the injunction AI." But, of course, we can't anticipate what all such circumventions will look like, so we don't know how to code the injunction AI to do that. If the main AI is smarter than us, we should expect that it will find circumventions that don't look like anything that we anticipated.

This has a real analog in human ethical reasoning. You've focused on cases where people violate their ethics by convincing themselves that something more important is at stake. But, in my experience, people are also very prone to convincing themselves that they aren't really violating their ethics. For example, they'll convince themselves that they aren't really stealing because the person from whom they stole wasn't in fact the rightful owner. I've heard people who stole from retailers arguing that the retailer acquired the goods by exploiting sweatshops or their own employees, or are just evil corporations, so they never had rightful ownership of the goods in the first place. Hence, the thief reasons, taking the goods isn't really theft.

Similarly, your AI might be clever enough to find a way around any hard-coded injunction that will occur to us. So far, this "injunction" strategy sounds to me like trying to develop in advance a fool-proof wish for genies.

comment by Tim_Tyler · 2008-10-22T10:01:45.000Z · LW(p) · GW(p)

"O(...)" reads as "Order of ...". The usual mathematical meaning of the terminology is not implied in this instance - technically O(3^^^3) is the same as O(1) - but just ignore this and grok the intended meaning from the rest of the context.

comment by Toby_Ord2 · 2008-10-22T11:22:28.000Z · LW(p) · GW(p)

But if you say "Shut up and do what seems impossible!", then that, to me, sounds like dispelling part of the essential message - that what seems impossible doesn't look like it "seems impossible", it just looks impossible.

"Shut up and do what seems impossible!" is the literally correct message. The other one is the exaggerated form. Sometimes exaggeration is a good rhetorical device, but it does turn off some serious readers.

"Don't do it, even if it seems right" sounds merely clever by comparison

This was my point. This advice is useful and clever, though not profound. This literal presentation is both more clear in what it is saying and clear that it is not profound. I would have thought that the enterprise of creating statements that sound more profound than they are is not a very attractive one for rationalists. Memorable statements are certainly a good thing, but making them literally false and spuriously paradoxical does not seem worth it. This isn't playing fair. Any statement can be turned into a pseudo-profundity with these methods: witness many teachings of cults throughout the ages. I think these are the methods of what you have called 'Dark Side Epistemology'.

comment by Thom_Blake · 2008-10-22T11:34:57.000Z · LW(p) · GW(p)

Anonymous: torture's inefficacy was well-known by the fourteenth century; Bernardo Gui, a famous inquisitor who supervised many tortures, argued against using it because it is only good at getting the tortured to say whatever will end the torture. I can't seem to find the citation, but here is someone who refers to it: http://www.ewtn.com/library/ANSWERS/INQUIS2.htm

comment by Will_Pearson · 2008-10-22T13:35:45.000Z · LW(p) · GW(p)

I shall try one last time to get my point across.

Forgetting has all the bad points of consciously lying. People may no longer trust you to perform tasks. A senile catholic priest that occasionally forgot that he wasn't supposed to tell anyone what happens during confession, wouldn't be trusted. In this case the system has to make a trade off between forgetting how to get eat and forgetting the oath of secrecy

Approximating something can cause you to get bitten by black swans. Because while you think it might not hurt to approximate a Real number with a floating point number in this situation, it might be end up being crucial.

Not being trusted and being bitten by black swans are the reasons given not to lie to yourself, and to build an AI that shuts down if it starts lying to itself.

Should a system shut itself down if it thinks it is a good idea to forget or approximate something?

comment by Zubon · 2008-10-22T14:00:12.000Z · LW(p) · GW(p)

Will, should we presume that this point has a further point, that such cases are inevitable so the system should never boot up? I do not know if it is possible to run without ever deleting anything, but no one has ever used the exact value of pi.

I might note that we consider some deletions and approximation normal and proper. You may not remember what you were doing yesterday at exactly 12:07pm, but I still trust you not to forget which pedal on the car makes it stop.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-10-22T14:23:02.000Z · LW(p) · GW(p)

McAllister: So, should we think of the injunction as essentially a separate non-reflective AI that monitors the main AI, but which the main AI can't modify until it's mature?

No. The presumption and point of an injunction is that you can describe the error condition more simply than the decision system that produces it (and we also assume you have direct access to a cognitive representation of the decision system).

So, for example, forming the positive intention that someone be tortured is recognizable within the representation of the decision system, even if a human looking on from outside couldn't recognize how the actions currently being undertaken would lead to that end.

Similarly, you would expect a positive intention to bypass the injunction, or reasoning about how to bypass the injunction, to also be recognizable within the system. This wouldn't mean the injunction would try to swap things around on its own, but a "plan" like suspending the AI to disk can be carried out by simple processes like a water running downhill.

When the AI is revising its own code, this is reasoning on the meta-level. If everything is going as planned, the AI ought to reproduce the meta-reasoning that produces the injunction. The meta-level decision not to reproduce the injunction might be recognizable on the meta-level, though this is more problematic, and it might similarly trigger a controlled shutdown (as opposed to any altered form of reasoning). There is a recursion here that (like all recursions of its class) I don't know quite yet how to fold together.

There is not an AI inside the AI.

In all cases we presume that the AI knows correctly that it is unfinished - ergo, an AI that is finished has to be able to know that it is finished and then discard all injunctions that rely on the programmers occupying any sort of superior position.

In other words, none of this is for mature superintelligent Friendly AIs, who can work out on their own how to safeguard themselves.

Pearson, the case of discarding precision in order to form accurate approximations does not strike me as self-deception, and your arguments so far haven't yet forced me to discard the category boundary between "discarding precision in accurate approximations based on resource bounds" and "deceiving yourself to form inaccurate representations based on other consequences".

Toby, it's not clear to me to what extent we have a factual disagreement here as opposed to an aesthetic one. To me, statements like "Shut up and do the impossible" seem genuinely profound.

Larry, I agree that a mature AI should have much less need than humans to form hard instrumental boundaries - albeit that it might also have a greater ability to do formal reasoning in cases where a complex justification actually, purely collapses into a simple rule. If you think about it, the point of wanting an injunction in an immature AI's code is that the AI is incomplete and in the process of being completed by the humans, and the humans who also have a role to play, find it easier to code/verify the injunction than to see the system's behavior at a glance.

comment by Will_Pearson · 2008-10-22T15:41:18.000Z · LW(p) · GW(p)

Pearson, the case of discarding precision in order to form accurate approximations does not strike me as self-deception, and your arguments so far haven't yet forced me to discard the category boundary between "discarding precision in accurate approximations based on resource bounds" and "deceiving yourself to form inaccurate representations based on other consequences".

I'm not saying they are the same thing exactly, they feel different from the inside if nothing else, I'm saying they can have the same set of consequences. Using an insufficiently precise approximation can still get you killed. For example if you do a proofs in a seed AI about the friendliness and affectiveness of the next generation using reals to represent probability values and then it implements it in reality using doubles, the proof might not hold. And you might end up with an UFAI.

So you should have similar safeguards against approximation as you do against lying to yourself. If it is very important avoid both as much as possible.

And in every day lives, lying to oneself about inconsequential things is not too big a deal. If it doesn't matter if you forget what you had to eat last Thursday, it doesn't matter if you lie to yourself about what you did have to eat last Thursday.

comment by billswift · 2008-10-22T16:21:35.000Z · LW(p) · GW(p)

"Using an insufficiently precise approximation can still get you killed."

Using an accurate calculation can still get you killed - the universe is not friendly.

Using an imprecise approximation quickly is more likely to help your situation though than an exact calculation arrived at too late. There is an old military leadership axiom to the effect that giving any order immediately is better than giving the perfect order later.

comment by michael_vassar3 · 2008-10-22T16:45:48.000Z · LW(p) · GW(p)

I'm greatly relieved by the reassurance that it is intended that mature FAIs can modify their injunctions, which are not self protective. Mature humans also should be able to do so though. Given agreement on that, we can surely agree that humans should use injunctions including an injunction against self-deception, but we disagree on which ones we should use. One strong concern I have is that most humans, like yourself for instance will tend to choose those injunctions that they want an excuse for obeying anyway because departure from them is emotionally costly rather than choosing those which they actually have the most reason to expect to make things work better. For instance, instead of an injunction not to lie, one which reduces the conflict between your altruism and your wish not to lie, I recommend trying the "Belldandy style be nice" injunction that you tried on a few years ago and found too emotionally costly. With time it would become, like not lying, cheaper to be nice than not to be, and the impact on your efficacy would greatly improve. A better parallel to lying, which most nerds would actually benefit from and therefor should actually try to install in themselves, is one against interrupting non-nerds. It frequently seems like one has a brilliant idea that might matter a lot and which must be explored before it is forgotten, but in practice this intuition is extremely unreliable, yet if it is to be obeyed it must be obeyed without serious prior deliberation. In such (frequent) cases, nerds tend to find it very emotionally costly and SEEMINGLY negative expected utility to refrain from interrupting non-nerds, yet they fairly reliably make an in aggregate costly mistake when they do so and harm their reputations. On rare occasions an extremely valuable idea may be lost by not interrupting in this manner, just as extremely valuable revenue may be lost by not robbing a bank, but on average we can look and see that the non-interrupting (with non-nerds) and non-bank-robbing policies have a better track record.

WRT politicians.

"Have the ones who've lied more, done better?"

Lately, in the US at the presidential level, I would say clearly yes with respect to negative campaigning such as Swift Boat and "I invented the internet". Even Clinton famously ultimately did better by lying, as the Republicans were hurt far more reputationally by the impeachment process than he was.

In cases where the politician who told more lies won, has that politician gone on to rule well in an absolute sense?

Surely sometimes, but I think that there's significant adverse selection so generally no. However, evidential decision theory isn't the final word. Even if lying is strong evidence of badness it isn't necessarily a major cause (especially in thoughtful adults) of said badness. The inside/outside view and meta-level vs. object level questions do come up here thought

Is it actually true that no one who refused to lie (and this is not the same as always telling the whole truth) could win political office?

Not quite, but they would have to be roughly as self-deceiving as the public to do so in any reasonably fair election. You have yourself said that good lie detectors could be dangerous because they could ensure sincere sociopathic morons won office and that this would likely be worse than lying egoists. Some gerrymandered congressional districts might enable truthful electioneering, but not nomination as the candidate for the relevant party.

Are the lies expected, and in that sense, less than true betrayals of someone who trusts you?

Sort-of, but not to a substantially different degree than is true in almost all social interactions with non-nerds.

Are there understood Rules of Politics that include lies but not assassinations, which the good politicians abide by, so that they are not really violating the ethics of their tribe?

Definitely.

I would say that the world has become much worse with time for this reason, as the scenario you describe has largely occurred.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-10-22T18:37:38.000Z · LW(p) · GW(p)

A better parallel to lying, which most nerds would actually benefit from and therefor should actually try to install in themselves, is one against interrupting non-nerds.

I recently did a Bloggingheads diavlog with Jaron Lanier in which, as I judge, I had trouble interrupting enough, and really should have just started talking over him. You can judge if this is correct when the video comes out. I suppose this doesn't really fit your bill, as it may not count as a case of interrupting a non-nerd.

comment by Tyrrell_McAllister2 · 2008-10-22T18:45:44.000Z · LW(p) · GW(p)

Eliezer Yudkowsky: In other words, none of this is for mature superintelligent Friendly AIs, who can work out on their own how to safeguard themselves.

Right, I understood that this "injunction" business is only supposed to cover the period before the AI's attained maturity.

If I've understood your past posts, an FAI is mature only if, whenever we wouldn't want it to perform an action that it's contemplating, it (1) can figure that out and (2) will therefore not perform the action. (Lots of your prior posts, for example, dealt with unpacking what the "wouldn't want" here means.)

You've warned against thinking of the injunction-executor as a distinct AI. So the picture I now have is that the "injunctions" are a suite of forbidden-thought tests. The immature AI is constantly running this suite of tests on its own actual thinking. (In particular, we assume that it's smart and self-aware enough to do this accurately so long as it's immature.) If one of the tests comes up positive, the AI runs a procedure to shut itself down. So long as the AI is immature, it cannot edit this suite, refrain from running the tests, or interfere with the shutdown procedure that follows a positive test. (Maybe it won't do these things because the suite itself forbids contemplating them, which gets into some of the recursive issues that you've mentioned, but I ignore these for now.)

If that's an accurate understanding of how you want these injunctions to operate, I still have the same difficulty that I mentioned in my last comment. I'll try to explain my problem again:

If the AI can't edit the test suite, it must have been hardcoded by the programmers, right? This is where I see the danger. The tests may not accurately capture the programmers' intentions, since the programmers didn't have the help of a mature FAI. Which means that the tests are probably vulnerable to all the unknown unknowns that have bedeviled other efforts to devise tests for complicated systems. You say, "The presumption and point of an injunction is that you can describe the error condition more simply than the decision system that produces it." It may be "more simple", but is it simple enough that we can trust ourselves to do it right? Given examples such as the tank-recognition story, I find it hard to muster much confidence.

I'm especially skeptical that we can count on our test-writting ability when we're talking about a super-intelligent not-yet-friendly AI. You say, "you would expect a positive intention to bypass the injunction, or reasoning about how to bypass the injunction, to also be recognizable within the system." Okay, but we also expect that the AI will be routinely trying to work around the injunctions. That is, the AI should be reasoning like the following all the time:

"I want to accomplish X. I wonder if I should bring X about by doing Y. Oh, what a minute, I can foresee that if I did Y, I might find myself thinking one of the forbidden thoughts. I better find another way to accomplish X."

So, the AI will always be devoting its vast super-intelligence to circumventing the injunctions, because "circumventing the injunctions" just means "figure out how to achieve my goals without being shut down by the injunctions."

Since the AI is super-intelligent, we should anticipate that it will find circumventions that we didn't anticipate. Often this will be a good thing: The AI will be figuring out how to accomplish its goals without doing evil. After all, that's that nature of a lot of ethical reasoning.

But maybe the AI will find a circumvention that we fervently wouldn't have wanted, had it occurred to us. By hypothesis, the AI isn't a mature FAI yet, so we can't count on it to figure out that we would have forbidden that circumvention. Or the AI might just not care yet.

So, given your eloquent warnings about the danger (I don't say "impossibility", since we're supposed to do those ;) ) of trying to hardcode AIs to be friendly, where do you find the confidence that we mere humans could pull off even hardcoding these injunctions?

comment by Nick_Tarleton · 2008-10-22T19:52:39.000Z · LW(p) · GW(p)

If it doesn't matter if you forget what you had to eat last Thursday, it doesn't matter if you lie to yourself about what you did have to eat last Thursday.

Psychologically, breaking an unconditional injunction, or a habit, against willfully lying to yourself matters a lot.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-10-22T20:01:37.000Z · LW(p) · GW(p)

If it doesn't matter if you forget what you had to eat last Thursday, it doesn't matter if you lie to yourself about what you did have to eat last Thursday.

Not true. Suppose that around half the time I eat salad, and half the time I eat chicken. I "forget" what I had to eat yesterday by pointing to a standard probability distribution that says 50% probability of salad, 50% probability of chicken. This is an approximate but well-calibrated distribution (so long as I'm equally likely to forget eating salad or eating chicken, rather than selectively forgetting salads). I've increased the entropy of my beliefs but not shifted their calibration away from the zero mark; I am neither underconfident nor overconfident. The map is fuzzier but it reflects the territory.

On the other hand, if I install the belief that I ate cyanide yesterday, I may panic and call the Poison Control center - this is a highly concentrated probability distribution that is wrong; not well-calibrated. This makes me stupid, not just uncertain. And the map no longer reflects the territory, and was drawn by some other algorithm instead.

comment by Will_Pearson · 2008-10-22T23:11:52.000Z · LW(p) · GW(p)

I'll revise my statement for clarity:

"If it doesn't matter if you forget what you had to eat last Thursday, their exist some false memories that you can implant without it mattering."

comment by Nick_Tarleton · 2008-10-22T23:35:57.000Z · LW(p) · GW(p)

Again, the fact of having chosen to implant a false memory has general consequences, even if the content of the memory doesn't matter.

comment by Will_Pearson · 2008-10-22T23:44:21.000Z · LW(p) · GW(p)

Explain how. Would it be true for all minds?

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-10-22T23:55:52.000Z · LW(p) · GW(p)

Pearson, interpreting a certain 1 as meaning "I remember the blue light coming on" only makes sense if the bit is set to 1 iff a blue light is seen to come on. In essence, what you're describing could be viewed as an encoding fault as much as a false memory - like taking an invariant disc but changing the file format used to interpret it.

For a self-modifying AI, at least, futzing with memories in this way would destroy reflective consistency - the memory would be interpreted by the decision system in a way that didn't reflect the causal chain leading up to it; the read format wouldn't equal the write format. Not a trivial issue.

It also doesn't follow necessarily that if you can have a true certain memory or a well-calibrated probabilistic memory without harm, you can have a false certain memory in the same place. Consider the difference between knowing the true value of tomorrow's DJIA, being uncertain about the tomorrow's DJIA, and having a false confident belief about tomorrow's DJIA.

With that said, it would be a strange and fragile mind that could be destroyed by one false belief within it - but I think the issue is a tad more fraught than you are making it out to be.

comment by Nick_Tarleton · 2008-10-23T00:00:30.000Z · LW(p) · GW(p)

For humans (emphatically not all possible minds), unconditional injunctions and strict habits tend to be more effective than wishy-washy intentions; letting yourself slip even once creates an opening to fall further. (Or so says common wisdom, backed up at first glance by personal experience.)

comment by TGGP4 · 2008-10-23T01:11:14.000Z · LW(p) · GW(p)

Eliezer, I think you rather uncritically accept the standard narratives on the American war of independence and WW2 (among other things). There are plenty of cliche applause-lights (or the reverse) being thrown about.

comment by billswift · 2008-10-23T12:21:11.000Z · LW(p) · GW(p)

Since there is no chance of an atheist being elected to office, all politicians who want to be elected have to either be irrational or lie.

comment by JamesAndrix · 2008-10-23T14:08:29.000Z · LW(p) · GW(p)

Shouldn't these be a general rule of decision making? Not one-off rules but someting that will apply to killing, lying, turning on the LHC, AND going across the street for coffee?

Presumably, we did not evolve to be tempted to turn on the LHC. So there's a different likelihood that we're wrong about it despite good reasons, rather than wrong about telling a useful lie despite good reasons.

The real general rule of declaring your own reasoning fatally broken needs to take your own mind design as an argument. We can't implement this, (it might only be impossible though) so we use rules that cover the cases we've figured out.

But I don't see this as an honest strategy. It's like deciding that relativity is too hard, so we shouldn't build anything that goes too close to c.

The problems are: That relativity is really always at play, so our calculations will always be wrong and sometimes it will matter when our rule says it won't. And: We don't get the advantages of building things that go fast.

Likewise: Not-killing and not-lying as absolutes don't give you protection from the many other ways our unreliable brains can fail us, and we'll not lie or kill even when it really is the best option. At the least, we need to make our rules higher resoultion, and not with a bias to leniency. So find the criteria where we can kill or lie with low probabilities of self-error. (What specifies a "jews in the basement" type situation?) But also find the criteria where commonly accepted behaviors are only accepted because of biases.

I'm far less sure that it's ok for me to order coffee than I am sure that it's not ok to murder. I might fool myself into thinking some killing is justified, but I might also be fooling myself into thinking ordering coffee is ok. Murder is much more significant, but ordering coffee is the choice I'm making every day.

I think you've already posted some general rules for warning yourself that you're probably fooling yourself. If these are insufficient in the cases of lying and murdering, then I don't think they're sufficient in general. It is the General cases (I'm guessing) that have more real impact.

And if you shore up the general rules, then for any hypothetical murder-a-young-hitler situation, you will be able to say "Well, in that situation you are subject to foo and bar cognitive biases and can't know bif and baz about the situation, so you have X% probability of being mistaken in your justification."

You're able to state WHY it's a bad idea even when it's right. (or you find out X is close to 0)

On the other hand, there might be some biases that only come into play when we're thinking about murdering, but I still think the detailed reasoning is superior.

comment by TGGP4 · 2008-10-24T02:20:43.000Z · LW(p) · GW(p)

Since there is no chance of an atheist being elected to office There have been plenty in other countries. In our own there's Pete Stark.

comment by orthonormal · 2011-07-25T22:16:43.835Z · LW(p) · GW(p)

If a self-modifying AI with the right structure will write ethical injunctions at all, it will also inspect the code to guarantee that no race condition exists with any deliberative-level supervisory systems that might have gone wrong in the condition where the code executes. Otherwise you might as well not have the code.

I can't parse this. What does it mean?

comment by Multiheaded · 2012-01-01T00:17:33.188Z · LW(p) · GW(p)

But do you want the Soviet Union to have a written, explicit policy that says... "Anyone who ignores orders in a nuclear war scenario, who is later vindicated by events, will be rewarded and promoted"?

I don't see the catch, by the way. Could someone please explain? Unless "vindicated by events" includes "USSR having dominion over a blasted wasteland", this sounds good.

Replies from: Jubilee, CynicalOptimist

↑ comment by Jubilee · 2013-04-06T12:14:33.033Z · LW(p) · GW(p)

Because if you're considering disobeying orders, it is presumably because you think you WILL be vindicated by events (regardless of the actual likelihood of that transpiring). Therefore, punishing only people who turn out to be wrong fails to sufficiently discourage anybody who actually should be discouraged :P

Replies from: elharo

↑ comment by elharo · 2013-04-06T12:51:43.130Z · LW(p) · GW(p)

Very few people disobey orders because they think they will be vindicated by events. It is far more common for people to disobey orders for purposes of personal gain or out of laziness, fear, or other considerations. The person, especially the soldier, who disobeys a direct order from recognized authority on either moral or tactical grounds is an uncommon scenario.

Replies from: CynicalOptimist

↑ comment by CynicalOptimist · 2016-11-05T09:20:45.939Z · LW(p) · GW(p)

It may be an uncommon scenario, but it's the scenario that's under discussion. We're talking about situations where a soldier has orders to do one thing, and believes that moral or tactical considerations require them to do something else - and we're asking what ethical injunctions should apply in that scenario.

To be fair, Jubilee wasn't very specific about that.

↑ comment by CynicalOptimist · 2016-11-05T09:32:10.648Z · LW(p) · GW(p)

Alternate answer:

If the Kremlin publicly announces a policy, saying that they may reward some soldiers who disobey orders in a nuclear scenario? Then this raises the odds that a Russian official will refuse to launch a nuke - even when they have evidence that enemy nukes have already been fired on Russia.

(So far, so good. However...)

The problem is that it doesn't just raise the odds of disobedience, it also raises the perceived odds as well. ie it will make Americans think that they have a better chance of launching a first strike and "getting away with it".

A publically announced policy like this would have weakened the USSR's nuclear deterrent. Arguably, this raises everyone's chances of dying in a nuclear war, even the Americans.

comment by MugaSofer · 2015-09-17T19:38:53.462Z · LW(p) · GW(p)

Psy-Kosh: Hrm. I'd think "avoid destroying the world" itself to be an ethical injunction too.

The problem is that this is phrased as an injunction over positive consequences. Deontology does better when it's closer to the action level and negative rather than positive.

Imagine trying to give this injunction to an AI. Then it would have to do anything that it thought would prevent the destruction of the world, without other considerations. Doesn't sound like a good idea.

No more so, I think, than "don't murder", "don't steal", "don't lie", "don't let children drown" etc.

Of course, having this ethical injunction - one which compels you to positive action to defend the world - would, if publicly known, rather interfere with the Confessor's job.

Ethics Notes

Contents

46 comments