Morality is Scary

wei-dai

Morality is Scary

post by Wei Dai (Wei_Dai) · 2021-12-02T06:35:06.736Z · LW · GW · 116 comments

122 comments

I'm worried that many AI alignment researchers and other LWers have a view of how human morality works, that really only applies to a small fraction of all humans (notably moral philosophers and themselves). In this view, people know or at least suspect that they are confused about morality, and are eager or willing to apply reason and deliberation to find out what their real values are, or to correct their moral beliefs. Here's an example [LW · GW] of someone who fits this view:

I’ve written, in the past, about a “ghost” version of myself — that is, one that can float free from my body; which travel anywhere in all space and time, with unlimited time, energy, and patience; and which can also make changes to different variables, and play forward/rewind different counterfactual timelines (the ghost’s activity somehow doesn’t have any moral significance).

I sometimes treat such a ghost kind of like an idealized self. It can see much that I cannot. It can see directly what a small part of the world I truly am; what my actions truly mean. The lives of others are real and vivid for it, even when hazy and out of mind for me. I trust such a perspective a lot. If the ghost would say “don’t,” I’d be inclined to listen.

I'm currently reading The Status Game by Will Storr (highly recommended BTW), and found in it the following description of how morality works in most people, which matches my own understanding of history and my observations of humans around me:

The moral reality we live in is a virtue game. We use our displays of morality to manufacture status. It’s good that we do this. It’s functional. It’s why billionaires fund libraries, university scholarships and scientific endeavours; it’s why a study of 11,672 organ donations in the USA found only thirty-one were made anonymously. It’s why we feel good when we commit moral acts and thoughts privately and enjoy the approval of our imaginary audience. Virtue status is the bribe that nudges us into putting the interests of other people – principally our co-players – before our own.

We treat moral beliefs as if they’re universal and absolute: one study found people were more likely to believe God could change physical laws of the universe than he could moral ‘facts’. Such facts can seem to belong to the same category as objects in nature, as if they could be observed under microscopes or proven by mathematical formulae. If moral truth exists anywhere, it’s in our DNA: that ancient game-playing coding that evolved to nudge us into behaving co-operatively in hunter-gatherer groups. But these instructions – strive to appear virtuous; privilege your group over others – are few and vague and open to riotous differences in interpretation. All the rest is an act of shared imagination. It’s a dream we weave around a status game.

The dream shifts as we range across the continents. For the Malagasy people in Madagascar, it’s taboo to eat a blind hen, to dream about blood and to sleep facing westwards, as you’ll kick the sunrise. Adolescent boys of the Marind of South New Guinea are introduced to a culture of ‘institutionalised sodomy’ in which they sleep in the men’s house and absorb the sperm of their elders via anal copulation, making them stronger. Among the people of the Moose, teenage girls are abducted and forced to have sex with a married man, an act for which, writes psychologist Professor David Buss, ‘all concerned – including the girl – judge that her parents giving her to the man was a virtuous, generous act of gratitude’. As alien as these norms might seem, they’ll feel morally correct to most who play by them. They’re part of the dream of reality in which they exist, a dream that feels no less obvious and true to them than ours does to us.

Such ‘facts’ also change across time. We don’t have to travel back far to discover moral superstars holding moral views that would destroy them today. Feminist hero and birth control campaigner Marie Stopes, who was voted Woman of the Millennium by the readers of The Guardian and honoured on special Royal Mail stamps in 2008, was an anti-Semite and eugenicist who once wrote that ‘our race is weakened by an appallingly high percentage of unfit weaklings and diseased individuals’ and that ‘it is the urgent duty of the community to make parenthood impossible for those whose mental and physical conditions are such that there is well-nigh a certainty that their offspring must be physically and mentally tainted’. Meanwhile, Gandhi once explained his agitation against the British thusly: ‘Ours is one continual struggle against a degradation sought to be inflicted upon us by the Europeans, who desire to degrade us to the level of the raw Kaffir [black African] … whose sole ambition is to collect a certain number of cattle to buy a wife with and … pass his life in indolence and nakedness.’ Such statements seem obviously appalling. But there’s about as much sense in blaming Gandhi for not sharing our modern, Western views on race as there is in blaming the Vikings for not having Netflix. Moral ‘truths’ are acts of imagination. They’re ideas we play games with.

The dream feels so real. And yet it’s all conjured up by the game-making brain. The world around our bodies is chaotic, confusing and mostly unknowable. But the brain must make sense of it. It has to turn that blizzard of noise into a precise, colourful and detailed world it can predict and successfully interact with, such that it gets what it wants. When the brain discovers a game that seems to make sense of its felt reality and offer a pathway to rewards, it can embrace its rules and symbols with an ecstatic fervour. The noise is silenced! The chaos is tamed! We’ve found our story and the heroic role we’re going to play in it! We’ve learned the truth and the way – the meaning of life! It’s yams, it’s God, it’s money, it’s saving the world from evil big pHARMa. It’s not like a religious experience, it is a religious experience. It’s how the writer Arthur Koestler felt as a young man in 1931, joining the Communist Party:

‘To say that one had “seen the light” is a poor description of the mental rapture which only the convert knows (regardless of what faith he has been converted to). The new light seems to pour from all directions across the skull; the whole universe falls into pattern, like stray pieces of a jigsaw puzzle assembled by one magic stroke. There is now an answer to every question, doubts and conflicts are a matter of the tortured past – a past already remote, when one lived in dismal ignorance in the tasteless, colourless world of those who don’t know. Nothing henceforth can disturb the convert’s inner peace and serenity – except the occasional fear of losing faith again, losing thereby what alone makes life worth living, and falling back into the outer darkness, where there is wailing and gnashing of teeth.’

I hope this helps further explain why I think even solving (some versions of) the alignment problem probably won't be enough to ensure a future that's free from astronomical waste or astronomical suffering. A part of me is actually more scared of many futures in which "alignment is solved", than a future where biological life is simply wiped out by a paperclip maximizer.

116 comments

Comments sorted by top scores.

comment by TekhneMakre · 2021-12-02T12:41:30.809Z · LW(p) · GW(p)

> All the rest is an act of shared imagination. It’s a dream we weave around a status game.
> They’re part of the dream of reality in which they exist, a dream that feels no less obvious and true to them than ours does to us.
> Moral ‘truths’ are acts of imagination. They’re ideas we play games with.

IDK, I feel like you could say the same sentences truthfully about math, and if you "went with the overall vibe" of them, you might be confused and mistakenly think math was "arbitrary" or "meaningless", or doesn't have a determinate tendency, etc. Like, okay, if I say "one element of moral progress is increasing universalizability", and you say "that's just the thing your status cohort assigns high status", I'm like, well, sure, but that doesn't mean it doesn't also have other interesting properties, like being a tendency across many different peoples; like being correlated with the extent to which they're reflecting, sharing information, and building understanding; like resulting in reductionist-materialist local outcomes that have more of material local things that people otherwise generally seem to like (e.g. not being punched, having food, etc.); etc. It could be that morality has tendencies, but not without hormesis and mutually assured destrubtion and similar things that might be removed by aligned AI.

Replies from: fourier, SDM

↑ comment by fourier · 2021-12-12T19:11:33.699Z · LW(p) · GW(p)

Like, okay, if I say "one element of moral progress is increasing universalizability", and you say "that's just the thing your status cohort assigns high status", I'm like, well, sure, but that doesn't mean it doesn't also have other interesting properties, like being a tendency across many different peoples; like being correlated with the extent to which they're reflecting, sharing information, and building understanding; like resulting in reductionist-materialist local outcomes that have more of material local things that people otherwise generally seem to like (e.g. not being punched, having food, etc.);

"Morality" is totally unlike mathematics where the rules can first be clearly defined, and we operate with that set of rules.

I believe "increasing universalizability" is a good example to prove OPs point. I don't think it's a common belief among "many different peoples" in any meaningful sense. I don't even really understand what it entails. There may be a few nearly universal elements like "wanting food", but destructive aspects are fundamental to our lives so you can't just remove them without fundamentally altering our nature as human beings. Like a lot of people, I don't mind being punched a little as long as (me / my family / my group) wins and gains more resources. I really want to see the people I hate being harmed, and would sacrifice a lot for it, that's a very fundamental aspect of being human.

Replies from: TekhneMakre, TekhneMakre

↑ comment by TekhneMakre · 2021-12-12T20:19:40.982Z · LW(p) · GW(p)

I really want to see the people I hate being harmed, and would sacrifice a lot for it, that's a very fundamental aspect of being human.

Are you pursuing this to any great extent? If so, remind me to stay away from you and avoid investing in you.

Replies from: fourier

↑ comment by fourier · 2021-12-12T21:05:15.934Z · LW(p) · GW(p)

Why are you personally attacking me for discussing the topic at hand? I'm discussing human nature and giving myself as a counter-example, but I clearly meant that it applies to everyone in different ways. I will avoid personal examples since some people have a hard time understanding. I believe you are ironically proving my point by signaling against me based on my beliefs which you dislike.

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2021-12-13T01:03:36.074Z · LW(p) · GW(p)

Attacking you? I said I don't want to be around you and don't want to invest in you. I said it with a touch of snark ("remind me").

> I clearly meant that it applies to everyone in different ways

Not clear to me. I don't think everyone "would sacrifice a lot" to "see the people [they] hate being harmed". I wouldn't. I think behaving that way is inadvisable for you and harmful to others, and will tend to make you a bad investment opportunity.

↑ comment by TekhneMakre · 2021-12-12T20:18:23.686Z · LW(p) · GW(p)

"Morality" is totally unlike mathematics where the rules can first be clearly defined, and we operate with that set of rules.

By that description, mathematics is fairly unlike mathematics.

I don't even really understand what it entails.

It entails that behavior that people consider moral, tends towards having the property that if everyone behaved like that, things would be good. Rule of law, equality before the law, Rawlsian veil of ignorance, stare decisis, equality of opportunity, the golden rule, liberty, etc. Generally, norms that are symmetric across space, time, context, and person. (Not saying we actually have these things, or that "most people" explicitly think these things are good, just that people tend to update in favor of these things.)

Replies from: fourier

↑ comment by fourier · 2021-12-12T21:04:24.649Z · LW(p) · GW(p)

It entails that behavior that people consider moral, tends towards having the property that if everyone behaved like that, things would be good

This is just circular. What is "good"?

Rule of law, equality before the law, Rawlsian veil of ignorance, stare decisis, equality of opportunity, the golden rule, liberty, etc. Generally, norms that are symmetric across space, time, context, and person. (Not saying we actually have these things, or that "most people" explicitly think these things are good, just that people tend to update in favor of these things.)

Evidence that "most people" update in favor of these things? It seems like a very current western morality centric view, and you could probably get people to update in the opposite direction (and they did, many times in history).

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2021-12-13T01:07:01.008Z · LW(p) · GW(p)

>Evidence that "most people" update in favor of these things? It seems like a very current western morality centric view,

Yeah, I think you're right that it's biased towards Western. I think you can generate the obvious examples (e.g. law systems developing; e.g. various revolutions in the name of liberty and equality and against tyranny), and I'm not interested enough right now to come up with more comprehensive treatment of the evidence, and I'm not super confident. It could be interesting to see how this plays out in places where these tendencies seem least present. Is China such a place? (What do most people living in China really think of non-liberty, non-Rawlsianism, etc.?)

↑ comment by Sammy Martin (SDM) · 2021-12-02T18:32:48.284Z · LW(p) · GW(p)

The above sentences, if taken (as you do) as claims about human moral psychology rather than normative ethics, are compatible with full-on moral realism. I.e. everyone's moral attitudes are pushed around by status concerns, luckily we ended up in a community that ties status to looking for long-run implications of your beliefs and making sure they're coherent, and so without having fundamentally different motivations to any other human being we were better able to be motivated by actual moral facts.

I know the OP is trying to say loudly and repeatedly that this isn't the case because 'everyone else thought that as well, don't you know?' with lots of vivid examples, but if that's the only argument it seems like modesty epistemology - i.e. "most people who said the thing you said were wrong, and also said that they weren't like all those other people who were wrong in the past for all these specific reasons, so you should believe you're wrong too".

I think a lot of this thread confuses moral psychology with normative ethics - most utilitarians know and understand that they aren't solely motivated by moral concerns, and are also motivated by lots of other things. They know they don't morally endorse those motivations in themselves, but don't do anything about it, and don't thereby change their moral views.

If Peter Singer goes and buys a coffee, it's no argument at all to say "aha, by revealed preferences, you must not really think utilitarianism is true, or you'd have given the money away!" That doesn't show that when he does donate money, he's unmotivated by moral concerns.

Probably even this 'pure' motivation to act morally in cases where empathy isn't much of an issue is itself made up of e.g. a desire not to be seen believing self-contradictory things, cognitive dissonance, basic empathy and so on. But so what? If the emotional incentives work to motivate people to form more coherent moral views, it's the reliability of the process of forming the views that matter, not the motivation. I'm sure you could tell a similar story about the motivations that drive mathematicians to check their proofs are valid.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-02T12:29:20.489Z · LW(p) · GW(p)

You sound like you're positing the existence of two type of people: type I people who have morality based on "reason" and type II people who have morality based on the "status game". In reality, everyone's nearly everyone's morality is based on something like the status game (see also: 1 [LW · GW] 2 [LW · GW] 3 [LW · GW]). It's just that EAs and moral philosophers are playing the game in a tribe which awards status differently.

The true intrinsic values of most people do place a weight on the happiness of other people (that's roughly what we call "empathy"), but this weight is very unequally distributed [LW(p) · GW(p)].

There are definitely thorny questions regarding the best way to aggregate the values of different people in TAI. But, I think that given a reasonable solution, a lower bound on the future is imagining that the AI will build a private utopia for every person, as isolated from the other "utopias" as that person wants it to be. Probably some people's "utopias" will not be great, viewed in utilitarian terms. But, I still prefer that over paperclips (by far). And, I suspect that most people do (even if they protest it in order to play the game).

Replies from: Wei_Dai, Duncan_Sabien, Avnix

↑ comment by Wei Dai (Wei_Dai) · 2021-12-02T15:40:04.977Z · LW(p) · GW(p)

It’s just that EAs and moral philosophers are playing the game in a tribe which awards status differently.

Sure, I've said as much in recent comments, including this one [LW(p) · GW(p)]. ETA: Related to this, I'm worried about AI disrupting "our" status game in an unpredictable and possibly dangerous way. E.g., what will happen when everyone uses AI advisors to help them play status games, including the status game of moral philosophy?

The true intrinsic values of most people do place a weight on the happiness of other people (that’s roughly what we call “emapthy”), but this weight is very unequally distributed.

What do you mean by "true intrinsic values"? (I couldn't find any previous usage of this term by you.) How do you propose finding people's true intrinsic values?

These weights, if low enough relative to other "values", haven't prevented people from committing atrocities on each other in the name of morality.

There are definitely thorny questions regarding the best way to aggregate the values of different people in TAI. But, I think that given a reasonable solution, a lower bound on the future is imagining that the AI will build a private utopia for every person, as isolated from the other “utopias” as that person wants it to be.

This implies solving a version of the alignment problem that includes reasonable value aggregation between different people (or between AIs aligned to different people), but at least some researchers don't seem to consider that part of "alignment".

Given that playing status games and status competition between groups/tribes/status games constitute a huge part of people's lives, I'm not sure how private utopias that are very isolated from each other would work. Also, I'm not sure if your solution would prevent people from instantiating simulations of perceived enemies / "evil people" in their utopias and punishing them, or just simulating a bunch of low status people to lord over.

Probably some people’s “utopias” will not be great, viewed in utilitarian terms. But, I still prefer that over paperclips (by far).

I concede that a utilitarian would probably find almost all "aligned" futures better than paperclips. Perhaps I should have clarified that by "parts of me" being more scared, I meant the selfish and NU-leaning parts. The utilitarian part of me is just worried about the potential waste caused by many or most "utopias" being very suboptimal in terms of value created per unit of resource consumed.

Replies from: vanessa-kosoy, jacob_cannell, TekhneMakre

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-02T16:33:00.157Z · LW(p) · GW(p)

What do you mean by "true intrinsic values"? (I couldn't find any previous usage of this term by you.) How do you propose finding people's true intrinsic values?

I mean the values relative to which a person seems most like a rational agent, arguably formalizable along these [AF(p) · GW(p)] lines.

These weights, if low enough relative to other "values", haven't prevented people from committing atrocities on each other in the name of morality.

Yes.

This implies solving a version of the alignment problem that includes reasonable value aggregation between different people (or between AIs aligned to different people), but at least some researchers don't seem to consider that part of "alignment".

Yes. I do think multi-user alignment is an important problem (and occasionally spend some time thinking about it), it just seems reasonable to solve single user alignment first. Andrew Critch is an example of a person who seems to be concerned about this.

Given that playing status games and status competition between groups/tribes/status games constitute a huge part of people's lives, I'm not sure how private utopias that are very isolated from each other would work.

I meant that each private utopia can contain any number of people created by the AI, in addition to its "customer". Ofc groups that can agree on a common utopia can band together as well.

Also, I'm not sure if your solution would prevent people from instantiating simulations of perceived enemies / "evil people" in their utopias and punishing them, or just simulating a bunch of low status people to lord over.

They are prevented from simulating other pre-existing people without their consent, but can simulate a bunch of low status people to lord over. Yes, this can be bad. Yes, I still prefer this (assuming my own private utopia) over paperclips. And, like I said, this is just a relatively easy to imagine lower bound, not necessarily the true optimum.

Perhaps I should have clarified that by "parts of me" being more scared, I meant the selfish and NU-leaning parts.

The selfish part, at least, doesn't have any reason to be scared as long as you are a "customer".

Replies from: Wei_Dai, Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-03T07:39:20.098Z · LW(p) · GW(p)

They are prevented from simulating other pre-existing people without their consent

Why do you think this will be the result of the value aggregation (or a lower bound on how good the aggregation will be)? For example, if there is a big block of people who all want to simulate person X in order to punish that person, and only X and a few other people object, why won't the value aggregation be "nobody pre-existing except X (and Y and Z etc.) can be simulated"?

Replies from: vanessa-kosoy, vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-04T10:47:51.090Z · LW(p) · GW(p)

Given some assumptions about the domains of the utility functions, it is possible to do better than what I described in the previous comment [AF · GW]. Let be the space of possible experience histories^[1] of user $i$ and $Y$ the space of everything else the utility functions depend on (things that nobody can observe directly). Suppose that the domain of the utility functions is $Z := \prod_{i} X_{i} \times Y$ . Then, we can define the "denosing^[2] operator" $D_{i} : C (Z) \to C (Z)$ for user $i$ by

$(D_{i} u) (x_{i}, x_{- i}, y) := max x^{'} \in \prod_{j \neq i} X_{j} u (x_{i}, x^{'}, y)$

Here, $x_{i}$ is the argument of $u$ that ranges in $X_{i}$ , $x_{- i}$ are the arguments that range in $X_{j}$ for $j \neq i$ and $y$ is the argument that ranges in $Y$ .

That is, $D_{i}$ modifies a utility function by having it "imagine" that the experiences of all users other than $i$ have been optimized, for the experiences of user $i$ and the unobservables held constant.

Let $u_{i} : Z \to R$ be the utility function of user $i$ , and $d^{0} \in R^{n}$ the initial disagreement point (everyone dying), where $n$ is the number of users. We then perform cooperative bargaining on the denosed utility functions $D_{i} u_{i}$ with disagreement point $d^{0}$ , producing some outcome $μ_{0} \in Δ (Z)$ . Define $d^{1} \in R^{n}$ by $d_{i}^{1} := E_{μ} [u_{i}]$ . Now we do another cooperative bargaining with $d^{1}$ as the disagreement point and the original utility functions $u_{i}$ . This gives us the final outcome $μ_{1}$ .

Among other benefits, there is now much less need to remove outliers. Perhaps, instead of removing them we still want to mitigate them by applying "amplified denosing" to them which also removes the dependence on $Y$ .

For this procedure, there is a much better case that the lower bound will be met.

In the standard RL formalism this is the space of action-observation sequences $(A \times O)^{ω}$ . ↩︎
From the expression "nosy preferences", see e.g. here. ↩︎

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2021-12-04T18:45:18.432Z · LW(p) · GW(p)

This is very interesting (and "denosing operator" is delightful).

Some thoughts:

If I understand correctly, I think there can still be a problem where user wants an experience history such that part of the history is isomorphic to a simulation of user $j$ suffering ( $i$ wants to fully experience $j$ suffering in every detail).

Here a fixed $x_{i}$ may entail some fixed $x_{j}$ for (some copy of) some $j$ .

It seems the above approach can't then avoid leaving one of $i$ or $j$ badly off:
If $i$ is permitted to freely determine the experience of the embedded $j$ copy, the disagreement point in the second bargaining will bake this in: $j$ may be horrified to see that $i$ wants to experience its copy suffer, but will be powerless to stop it (if $i$ won't budge in the bargaining).

Conversely, if the embedded $j$ is treated as a user which $i$ will imagine is exactly to $i$ 's liking, but who actually gets what $j$ wants, then the selected $μ_{0}$ will be horrible for $i$ (e.g. perhaps $i$ wants to fully experience Hitler suffering, and instead gets to fully experience Hitler's wildest fantasies being realized).

I don't think it's possible to do anything like denosing to avoid this.

It may seem like this isn't a practical problem, since we could reasonably disallow such embedding. However, I think that's still tricky since there's a less exotic version of the issue: my experiences likely already are a collection of subagents' experiences. Presumably my maximisation over $x_{j o e}$ is permitted to determine all the $x_{s u b j o e}$ .

It's hard to see how you draw a principled line here: the ideal future for most people may easily be transhumanist to the point where today's users are tomorrow's subpersonalities (and beyond).

A case that may have to be ruled out separately is where $i$ wants to become a suffering $j$ . Depending on what I consider 'me', I might be entirely fine with it if 'I' wake up tomorrow as suffering $j$ (if I'm done living and think $j$ deserves to suffer).
Or perhaps I want to clone myself $10^{10}$ times, and then have all copies convert themselves to suffering $j$ s after a while. [in general, it seems there has to be some mechanism to distribute resources reasonably - but it's not entirely clear what that should be]

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-04T19:49:10.341Z · LW(p) · GW(p)

I think that a rigorous treatment of such issues will require some variant of IB physicalism [LW · GW] (in which the monotonicity problem has been solved, somehow). I am cautiously optimistic that a denosing operator exists there which dodges these problems. This operator will declare both the manifesting and evaluation of the source codes of other users to be "out of scope" for a given user. Hence, a preference of to observe the suffering of $j$ would be "satisfied" by observing nearly anything, since the maximization can interpret anything as a simulation of $j$ .

The "subjoe" problem is different: it is irrelevant because "subjoe" is not a user, only Joe is a user. All the transhumanist magic that happens later doesn't change this. Users are people living during the AI launch, and only them. The status of any future (trans/post)humans is determined entirely according to the utility functions of users. Why? For two reasons: (i) the AI can only have access and stable pointers to existing people (ii) we only need the buy-in of existing people to launch the AI. If existing people want future people to be treated well, then they have nothing to worry about since this preference is part of the existing people's utility functions.

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2021-12-04T23:27:03.982Z · LW(p) · GW(p)

Ah - that's cool if IB physicalism might address this kind of thing (still on my to-read list).

Agreed that the subjoe thing isn't directly a problem. My worry is mainly whether it's harder to rule out experiencing a simulation of $x_{s u b j - s u f f e r i n g}$ , since sub $j$ isn't a user. However, if you can avoid the suffering $j$ s by limiting access to information, the same should presumably work for relevant sub- $j$ s.

If existing people want future people to be treated well, then they have nothing to worry about since this preference is part of the existing people's utility functions.

This isn't so clear (to me at least) if:

Most, but not all current users want future people to be treated well.
Part of being "treated well" includes being involved in an ongoing bargaining process which decides the AI's/future's trajectory.

For instance, suppose initially 90% of people would like to have an iterated bargaining process that includes future (trans/post)humans as users, once they exist. The other 10% are only willing to accept such a situation if they maintain their bargaining power in future iterations (by whatever mechanism).

If you iterate this process, the bargaining process ends up dominated by users who won't relinquish any power to future users. 90% of initial users might prefer drift over lock-in, but we get lock-in regardless (the disagreement point also amounting to lock-in).

Unless I'm confusing myself, this kind of thing seems like a problem. (not in terms of reaching some non-terrible lower bound, but in terms of realising potential)
Wherever there's this kind of asymmetry/degradation over bargaining iterations, I think there's an argument for building in a way to avoid it from the start - since anything short of 100% just limits to 0 over time. [it's by no means clear that we do want to make future people users on an equal footing to today's people; it just seems to me that we have to do it at step zero or not at all]

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-05T12:36:00.840Z · LW(p) · GW(p)

Ah - that's cool if IB physicalism might address this kind of thing

I admit that at this stage it's unclear because physicalism brings in the monotonicity principle that creates bigger problems than what we discuss here. But maybe some variant can work.

For instance, suppose initially 90% of people would like to have an iterated bargaining process that includes future (trans/post)humans as users, once they exist. The other 10% are only willing to accept such a situation if they maintain their bargaining power in future iterations (by whatever mechanism).

Roughly speaking, in this case the 10% preserve their 10% of the power forever. I think it's fine because I want the buy-in of this 10% and the cost seems acceptable to me. I'm also not sure there is any viable alternative which doesn't have even bigger problems.

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2021-12-08T03:52:19.986Z · LW(p) · GW(p)

Sure, I'm not sure there's a viable alternative either. This kind of approach seems promising - but I want to better understand any downsides.

My worry wasn't about the initial 10%, but about the possibility of the process being iterated such that you end up with almost all bargaining power in the hands of power-keepers.

In retrospect, this is probably silly: if there's a designable-by-us mechanism that better achieves what we want, the first bargaining iteration should find it. If not, then what I'm gesturing at must either be incoherent, or not endorsed by the 10% - so hard-coding it into the initial mechanism wouldn't get the buy-in of the 10% to the extent that they understood the mechanism.

In the end, I think my concern is that we won't get buy-in from a large majority of users:
In order to accommodate some proportion with odd moral views it seems likely you'll be throwing away huge amounts of expected value in others' views - if I'm correctly interpreting your proposal (please correct me if I'm confused).

Is this where you'd want to apply amplified denosing?
So, rather than filtering out the undesirable , for these $i$ you use:

$(D_{i} u) (x_{i}, x_{- i}, y) := {max x^{'} \in \prod_{j \neq i} X_{j}, y^{'} \in Y} u (x_{i}, x^{'}, y^{'})$ [i.e. ignoring y and imagining it's optimal]

However, it's not clear to me how we'd decide who gets strong denosing (clearly not everyone, or we don't pick a $y$ ). E.g. if you strong-denose anyone who's too willing to allow bargaining failure [everyone dies] you might end up filtering out altruists who worry about suffering risks.
Does that make sense?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-08T18:30:03.762Z · LW(p) · GW(p)

My worry wasn't about the initial 10%, but about the possibility of the process being iterated such that you end up with almost all bargaining power in the hands of power-keepers.

I'm not sure what you mean here, but also the process is not iterated: the initial bargaining is deciding the outcome once and for all. At least that's the mathematical ideal we're approximating.

In the end, I think my concern is that we won't get buy-in from a large majority of users: In order to accommodate some proportion with odd moral views it seems likely you'll be throwing away huge amounts of expected value in others' views

I don't think so? The bargaining system does advantage large groups over small groups.

In practice, I think that for the most part people don't care much about what happens "far" from them (for some definition of "far", not physical distance) so giving them private utopias is close to optimal from each individual perspective. Although it's true they might pretend to care more than they do for the usual reasons, if they're thinking in "far-mode".

I would certainly be very concerned about any system that gives even more power to majority views. For example, what if the majority of people are disgusted by gay sex and prefer it not the happen anywhere? I would rather accept things I disapprove of happening far away from me than allow other people to control my own life.

Ofc the system also mandates win-win exchanges. For example, if Alice's and Bob's private utopias each contain something strongly unpalatable to the other but not strongly important to the respective customer, the bargaining outcome will remove both unpalatable things.

E.g. if you strong-denose anyone who's too willing to allow bargaining failure [everyone dies] you might end up filtering out altruists who worry about suffering risks.

I'm fine with strong-denosing negative utlitarianists who would truly stick to their guns about negative utilitarianism (but I also don't think there are many).

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2021-12-09T22:51:33.418Z · LW(p) · GW(p)

Ah, I was just being an idiot on the bargaining system w.r.t. small numbers of people being able to hold it to ransom. Oops. Agreed that more majority power isn't desirable.
[re iteration, I only meant that the bargaining could become iterated if the initial bargaining result were to decide upon iteration (to include more future users). I now don't think this is particularly significant.]

I think my remaining uncertainty (/confusion) is all related to the issue I first mentioned (embedded copy experiences). It strikes me that something like this can also happen where minds grow/merge/overlap.

This operator will declare both the manifesting and evaluation of the source codes of other users to be "out of scope" for a given user. Hence, a preference of to observe the suffering of $j$ would be "satisfied" by observing nearly anything, since the maximization can interpret anything as a simulation of $j$ .

Does this avoid the problem if $i$ 's preferences use indirection? It seems to me that a robust pointer to $j$ may be enough: that with a robust pointer it may be possible to implicitly require something like source-code-access without explicitly referencing it. E.g. where $i$ has a preference to "experience $j$ suffering in circumstances where there's strong evidence it's actually $j$ suffering, given that these circumstances were the outcome of this bargaining process".

If $i$ can't robustly specify things like this, then I'd guess there'd be significant trouble in specifying quite a few (mutually) desirable situations involving other users too. IIUC, this would only be any problem for the denosed bargaining to find a good $d^{1}$ : for the second bargaining on the true utility functions there's no need to put anything "out of scope" (right?), so win-wins are easily achieved.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-03T12:11:24.277Z · LW(p) · GW(p)

I'm imagining cooperative bargaining between all users, where the disagreement point is everyone dying^[1]^[2] (this is a natural choice assuming that if we don't build aligned TAI we get paperclips). This guarantees that every user will receive an outcome that's at least not worse than death.

With Nash bargaining, we can still get issues for (in)famous people that millions of people want to do unpleasant things to. Their outcome will be better than death, but maybe worse than in my claimed "lower bound".

With Kalai-Smorodinsky bargaining things look better, since essentially we're maximizing a minimum over all users. This should admit my lower bound, unless it is somehow disrupted by enormous asymmetries in the maximal payoffs of different users.

In either case, we might need to do some kind of outlier filtering: if e.g. literally every person on Earth is a user, then maybe some of them are utterly insane in ways that cause the Pareto frontier to collapse.

[EDIT: see improved solution [AF(p) · GW(p)]]

Bargaining assumes we can access the utility function. In reality, even if we solve the value learning problem in the single user case, once you go to the multi-user case it becomes a mechanism design problem: users have incentives to lie / misrepresent their utility functions. A perfect solution might be impossible, but I proposed mitigating this by assigning each user a virtual "AI lawyer" that provides optimal input on their behalf into the bargaining system. In this case they at least have no incentive to lie to the lawyer, and the outcome will not be skewed in favor of users who are better in this game, but we don't get the optimal bargaining solution either.

All of this assumes the TAI is based on some kind of value learning. If the first-stage TAI is based on something else, the problem might become easier or harder. Easier because the first-stage TAI will produce better solutions to the multi-user problem for the second-stage TAI. Harder because it can allow the small group of people controlling it to impose their own preferences.

For IDA-of-imitation, democratization seems like a hard problem because the mechanism by which IDA-of-imitation solves AI risk is precisely by empowering a small group of people over everyone else (since the source of AI risk comes from other people launching unaligned TAI). Adding transparency can entirely undermine safety.

For quantilized debate [LW(p) · GW(p)], adding transparency opens us to an attack vector where the AI manipulates public opinion. This significantly lowers the optimization pressure bar for manipulation, compared to manipulating the (carefully selected) judges, which might undermine the key assumption that effective dishonest strategies are harder to find than effective honest strategies.

This can be formalized by literally having the AI consider the possibility of optimizing for some unaligned utility function. This is a weird and risky approach but it works to 1st approximation. ↩︎
An alternative choice of disagreement point is maximizing the utility of a randomly chosen user. This has advantages and disadvantages. ↩︎

Replies from: Wei_Dai, Joe_Collman

↑ comment by Wei Dai (Wei_Dai) · 2021-12-05T00:52:33.552Z · LW(p) · GW(p)

Bargaining assumes we can access the utility function. In reality, even if we solve the value learning problem in the single user case, once you go to the multi-user case it becomes a mechanism design problem: users have incentives to lie / misrepresent their utility functions. A perfect solution might be impossible, but I proposed mitigating this by assigning each user a virtual “AI lawyer” that provides optimal input on their behalf into the bargaining system. In this case they at least have no incentive to lie to the lawyer, and the outcome will not be skewed in favor of users who are better in this game, but we don’t get the optimal bargaining solution either.

Assuming each lawyer has the same incentive to lie as its client, it has an incentive to misrepresent that some preferable-to-death outcomes are "worse-than-death" (in order to force those outcomes out of the set of "feasible agreements" in hope of getting a more preferred outcome as the actual outcome), and this at equilibrium is balanced by the marginal increase in the probability of getting "everyone dies" as the outcome (due to feasible agreements becoming a null set) caused by the lie. So the probability of "everyone dies" in this game has to be non-zero.

(It's the same kind of problem as in the AI race or tragedy of commons: people not taking into account the full social costs of their actions as they reach for private benefits.)

Of course in actuality everyone dying may not be a realistic consequence of failure to reach agreement, but if the real consequence is better than that, and the AI lawyers know this, they would be more willing to lie since the perceived downside of lying would be smaller, so you end up with a higher chance of no agreement.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-05T12:48:57.024Z · LW(p) · GW(p)

Yes, it's not a very satisfactory solution. Some alternative/complementary solutions:

Somehow use non-transformative AI to do my mind uploading, and then have the TAI to learn by inspecting the uploads. Would be great for single-user alignment as well.
Somehow use non-transformative AI to create perfect lie detectors, and use this to enforce honesty in the mechanism. (But, is it possible to detect self-deception?)
Have the TAI learn from past data which wasn't affected by the incentives created by the TAI. (But, is there enough information there?)
Shape the TAI's prior about human values in order to rule out at least the most blatant lies.
Some clever mechanism design I haven't thought of. The problem with this is, most mechanism designs rely on money and money that doesn't seem applicable, whereas when you don't have money there are many impossibility theorems.

↑ comment by Joe Collman (Joe_Collman) · 2021-12-03T20:28:26.016Z · LW(p) · GW(p)

In either case, we might need to do some kind of outlier filtering: if e.g. literally every person on Earth is a user, then maybe some of them are utterly insane in ways that cause the Pareto frontier to collapse.

This seems near guaranteed to me: a non-zero amount of people will be that crazy (in our terms), so filtering will be necessary.

Then I'm curious about how we draw the line on outlier filtering. What filtering rule do we use? I don't yet see a good principled rule (e.g. if we want to throw out people who'd collapse agreement to the disagreement point, there's more than one way to do that).

Replies from: None

↑ comment by [deleted] · 2021-12-04T08:27:14.406Z · LW(p) · GW(p)

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2021-12-04T16:57:23.210Z · LW(p) · GW(p)

Maybe crazy behaviour correlates with less intelligence

Depending what we mean by 'crazy' I think that's unlikely - particularly when what we care about here are highly unusual moral stances. I'd see intelligence as a multiplier, rather than something which points you in the 'right' direction. Outliers will be at both extremes of intelligence - and I think you'll get a much wider moral variety on the high end.

For instance, I don't think you'll find many low-intelligence antinatalists - and here I mean the stronger, non-obvious claim: not simply that most people calling themselves antinatalists, or advocating for antinatalism will have fairly high intelligence, but rather that most people with such a moral stance (perhaps not articulated) will have fairly high intelligence.

Generally, I think there are many weird moral stances you might think your way into that you'd be highly unlikely to find 'naturally' (through e.g. absorption of cultural norms).
I'd also expect creativity to positively correlate with outlier moralities. Minds that habitually throw together seven disparate concepts will find crazier notions than those which don't get beyond three.

Replies from: None

↑ comment by [deleted] · 2021-12-04T18:00:10.279Z · LW(p) · GW(p)

Replies from: Joe_Collman

↑ comment by Joe Collman (Joe_Collman) · 2021-12-05T02:53:08.520Z · LW(p) · GW(p)

First, I think we want to be thinking in terms of [personal morality we'd reflectively endorse] rather than [all the base, weird, conflicting... drivers of behaviour that happen to be in our heads].

There are things most of us would wish to change about ourselves if we could. There's no sense in baking them in for all eternity (or bargaining on their behalf), just because they happen to form part of what drives us now. [though one does have to be a bit careful here, since it's easy to miss the upside of qualities we regard as flaws]

With this in mind, reflectively endorsed antinatalism really is a problem: yes, some people will endorse sacrificing everything just to get to a world where there's no suffering (because there are no people).

Note that the kinds of bargaining approach Vanessa is advocating are aimed at guaranteeing a lower bound for everyone (who's not pre-filtered out) - so you only need to include one person with a particularly weird view to fail to reach a sensible bargain. [though her most recent version [LW(p) · GW(p)] should avoid this]

↑ comment by Wei Dai (Wei_Dai) · 2021-12-04T17:56:49.002Z · LW(p) · GW(p)

Yes, I still prefer this (assuming my own private utopia) over paperclips.

For a utilitarian, this doesn't mean much. What's much more important is something like, "How close is this outcome to an actual (global) utopia (e.g., with optimized utilitronium filling the universe), on a linear scale?" For example, my rough expectation (without having thought about it much) is that your "lower bound" outcome is about midway between paperclips and actual utopia on a logarithmic scale. In one sense, this is much better than paperclips, but in another sense (i.e., on the linear scale), it's almost indistinguishable from paperclips, and a utilitarian would only care about the latter and therefore be nearly as disappointed by that outcome as paperclips.

Replies from: vanessa-kosoy, vanessa-kosoy, sil-ver

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-07T12:09:43.848Z · LW(p) · GW(p)

I want to add a little to my stance on utilitarianism. A utilitarian superintelligence would probably kill me and everyone I love, because we are made of atoms that could be used for minds that are more hedonic^[1]^[2]^[3]. Given a choice between paperclips and utilitarianism, I would still choose utilitarianism. But, if there was a utilitarian TAI project along with a half-decent chance to do something better (by my lights), I would actively oppose the utilitarian project. From my perspective, such a project is essentially enemy combatants.

One way to avoid it is by modifying utilitarianism to only place weight on currently existing people. But this is already not that far from my cooperative bargaining proposal (although still inferior to it, IMO). ↩︎
Another way to avoid it is by postulating some very strong penalty on death (i.e. discontinuity of personality). But this is not trivial to do, especially without creating other problems. Moreover, from my perspective this kind of thing is hacks trying to work around the core issue, namely that I am not a utilitarian (along with the vast majority of people). ↩︎
A possible counterargument is, maybe the superhedonic future minds would be sad to contemplate our murder. But, this seems too weak to change the outcome, even assuming that this version of utilitarianism mandates minds who would want to know the truth and care about it, and that this preference is counted towards "utility". ↩︎

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-08T06:43:52.542Z · LW(p) · GW(p)

A utilitarian superintelligence would probably kill me and everyone I love, because we are made of atoms that could be used for minds that are more hedonic

This seems like a reasonable concern about some types of hedonic utilitarianism. To be clear, I'm not aware of any formulation of utilitarianism that doesn't have serious issues, and I'm also not aware of any formulation of any morality that doesn't have serious issues.

But, if there was a utilitarian TAI project along with a half-decent chance to do something better (by my lights), I would actively oppose the utilitarian project. From my perspective, such a project is essentially enemy combatants.

Just to be clear, this isn't in response to something I wrote, right? (I'm definitely not advocating any kind of "utilitarian TAI project" and would be quite scared of such a project myself.)

Moreover, from my perspective this kind of thing is hacks trying to work around the core issue, namely that I am not a utilitarian (along with the vast majority of people).

So what are you (and them) then? What would your utopia look like?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-08T18:59:04.834Z · LW(p) · GW(p)

Just to be clear, this isn't in response to something I wrote, right? (I'm definitely not advocating any kind of "utilitarian TAI project" and would be quite scared of such a project myself.)

No! Sorry, if I gave that impression.

So what are you (and them) then? What would your utopia look like?

Well, I linked my toy model of partiality [LW(p) · GW(p)] before. Are you asking about something more concrete?

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-16T04:29:55.021Z · LW(p) · GW(p)

Well, I linked my toy model of partiality before. Are you asking about something more concrete?

Yeah, I mean aside from how much you care about various other people, what concrete things do you want in your utopia?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-25T11:13:50.370Z · LW(p) · GW(p)

I have low confidence about this, but my best guess personal utopia would be something like: A lot of cool and interesting things are happening. Some of them are good, some of them are bad (a world in which nothing bad ever happens would be boring). However, there is a limit on how bad something is allowed to be (for example, true death, permanent crippling of someone's mind and eternal torture are over the line), and overall "happy endings" are more common than "unhappy endings". Moreover, since it's my utopia (according to my understanding of the question, we are ignoring the bargaining process and acausal cooperation here), I am among the top along those desirable dimensions which are zero-sum (e.g. play an especially important / "protagonist" role in the events to the extent that it's impossible for everyone to play such an important role, and have high status to the extent that it's impossible for everyone to have such high status).

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-04T18:48:09.908Z · LW(p) · GW(p)

First, you wrote "a part of me is actually more scared of many futures in which alignment is solved, than a future where biological life is simply wiped out by a paperclip maximizer." So, I tried to assuage this fear for a particular class of alignment solutions.

Second... Yes, for a utilitarian this doesn't mean "much". But, tbh, who cares? I am not a utilitarian. The vast majority of people are not utilitarians. Maybe even literally no one is an (honest, not self-deceiving) utilitarian. From my perspective, disappointing the imaginary utilitarian is (in itself) about as upsetting as disappointing the imaginary paperclip maximizer.

Third, what I actually want from multi-user alignment is a solution that (i) is acceptable to me personally (ii) is acceptable to the vast majority of people (at least if they think through it rationally and are arguing honestly and in good faith) (iii) is acceptable to key stakeholders (iv) as much as possible, doesn't leave any Pareto improvements on the table and (v) sufficiently Schelling-pointy to coordinate around. Here, "acceptable" means "a lot better than paperclips and not worth starting an AI race/war to get something better".

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-04T20:42:18.662Z · LW(p) · GW(p)

Second… Yes, for a utilitarian this doesn’t mean “much”. But, tbh, who cares? I am not a utilitarian. The vast majority of people are not utilitarians. Maybe even literally no one is an (honest, not self-deceiving) utilitarian. From my perspective, disappointing the imaginary utilitarian is (in itself) about as upsetting as disappointing the imaginary paperclip maximizer.

I'm not a utilitarian either, because I don't know what my values are or should be. But I do assign significant credence to the possibility that something in the vincinity of utilitarianism is the right values (for me, or period). Given my uncertainties, I want to arrange the current state of the world so that (to the extent possible), whatever I end up deciding my values are, through things like reason, deliberation, doing philosophy, the world will ultimately not turn out to be a huge disappointment according to those values. Unfortunately, your proposed solution isn't very reassuring to this kind of view.

It's quite possible that I (and people like me) are simply out of luck, and there's just no feasible way to do what we want to do, but it sounds like you think I shouldn't even want what I want, or at least that you don't want something like this. Is it because you're already pretty sure what your values are or should be, and therefore think there's little chance that millennia from now you'll end up deciding that utilitarianism (or NU, or whatever) is right after all, and regret not doing more in 2021 to push the world in the direction of [your real values, whatever they are]?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-04T21:26:33.995Z · LW(p) · GW(p)

I'm moderately sure what my values are, to some approximation. More importantly, I'm even more sure that, whatever my values are, they are not so extremely different from the values of most people that I should wage some kind of war against the majority instead of trying to arrive at a reasonable compromise. And, in the unlikely event that most people (including me) will turn out to be some kind of utilitarians after all, it's not a problem: value aggregation will then produce a universe which is pretty good for utilitarians.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-04T21:56:54.278Z · LW(p) · GW(p)

I’m moderately sure what my values are, to some approximation. More importantly, I’m even more sure that, whatever my values are, they are not so extremely different from the values of most people [...]

Maybe you're just not part of the target audience of my OP then... but from my perspective, if I determine my values through the kind of process described in the first quote, and most people determine their values through the kind of process described in the second quote, it seems quite likely that the values end up being very different.

[...] that I should wage some kind of war against the majority instead of trying to arrive at a reasonable compromise.

The kind of solution I have in mind is not "waging war" but for example, solving metaphilososphy and building an AI that can encourage philosophical reflection in humans or enhance people's philosophical abilities.

And, in the unlikely possibility that most people (including me) will turn out to be some kind of utilitarians after all, it’s not a problem: value aggregation will then produce a universe which is pretty good for utilitarians.

What if you turn out to be some kind of utilitarian but most people don't (because you're more like the first group in the OP and they're more like the second group), or most people will eventually turn out to be some kind of utilitarian in a world without AI, but in a world with AI, this [LW(p) · GW(p)] will happen?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-04T22:28:28.044Z · LW(p) · GW(p)

I don't think people determine their values through either process. I think that they already have values, which are to a large extent genetic and immutable. Instead, these processes determine what values they pretend to have for game-theory reasons. So, the big difference between the groups is which "cards" they hold and/or what strategy they pursue, not an intrinsic difference in values.

But also, if we do model values as the result of some long process of reflection, and you're worried about the AI disrupting or insufficiently aiding this process, then this is already a single-user alignment issue and should be analyzed in that context first. The presumed differences in moralities are not the main source of the problem here.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-04T22:46:03.594Z · LW(p) · GW(p)

I don’t think people determine their values through either process. I think that they already have values, which are to a large extent genetic and immutable. Instead, these processes determine what values they pretend to have for game-theory reasons. So, the big difference between the groups is which “cards” they hold and/or what strategy they pursue, not an intrinsic difference in values.

This is not a theory that's familiar to me. Why do you think this is true? Have you written more about it somewhere or can link to a more complete explanation?

But also, if we do model values as the result of some long process of reflection, and you’re worried about the AI disrupting or insufficiently aiding this process, then this is already a single-user alignment issue and should be analyzed in that context first. The presumed differences in moralities are not the main source of the problem here.

This seems reasonable to me. (If this was meant to be an argument against something I said, there may have been anther miscommuncation, but I'm not sure it's worth tracking that down.)

Replies from: vanessa-kosoy, adrian-arellano-davin

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-08T18:05:06.183Z · LW(p) · GW(p)

This is not a theory that's familiar to me. Why do you think this is true? Have you written more about it somewhere or can link to a more complete explanation?

I considering writing about this for a while, but so far I don't feel sufficiently motivated. So, the links I posted upwards in the thread are the best I have, plus vague gesturing in the directions of Hansonian signaling theories, Jaynes' theory of consciousness and Yudkowsky's belief in belief [LW · GW].

↑ comment by mukashi (adrian-arellano-davin) · 2021-12-04T23:02:38.762Z · LW(p) · GW(p)

Isn't this the main thesis of "The righteous mind"?

↑ comment by Rafael Harth (sil-ver) · 2021-12-04T18:30:56.125Z · LW(p) · GW(p)

This comment seems to be consistent with the assumption that the outcome 1 year after the singularity is locked in forever. But the future we're discussing here is one where humans retain autonomy (?), and in that case, they're allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI. I think a future where we begin with highly suboptimal personal utopias and gradually transition into utilitronium is among the more plausible outcomes. Compared with other outcomes where Not Everyone Dies, anyway. Your credence may differ if you're a moral relativist.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-04T18:37:39.864Z · LW(p) · GW(p)

But the future we’re discussing here is one where humans retain autonomy (?), and in that case, they’re allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI.

What if the humans ask the aligned AI to help them be more moral, and part of what they mean by "more moral" is having fewer doubts about their current moral beliefs? This is what a "status game" view of morality seems to predict, for the humans whose status games aren't based on "doing philosophy", which seems to be most of them.

Replies from: sil-ver

↑ comment by Rafael Harth (sil-ver) · 2021-12-04T19:31:01.822Z · LW(p) · GW(p)

I don't have any reason why this couldn't happen. My position is something like "morality is real, probably precisely quantifiable; seems plausible that in the scenario of humans with autonomy and aligned AI, this could lead to an asymmetry where more people tend toward utilitronium over time". (Hence why I replied, you didn't seem to consider that possibility.) I could make up some mechanisms for this, but probably you don't need me for that. Also seems plausible that this doesn't happen. If it doesn't happen, maybe the people who get to decide what happens with the rest of the universe tend toward utilitronium. But my model is widely uncertain and doesn't rule out futures of highly suboptimal personal utopias that persist indefinitely.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-04T20:15:06.216Z · LW(p) · GW(p)

I could make up some mechanisms for this, but probably you don’t need me for that.

I'm interested in your view on this, plus what we can potentially do to push the future in this direction.

Replies from: sil-ver

↑ comment by Rafael Harth (sil-ver) · 2021-12-04T22:12:15.837Z · LW(p) · GW(p)

I strongly believe that (1) well-being is objective, (2) well-being is quantifiable, and (3) Open Individualism is true (i.e., the concept of identity isn't well-defined, and you're subjectively no less continuous with the future self if any other person than your own future self).

If (1-3) are all true, then utilitronium is the optimal outcome for everyone even if they're entirely selfish. Furthermore, I expect an AGI to figure this out, and to the extent that it's aligned, it should communicate that if it's asked. (I don't think an AGI will therefore decide to do the right thing, so this is entirely compatible with everyone dying if alignment isn't solved.)

In the scenario where people get to talk to the AGI freely and it's aligned, two concrete mechanisms I see are (a) people just ask the AGI what is morally correct and it tells them, and (b) they get some small taste of what utilitronium would feel like, which would make it less scary. (A crucial piece is that they can rationally expect to experience this themselves in the utilitronium future.)

In the scenario where people don't get to talk to the AGI, who knows. It's certainly possible that we have singleton scenario with a few people in charge of the AGI, and they decide to censor questions about ethics because they find the answers scary.

The only org I know of that works on this and shares my philosophical views is QRI. Their goal is to (a) come up with a mathematical space (probably a topological one, mb a Hilbert space) that precisely describes the subjective experience of someone, (b) find a way to put someone in the scanner and create that space, and (c) find a property of that space that corresponds to their well-being in that moment. The flag ship theory is that this property is symmetry. Their model is stronger than (1-3), but if it's correct, you could get hard evidence on this before AGI since it would make strong testable predictions about people's well-being (and they think it could also point to easy interventions, though I don't understand how that works). Whether it's feasible to do this before AGI is a different question. I'd bet against it, but I think I give it better odds than any specific alignment proposal. (And I happen to know that Mike agrees that the future is dominated by concerns about AI and thinks this is the best thing to work on.)

So, I think their research is the best bet for getting more people on board with utilitronium since it can provide evidence on (1) and (2). (Also has the nice property that it won't work if (1) or (2) are false, so there's low risk of outrage.) Other than that, write posts arguing for moral realism and/or for Open Individualism.

Quantifying suffering before AGI would also plausibly help with alignment, since at least you can formally specify a broad space of outcomes you don't want. though it certainly doesn't solve it, e.g. because of inner optimizers.

↑ comment by jacob_cannell · 2021-12-14T05:19:58.528Z · LW(p) · GW(p)

This implies solving a version of the alignment problem that includes reasonable value aggregation between different people (or between AIs aligned to different people),

We already have a solution to this: money. It's also the only solution that satisfies some essential properties such as sybil orthogonality (especially important for posthuman/AGI societies).

↑ comment by TekhneMakre · 2021-12-02T20:36:46.676Z · LW(p) · GW(p)

at least some researchers don't seem to consider that part of "alignment".

It's part of alignment. Also, it seems mostly separate from the part about "how do you even have consequentialism powerful enough to make, say, nanotech, without killing everyone as a side-effect?", and the latter seems not too related to the former.

↑ comment by Duncan Sabien (Deactivated) (Duncan_Sabien) · 2021-12-02T19:58:23.005Z · LW(p) · GW(p)

In reality, everyone's morality is based on something like the status game (see also: 1 [LW · GW] 2 [LW · GW] 3 [LW · GW])

... I really wanted to say [citation needed], but then you did provide citations, but then the citations were not compelling to me.

I'm pretty opposed to such universal claims being made about humans without pushback, because such claims always seem to me to wish-away the extremely wide variation in human psychology and the difficulty establishing anything like "all humans experience X."

There are people who have no visual imagery, people who do not think in words, people who have no sense of continuity of self, people who have no discernible emotional response to all sorts of "emotional" stimuli, and on and on and on.

So, I'll go with "it makes sense to model people as if every one of them is motivated by structures built atop the status game." And I'll go with "it seems like the status architecture is a physiological near-universal, so I have a hard time imagining what else people's morality might be made of." And I'll go with "everyone I've ever talked to had morality that seemed to me to cash out to being statusy, except the people whose self-reports I ignored because they didn't fit the story I was building in my head."

But I reject the blunt universal for not even pretending that it's interested in making itself falsifiable.

Replies from: Wei_Dai, vanessa-kosoy

↑ comment by Wei Dai (Wei_Dai) · 2021-12-03T03:03:25.445Z · LW(p) · GW(p)

Kind of frustrating that this high karma reply to a high karma comment on my post is based on a double misunderstanding/miscommunication:

First Vanessa understood me as claiming that a significant number of people's morality is not based on status games. I tried to clarify in an earlier comment already, but to clarify some more: that's not my intended distinction between the two groups. Rather the distinction is that the first group "know or at least suspect that they are confused about morality, and are eager or willing to apply reason and deliberation to find out what their real values are, or to correct their moral beliefs" (they can well be doing this because of the status game that they're playing) whereas this quoted description doesn't apply to the second group.
Then you (Duncan) understood Vanessa as claiming that literally everyone's morality is based on status games, when (as the subsequent discussion revealed) the intended meaning was more like "the number of people whose morality is not based on status games is a lot fewer than (Vanessa's misunderstanding of) Wei's claim".

Replies from: Duncan_Sabien

↑ comment by Duncan Sabien (Deactivated) (Duncan_Sabien) · 2021-12-03T03:38:19.523Z · LW(p) · GW(p)

I think it's important and valuable to separate out "what was in fact intended" (and I straightforwardly believe Vanessa's restatement as a truer explanation of her actual position) from "what was originally said, and how would 70+ out of 100 readers tend to interpret it."

I think we've cleared up what was meant. I still think it was bad that [the perfectly reasonable thing that was meant] was said in a [predictably misleading fashion].

But I think we've said all that needs to be said about that, too.

Replies from: SaidAchmiz

↑ comment by Said Achmiz (SaidAchmiz) · 2021-12-04T00:59:59.671Z · LW(p) · GW(p)

This is a tangent (so maybe you prefer to direct this discussion elsewhere), but: what’s with the brackets? I see you using them regularly; what do they signify?

Replies from: Duncan_Sabien

↑ comment by Duncan Sabien (Deactivated) (Duncan_Sabien) · 2021-12-04T03:47:26.956Z · LW(p) · GW(p)

I use them where I'm trying to convey a single noun that's made up of many words, and I'm scared that people will lose track of the overall sentence while in the middle of the chunk. It's an attempt to keep the overall sentence understandable. I've tried hyphenating such phrases and people find that more annoying.

Replies from: SaidAchmiz

↑ comment by Said Achmiz (SaidAchmiz) · 2021-12-04T04:53:09.394Z · LW(p) · GW(p)

Hmm, I see, thanks.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-02T20:29:57.845Z · LW(p) · GW(p)

It's not just that the self-reports didn't fit the story I was building, the self-reports didn't fit the revealed preferences. Whatever people say about their morality, I haven't seen anyone who behaves like a true utilitarian.

IMO, this is the source of all the gnashing of teeth about how much % of your salary you need to donate: the fundamental contradiction between the demands of utilitarianism and how much people are actually willing to pay for the status gain. Ofc many excuses were developed ("sure I still need to buy that coffee or those movie tickets, otherwise I won't be productive") but they don't sound like the most parsimonious explanation.

This is also the source of paradoxes in population ethics and its vicinity: those abstractions are just very remote from actual human minds, so there's no reason they should produce anything sane in edge cases. Their only true utility is as an approximate guideline for making group decisions, for sufficiently mundane scenarios. Once you get to issues with infinities it becomes clear utilitarianism is not even mathematically coherent, in general.

You're right that there is a lot of variation in human psychology. But it's also an accepted practice to phrase claims as universal when what you actually mean is, the exceptions are negligible for our practical purpose. For example, most people would accept "humans have 2 arms and 2 legs" as a true statement in many contexts, even though some humans have less. In this case, my claim is that the exceptions are much rarer than the OP seems to imply (i.e. most people the OP classifies as exceptions are not really exceptions).

I'm all for falsifiability, but it's genuinely hard to do falsifiability in soft topics like this, where no theory makes very sharp predictions and collecting data is hard. Ultimately, which explanation is more reasonable is going to be at least in part an intuitive judgement call based on your own experience and reflection. So, yes, I certainly might be wrong, but what I'm describing is my current best guess.

Replies from: Duncan_Sabien, Gunnar_Zarncke

↑ comment by Duncan Sabien (Deactivated) (Duncan_Sabien) · 2021-12-02T21:48:21.346Z · LW(p) · GW(p)

But it's also an accepted practice to phrase claims as universal when what you actually mean is, the exceptions are negligible for our practical purpose. For example, most people would accept "humans have 2 arms and 2 legs" as a true statement in many contexts, even though some humans have less.

The equivalent statement would be "In reality, everyone has 2 arms and 2 legs."

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-02T22:43:03.904Z · LW(p) · GW(p)

Well, if the OP said something like "most people have 2 eyes but enlightened Buddhists have a third eye" and I responded with "in reality, everyone have 2 eyes" then, I think my meaning would be clear even though it's true that some people have 1 or 0 eyes (afaik maybe there is even a rare mutation that creates a real third eye). Not adding all possible qualifiers is not the same as "not even pretending that it's interested in making itself falsifiable".

Replies from: Duncan_Sabien

↑ comment by Duncan Sabien (Deactivated) (Duncan_Sabien) · 2021-12-03T20:41:22.202Z · LW(p) · GW(p)

I think your meaning would be clear, but "everyone knows what this straightforwardly false thing that I said really meant" is insufficient for a subculture trying to be precise and accurate and converge on truth. Seems like more LWers are on your side than on mine on that question, but that's not news. ¯\_(ツ)_/¯

It's a strawman to pretend that "please don't say a clearly false thing" is me insisting on "please include all possible qualifiers." I just wish you hadn't said a clearly false thing, is all.

Replies from: vanessa-kosoy, Vladimir_Nesov

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-03T21:06:34.284Z · LW(p) · GW(p)

Natural language is not math, it's inherently ambiguous and it's not realistically possible to always be precise without implicitly assuming anything about the reader's understanding of the context. That said, it seems like I wasn't sufficiently precise in this case, so I edited my comment. Thank you for the correction.

↑ comment by Vladimir_Nesov · 2021-12-03T21:49:23.618Z · LW(p) · GW(p)

insufficient for a subculture trying to be precise and accurate and converge on truth

The tradeoff is with verbosity and difficulty of communication, it's not always a straightforward Pareto improvement. So in this case I fully agree with dropping "everyone" or replacing it with a more accurate qualifier. But I disagree with a general principle that would discount ease [LW · GW] for a person who is trained and talented in relevant ways. New habits of thought that become intuitive [LW · GW] are improvements, checklists and other deliberative rituals that slow down thinking need merit that overcomes their considerable cost.

↑ comment by Gunnar_Zarncke · 2021-12-02T21:27:27.131Z · LW(p) · GW(p)

I haven't seen anyone who behaves like a true utilitarian.

That looks like a No True Scotsman argument to me. Just because the extreme doesn't exist doesn't mean that all of the scale can be explained by status games.

Replies from: vanessa-kosoy, Ratios

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-02T22:35:56.059Z · LW(p) · GW(p)

What does it have to do with "No True Scotsman"? NTS is when you redefine your categories to justify your claim. I don't think I did that anywhere.

Just because the extreme doesn't exist doesn't mean that all of the scale can be explained by status games.

First, I didn't say all the scale is explained by status games, I did mention empathy as well.

Second, that by itself sure doesn't mean much. Explaining all the evidence would require an article, or maybe a book (although I hoped the posts I linked explain some of it). My point here is that there is an enormous discrepancy between the reported morality and the revealed preferences, so believing self-reports is clearly a non-starter. How do you build an explanation not from self-reports is a different (long) story.

Replies from: Gunnar_Zarncke

↑ comment by Gunnar_Zarncke · 2021-12-02T23:28:41.110Z · LW(p) · GW(p)

I agree that there is an enormous discrepancy.

↑ comment by Ratios · 2021-12-02T21:58:51.790Z · LW(p) · GW(p)

If you try to quantify it, humans on average probably spend over 95% (Conservative estimation) of their time and resources on non-utilitarian causes. True utilitarian behavior Is extremely rare and all other moral behaviors seem to be either elaborate status games or extended self-interest [1]. The typical human is way closer under any relevant quantified KPI to being completely selfish than being a utilitarian.

[1] - Investing in your family/friends is in a way selfish, from a genes/alliances (respectively) perspective.

↑ comment by Sweetgum (Avnix) · 2022-10-25T02:16:43.855Z · LW(p) · GW(p)

But, I still prefer that over paperclips (by far). And, I suspect that most people do (even if they protest it in order to play the game).

What does this even mean? If someone says they don't want X, and they never take actions that promote X, how can it be said that they "truly" want X? It's not their stated preference or their revealed preference!

comment by Viliam · 2021-12-02T21:46:36.395Z · LW(p) · GW(p)

Feminist hero and birth control campaigner Marie Stopes, who was voted Woman of the Millennium by the readers of The Guardian and honoured on special Royal Mail stamps in 2008, was an anti-Semite and eugenicist

My conclusion from this is more like "successful politicians are not moral paragons". More generally, trying to find morally virtuous people by a popular vote is not going to produce great results, because the popularity plays much greater role than morality.

I googled for "woman of the year" to get more data points; found this list, containing: 2019 Greta Thunberg, 2016 Hillary Clinton, 2015 Angela Merkel, 2010 Nancy Pelosi, 2008 Michelle Obama, 1999 Madeleine Albright, 1990 Aung San Suu Kyi... clearly, being a politician dramatically increases your chances of winning. Looking at their behavior, Aung San Suu Kyi later organized a genocide.

The list also includes 2009 Malala Yousafzai, who as far as I know is an actual hero with no dark side. But that's kinda my point, that putting Malala Yousafzai on the same list as Greta Thunberg and Hillary Clinton just makes the list confusing. And if you had to choose one of them as the "woman of the millenium", I would expect most readers to vote for someone representing their political tribe. But to me that does not mean that people have no sense of morality, only that they can easily get politically mindkilled.

For the Malagasy people in Madagascar, it’s taboo [...] to sleep facing westwards, as you’ll kick the sunrise.

And this sounds silly to us, because we know that "kicking the sunrise" is impossible, because Sun is a planet, it is far away, and your kicking has no impact on it.

So, we should distinguish between people having different moral feelings, and having different models of the world. If you actually believed that kicking the Sun is possible and can have astronomical consequences, you would probably also perceive people sleeping westwards as criminally negligent, possibly psychopathic.

Kinda like being angry at people who don't wear face masks only makes sense under the assumption that the face masks prevent spreading of a potentially deadly disease. Without this context, anger towards people with no face masks would just be silly.

Replies from: Wei_Dai, None, fourier

↑ comment by Wei Dai (Wei_Dai) · 2021-12-12T20:41:11.542Z · LW(p) · GW(p)

And this sounds silly to us, because we know that “kicking the sunrise” is impossible, because Sun is a planet, it is far away, and your kicking has no impact on it.

I think a lot of contemporary cultures back then would have found "kicking the sunrise" to be silly, because it was obviously impossible even given what they knew at the time, i.e., you can only kick something if you physically touch it with your foot, and nobody has ever even gotten close to touching the sun, and it's even more impossible while you're asleep.

So, we should distinguish between people having different moral feelings, and having different models of the world. If you actually believed that kicking the Sun is possible and can have astronomical consequences, you would probably also perceive people sleeping westwards as criminally negligent, possibly psychopathic.

Why did the Malagasy people have such a silly belief? Why do many people have very silly beliefs today? (Among the least politically risky ones to cite, someone I've known for years who otherwise is intelligent and successful, currently believes, or at least believed in the recent past, that 2/3 of everyone will die as a result of taking the COVID vaccines.) I think the unfortunate answer is that people are motivated to or are reliably caused to have certain false beliefs, as part of the status games that they're playing. I wrote about one such dynamic [LW(p) · GW(p)], but that's probably not a complete account.

↑ comment by [deleted] · 2021-12-06T16:06:26.977Z · LW(p) · GW(p)

morally virtuous people

I feel like your definition of "morally virtuous" is missing at least 2 parameters: the context that the person is in, and the definition of "morally virtuous". You seem to treat both as fixed or not contributing to the outcome, but in my experience they're equally if not more important than the person. Your example of Aung San Suu Kyi is a good example of that. She was "good" in 1990 given her incentives in 1990 and the popular definition of "good" in 1990. Not so much later.

Replies from: Viliam

↑ comment by Viliam · 2021-12-06T17:58:16.445Z · LW(p) · GW(p)

Moral virtue seems to involve certain... inflexibility to incentives.

If someone says "I would organize the genocide of Rohingya if and only if organizing such genocide is profitable, and it so happens that today it would be unprofitable, therefore today I oppose the genocide", we would typically not call this person moral.

Of course, people usually do not explain their decision algorithms in detail, so the person described above would probably only say "I oppose the genocide", which would seem quite nice of them.

With most people, we will never know what they would do in a parallel universe, where organizing a genocide could give them a well-paid job. Without evidence to contrary, we usually charitably assume that they would refuse... but of course, perhaps this is unrealistically optimistic.

(This only addresses the objection about "context". The problem of definition is more complicated.)

↑ comment by fourier · 2021-12-12T19:39:11.771Z · LW(p) · GW(p)

> and this sounds silly to us, because we know that "kicking the sunrise" is impossible, because sun is a planet, it is far away, and your kicking has no impact on it.

No, the reason it sounds silly to you is not because it's not true, but because it's not part of your own sacred beliefs. There is no fundamental reason for people to support things you are taking for granted as moral facts, like women's right or racial rights.

In fact, given an accurate model of the world, a lot of things that make the most sense you may find distasteful based on your current unusual "moral" fashions.

For example, exterminating opposing groups is common in human societies historically. Often groups are competing for resources, since one group wants more resources for them and their progeny, exterminating the other group makes the most sense.

And if the fundamental desire for survival and dominance -- drilled into us by evolution -- isn't moral, then the concept just seems totally meaningless.

Replies from: Viliam

↑ comment by Viliam · 2021-12-14T20:56:16.594Z · LW(p) · GW(p)

And if the fundamental desire for survival and dominance -- drilled into us by evolution -- isn't moral, then the concept just seems totally meaningless.

A concept is "totally meaningless" just because it does not match some evolutionary strategies? First, concepts are concepts, regardless of what is their relation to evolution. Second, there are many strategies in evolution, including things like cooperation or commitments, which intuitively seem more aligned with morality.

Humans are a social species, where the most aggresive one with most muscles is not necessarily a winner. Sometimes it is actually a loser, who gets beaten by the cops and thrown in jail. Another example: Some homeless people are quite scary and they can survive things that I probably cannot imagine; yet, from the evolutionary perspective, they are usually less successful than me.

Even if a group wants to exterminate another group, it is usually easier if they befriend a different group first, and then attack together. But you usually don't make friends by being a backstabbing asshole. And "not being a backstabbing asshole" is kinda what morality is about.

There is no fundamental reason for people to support things you are taking for granted as moral facts, like women's right or racial rights.

Here we need to decouple moral principles from factual beliefs. On the level of moral principles, many people accept "if some individual is similar to me, they should be treated with some basic respect" as a moral rule. Not all of them, of course. If someone does not accept this moral rule, then... de gustibus non est disputandum, I guess. (I suspect that ethics is somehow downstream of aesthetics, but I may be confused about this.) But even if someone accepts this rule, the actual application will depend on their factual beliefs about who is "similar to me".

I believe it is a statement about the world (not just some kind of sacred belief) that approval of women's rights is positively correlated with the belief that (mentally) women are similar to men. Similarly, the approval of racial rights is positively correlated with the belief that people of different races are (mentally) similar to each other. This statement should be something that both people who approve and who disapprove of the aforementioned rights should agree upon.

At least it seems to me that historically, people who promoted these rights often argued about similarity; and people who opposed these rights often argued about dissimilarity. For example, if you believe that women are inherently incapable of abstract thinking, then of course it does not make any sense to let them study at universities. Or if you believe that black people enjoy being slaves, and actually slavery is much better for them than freedom, then of course abolitionists are just evil fanatics. But if it turns of that these beliefs are factually wrong, then this belief update has moral consequences. If does not effect which moral principles you accept; but if you already accept some moral principles (and many people do) it can effect what these moral principles apply to. You can become an X rights proponent not by adopting a new moral principle, but by learning that your already existing moral principle actually also applies to group X (and then it requires some moral pressure to overcome compartmentalization).

This agains if different from the question what is the right meaning of the word "similar" in sentence "people similar to me should be treated with respect". What kinds of similarity matter? Is the color of the eyes important? Or is it more about being sentient, capable of feeling pain, and such stuff? Again, it seems to me that if someone decides that the color of the eyes is ultimately unimportant, that person is not making a completely random decision, but someone builds on the already existing underlying moral feelings (perhaps combining them with some factual beliefs about how the eye color is or isn't related to other things that matter).

comment by moridinamael · 2021-12-03T16:04:52.036Z · LW(p) · GW(p)

I am also scared of futures where "alignment is solved" under the current prevailing usage of "human values."

Humans want things that we won't end up liking, and prefer things that we will regret getting relative to other options that we previously dispreferred. We are remarkably ignorant of what we will, in retrospect, end up having liked, even over short timescales. Over longer timescales, we learn to like new things that we couldn't have predicted a priori, meaning that even our earnest and thoughtfully-considered best guess of our preferences in advance will predictably be a mismatch for what we would have preferred in retrospect.

And this is not some kind of bug, this is centrally important to what it is to be a person; "growing up" requires a constant process of learning that you don't actually like certain things you used to like and now suddenly like new things. This truth ranges over all arenas of existence, from learning to like black coffee to realizing you want to have children.

I am personally partial to the idea of something like Coherent Extrapolated Volition. But it seems suspicious that I've never seen anybody on LW sketch out how a decision theory ought to behave in situations where the agents utility function will have predictably changed by the time the outcome arrives so the "best choice" is actually a currently dispreferred choice. (In other words, situations where the "best choice" in retrospect, and in expectation, do not match.) It seems dangerous to throw ourselves into a future where "best-in-retrospect" wins every time, because I can imagine many alterations to my utility function that I definitely wouldn't want to accept in advance, but which would make me "happier" in the end. And it also seems awful to accept a process by which "best-in-expectation" wins every time, because I think a likely result is that we are frozen into whatever our current utility function looks like forever. And I do not see any principled and philosophically obvious method by which we ought to arbitrate between in-advance and in-retrospect preferences.

Another way of saying the above is that it seems that "wanting" and "liking" ought to cohere but how they ought to cohere seems tricky to define without baking in some question-begging assumptions.

Replies from: Astor, Vladimir_Nesov, None

↑ comment by Astor · 2021-12-04T11:13:43.677Z · LW(p) · GW(p)

I thought a solved alignment problem would implicate a constant process of changing the values of the AI in regard to the most recent human values. So if something does not lead to the expected terminal goals of the human (such as enjoyable emotions), then the human can indicate that outcome to the AI and the AI would adjust its own goals accordingly.

Replies from: moridinamael

↑ comment by moridinamael · 2021-12-04T18:24:50.597Z · LW(p) · GW(p)

The idea that the AI should defer to the "most recent" human values is an instance of the sort of trap I'm worried about. I suspect we could be led down an incremental path of small value changes in practically any direction, which could terminate in our willing and eager self-extinction or permanent wireheading. But how much tyranny should present-humanity be allowed to have over the choices of future humanity?

I don't think "none" is as wise an answer as it might sound at first. To answer "none" implies a kind of moral relativism that none of us actually hold, and which would make us merely the authors of a process that ultimately destroys everything we currently value.

But also, the answer of "complete control by the future by the present" seems obviously wrong, because we will learn about entirely new things worth caring about that we can't predict now, and sometimes it is natural to change what we like.

More fundamentally, I think the assumption that there exist "human terminal goals" presumes too much. Specifically, it's an assumption that presumes that our desires, in anticipation and in retrospect, are destined to fundamentally and predictably cohere. I would bet money that this isn't the case.

↑ comment by Vladimir_Nesov · 2021-12-03T22:25:03.789Z · LW(p) · GW(p)

The implication of doing everything that AI could do at once is unfortunate. The urgent objective of AI alignment is prevention of AI risk, where a minimal solution is to take away access to unrestricted compute from all humans in a corrigible way that would allow eventual desirable use of it. All other applications of AI could follow much later through corrigibility of this urgent application.

Replies from: None

↑ comment by [deleted] · 2021-12-04T08:51:04.169Z · LW(p) · GW(p)

↑ comment by [deleted] · 2021-12-04T08:33:25.113Z · LW(p) · GW(p)

Replies from: moridinamael

↑ comment by moridinamael · 2021-12-04T18:12:38.419Z · LW(p) · GW(p)

Yes, there is a broad class of wireheading solutions that we would want to avoid, and it is not clear how to specify a rule that distinguishes them from outcomes that we would want. When I was a small child I was certain that I would never want to move away from home. Then I grew up, changed my mind, and moved away from home. It is important that I was able to do something which a past version of myself would be horrified by. But this does not imply that there should be a general rule allowing all such changes. Understanding which changes to your utility function are good or bad is, as far as decision theory is concerned, undefined.

comment by Jon Garcia · 2021-12-02T18:45:13.601Z · LW(p) · GW(p)

Even if moralities vary from culture to culture based on the local status games, I would suggest that there is still some amount of consequentialist bedrock to why certain types of norms develop. In other words, cultural relativism is not unbounded.

Generally speaking, norms evolve over time, where any given norm at one point didn't yet exist if you go back far enough. What caused these norms to develop? I would say the selective pressures for norm development come from some combination of existing culturally-specific norms and narratives (such as the sunrise being an agent that could get hurt when kicked) along with more human-universal motivations (such as empathy + {wellbeing = good, suffering = bad} -> you are bad for kicking the sunrise -> don't sleep facing west) or other instrumentally-convergent goals (such as {power = good} + "semen grants power" -> institutionalized sodomy). At every step along the evolution of a moral norm, every change needs to be justifiable (in a consequentialist sense) to the members of the community who would adopt it. Moral progress is when the norms of society come to better resonate with both the accepted narratives of society (which may come from legends or from science) and the intrinsic values of its members (which come from our biology / psychology).

In a world where alignment has been solved to most everyone's satisfaction, I think that the status-game / cultural narrative aspect of morality will necessarily have been taken into account. For example, imagine a post-Singularity world kind of like Scott Alexander's Archipelago, where the ASI cooperates with each sub-community to create a customized narrative for the members to participate in. It might then slowly adjust this narrative (over decades? centuries?) to align better with human flourishing in other dimensions. The status-game aspect could remain in play as long as status becomes sufficiently correlated with something like "uses their role in life to improve the lives of others within their sphere of control". And I think everyone would be better off if each narrative also becomes at least consistent with what we learn from science, even though the stories that define the status game will be different from one culture to another in other ways.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-03T03:24:31.150Z · LW(p) · GW(p)

Upvoted for some interesting thoughts.

In a world where alignment has been solved to most everyone’s satisfaction, I think that the status-game / cultural narrative aspect of morality will necessarily have been taken into account. For example, imagine a post-Singularity world kind of like Scott Alexander’s Archipelago, where the ASI cooperates with each sub-community to create a customized narrative for the members to participate in. It might then slowly adjust this narrative (over decades? centuries?) to align better with human flourishing in other dimensions.

Can you say more about how you see us getting from here to there?

Replies from: Jon Garcia

↑ comment by Jon Garcia · 2021-12-03T17:57:06.232Z · LW(p) · GW(p)

Getting from here to there is always the tricky part with coordination problems, isn't it? I do have some (quite speculative) ideas on that, but I don't see human society organizing itself in this way on its own for at least a few centuries given current political and economic trends, which is why I postulated a cooperative ASI.

So assuming that either an aligned ASI has taken over (I have some ideas on robust alignment, too, but that's out of scope here) or political and economic forces (and infrastructure) have finally pushed humanity past a certain social phase transition, I see humanity undergoing an organizational shift much like what happened with the evolution of multicellularity and eusociality. This would look at first mostly the same as today, except that national borders have become mostly irrelevant due to advances in transportation and communication infrastructure. Basically, imagine the world's cities and highways becoming something like the vascular system of dicots or the closed circulatory system of vertebrates, with the regions enclosed by network circuits acting as de facto states (or organs/tissues, to continue the biological analogy). Major cities and the regions along the highways that connect them become the de facto arbiters of international policy, while the major cities and highways within each region become the arbiters of regional policy, and so on in a hierarchically embedded manner.

Within this structure, enclosed regions would act as hierarchically embedded communities that end up performing a division of labor for the global network, just as organs divide labor for the body (or like tissues divide labor within an organ, or cells within a tissue, or organelles within a cell, if you're looking within regions). Basically, the transportation/communication/etc. network edges would come to act as Markov blankets for the regions they encapsulate, and this organization would extend hierarchically, just like in biological systems, down to the level of local communities. (Ideally, each community would become locally self-sufficient, with broader networks taking on a more modulatory role, but that's another discussion.)

Anyway, once this point is reached, or even as the transition is underway, I see the ASI and/or social pressures facilitating the movement of people toward communities of shared values and beliefs (i.e., shared narratives, or at least minimally conflicting narratives), much like in Scott Alexander's Archipelago. Each person or family unit should move so as to minimize their displacement while maximizing the marginal improvement they could make to their new community (and the marginal benefit they could receive from the new community).

In the system that emerges, stories would become something of a commodity, arising within communities as shared narratives that assign social roles and teach values and lessons (just like the campfire legends of ancient hunter-gatherer societies). Stories with more universal resonance would propagate up hierarchical layers of the global network and then get disseminated top-down toward other local communities within the broader regions. This would provide a narrative-synchronization effect at high levels and across adjacent regions while also allowing for local variations. The status games / moralities of the international level would eventually attain a more "liberal" flavor, while those at more local levels could be more "conservative" in nature.

Sorry, that was long. And it probably involved more idealizational fantasy than rational prediction of future trends. But I have a hunch that something like this could work

comment by jacob_cannell · 2021-12-14T05:19:31.632Z · LW(p) · GW(p)

I honestly have a difficult time understanding the people (such as your "AI alignment researchers and other LWers, Moral philosophers") who actually believe in Morality with a capital M. I believe they are misguided at best, potentially dangerous at worst.

I hadn't heard of the Status Game book you quote, but for a long time now it's seemed obvious to me that there is no objective true Morality, it's purely a cultural construct, and mostly a status game. Any deep reading of history, cultures, and religions, leads one to this conclusion.

Humans have complex values, and that is all.

We humans cooperate and compete to optimize the universe according to those values, as we always have, as our posthuman descendants will, even without fully understanding them.

Replies from: Ape in the coat

↑ comment by Ape in the coat · 2021-12-14T12:01:11.692Z · LW(p) · GW(p)

I think you are misunderstanding what Wei_Dai meant by "AI alignment researchers and other LWers, Moral philosophers" perspective on morality. It's not about capital letters or "objectivity" of our morality. It's about that exact fact that humans have complex values and whether we can understand them and translate them into one course of action according to which we are going to optimize the universe.

Basically, as I understand it, the difference is between people who try to resolve the confilcts between their different values and generally think about them as an approximation of some coherent utility function, and those who don't.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2021-12-14T17:07:10.041Z · LW(p) · GW(p)

It's about that exact fact that humans have complex values and whether we can understand them and translate them into one course of action according to which we are going to optimize the universe.

If we agree humans have complex subjective values, then optimizing group decisions (for a mix of agents with different utility functions) is firmly a question for economic mechanism design - which is already a reasonably mature field.

Replies from: JenniferRM, Ape in the coat

↑ comment by JenniferRM · 2022-04-15T08:10:23.498Z · LW(p) · GW(p)

A problem here, however, is the Myerson–Satterthwaite result which suggests that auction runners, to enable clean and helpful auctions for others, risk being hurt when they express and seek their own true preferences, or (if they take no such risks) become bad auctioneers for others.

The thing that seems like it might just be True here is that Good Governance requires personal sacrifice by leaders, which I mostly don't expect to happen, given normal human leaders, unless those leaders are motivated by, essentially: "altruistic" "moral sentiment".

It could be that I'm misunderstanding some part of the economics or the anthropology or some such?

But it looks to me like if someone says that there is no such thing as moral sentiment, it implies that they themselves do not have such sentiments, and so perhaps those specific people should not be given power or authority or respect in social processes that are voluntary, universal, benevolent, and theoretically coherent.

The reasonableness of this conclusion goes some way to explain to me how there is so much "social signaling" and also goes to explaining why so much of this signaling is fake garbage transmitted into the social environment by power-hungry psychos.

↑ comment by Ape in the coat · 2021-12-15T06:46:23.990Z · LW(p) · GW(p)

Well, that's one way to do it. With it's own terrible consequences, but lets not focus on them for now.

What's more important is that this solution is very general, while all human values belong to the same cluster. So there may be more preferable, more human-specific solution for the problem.

comment by Shmi (shminux) · 2021-12-02T08:13:49.622Z · LW(p) · GW(p)

To repost my comment from a couple of weeks back [LW(p) · GW(p)], which seems to say roughly the same thing, not as well:

I don't believe alignment is possible. Humans are not aligned with other humans, and the only thing that prevents an immediate apocalypse is the lack of recursive self-improvement on short timescales. Certainly groups of humans happily destroy other groups of humans, and often destroy themselves in the process of maximizing something like the number of statues. Best we can hope for that whatever takes over the planet after meatbags are gone has some of the same goals that the more enlightened meatbags had, where "enlightened" is a very individual definition. Maybe it is a thriving and diverse Galactic civilization, maybe it is the word of God spread to the stars, maybe it is living quietly on this planet in harmony with the nature. There is no single or even shared vision of the future that can be described as "aligned" by most humans.

Replies from: Charlie Steiner, Ratios

↑ comment by Charlie Steiner · 2021-12-03T01:12:48.973Z · LW(p) · GW(p)

Do you think there are changes to the current world that would be "aligned"? (E.g. deleting covid) Then we could end up with a world that is better than our current one, even without needing all humans to agree on what's best.

Another option: why not just do everything at once? Have some people living in a diverse Galactic civilization, other people spreading the word of god, and other people living in harmony with nature, and everyone contributing a little to everyone else's goals? Yes, in principle people can have different values such that this future sounds terrible to everyone - but in reality it seems more like people would prefer this to our current world, but might merely feel like they were missing out relative to their own vision of perfection.

↑ comment by Ratios · 2021-12-02T17:28:16.277Z · LW(p) · GW(p)

I have also made a similar comment a few weeks ago [LW(p) · GW(p)], In fact, this point seems to me so trivial yet corrosive that I find it outright bizarre it's not being tackled/taken seriously by the AI alignment community.

comment by WalterL · 2021-12-03T19:56:47.728Z · LW(p) · GW(p)

I'm not sure what you mean by 'astronomical waste or astronomical suffering'. Like, you are writing that everything forever is status games, ok, sure, but then you can't turn around and appeal to a universal concept of suffering/waste, right?

Whatever you are worried about is just like Gandhi worrying about being too concerned with cattle, plus x years, yeah? And even if you've lucked into a non status games morality such that you can perceive 'Genuine Waste' or what have you...surely by your own logic, we who are reading this are incapable of understanding, aside from in terms of status games.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-03T20:35:24.115Z · LW(p) · GW(p)

I'm suggesting that maybe some of us lucked into a status game where we use "reason" and "deliberation" and "doing philosophy" to compete for status, and that somehow "doing philosophy [LW · GW]" etc. is a real thing that eventually leads to real answers about what values we should have (which may or may not depend on who we are). Of course I'm far from certain about this, but at least part of me wants to act as if it's true, because what other choice does it have?

Replies from: None

↑ comment by [deleted] · 2021-12-06T16:08:47.491Z · LW(p) · GW(p)

The alternative is egoism. To the extent that we are allies, I'd be happy if you adopted it.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-06T21:31:09.532Z · LW(p) · GW(p)

I don't think that's a viable alternative, given that I don't believe that egoism is certainly right (surely the right way to treat moral uncertainty can't be to just pick something and "adopt it"?), plus I don't even know how to adopt egoism if I wanted to:

https://www.lesswrong.com/posts/Nz62ZurRkGPigAxMK/where-do-selfish-values-come-from [LW · GW]
https://www.lesswrong.com/posts/c73kPDr8pZGdZSe3q/solving-selfishness-for-udt [LW · GW] (which doesn't really solve the problem despite the title)

comment by TekhneMakre · 2021-12-02T12:55:10.689Z · LW(p) · GW(p)

So on the one hand you have values that are easily, trivially compatible, such as "I want to spend 1000 years climbing the mountains of Mars" or "I want to host blood-sports with my uncoerced friends with the holodeck safety on".

On the other hand you have insoluble, or at least apparently insoluble, conflicts: B wants to torture people, C wants there to be no torture anywhere at all. C wants to monitor everyone everywhere forever to check that they aren't torturing anyone or plotting to torture anyone, D wants privacy. E and F both want to be the best in the universe at quantum soccer, even if they have to kneecap everyone else to get that. Etc.

It's simply false that you can just put people in the throne as emperor of the universe, and they'll justly compromise about all conflicts. Or even do anything remotely like that.

How many people have conflictual values that they, effectively, value lexicographically more than their other values? Does decision theory imply that compromise will be chosen by sufficiently well-informed agents who do not have lexicographically valued conflictual values?

comment by Vladimir_Nesov · 2021-12-02T10:18:16.644Z · LW(p) · GW(p)

I'm leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values, preventing overoptimized weird/controversial situations, even at the cost of astronomical waste. Absence of x-risks, including AI risks, is generally good. Within this environment, the civilization might be able to eventually work out more about values, expanding the scope of their definition and thus allowing stronger optimization. Here corrigibility is in part about continually picking up the values and their implied scope from the predictions of how they would've been worked out some time in the future.

Replies from: Wei_Dai, Ratios

↑ comment by Wei Dai (Wei_Dai) · 2021-12-05T10:50:24.799Z · LW(p) · GW(p)

I’m leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values

Please say more about this? What are some examples of "relatively well-understood values", and what kind of AI do you have in mind that can potentially safely optimize "towards good trajectories within scope" of these values?

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2022-01-08T10:38:03.750Z · LW(p) · GW(p)

My point is that the alignment (values) part of AI alignment is least urgent/relevant to the current AI risk crisis. It's all about corrigibility and anti-goodharting. Corrigibility is hope for eventual alignment, and anti-goodharting makes inadequacy of current alignment and imperfect robustness of corrigibility less of a problem. I gave the relevant example of relatively well-understood values, preference for lower x-risks. Other values are mostly relevant in how their understanding determines the boundary of anti-goodharting, what counts as not too weird for them to apply, not in what they say is better. If anti-goodharting holds (too weird and too high impact situations are not pursued in planning and possibly actively discouraged), and some sort of long reflection is still going on, current alignment (details of what the values-in-AI prefer, as opposed to what they can make sense of) doesn't matter in the long run.

I include maintaining a well-designed long reflection somewhere into corrigibility, for without it there is no hope for eventual alignment, so a decision theoretic agent that has long reflection within its preference is corrigible in this sense. Its corrigibility depends on following a good decision theory, so that there actually exists a way for the long reflection to determine its preference so that it causes the agent to act as the long reflection wishes. But being an optimizer it's horribly not anti-goodharting, so can't be stopped and probably eats everything else.

An AI with anti-goodharting turned to the max is the same as AI with its stop button pressed. An AI with minimal anti-goodharting is an optimizer, AI risk incarnate. Stronger anti-goodharting is a maintenance mode, opportunity for fundamental change, weaker anti-goodharting makes use of more developed values to actually do the things. So a way to control the level of anti-goodharting in an AI is a corrigibility technique. The two concepts work well with each other.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2022-01-08T20:42:19.614Z · LW(p) · GW(p)

This seems interesting and novel to me, but (of course) I'm still skeptical.

I gave the relevant example of relatively well-understood values, preference for lower x-risks.

Preference for lower x-risk doesn't seem "well-understood" to me, if we include in "x-risk" things like value drift/corruption, premature value lock-in, and other highly consequential AI-enabled decisions (potential existential mistakes) that depend on hard philosophical questions. I gave some specific examples in this recent comment [LW(p) · GW(p)]. What do you think about the problems on that list? (Do you agree that they are serious problems, and if so how do you envision them being solved or prevented in your scenario?)

↑ comment by Ratios · 2021-12-02T17:49:13.543Z · LW(p) · GW(p)

The fact that AI alignment research is 99% about control, and 1% (maybe less?) about metaethics (In the context of how do we even aggregate the utility function of all humanity) hints at what is really going on, and that's enough said.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-12-05T11:34:48.837Z · LW(p) · GW(p)

Have you heard about CEV and Fun Theory? In an earlier, more optimistic time, this was indeed a major focus. What changed is we became more pessimistic and decided to focus more on first things first -- if you can't control the AI at all, it doesn't matter what metaethics research you've done. Also, the longtermist EA community still thinks a lot about metaethics relative to literally every other community I know of, on par with and perhaps slightly more than my philosophy grad student friends. (That's my take at any rate, I haven't been around that long.)

Replies from: Ratios

↑ comment by Ratios · 2021-12-05T21:19:30.870Z · LW(p) · GW(p)

CEV was written in 2004, fun theory 13 years ago. I couldn't find any recent MIRI paper that was about metaethics (Granted I haven't gone through all of them). The metaethics question is important just as much as the control question for any utilitarian (What good will it be to control an AI only for it to be aligned with some really bad values, an AI-controlled by a sadistic sociopath is infinitely worse than a paper-clip-maximizer). Yet all the research is focused on control, and it's very hard not to be cynical about it. If some people believe they are creating a god, it's selfishly prudent to make sure you're the one holding the reigns to this god. I don't get why having some blind trust in the benevolence of Peter Thiel (who finances this) or other people who will suddenly have godly powers to care for all humanity seems naive with all we know about how power corrupts and how competitive and selfish people are. Most people are not utilitarians, so as a quasi-utilitarian I'm pretty terrified of what kind of world will be created with an AI-controlled by the typical non-utilitarian person.

Replies from: daniel-kokotajlo, Mitchell_Porter

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-12-05T23:25:31.086Z · LW(p) · GW(p)

My claim was not that MIRI is doing lots of work on metaethics. As far as I know they are focused on the control/alignment problem. This is not because they think it's the only problem that needs solving; it's just the most dire, the biggest bottleneck, in their opinion.

You may be interested to know that I share your concerns about what happens after (if) we succeed at solving alignment. So do many other people in the community, I assure you. (Though I agree on the margin more quiet awareness-raising about this would plausibly be good.)

↑ comment by Mitchell_Porter · 2021-12-06T05:26:43.089Z · LW(p) · GW(p)

http://www.metaethical.ai is the state of the art as far as I'm concerned...

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-01-15T03:26:54.732Z · LW(p) · GW(p)

I think this post makes an important point -- or rather, raises a very important question, with some vivid examples to get you started. On the other hand, I feel like it doesn't go further, and probably should have -- I wish it e.g. sketched a concrete scenario in which the future is dystopian not because we failed to make our AGIs "moral" but because we succeeded, or e.g. got a bit more formal and complemented the quotes with a toy model (inspired by the quotes) of how moral deliberation in a society might work, under post-AGI-alignment conditions, and how that could systematically lead to dystopia unless we manage to be foresightful and set up the social conditions just right.

I recommend not including this post, and instead including this one [LW(p) · GW(p)] and Wei Dai's exchange in the comments. [LW(p) · GW(p)]

comment by [deleted] · 2021-12-06T16:01:28.363Z · LW(p) · GW(p)

If with "morality" you mean moral realism, then yes, I agree that it is scary.
I'm most scared by the apparent assumption that we have solved the human alignment problem.
Looking at history, I don't feel like our current situation of relative peace is very stable.
My impression is that "good" behavior is largely dependent on incentives, and so is the very definition of "good".
Perhaps markets are one of the more successful tools of creating aligned behaviour in humans, but even in that case it only seems to work if the powers of the market participants are balanced, which is not a luxury we have in alignment work.

comment by Ben123 · 2021-12-03T03:20:44.030Z · LW(p) · GW(p)

You could read the status game argument the opposite way: Maybe status seeking causes moral beliefs without justifying them, in the same way that it can distort our factual beliefs about the world. If we can debunk moral beliefs by finding them to be only status-motivated, the status explanation can actually assist rational reflection on morality.

Also the quote from The Status Game conflates purely moral beliefs and factual beliefs in a way that IMO weakens its argument. It's not clear that many of the examples of crazy value systems would survive full logical and empirical information.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-03T03:34:31.271Z · LW(p) · GW(p)

The point I was trying to make with the quote is that many people are not motivated to do "rational reflection on morality" or examine their value systems to see if they would "survive full logical and empirical information". In fact they're motivated to do the opposite, to protect their value systems against such reflection/examination. I'm worried that alignment researchers are not worried enough that if an alignment scheme causes the AI to just "do what the user wants", that could cause a lock-in of crazy value systems that wouldn't survive full logical and empirical information.

comment by Gunnar_Zarncke · 2021-12-02T23:12:26.650Z · LW(p) · GW(p)

There is no unique eutopia.

Sentient beings that collaborate outcompete ones that don't (not considering here, inner competition in a singleton). Collaboration means that interests between beings are traded/compromised. Better collaboration methods have a higher chance to win. We see this over the course of history. This is a messy evolutionary process. But I think there is a chance that this process itself can be improved e.g. with FAI. Think of an interactive "AlphaValue" that does Monte-Carlo Tree Search over collaboration opportunities. It will not converge on a unique best CEV but result in one of many possible eutopias.

comment by c.trout (ctrout) · 2022-04-14T16:54:20.521Z · LW(p) · GW(p)

I don't follow the reasoning. How do you get from "most people's moral behaviour is explainable in terms of them 'playing' a status game" to "solving (some versions of) the alignment problem probably won't be enough to ensure a future that's free from astronomical waste or astronomical suffering"?

More details:
Regarding the quote from The Status Game: I have not read the book, so I'm not sure what the intended message is but this sounds like some sort of unwarranted pessimism about ppl's moral standing (something like a claim like "the vast majority of ppl are morally ugly in this way"). There is all the difference between being able to explain most ppl's behaviour in terms of "playing a status game" (in some sense of the word "playing"), and claiming that most ppl's conscious motivation to act morally is to win at the status game. The latter claim could plausibly warrant the pessimism; the former, not. But I don't see an argument for the latter. Why is the former claim not evidence for ugliness? For the same reason that the claim "a mother's love for their child is genetically 'hard-wired'" is not evidence that a mother's love for their child is ugly (or fake, not genuine etc). Explaining the underlying causal mechanisms of a given moral behaviour is not (in general) enough to warrant a given moral judgment. (If instead the book is arguing for some sort of meta-ethical anti-realism, well, then the discussion needs to be much longer...)

Regarding your fear about morality: is the worry that if we just aggregated everyone's values we would get a lock-in to some sort of "ugly" status game? Again, we need more details on its proposed implementation before we can judge whether its ugly (something to be scared of).

But also, why are we assuming some sort of aggregation of first-order human value preferences (no matter the method of aggregation)? Assuming we're talking about AGI (and not CAIS), I always thought it strange to think we need to make sure it shares our own idealized preferences, as opposed to merely the preferences we would hope for in, say, a benevolent god or something. I don't see any a priori reason to believe that the preferences/goals of a benevolent shepherd are likely to be shared with/or strongly aligned with those of the shepherd's flock (no matter how you aggregate the preferences/goals of the flock). (I suppose it depends on the nature of the species we're talking about, but whether it's sheep or humans, I maintain my skepticism). In any case, I agree with you that a lot more meta-ethics needs to be discussed in the alignment research community.

comment by ThomasMore · 2021-12-03T13:26:20.798Z · LW(p) · GW(p)

Great post, thanks! Widespread value pluralism a la 'well that's just, like, your opinion man' is now a feature of modern life. Here are a pair of responses from political philosophy which may be of some interest

(1) Rawls/Thin Liberal Approach. Whilst we may not be able to agree on what 'the good life' is, we can at least agree on a basic system which ensures all participants can pursue their own idea of the good life. So,(1) Protect a list of political liberties and freedoms and (2) degree of economic levelling. Beyond that, it is up to the individual what concept of the good they pursue. Scott Alexander's Archipelago is arguably a version of this theory, albeit with a plurality of communities rather than a single state. Note it is 'thin' but not nonexistent - obviously certain concepts of the good, such as 'killing/enslaving everyone for my God', are incompatible and excluded.

(2) Nussbaum 'Capacity' Approach. Bit like Liberal+ approach. You take the liberal approach then beef it up by adding some more requirements: you need to protect the capacity of people to achieve wellbeing. Basically - protect life, environment, health/bodily integrity, education (scientific & creative), practical reason, being able to play, hold property, form emotional and social attachments. The main difference with (2) is that it is a thicker conception of the 'good life' - it will deny various traditional forms of life on the basis they do not educate their children or give them critical thinking skills. Hence, Nussbaum champions the notion of 'universal values.'

Going from (1) to (2) depends on how comfortable you are with an objective notion of flourishing. IMO it's not totally implausible given commonalities across cultures of values (which Nussbaum points out - moral relativism is often exaggerated) and various aspects of human experience.

comment by Tee Canker (tee-canker) · 2021-12-02T07:19:55.975Z · LW(p) · GW(p)

Don't you need AI to go through the many millions of experiences that it might take to develop a good morality strategy?

I'm entranced by Jordan Peterson's descriptions, which seem to light up the evolutionary path of morality for humans. Shouldn't AI be set up to try to grind through the same progress?

Replies from: andrew-mcknight

↑ comment by Andrew McKnight (andrew-mcknight) · 2021-12-02T22:12:53.402Z · LW(p) · GW(p)

I think the main thing you're missing here is that an AI is not generally going to share common learning facilities with humans. An AI growing up as a human will make it wildly different from a normal human because they aren't built precisely to learn from those experiences the way a human does.

comment by Flaglandbase · 2022-04-15T08:51:04.675Z · LW(p) · GW(p)

What's truly scary is how much the beliefs and opinions of normal people make them seem like aliens to me.

comment by anon312 · 2021-12-06T05:35:20.640Z · LW(p) · GW(p)

I find the paragraph beginning with these two sentences, its examples misleading and unconvincing in the point about moral disagreement across time it tries to make:

Such ‘facts’ also change across time. We don’t have to travel back far to discover moral superstars holding moral views that would destroy them today.

I shall try to explain why, because such evidence seemed persuasive to me before I thought about it more; I made this account just for this comment after being a lurker for a while -- I have found your previous posts about moral uncertainty quite interesting.

When comparing moral views across time to see if there is disagreement or agreement, we should see if there is an underlying principle that is stated by the person or one that can be inferred, which is different from the specific contextual judgement.

The application of the same principles across time to various concrete matters will widely vary depending on the context and information of the person applying them, that person's capabilities. Therefore when seeking to determine moral agreement or disagreement with a specific judgement, one should not give examples of specific practical moral judgements without also giving information about the underlying motivation and factual view of the world associated with it.

One can see how much utilitarians with the same theory can diverge on what to do. But they disagree sometimes not in moral theory, but on things that are un-ambiguously empirical matters. So people can disagree on a specific moral judgment without that disagreement being of a moral nature. Because people across time differ quite a lot in their non-moral factual understanding of world, one would need to do research and interpretation from case to case to uncover the actual principles at play in various moral judgements which seem strange or disagreeable to us. Without trying to consider the moral rules or principles being applied, and the understanding of the world they are being applied to, how can we determine if we disagree with someone about a moral fact, or another fact about the world?

To give a simple example about how moral judgements and factual judgements are related: it is morally easy for us to see that killing witches was wrong, because there were no witches. The communities that burned witches may have differed morally from us, but there is enough moral agreement between us and them that if they had not believed in witches or those specific false instances of witches, they would not have burned them (even when there was lying about witches involved in historical contexts of witch burning, that lying took place in a community that believed in witches and wanted to punish them accordingly).

It would not make sense to say something like

But there’s about as much sense in blaming Gandhi for not sharing our modern, Western views on race as there is in blaming the Vikings for not having Netflix.

about the communities that burned witches. I think we can correctly say they were objectively mistaken to burn those witches. I'm not very sure about much involving morality, but that is the sort of modest 'moral realism' I am confident having when judging the historic moral decisions of past communities and individuals (I am not sure about the criteria for 'blaming' but I am confident we can say they were objectively wrong to do that, like about any other factual matter). The loosely held shared moral principle would be something like 'do not kill people for entirely mistaken reasons'. I am not saying that will be the case, locating a shared moral principle, for every past moral decision we might disagree with. But for more complicated cases we really do have to research before determining if there is a genuinely moral disagreement IMO, rather than just stating various specific judgements we seem to disagree with.

So in the case of Marie Stopes and Gandhi, among other examples, I am unable to tell if I morally disagree with them based on the limited information given in the excerpt. I would need to know more about their view of the facts of the matters mentioned and try to understand and infer from what they have shared elsewhere, the moral views being applied (it is especially true in the case of innovative moral activists, that they will articulate their principles, so it is odd that the author does not tell us them).

I hope I have been convincing and reasonable about why we should be wary of telling what someone's underlying moral view is and our agreement or disagreement with it from limited examples of the sort the author gives here; this is even more true for pre-modern cultures with very different understandings of the world, like the Malagasy and their taboos. How tightly wound implicit moral views and empirical views can be is clear with the case of communism, another example given. There is enough moral feeling, belief in common that many unusually-bad-to-us cases of moral behavior are also tied to a very different empirical view of the world.

One final note, regarding status games and morality. I think the fact that many people are urged into behavior considered moral by the expectations of others who find moral behavior praiseworthy, shows that the concept of morality can be objectively applied in scenarios well outside specific people having an autonomous intrinsically moral motivation. Not everyone wants to act morally even if they recognize an objective moral ideal, but often instrumentally is driven to engage in acts considered moral, or meeting the respectable moral minimum at least, by society for their own amoral reasons.

We can even imagine a world where no one acts for intrinsically moral reasons, but people still sometimes shape their actions around a concept of objective morality that arises because people are able to recognize their own selfish interests and recognize, want to reward someone 'altruistically' contributing to their interests or at least respecting them enough not to hurt them. I don't think that world is ours, but I do think in our world many people do not have positive moral motivations for much of their behavior and moral reformers, saint-types are known as such for having unusually strong intrinsic moral motivations and for driving people to further live up to the moral ideals they merely superficially meet for instrumental reasons most of the time.

One can imagine 'moral saints' negotiating among various people's selfish interests to find or push people into the positive sum outcome they themselves could not selfishly have gotten into. I do not think primarily moral agents and 'enlightened' selfish ones are necessarily opposed at all (this sort of negotiation is how one can picture a broader human cooperation / alignment in the context of powerful moral human agents, in relation to the power of an aligned artificial intelligence).

The desire to appear more moral than one really is adds a lot of treachery to moral discourse in all communities; moral theorizing can offend people through intruding on illusions about how selfish they are and this trickiness is surely one reason for less progress on clarifying moral objectivity than there would be otherwise. It seems likely that due to human moral imperfection there is ineliminable suboptimality in our actions and actual desires relative to all of our ideal moral concepts (even if they can genuinely partially motivate us in a more moral direction through their attraction as pure ideals), and that this shall be transferred to any agents that we create to satisfy us. It will probably be like Kant said, the learning of human values by an AI:

Out of the crooked timber of humanity, no straight thing was ever made

comment by Ape in the coat · 2021-12-05T07:23:29.672Z · LW(p) · GW(p)

It seems that our morality consists of two elements. First is bias, based on game theoretical environment of our ancestors. Humans developed complex feelings around activities that promoted inclusive genetic fitness and now we are intrinsically and authentically motivated to do them for their own sake.

There is also a limited capability for moral updates. That's what we use to resolve contradictions in our moral intuitions. And that's also what allow us to persuade ourselves that doing some status promoting thing is actually moral. One the one hand, the fact that it's so easy to change our ethics due to status reasons is kind of scary. On the other, whole ability to morally update was probably developed exactly for this, so it's kind of a miracle that we can use it differently at all.

I don't think that we can update into genuinely feeling that maximizing paperclips for its own sake is the right thing to do. All possible human minds occupy only a small part of all possible minds space. We can consider alignment to be somwhat solved if TAI guarantee us optimization in the direction of some neighbourhood of our moral bias. However, I think it's possible to do better and we do not need all humans to be moral philosophers for that. It will be enough if TAI itself is a perfect moral philosopher able to deduct our coherent extrapolated volition and become an optimization process in that direction.

comment by romeostevensit · 2021-12-02T07:44:28.647Z · LW(p) · GW(p)

You may not be interested in mutually exclusive compression schemas, but mutually exclusive compression schemas are interested in you. One nice thing is that given that the schemas use an arbitrary key to handshake with there is hope that they can be convinced to all get on the same arbitrary key without loss of useful structure.

comment by Tee Canker (tee-canker) · 2021-12-02T07:07:12.139Z · LW(p) · GW(p)

comment by Tee Canker (tee-canker) · 2021-12-02T07:06:24.281Z · LW(p) · GW(p)

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2021-12-02T07:16:59.136Z · LW(p) · GW(p)

Can you please say something about how these videos are relevant to my post?

Replies from: tee-canker

↑ comment by Tee Canker (tee-canker) · 2021-12-02T07:20:42.454Z · LW(p) · GW(p)

Sorry, I'm not used to this particular interface.

I gave comments and did my best.

Thank you for sharing!!!

Morality is Scary

Contents

116 comments