Alignment: "Do what I would have wanted you to do"

oleg-trott

Alignment: "Do what I would have wanted you to do"

post by Oleg Trott (oleg-trott) · 2024-07-12T16:47:24.090Z · LW · GW · 48 comments

  nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans
None
48 comments

Yoshua Bengio writes^[1]:

nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans

I think I do^[2]. I believe that the difficulties of alignment arise from trying to control something that can manipulate you. And I think you shouldn't try.

Suppose you have a good ML algorithm (Not the stuff we have today that needs 1000x more data than humans), and you train it as a LM.

There is a way to turn a (very good) LM into a goal-driven chatbot via prompt engineering alone, which I'll assume the readers can figure out. You give it a goal "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do".

Whoever builds this AGI will choose what X will be^[3]. If it's a private project with investors, they'll probably have a say, as an incentive to invest.

Note that the goal is in plain natural language, not a product of rewards and punishments. And it doesn't say "Do what X wants you to do now".

Suppose this AI becomes superhuman. Its understanding of languages will also be perfect. The smarter it becomes, the better it will understand the intended meaning.

Will it turn everyone into paperclips? I don't think so. That's not what (pre-ASI) X would have wanted, presumably, and the ASI will be smart enough to figure this one out.

Will it manipulate its creators into giving it rewards? No. There are no "rewards".

Will it starve everyone, while obeying all laws and accumulating wealth? Not what I, or any reasonable human, would have wanted.

Will it resist being turned off? Maybe. Depends on whether it thinks that this is what (pre-ASI) X would have wanted it to.

^{^}
https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/
^{^}
I'm not familiar with the ASI alignment literature, but presumably he is. I googled "would have wanted" + "alignment" on this site, and this didn't seem to turn up much. If this has already been proposed, please let me know in the comments.
^{^}
Personally, I'd probably want to hedge against my own (expected) fallibility a bit, and include more people that I respect. But this post is just about aligning the AGI with its creators.

48 comments

Comments sorted by top scores.

comment by johnswentworth · 2024-07-12T19:20:26.092Z · LW(p) · GW(p)

Let's assume a base model (i.e. not RLHF'd), since you asserted a way to turn the LM into a goal-driven chatbot via prompt engineering alone. So you put in some prompt, and somewhere in the middle of that prompt is a part which says "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do", for some X.

The basic problem is that this hypothetical language model will not, in fact, do what X, having considered this carefully for a while, would have wanted it to do. What it will do is output text which statistically looks like it would come after that prompt, if the prompt appeared somewhere on the internet.

Replies from: None, oleg-trott

↑ comment by [deleted] · 2024-07-13T12:36:29.726Z · LW(p) · GW(p)

The basic problem is that this hypothetical language model will not, in fact, do what X, having considered this carefully for a while, would have wanted it to do. What it will do is output text which statistically looks like it would come after that prompt, if the prompt appeared somewhere on the internet.

The Waluigi effect [LW · GW] seems relevant here. From the perspective of Simulator Theory [? · GW], the prompt is meant to summon a careful simulacrum that follows the instruction to the T, but in reality, this works only if "on the actual internet characters described with that particular [prompt] are more likely to reply with correct answers."

Things can get even weirder and the model can collapse into the complete antithesis of the nice, friendly, aligned persona:

Rules normally exist in contexts in which they are broken.
When you spend many bits-of-optimisation locating a character, it only takes a few extra bits to specify their antipode.
There's a common trope in plots of protagonist vs antagonist.

↑ comment by Oleg Trott (oleg-trott) · 2024-07-12T21:10:04.560Z · LW(p) · GW(p)

Technically true. But you could similarly argue that humans are just clumps of molecules following physical laws. Talking about human goals is a charitable interpretation.

And if you are in a charitable mood, you could interpret LMs as absorbing the explicit and tacit knowledge of millions of Internet authors. A superior ML algorithm would just be doing this better (and maybe it wouldn't need lower-quality data).

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-07-12T22:06:43.933Z · LW(p) · GW(p)

That is not how this works. Let's walk through it for both the "human as clumps of molecules following physics" and the "LLM as next-text-on-internet predictor".

Humans as clumps of molecules following physics

Picture a human attempting to achieve some goal - for concreteness, let's say the human is trying to pick an apple from a high-up branch on an apple tree. Picture what that human does: they maybe get a ladder, or climb the tree, or whatever. They manage to pluck the apple from the tree and drop it in a basket.

Now, imagine a detailed low-level simulation of the exact same situation: that same human trying to pick that same apple. Modulo quantum noise, what happens in that simulation? What do we see when we look at its outputs? Well, it looks like a human attempting to achieve some goal: the clump of molecules which is a human gets another clump which is a ladder, or climbs the clump which is the tree, or whatever.

LLM as next-text-on-internet predictor

Now imagine finding the text "Notes From a Prompt Factory" on the internet, today (because the LLM is trained on text from ~today). Imagine what text would follow that beginning on the internet today.

The text which follows that beginning on the internet today is not, in fact, notes from a prompt factory. Instead, it's fiction about a fictional prompt factory. So that's the sort of thing we should expect a highly capable LLM to output following the prompt "Notes From a Prompt Factory": fiction. The more capable it is, the more likely it is to correctly realize that that prompt precedes fiction.

It's not a question of whether the LLM is absorbing the explicit and tacit knowledge of internet authors; I'm perfectly happy to assume that it is. The issue is that the distribution of text on today's internet which follows the prompt "Notes From a Prompt Factory" is not the distribution of text which would result from actual notes on an actual prompt factory. The highly capable LLM absorbs all that knowledge from internet authors, and then uses that knowledge to correctly predict that what follows the text "Notes From a Prompt Factory" will not be actual notes from an actual prompt factory.

Replies from: oleg-trott

↑ comment by Oleg Trott (oleg-trott) · 2024-07-13T01:02:55.861Z · LW(p) · GW(p)

"Some content on the Internet is fabricated, and therefore we can never trust LMs trained on it"

Is this a fair summary?

Replies from: johnswentworth, Tapatakt

↑ comment by johnswentworth · 2024-07-13T16:26:30.912Z · LW(p) · GW(p)

No, because we have tons of information about what specific kinds of information on the internet is/isn't usually fabricated. It's not like we have no idea at all which internet content is more/less likely to be fabricated.

Information about, say, how to prove that there are infinitely many primes is probably not usually fabricated. It's standard basic material, there's lots of presentations of it, it's not the sort of thing which people usually troll about. Yes, the distribution of internet text about the infinitude of primes contains more-than-zero trolling and mistakes and the like, but that's not the typical case, so low-temperature sampling from the LLM should usually work fine for that use-case.

On the other end of the spectrum, "fusion power plant blueprints" on the internet today will obviously be fictional and/or wrong, because nobody currently knows how to build a fusion power plant which works. This generalizes to most use-cases in which we try to get an LLM to do something (using only prompting on a base model) which nobody is currently able to do. Insofar as the LLM is able to do such things, that actually reflects suboptimal next-text prediction on its part.

↑ comment by Tapatakt · 2024-07-13T11:54:35.635Z · LW(p) · GW(p)

I would add "and the kind of content you want to get from aligned AGI definitely is fabricated on the Internet today". So the powerful LM trying to predict it will predict how the fabrication would look like.

comment by RogerDearnaley (roger-d-1) · 2024-07-12T23:21:15.518Z · LW(p) · GW(p)

Congratulations! You reinvented from scratch (a single-person version of) Coherent Extrapolated Volition [? · GW] (i.e. without the Coherent part). That's a widely considered candidate solution to the Outer Alignment [? · GW] problem (I believe first proposed by MIRI [? · GW] well over a decade ago).

However, what I think Yoshua was also, or even primarily, talking about is the technical problem of "OK, you've defined a goal — how do you then build a machine that you're certain will faithfully attempt to carry out that goal, rather than something else?", which is often called the Inner Alignment [? · GW] problem. (Note that the word "certain" becomes quite important in a context where a mistake could drive the human race extinct.) Proposals tend to involve various combinations of Reinforcement Learning [? · GW] and/or Stochastic Gradient Descent [? · GW] and/or Good Old-Fashioned AI and/or Bayesian [? · GW] Learning, all of which people (who don't want to go extinct) have concerns about. After that, there's also the problem of: OK, you built a machine that genuinely wants to figure out what you would have wanted to do, and then do it — how do you ensure that it's actually good at figuring that out correctly? This is often, on Less Wrong, called the Human Values [? · GW] problem — evidence suggests that modern LLMs are actually pretty good at at least the base encyclopedic factual knowledge part of that.

Roughly speaking, you have to define the right goal (which to avoid oversimplifications generally requires defining it at a meta level as something such as "the limit as some resources tend to infinity of the outcomes of a series of processes like this"), you have make the machine care about that and not anything else, and then you have to make the machine capable of carrying out the process, to a sufficiently good approximation.

Anyway, welcome down the rabbit-hole: there's a lot to read here.

Replies from: None, oleg-trott

↑ comment by [deleted] · 2024-07-13T01:07:35.409Z · LW(p) · GW(p)

I think it is worthwhile to balance out the links you have included in your comment with the following, which refer to discussions and analyses (listed in no particular order) that have been posted on LessWrong and which cast very significant doubt on both the theoretical soundness and the practical viability of CEV [? · GW]:

Marcello, commenting 16 years ago during Eliezer's Metaethics Sequence [? · GW], pointed out [LW(p) · GW(p)] that there is no particular reason to expect extrapolation to be coherent at all, because of butterfly effects and the manner in which "the mood you were in upon hearing them and so forth could influence which of the arguments you came to trust." (By the way, Marcello's objection, as far as I know, has never been convincingly or even concretely addressed by CEV-proponents. Even the official CEV document and all the writing that came after it mentioned its concerns in a few paragraphs and just... moved on without any specific explanations)
Wei Dai, in the same thread, questioned [LW(p) · GW(p)] why we ought to expect any coherence of the godshatter [LW · GW] that our human values are to exist and explained [LW(p) · GW(p)] how Eliezer's response was inadequate.
jbash, addressing the idea [LW(p) · GW(p)] that we need not figure out what human values are (since we can instead outsource everything to an AI that will enact "CEV"), questioned [LW(p) · GW(p)] what basis there is for thinking that CEV even exists, explaining that there is no reason to think humans must converge to the same volition, that the fact that there is no "transcendent, ineffable" [LW(p) · GW(p)], objective morality inherent in the universe which goes beyond what humans determine to be worthwhile [LW · GW] means "there's no particularly defensible way of resolving any real difference", and echoed previously-mentioned concerns by arguing that "even finding any given person's individual volition may be arbitrarily path-dependent."
Charlie Steiner, in his excellent Reducing Goodhart sequence [? · GW], explained [? · GW] that "Humans don't have our values written in Fortran on the inside of our skulls, we're collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It's not that there's some pre-theoretic set of True Values hidden inside people and we're merely having trouble getting to them - no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like "which atoms exactly count as part of the person" and "what do you do if the person says different things at different times?""
Richard Ngo, as part of his comprehensive overview of Realism about Rationality [LW · GW] and his skepticism of it, pointed out [LW(p) · GW(p)] the very important fact that "there is no canonical way to scale [a human] up", which means that you need to make arbitrary choices (with no real guiding hand available to you) of which modifications to make to a human if you want to make him/her more intelligent, knowledgeable, coherent etc.
Wei Dai again, commenting on one of Rohin Shah's early posts in the Value Learning [? · GW] sequence, explained how humans will have problems with distributional shifts [LW · GW] regardless of how carefully we try to design a virtual environment where we try to get an AI to learn and study their values (as part of an Ambitious Value Learning [? · GW] paradigm, for example), we will have to contend with the very serious problem that "if you give someone a different set of life experiences, they're going to end up a different person with different values, so it seems impossible to learn a complete and consistent utility function by just placing someone in various virtual environments with fake memories of how they got there and observing what they do." Rohin agreed that "if you think that a different set of life experiences means that you are a different person with different values, then that's a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work."
Steven Byrnes, at the conclusion of his excellent and comprehensive post [LW · GW] covering, from a neuroscientific perspective, the meaning, creation, and functioning of "valence" in the human brain, called into question the very foundation of CEV [? · GW]-thought that has become an entrenched part of LW thinking. As I mentioned in my short review [LW(p) · GW(p)] of his post, "highlights of Steve's writing include crucial observations such as the fact that there is no clean, natural boundary separating "I want to sit down because my legs hurt" and "I want the oppressed masses to obtain justice" (because both of these desires are made of the same brain-stuff and come about from the same structure [? · GW] and reinforcement of innate, base drives through a learning algorithm), that valence is at the root of all normative thinking, and that a human-independent "true morality" would have no reason to emerge as a convergent destination of this process."
Joe Carlsmith, while detailing "the Ignorance of normative realism bot" [LW · GW], made similar points to Steve by explaining how normative realism, "as its proponents widely acknowledge, [concedes that] normative stuff has no causal interaction with the natural world" and leaves beings that are assumed to have a specific value system in a "totally screwed" scenario when they are unable to access the supposed normative box from an epistemic perspective.
Joe Carlsmith again, in his outstanding description of "An even deeper atheism" [LW · GW], questioned "how different, exactly, are human hearts from each other? And in particular: are they sufficiently different that, when they foom, and even "on reflection," they don't end up pointing in exactly the same direction?", explaining that optimism about this type of "human alignment" is contingent on the "claim that most humans will converge, on reflection, to sufficiently similar values that their utility functions won't be "fragile" relative to each other." But Joe then keyed in on the crucial point that "while it's true that humans have various important similarities to each other (bodies, genes, cognitive architectures, acculturation processes) that do not apply to the AI case, nothing has yet been said to show that these similarities are enough to overcome the "extremal Goodhart" argument for value fragility", mentioning that when we "systematize" and "amp them up to foom", human desires decohere significantly. (This is the same point that Scott Alexander made in his classic post on the tails coming apart.) Ultimately, the essay concludes by claiming that it's perfectly plausible for most humans to be "paperclippers relative to each other [in the supposed reflective limit]", which is a position of "yet-deeper atheism" that goes beyond Eliezer's unjustified humanistic trust in human hearts.
MikkW explained [LW(p) · GW(p)] the basis of how the feedback loops implicit in the structure of the brain [? · GW] cause reward and punishment signals to "release chemicals that induce the brain to rearrange itself" in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery, changing the fundamental basis of human identity and the meaning of morality (as fixed computation) [LW · GW] from their perspective.
Wei Dai [LW · GW] (yet again) and Stuart Armstrong [LW · GW] explained how there doesn't seem to be a principled basis to expect "beliefs" and "values" to ultimately make sense as distinct and coherent concepts that carve reality at the joints [LW · GW], and also how inferring a human's preferences merely from their actions is impossible unless you make specific assumptions about their rationality and epistemic beliefs about the state of the world, respectively. Paul Christiano went in detail on why this means that even the "easy" goal inference problem [? · GW], meaning the (entirely fantastical and unrealistically optimistic set-up) in which we have access to an infinite amount of compute and to the "complete human policy (a lookup table of what a human would do after making any sequence of observations)", and we must then come up with "any reasonable representation of any reasonable approximation to what that human wants," is in fact tremendously hard.
The basic and common (at least on LW) representation of "values" through a utility function, typically justified on the basis of notions that "there exist theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies" [LW(p) · GW(p)] (such as in "Coherent decisions imply consistent utilities" [LW · GW], "Sufficiently optimized agents appear coherent" etc), has been challenged (either in whole or in part) by EJT in his post on how "There are no coherence theorems" [LW · GW], in his definition and analysis [LW · GW] of incomplete preferences and preferential gaps as an alignment strategy, by Rohin Shah's explanation [LW · GW] of how consequentialist agents optimizing for universe-histories rather than world-states can display any external behavior whatsoever, by Said Achmiz's explanations of what reasonable yet incomplete preferences can look like (1 [LW(p) · GW(p)], 2 [LW(p) · GW(p)], 3 [LW(p) · GW(p)]), by John Wentworth's analysis of (not maximally coherent) subagents [LW · GW] (note: John Wentworth later argued against [LW · GW] subagents), by Sami Petersen arguing Wentworth was wrong, by illustrating [LW · GW] why incomplete preferences need not be vulnerable, by Steve Byrnes's exploration [LW · GW] of corrigibility in a consequentialist frame that cares about world-trajectories or other kinds of (not current-world-state) preferences, and finally by me in my own question [LW · GW] and comments on my post (1 [LW(p) · GW(p)], 2 [LW(p) · GW(p)], 3 [LW(p) · GW(p)], 4 [LW(p) · GW(p)], 5 [LW(p) · GW(p)]).

I am no doubt missing a ton of links and references to other posts and comments that have made similar or related points over the years. Sadly, I don't quite have the energy to look for more than what I've amassed here. In any case, I, myself, have made similar points before, such as in explaining why I agreed [LW(p) · GW(p)] with Seth Herd that instruction-following AGI was far more likely than value-aligned AGI [LW · GW] (while making the stronger claim that "I consider the concept of a value aligned AGI to be confused and ultimately incoherent"). I also wrote a rather long and comprehensive comment [LW(p) · GW(p)] to Wei Dai expressing my philosophical concerns and confusions about these issues:

Whenever I see discourse about the values or preferences of beings embedded [LW · GW] in a physical universe that goes beyond the boundaries of the domains (namely, low-specificity conversations dominated by intuition) in which such ultimately fake frameworks [LW · GW] function reasonably well, I get nervous and confused. I get particularly nervous if the people participating in the discussions are not themselves confused about these matters (I am not referring to you in particular here, since you have already signaled an appropriate level of confusion [LW · GW] about this). Such conversations stretch our intuitive notions past their breaking point by trying to generalize them out of distribution [LW · GW] without the appropriate level of rigor and care.
What counts as human "preferences"? Are these utility function-like orderings of future world states, or are they ultimately about universe-histories [LW · GW], or maybe a combination of those [LW(p) · GW(p)], or maybe something else entirely [LW · GW]? Do we actually have any good reason to think [LW · GW] that (some form of) utility maximization explains real-world behavior, or are the conclusions broadly converged upon on LW ultimately a result of intuitions [LW(p) · GW(p)] about what powerful cognition must be like whose source is a set of coherence arguments that do not stretch as far as they were purported to [LW · GW]? What do we do with the fact that humans don't seem to have utility functions [LW · GW] and yet lingering confusion about this [LW · GW] remained as a result of many incorrect and misleading statements [LW(p) · GW(p)] by influential members of the community?
How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so [LW · GW], how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?
In any case, are they indexical [LW · GW] or not? If we are supposed to think about preferences in terms of revealed preferences [LW(p) · GW(p)] only, what does this mean in a universe (or an Everett branch, if you subscribe to that particular interpretation of QM) that is deterministic [LW · GW]? Aren't preferences thought of as being about possible worlds, so they would fundamentally need to be parts of the map as opposed to the actual territory [LW · GW], meaning we would need some canonical [LW(p) · GW(p)] framework of translating [LW · GW] the incoherent and yet supposedly very complex and multidimensional [? · GW] set of human desires into something that actually corresponds to reality [LW · GW]? What additional structure [LW(p) · GW(p)] must be grafted upon the empirically-observable behaviors in order for "what the human actually wants" to be well-defined?
On the topic of agency, what exactly does that refer to in the real world [LW · GW]? Do we not "first need a clean intuitively-correct mathematical operationalization of what "powerful agent" even means" [LW(p) · GW(p)]? Are humans even agents, and if not [LW · GW], what exactly are we supposed to get out of approaches that are ultimately all about agency [LW · GW]? How do we actually get from atoms to agents [? · GW]? (note that the posts in that eponymous sequence do not even come close to answering this question) More specifically, is a real-world being actually the same as the abstract computation its mind embodies [LW · GW]? Rejections of souls and dualism, alongside arguments for physicalism, do not prove [LW(p) · GW(p)] the computationalist thesis to be correct, as physicalism-without-computationalism is not only possible but also (as the very name implies) a priori far more faithful to the standard physicalist worldview.
What do we mean by morality as fixed computation [LW · GW] in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent [LW(p) · GW(p)] (through sensitivity to butterfly effects and order dependence [LW · GW]) that a concept like "CEV" [? · GW] probably doesn't make sense? The feedback loops implicit in the structure of the brain [? · GW] cause reward and punishment signals to "release chemicals that induce the brain to rearrange itself" [LW(p) · GW(p)] in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated [? · GW] state (as a straightforward corollary, how would an em that does not "update" its brain chemistry the same way that a biological being does be "human" in any decision-relevant way?). We can think about a continuous personal identity through the lens of mutual information about memories, personalities etc [LW(p) · GW(p)], but our current understanding of these topics is vastly incomplete and inadequate, and in any case the naive (yet very widespread, even on LW) interpretation of "the utility function is not up for grabs" [LW · GW] as meaning that terminal values [LW · GW] cannot be changed (or even make sense as a coherent concept) seems totally wrong.
The way communities make progress on philosophical matters is by assuming that certain answers are correct [LW(p) · GW(p)] and then moving on. After all, you can't ever get to the higher levels that require a solid foundation if you aren't allowed to build such a foundation in the first place. But I worry, for reasons that have been stated before [LW(p) · GW(p)], that the vast majority of the discourse by "lay lesswrongers" [LW(p) · GW(p)] (and, frankly, even far more experienced members of the community working directly on alignment research; as a sample illustration, see a foundational report [LW · GW]'s failure to internalize the lesson of "Reward is not the optimization target" [LW · GW]) is based on conclusions reached through informal and non-rigorous intuitions [LW(p) · GW(p)] that lack the feedback loops necessary to ground themselves to reality [LW · GW] because they do not do enough "homework problems" [LW(p) · GW(p)] to dispel misconceptions and lingering confusions about complex and counterintuitive matters.

I think the fact that CEV-like ideas are still prevalent in this community (as inferred from posts like OP's [LW · GW]) means it would probably be worthwhile for someone (maybe me, if I have the time or energy at some point in the near-term future) to collect all of these ideas and references into one top-level post that conclusively makes the argument for why CEV doesn't make sense and won't work (beyond the mere practical detail of "we won't get to implement it in these inscrutable floating-point vectors [LW · GW] making up modern state-of-the-art ML systems") and why continuing to think about it as if it did causes confusions and incorrect thinking more than it dissolves [LW · GW] questions and creates knowledge.

Replies from: habryka4, habryka4, Seth Herd, cubefox, dxu, Tapatakt

↑ comment by habryka (habryka4) · 2024-07-13T16:42:09.294Z · LW(p) · GW(p)

Marcello, commenting 16 years ago during Eliezer's Metaethics Sequence [? · GW], pointed out [LW(p) · GW(p)] that there is no particular reason to expect extrapolation to be coherent at all, because of butterfly effects and the manner in which "the mood you were in upon hearing them and so forth could influence which of the arguments you came to trust." (By the way, Marcello's objection, as far as I know, has never been convincingly or even concretely addressed by CEV-proponents. Even the official CEV document and all the writing that came after it mentioned its concerns in a few paragraphs and just... moved on without any specific explanations)

I don't understand the writing in italics here. Eliezer and others responded pretty straightforwardly:

It seems to me that if you build a Friendly AI, you ought to build it to act where coherence exists and not act where it doesn't.

Or orthonormal says more precisely:

We can consider a reference class of CEV-seeking procedures; one (massively-underspecified, but that's not the point) example is "emulate 1000 copies of Paul Christiano living together comfortably and immortally and discussing what the AI should do with the physical universe; once there's a large supermajority in favor of an enactable plan (which can include further such delegated decisions), the AI does that".
I agree that this is going to be chaotic, in the sense that even slightly different elements of this reference class might end up steering the AI to different basins of attraction.
I assert, however, that I'd consider it a pretty good outcome overall if the future of the world were determined by a genuinely random draw from this reference class, honestly instantiated. (Again with the massive underspecification, I know.)
CEV may be underdetermined and many-valued, but that doesn't mean paperclipping is as good an answer as any.

Replies from: None

↑ comment by [deleted] · 2024-07-13T16:45:17.337Z · LW(p) · GW(p)

It seems to me that if you build a Friendly AI, you ought to build it to act where coherence exists and not act where it doesn't.

But the very point, as seen from the other links and arguments, is that we don't have good reason to believe (and in fact should probably disbelieve) the idea that there is any substantial part "where coherence exists". That's the essence of the disagreement, not whether there is some extremely-small-measure (in potential idea-space) situations where humans behave incoherently.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2024-07-13T17:26:53.853Z · LW(p) · GW(p)

I mean, send me all the money in your bank account right now then. You seem to claim you have no coherent preferences or are incapable of telling which of your preferences are ones you endorse, so seems like you wouldn't mind.

(Or insert any of the other standard reductio arguments here. You clearly about some stuff. In as much as you do, you have a speck of coherence in you. If you don't, I don't know how to do help you in any way and seems like we don't have any trade opportunities, and like, maybe I should just take your stuff because you don't seem to mind)

Replies from: None

↑ comment by [deleted] · 2024-07-13T17:47:46.077Z · LW(p) · GW(p)

Honestly, this doesn't seem like a good-faith response that does the slightest bit of interpretive labor. It's the type of "gotcha" comment that I really wouldn't have expected from you of all people. I'm not sure it's even worthwhile to continue this conversation.

I have preferences right now; this statement makes sense in the type of low-specificity conversation dominated by intuition [LW(p) · GW(p)] where we talk about such words as though they referred to real concepts that point [LW · GW] to specific areas of reality. Those preferences are probably not coherent, in the case that I can probably be money pumped by an intelligent enough agent that sets up a strange-to-my-current-self scenario. But they still exist, and one of them is to maintain a sufficient amount of money in my bank account to continue living a relatively high-quality life. Whether I "endorse" those preferences or not is entirely irrelevant to whether I have them right now; perhaps you could offer a rational argument [LW · GW] to eventually convince me that you would make much better use of all my money, and then I would endorse giving you that money, but I don't care about any of that right now. My current, unreflectively-endorsed self, doesn't want to part with what's in my bank account, and that's what guiding my actions, not an idealized, reified future version.

None of this means anything conclusive about me ultimately endorsing these preferences in the reflective limit, of those preferences being stable under ontology shifts [LW · GW] that reveal how my current ontology is hopelessly confused and reifies the analogues of ghosts [LW · GW], of there being any nonzero intersection between the end states of a process that tries to find my individual volition [LW(p) · GW(p)], of changes to my physical and neurological make-up keeping my identity the same [LW(p) · GW(p)] (in a decision-relevant sense relative to my values) when my memories and path through history change [LW(p) · GW(p)].

Replies from: habryka4

↑ comment by habryka (habryka4) · 2024-07-13T17:56:48.721Z · LW(p) · GW(p)

It's not a gotcha, I just really genuinely don't get how the model you are explaining doesn't just collapse into nothingness.

Like, you currently clearly think that some of your preferences are more stable under reflection. And you have guesses and preferences over the type of reflection that makes your preferences better by your own lights. So seems like you want to apply one to the other. Doing that intellectual labor is the core of CEV.

If you really have no meta level preferences (though I have no idea what that would mean since it's part of everyday life to balance and decide between conflicting desires) then CEV outputs something at least as coherent as you are right now, which is plenty coherent given that you probably acquire resources and have goals. My guess is you can do a bunch better. But I don't see any way for CEV to collapse into nothingness. It seems like it has to output something at least as coherent as you are now.

So when you say "there is no coherence" that just seems blatantly contradicted by you standing before me and having coherent preferences, and not wanting to collapse into a puddle of incoherence.

Replies from: None

↑ comment by [deleted] · 2024-07-13T18:06:24.683Z · LW(p) · GW(p)

I just really genuinely don't get how the model you are explaining doesn't just collapse into nothingness.

I mean, I literally quoted the Wei Dai post in my previous comment which made the point that it could be possible for the process of extrapolated volition to output "nothingness" as its answer. I don't necessarily think that's very likely, but it is a logically possible alternative to CEV.

In any case, suppose you are right about this, about the fact that the model I am explaining collapses into nothingness. So what? "X, if true, leads to bad consequences, so X is false" is precisely the type of appeal-to-consequences reasoning that's not a valid line of logic.

This conversation, from my perspective, has gone like this:

Me: [specific positive, factual claim about CEV not making conceptual sense, and about the fact that there might be no substantial part of human morality that's coherent after applying the processes discussed here]
You: If your model is right, then you just collapse into nothingness.
Me: ... So then I collapse into nothingness, right? The fact that it would be bad to do so according to my current, rough, unreflectively-endorsed "values" obviously doesn't causally affect the truth value of the proposition about what CEV would output, which is a positive rather than normative question.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2024-07-13T18:14:07.193Z · LW(p) · GW(p)

I think you misunderstood what I meant by "collapse to nothingness". I wasn't referring to you collapsing into nothingness under CEV. I meant your logical argument outputting a contradiction (where the contradiction would be that you prefer to have no preferences right now).

The thing I am saying is that I am pretty confident you don't have meta preferences that when propagated will cause you to stop wanting things, because like, I think it's just really obvious to both of us that wanting things is good. So in as much as that is a preference, you'll take it into account in a reasonable CEV set up.

We clearly both agree that there are ways to scale you up that are better or worse by your values. CEV is the process of doing our best to choose the better ways. We probably won't find the very best way, but there are clearly ways through reflection space that are better than others and that we endorse more going down.

You might stop earlier than I do, or might end up in a different place, but that doesn't change the validity of the process that much, and clearly doesn't result in you suddenly having no wants or preferences anymore (because why would you want that, and if you are worried about that, you can just make a hard commit at the beginning to never change in ways that causes that).

And yeah, maybe some reflection process will cause us to realize that actually everything is meaningless in a way that I would genuinely endorse. That seems fine but it isn't something I need to weigh from my current vantage point. If it's true, nothing I do matters anyways, but also it honestly seems very unlikely because I just have a lot of things I care about and I don't see any good arguments that would cause me to stop caring about them.

Replies from: None

↑ comment by [deleted] · 2024-07-13T18:23:53.662Z · LW(p) · GW(p)

clearly doesn't result in you suddenly having no wants or preferences anymore

Of course, but I do expect the process of extrapolation to result in wants and preferences that are extremely varied and almost arbitrary, depending on the particular choice of scale-up [LW(p) · GW(p)] and the path [LW(p) · GW(p)] taken through it, and which have little to do with my current (incoherent, not-reflectively-endorsed values).

Moreover, while I expect extrapolation to output desires and values at any point we hit the "stop" button, I do not expect (and this is another reframing of precisely the argument that I have been making the entire time) those values to be coherent themselves. You can scale me up as much as you want, but when you stop and consider the being that emerges, my perspective is that this being can also be scaled up now to have an almost arbitrary set of new values.

I think your most recent comment fails to disambiguate between "the output of the extrapolation process", which I agree will be nonempty (similarly to how my current set of values is nonempty), and "the coherent output of the extrapolation process", which I think might very well be empty, and in any case will mostly likely be very small in size (or measure) compared to the first one.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2024-07-13T18:33:39.796Z · LW(p) · GW(p)

I think your most recent comment fails to disambiguate between "the output of the extrapolation process", which I agree will be nonempty (similarly to how my current set of values is nonempty), and "the coherent output of the extrapolation process", which I think might very well be empty, and in any case will mostly likely be very small in size (or measure) compared to the first one.

Hmm, this paragraph maybe points to some linguistic disagreement we have (and my guess is causing confusion in other cases).

I feel like you are treating "coherent" as a binary, when I am treating it more as a sliding scale. Like, I think various embedded agency issues prevent an agent from being fully coherent (for example, a classical bayesian formulation of coherence requires logical omniscience and is computationally impossible), but it's also clearly the case that when I notice a dutch-book against me, and I update in a way that avoids future dutch-bookings, that I have in a meaningful way (but not in a way I could formalize) become more coherent.

So what I am saying is something like "CEV will overall increase the degree of coherence of your values". I totally agree that it will not get you all the way (whatever that means), and I also think we don't have a formalization of coherence that we can talk about fully formally (though I think we have some formal tools that are useful).

This gives me some probability we don't disagree that much, but my sense is you are throwing out the baby with the bathwater in your response to Roger, and that that points to a real disagreement.

Like, yes, I think for many overdetermined reasons we will not get something that looks like a utility function out of CEV (because computable utility functions over world-histories aren't even a thing that can meaningfully exist in the real world). But it seems to me like something like "The Great Reflection" would be extremely valuable and should absolutely be the kind of thing we aim for with an AI, since I sure have updated a lot on what I reflectively endorse, and would like to go down further that path by learning more true things and getting smarter in ways that don't break me.

Replies from: None

↑ comment by [deleted] · 2024-07-13T19:29:54.550Z · LW(p) · GW(p)

Hmm, this paragraph maybe points to some linguistic disagreement we have (and my guess is causing confusion in other cases).
I feel like you are treating "coherent" as a binary, when I am treating it more as a sliding scale.

Alright, so, on the one hand, this is definitely helpful for this conversation because it has allowed me to much better understand what you're saying. On the other hand, the sliding scale of coherence is much more confusing to me than the prior operationalizations of this concept were. I understand, at a mathematical level, what Eliezer [LW · GW] (and you [LW(p) · GW(p)]) mean by coherence, when viewed as binary. I don't think the same is true when we have a sliding scale instead. This isn't your fault, mind you; these are probably supposed to be confusing topics given our current, mostly-inadequate, state of understanding of them, but the ultimate effect is still what it is.

I expect we would still have a great deal of leftover disagreement about where on that scale CEV would take a human when we start extrapolating. I'm also somewhat confident that no effective way of resolving that disagreement is available to us currently.

This gives me some probability we don't disagree that much, but my sense is you are throwing out the baby with the bathwater in your response to Roger, and that that points to a real disagreement.

Well, yes, I suppose. The standard response to critiques of CEV, from the moment they started appearing, is some version of Stuart Armstrong's [LW · GW] "we don't need a full solution, just one good enough," and there probably is some disagreement over what is "enough" and over how well CEV is likely to work.

But there is also another side of this, which I mentioned at the very end of my initial comment that sparked this entire discussion, namely my expectation that overly optimistic ^[1] thinking about the "theoretical soundness and practical viability" of CEV "causes confusions and incorrect thinking more than it dissolves [LW · GW] questions and creates knowledge." I think this is another point of genuine disagreement, which is downstream of not just the object-level questions discussed here but also stuff like the viability of Realism about rationality [LW · GW], overall broad perspectives about rationality and agentic cognition [LW(p) · GW(p)], and other related things. These are broader philosophical issues, and I highly doubt much headway can be made to resolve this dispute through a mere series of comments.

But it seems to me like something like "The Great Reflection" would be extremely valuable

Incidentally, I do agree [LW(p) · GW(p)] with this to some extent:

I do not have answers to the very large set of questions I have asked and referenced in this comment. Far more worryingly, I have no real idea of how to even go about answering them or what framework to use or what paradigm [LW · GW] to think through. Unfortunately, getting all this right seems very important [LW · GW] if we want to get to a great future. Based on my reading of the general pessimism you [Wei Dai] have been signaling throughout your recent posts and comments, it doesn't seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us.
Perhaps if a group of really smart philosophy-inclined people who have internalized the lessons of the Sequences [? · GW]without being wedded to the very specific set of conclusions MIRI has reached about what AGI cognition must be like [LW(p) · GW(p)] and which seem to be contradicted by [LW · GW] the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models were to be given a ton of funding and access and 10 years to work on this problem as part of a proto-Long Reflection [LW(p) · GW(p)], something interesting would come out. But that is quite a long stretch at this point.

^{^}
From my perspective, of course.

Replies from: dxu

↑ comment by dxu · 2024-07-13T23:03:12.421Z · LW(p) · GW(p)

Can we not speak of apparent coherence relative to a particular standpoint? If a given system seems to be behaving in such a way that you personally can't see a way to construct for it a Dutch book, a series of interactions with it such that energy/negentropy/resources can be extracted from it and accrue to you, that makes the system inexploitable with respect to you, and therefore at least as coherent as you are. The closer to maximal coherence a given system is, the less it will visibly depart from the appearance of coherent behavior, and hence utility function maximization; the fact that various quibbles can be made about various coherence theorems does not seem to me to negate this conclusion.

Humans are more coherent than mice, and there are activities and processes which individual humans occasionally undergo in order to emerge more coherent than they did going in; in some sense this is the way it has to be, in any universe where (1) the initial conditions don't start out giving you fully coherent embodied agents, and (2) physics requires continuity of physical processes, so that fully formed coherent embodied agents can't spring into existence where there previously were none; there must be some pathway from incoherent, inanimate matter from which energy may be freely extracted, to highly organized configurations of matter from which energy may be extracted only with great difficulty, if it can be extracted at all.

If you expect the endpoint of that process to not fully accord with the von Neumann-Morgenstein axioms, because somebody once challenged the completeness axiom, independence axiom, continuity axiom, etc., the question still remains as to whether departures from those axioms will give rise to exploitable holes in the behavior of such systems, from the perspective of much weaker agents such as ourselves. And if the answer is "no", then it seems to me the search for ways to make a weaker, less coherent agent into a stronger, more coherent agent is well-motivated, and necessary—an appeal to consequences in a certain sense, yes, but one that I endorse!

↑ comment by habryka (habryka4) · 2024-07-13T16:37:21.408Z · LW(p) · GW(p)

Maybe I am missing some part of this discussion, but I don't get the last paragraph. It's clear there are a lot of issues with CEV, but I also have no idea what the alternative to something like CEV as a point of comparison is supposed to be. In as much as I am a godshatter of wants, and I want to think about my preferences, I need to somehow come to a conclusion about how to choose between different features, and the basic shape of CEV feels like the obvious (and approximately only) option that I see in front of me.

I agree there is no "canonical" way to scale me up, but that doesn't really change the need for some kind of answer to the question of "what kind of future do I want and how good could it be?".

How does "instruction-following AI" have anything to do with this? Like, OK, now you have an AI that in some sense follows your instructions. What are you going to do with it?

My best guess is you are going to do something CEV like, where you figure out what you want, and you have it help you reflect on your preferences and then somehow empower you to realize more of them. Ideally it would fully internalize that process so it doesn't need to rely on your slow biological brain and weak body, though of course you want to be very careful with that since changes to values under reflection seem very sensitive to small changes in initial conditions.

It's also what seems to me relatively broad consensus on LW that you should not aim for CEV as a first thing to do with an AGI. It's a thing you will do eventually, but aiming for it early does indeed seem doomed, but like, that's not really what the concept or process is about. It's about setting a target for what you want to eventually allow AI systems to help you with.

The Arbital article is also very clear about this:

CEV is meant to be the literally optimal or ideal or normative thing to do with an autonomous superintelligence, if you trust your ability to perfectly align a superintelligence on a very complicated target. (See below.)
CEV is rather complicated and meta and hence not intended as something you’d do with the first AI you ever tried to build. CEV might be something that everyone inside a project agreed was an acceptable mutual target for their second AI. (The first AI should probably be a Task AGI.)

Replies from: None, Seth Herd

↑ comment by [deleted] · 2024-07-13T16:59:23.550Z · LW(p) · GW(p)

It's clear there are a lot of issues with CEV, but I also have no idea what the alternative to something like CEV as a point of comparison is supposed to be.

This reads like an invalid appeal-to-consequences argument. The basic point is that "there are no good alternatives to CEV", even if true, does not provide meaningful evidence one way or another about whether CEV makes sense conceptually and gives correct and useful intuitions [LW(p) · GW(p)] about these issues.

In as much as I am a godshatter of wants, and I want to think about my preferences, I need to somehow come to a conclusion about how to choose between different features

I mean, one possibility (unfortunate and disappointing as it would be if true) is what Wei Dai described [LW · GW] 12 years ago:

By the way, I think nihilism often gets short changed around [? · GW] here [? · GW]. Given that we do not actually have at hand a solution to ontological crises in general or to the specific crisis that we face, what's wrong with saying that the solution set may just be null? Given that evolution doesn't constitute a particularly benevolent and farsighted designer, perhaps we may not be able to do much better than that poor spare-change collecting robot? If Eliezer is worried [? · GW] that actual AIs facing actual ontological crises could do worse than just crash, should we be very sanguine that for humans everything must "add up to moral normality"?
To expand a bit more on this possibility, many people have an aversion against moral arbitrariness, so we need at a minimum a utility translation scheme that's principled enough to pass that filter. But our existing world models are a hodgepodge put together by evolution so there may not be any such sufficiently principled scheme, which (if other approaches to solving moral philosophy also don't pan out) would leave us with legitimate feelings of "existential angst" and nihilism. One could perhaps still argue that any current such feelings are premature, but maybe some people have stronger intuitions than others that these problems are unsolvable?

So it's not like CEV is the only logical possibility in front of us, or the only one we have enough evidence to raise [LW · GW] to the level of relevant hypothesis. As such, I see this as still being of the appeal-to-consequences form. It might very well be the case that CEV, despite all the challenges and skepticism, nonetheless remains the best or most dignified [LW · GW] option to pursue (as a moonshot of sorts), but again, this has no impact on the object-level claims in my earlier comment.

How does "instruction-following AI" have anything to do with this? Like, OK, now you have an AI that in some sense follows your instructions. What are you going to do with it?

I think you're talking at a completely different level of abstraction and focus than me. I made no statements about the normative desirability of instruction-following AI in my comment [LW(p) · GW(p)] on Seth's post. Instead, I simply claimed, as a positive, descriptive, factual matter, that I was confident value-aligned AGI would not come about (and likely could not come about because of what I thought were serious theoretical problems).

It's also what seems to me relatively broad consensus on LW that you should not aim for CEV as a first thing to do with an AGI.

I don't think any relevant part of my comment is contingent on the timing of when you aim for CEV? Whether it's the first thing you do with an AGI or not.

↑ comment by Seth Herd · 2024-07-13T21:44:54.676Z · LW(p) · GW(p)

I was confused for a moment. You start out by saying there's no alternative to CEV, then end up by saying there's a consensus that CEV isn't a good first alignment target.

Doesn't that mean that whether or how to pursue CEV it's not relevant to whether we live or die? It seems like we should focus on the alignment targets we'll pursue first, and leave CEV and the deeper nature of values and preferences for the Long Reflection - if we can arrange to get one.

I certainly hope you're right that there's a de-facto consensus that CEV/value alignment probably isn't relevant for our first do-or-die shots at alignment. It sure looks that way to me, so I'd like to see more LW brainpower going toward detailed analyses of the alignment schemes on which we're most likely to bet the future.

Replies from: habryka4

↑ comment by habryka (habryka4) · 2024-07-13T22:10:26.489Z · LW(p) · GW(p)

I think it's still relevant because it creates a rallying point around what to do after you made substantial progress aligning AGI, which helps coordination in the run up to it, but I agree that most effort should go into other approaches.

↑ comment by Seth Herd · 2024-07-13T20:02:01.647Z · LW(p) · GW(p)

That's quite a collection of relevant work. I'm bookmarking this as the definitive collection on the topic; I haven't seen better and I assume you would've and linked it if it existed.

I think you should just go ahead and make this a post. When you do, we can have a whole discussion in a proper place, because this deserves more discussion.

Prior to you writing that post, here are some thoughts:

I think it's pretty clearly correct that CEV couldn't produce a single best answer, for the reasons you give and cite arguments for. Human values are quite clearly path-dependent. Given different experiences (and choices/stochastic brain activity/complex interactions between initial conditions and experiences), people will wind up valuing fairly different things.

However, this doesn't mean that something like CEV or ambitious value learning couldn't produce a pretty good result. Of all the many worlds that humans as a whole would absolutely love (compared to the nasty, brutish and short lives we now live), you could just pick one at random and I'd call that a dang good outcome.

I think your stronger claim, that the whole idea of values and beliefs is incoherent, should be kept separate. I think values and beliefs are pretty fuzzy and changeable, but real by the important meanings of those words. Whatever its ontological status, I prefer outcomes I prefer to ones I'd hate, and you could call those my values even if it's a very vague and path-dependent collection.

But I don't think that's probably a major component of this argument, so that stronger claim should probably be mostly set aside while considering whether anything like CEV/value learning could work.

Again, I hope you'll make this a post, but I'd be happy to continue the discussion here as well as there.

Replies from: None

↑ comment by [deleted] · 2024-07-13T20:22:36.881Z · LW(p) · GW(p)

I haven't seen better and I assume you would've and linked it if it existed; I haven't seen better and I assume you would've and linked it if it existed.

Yeah, I'm not aware of any other comprehensive compilation of arguments against CEV. That being said, I am confident that my list above is missing at least a few really interesting and relevant comments that I recall seeing here but just haven't been able to find again.

Again, I hope you'll make this a post

I will try to. This whole discussion, while necessary and useful, is a little bit off-topic to what Oleg Trott meant for this post to be about, and I think deserves a post of its own.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2024-07-13T22:34:12.252Z · LW(p) · GW(p)

FWIW, my personal guess is that the kind of extrapolation process described by CEV is fairly stable (in the sense of producing a pretty consistent extrapolation direction) as you start to increase the cognitive resources applied (something twice as a smart human thinking for ten times as long with access to ten times as much information, say), but may well still not have a single well defined limit as the cognitive resources used for the extrapolation tend to infinity. Using a (loose, not exact) analogy to a high-dimensional SGD or simulated-annealing optimization problem, the situation may be a basin/valley that looks approximately convex at a coarse scale (when examined with low resources), but actually has many local minima that increasing resources could converge to.

So the correct solution may be some form of satisficing: use CEV with a moderately super-human amount of computation resources applied to it, in a region where it still gives a sensible result. So I view CEV as more a signpost saying "head that way" than a formal description of a mathematical limiting process that clearly has a single well-defined limit.

As for human vales being godshatter of evolution, that's a big help: where they are manifestly becoming inconsistent with each other or with reality, you can use maximizing actual evolutionary fitness (which is a clear, well-defined concept) as a tie-breaker or sanity check. [Obviously, we don't want to take that to the point where then human population is growing fast (unless we're doing it by spreading through space, in which case, go for it).]

↑ comment by cubefox · 2024-07-13T17:00:56.900Z · LW(p) · GW(p)

By chance, did you, in the meantime, have any more thoughts on our debate [LW(p) · GW(p)] on moral (anti-)realism, on the definability of terms like "good"?

Replies from: None

↑ comment by [deleted] · 2024-07-13T17:06:47.951Z · LW(p) · GW(p)

I think I still endorse essentially all of what I said in that thread. Is there anything in particular you wanted me to talk about?

Replies from: cubefox

↑ comment by cubefox · 2024-07-13T17:20:46.227Z · LW(p) · GW(p)

Your central claim seemed to be that words like "good" have no associated anticipated experience, with which I disagreed in the comment linked above. You didn't yet reply to that.

Replies from: None

↑ comment by [deleted] · 2024-07-13T18:39:35.682Z · LW(p) · GW(p)

Well, the claim was the following:

That might well be evidence (in the Bayesian sense) that a given act, value, or person belongs to a certain category which we slap the label "good" onto. But it has little to do with my initial question. We have no reason to care about the property of "goodness" at all if we do not believe that knowing something is "good" gives us powerful evidence that allows us to anticipate experiences [LW · GW] and to constrain the territory around us [LW · GW]. Otherwise, "goodness" is just an arbitrary bag of things that is no more useful than the category of "bleggs" [LW · GW] that is generated for no coherent reason whatsoever, or the random category "r398t"s that I just made up and contains only apples, weasels, and Ron Weasley. Indeed, we would not even have enough reason to raise the question of what "goodness" is in the first place [LW · GW].

Yes, knowing that something is (in the moral-cognitivist, moral-realist, observer-independent sense) "good" allows you to anticipate that it... fulfills the preconditions of being "good" (one of which is "increased welfare", in this particular conception of it). At a conceptual level, that doesn't provide you relevant anticipated experiences that go beyond the category of "good and everything it contains"; it doesn't constrain the territory beyond statements that ultimately refer back to goodness itself. It holds the power of anticipated experience only in so much as it is self-referential in the end, which doesn't provide meaningful evidence that it's a concept which carves reality at the joints [LW · GW].

It's helpful to recall how the entire discussion began. You said, in response to Steven Byrnes's post [LW · GW]:

This is tangential to the point of the post, but "moral realism" is a much weaker claim than you seem to think. Moral realism only means that some moral claims are literally true. Popular uncontroversial examples: "torturing babies for fun is wrong" or "ceteris paribus, suffering is bad". It doesn't mean that someone is necessarily motivated by those claims if they believe they are true. It doesn't imply that anyone is motivated to be good just from believing that something is good.

When Seth Herd questioned [LW(p) · GW(p)] what you meant by good and "moral claims", you said [LW(p) · GW(p)] that you "don't think anyone needs to define what words used in ordinary language mean."

Now, in standard LW-thought, the meaning of "X is true", as explained by Eliezer a long time ago [LW · GW], is that it represents the correspondence [? · GW] between reality (the territory) and an observer's beliefs about reality (the map). Beliefs which are thought to be true pay rent in anticipated experiences [LW · GW] about the world. Taking the example of a supposed "moral fact" X, the labeling of it as "fact" (because it fulfills some conditions of membership in this category) implies it must pay rent.

But if the only way it does that is because it then allows you to claim that "X fulfills the conditions of membership", then this is not a useful category. It is precisely an arbitrary subset, analogous to the examples I gave in the comment I quoted above. If moral realism is viewed through the lens mentioned by [LW · GW] Roko, which does imply specific factual anticipated experiences about the world (which go beyond the definition of "moral realism instead"), namely that "All (or perhaps just almost all) beings, human, alien or AI, when given sufficient computing power and the ability to learn science and get an accurate map-territory morphism, will agree on what physical state the universe ought to be transformed into, and therefore they will assist you in transforming it into this state," then it's no longer arbitrary.

But you specifically disavowed this interpretation, even going so far as to say [LW(p) · GW(p)] that "I can believe that I shouldn't eat meat, or that eating meat is bad, without being motivated to stop eating meat." So your version of "moral realism" is just choosing a specific set of things you define to be "moral", without requiring anyone who agrees that this is moral to act in accordance with it (which would indeed be an anticipated experience about the outside world) and without any further explanation of why this choice pays any rent in experiences about the world that's not self-referential. This is a narrow and shallow definition of realism, and by itself doesn't explain the reasons for why these ideas were even brought up in the first place [LW · GW].

I really don't know if what I've written here is going to be helpful for this conversation. Look, if someone tells me that "X is a very massive star," which they define as "a star that's very massive" then what I mean by anticipated experiences ^[1] is not "X is very massive" or "X is a star", because these are already strictly included in (and logically implied by, at a tautological level) the belief about X, but more so stuff about "if there is any planet Y in the close vicinity of X, I expect to see Y rotating around a point inside or just slightly outside X." The latter contains a reason to care [LW · GW] about whether "X is very massive."

^{^}
In this specific context.

Replies from: cubefox

↑ comment by cubefox · 2024-07-14T19:13:44.075Z · LW(p) · GW(p)

Yes, knowing that something is (in the moral-cognitivist, moral-realist, observer-independent sense) "good" allows you to anticipate that it... fulfills the preconditions of being "good" (one of which is "increased welfare", in this particular conception of it). At a conceptual level, that doesn't provide you relevant anticipated experiences that go beyond the category of "good and everything it contains"; it doesn't constrain the territory beyond statements that ultimately refer back to goodness itself. It holds the power of anticipated experience only in so much as it is self-referential in the end, which doesn't provide meaningful evidence that it's a concept which carves reality at the joints.

I disagree with that. When we expect something to be good, we have some particular set of anticipated experiences (e.g. about increased welfare, extrapolated desires) that are consistent with our expectation, and some other set that is inconsistent with it. We do not merely "expect" a tautology, like "expecting" that good things are good (or that chairs are chairs etc). We can see this by the fact that we may very well see evidence that is inconsistent with our expectation, e.g. evidence that something instead leads to suffering and thus doesn't increase welfare, and hence isn't good. Believing something to be good therefore pays rent in anticipated experiences [LW · GW].

Moreover, we can wonder (ask ourselves) whether some particular thing is good or not (like e.g., recycling plastic), and this is not like "wondering" whether chairs are chairs. We are asking a genuine question, not a tautological one.

When Seth Herd questioned what you meant by good and "moral claims", you said that you "don't think anyone needs to define what words used in ordinary language mean."

To be clear, what I said was this: "I don't think anyone needs to define what words used in ordinary language mean because the validity of any attempt of such a definition would itself have to be checked against the intuitive meaning of the word in common usage."

But if the only way it does that is because it then allows you to claim that "X fulfills the conditions of membership", then this is not a useful category.

I think I have identified the confusion here. Assume you don't know what "bachelor" means, and you ask me which evidence I associate with that term. And I reply: If I believe something is a bachelor, I anticipate evidence that confirms that it is an unmarried man. Now you could reply that this is simply saying "'bachelor' fulfills the conditions of membership". But no, I have given you a non-trivial definition of the term, and if you already knew what "unmarried" and "man" meant (what evidence to expect if those terms apply), you now also know what to anticipate for "bachelor" -- what the term "bachelor" means. Giving a definition for X is not the same as merely saying "X fulfills the conditions of membership".

If moral realism is viewed through the lens mentioned by Roko, which does imply specific factual anticipated experiences about the world (which go beyond the definition of "moral realism instead"), namely that "All (or perhaps just almost all) beings, human, alien or AI, when given sufficient computing power and the ability to learn science and get an accurate map-territory morphism, will agree on what physical state the universe ought to be transformed into, and therefore they will assist you in transforming it into this state," then it's no longer arbitrary.

Roko relies here on the assumption that moral beliefs are inherently motivating ("moral internalism", as discussed by EY here), which is not a requirement for moral realism.

But you specifically disavowed this interpretation, even going so far as to say that "I can believe that I shouldn't eat meat, or that eating meat is bad, without being motivated to stop eating meat." So your version of "moral realism"

It is not just my interpretation, that is how the term "moral realism" is commonly defined in philosophy, e.g. in the SEP.

is just choosing a specific set of things you define to be "moral"

Well, I specifically don't need to propose any definition. What matters for any proposal for a definition (such as EY's "good ≈ maximizes extrapolated volition") is that it captures the natural language meaning of the term.

without requiring anyone who agrees that this is moral to act in accordance with it (which would indeed be an anticipated experience about the outside world)

I say that's confused. If I believe, for example, that raising taxes is bad, then I do have anticipated experiences associated with this belief. I may expect that raising taxes is followed by a weaker economy, more unemployment, less overall wealth, in short: decreased welfare. This expectation does not at all require that anyone agrees with me, nor that anyone is motivated to not raise taxes.

I really don't know if what I've written here is going to be helpful for this conversation.

The central question here is whether (something like) EY's ethical theory is sound. If it is, CEV could make sense as an alignment target, even if it is not clear how we get there.

Replies from: None

↑ comment by [deleted] · 2024-07-14T19:30:44.389Z · LW(p) · GW(p)

I will try, one more^[1] time, and I will keep this brief.

I think I have identified the confusion here. Assume you don't know what "bachelor" means, and you ask me which evidence I associate with that term. And I reply: If I believe something is a bachelor, I anticipate evidence that confirms that it is an unmarried man. Now you could reply that this is simply saying "'bachelor' fulfills the conditions of membership". But no, I have given you a non-trivial definition of the term, and if you already knew what "unmarried" and "man" meant (what evidence to expect if those terms apply), you now also know what to anticipate for "bachelor" -- what the term "bachelor" means. Giving a definition for X is not the same as merely saying "X fulfills the conditions of membership".

But why do you care about the concept of a bachelor? What makes you pick it out of the space of ideas and concepts [LW · GW] as worthy of discussion and consideration [LW · GW]? In my conception, it is the fact that you believe it carves reality at the joints [LW · GW] by allowing you to have relevant and useful anticipated experiences [LW · GW] about the world outside of what is contained inside the very definition or meaning of the word. If we did not know, due to personal experience, that it was useful [LW · GW] to know whether someone was a bachelor^[2], we would not talk about it; it would be just as arbitrary and useless a subset of idea-space as "the category of "bleggs" that is generated for no coherent reason whatsoever, or the random category "r398t"s that I just made up and contains only apples, weasels, and Ron Weasley." [LW(p) · GW(p)]

It is not just my interpretation, that is how the term "moral realism" is commonly defined in philosophy, e.g. in the SEP.

The SEP entry for "moral realism" is, unfortunately, not sufficient to resolve issues regarding what it means or how useful a concept it is. I would point you to the very introduction of the SEP entry on moral anti-realism:

It might be expected that it would suffice for the entry for “moral anti-realism” to contain only some links to other entries in this encyclopedia. It could contain a link to “moral realism” and stipulate the negation of the view described there. Alternatively, it could have links to the entries “anti-realism” and “morality” and could stipulate the conjunction of the materials contained therein. The fact that neither of these approaches would be adequate—and, more strikingly, that following the two procedures would yield substantively non-equivalent results—reveals the contentious and unsettled nature of the topic.
“Anti-realism,” “non-realism,” and “irrealism” may for most purposes be treated as synonymous. Occasionally, distinctions have been suggested for local pedagogic reasons (see, e.g., Wright 1988; Dreier 2004), but no such distinction has generally taken hold. (“Quasi-realism” denotes something very different, to be described below.) All three terms are to be defined in opposition to realism, but since there is no consensus on how “realism” is to be understood, “anti-realism” fares no better. Crispin Wright (1992: 1) comments that “if there ever was a consensus of understanding about ‘realism’, as a philosophical term of art, it has undoubtedly been fragmented by the pressures exerted by the various debates—so much so that a philosopher who asserts that she is a realist about theoretical science, for example, or ethics, has probably, for most philosophical audiences, accomplished little more than to clear her throat.”

^{^}
and possibly final
^{^}
because of reasons that go beyond knowing how to answer the question "is he a bachelor?" or "does he have the properties tautologically contained within the status of bachelors?"

Replies from: cubefox

↑ comment by cubefox · 2024-07-14T21:06:19.323Z · LW(p) · GW(p)

But why do you care about the concept of a bachelor? What makes you pick it out of the space of ideas and concepts as worthy of discussion and consideration?

Well, "bachelor" was just an example of a word for which you don't know the meaning, but want to know the meaning. The important thing here is that it has a meaning, not how useful the concept is.

But I think you actually want to talk about the meaning of terms like "good". Apparently you now concede that they are meaningful (are associated with anticipated experiences) and instead claim that the concept of "good" is useless. That is surprising. There is arguably nothing more important than ethics; than the world being in a good state or trajectory. So it is obvious that the term "good" is useful. Especially because it is exactly what an aligned superintelligence should be targeted at. After all, it's not an accident that EY came up with extrapolated volition as an ethical theory for solving the problem of what a superintelligence should be aligned to. An ASI shouldn't do bad things and should do good things, and the problem is making the ASI care for being good rather than for something else, like making paperclips.

Regarding the SEP quote: It doesn't argue that moral internalism is part of moral realism, which was what you originally were objecting to. But we need not even use the term "moral realism", we only need the claim that statements on what is good or bad have non-trivial truth values, i.e. aren't purely subjective, or mere expressions of applause, or meaningless, or the like. This is a semantic question about what terms like "good" mean.

Replies from: abandon

↑ comment by dirk (abandon) · 2024-07-19T01:12:50.030Z · LW(p) · GW(p)

For moral realism to be true in the sense which most people mean when they talk about it, "good" would have to have an observer-independent meaning. That is, it would have to not only be the case that you personally feel that it means some particular thing, but also that people who feel it to mean some other thing are objectively mistaken, for reasons that exist outside of your personal judgement of what is or isn't good.

(Also, throughout this discussion and the previous one you've misunderstood what it means for beliefs to pay rent in anticipated experiences. For a belief to pay rent, it should not only predict some set of sensory experiences but predict a different set of sensory experiences than would a model not including it. Let me bring in the opening paragraphs of the post [LW · GW]:

Thus begins the ancient parable:
If a tree falls in a forest and no one hears it, does it make a sound? One says, “Yes it does, for it makes vibrations in the air.” Another says, “No it does not, for there is no auditory processing in any brain.”
If there’s a foundational skill in the martial art of rationality, a mental stance on which all other technique rests, it might be this one: the ability to spot, inside your own head, psychological signs that you have a mental map of something, and signs that you don’t.
Suppose that, after a tree falls, the two arguers walk into the forest together. Will one expect to see the tree fallen to the right, and the other expect to see the tree fallen to the left? Suppose that before the tree falls, the two leave a sound recorder next to the tree. Would one, playing back the recorder, expect to hear something different from the other? Suppose they attach an electroencephalograph to any brain in the world; would one expect to see a different trace than the other?
Though the two argue, one saying “No,” and the other saying “Yes,” they do not anticipate any different experiences. The two think they have different models of the world, but they have no difference with respect to what they expect will happen to them; their maps of the world do not diverge in any sensory detail.

If you call increasing-welfare "good" and I call honoring-ancestors "good", our models do not make different predictions about what will happen, only about which things should be assigned the label "good". That is what it means for a belief to not pay rent.)

Replies from: cubefox

↑ comment by cubefox · 2024-07-19T02:34:06.597Z · LW(p) · GW(p)

For moral realism to be true in the sense which most people mean when they talk about it, "good" would have to have an observer-independent meaning. That is, it would have to not only be the case that you personally feel that it means some particular thing, but also that people who feel it to mean some other thing are objectively mistaken, for reasons that exist outside of your personal judgement of what is or isn't good.

That would only be a case of ambiguity (one word used with two different meanings). If you mean with saying "good" the same as people usually mean with "chair", this doesn't imply anti-realism, just likely misunderstandings.

Assume you are a realist about rocks, but call them trees. That wouldn't be a contradiction. Realism has nothing to do with "observer-independent meaning".

For a belief to pay rent, it should not only predict some set of sensory experiences but predict a different set of sensory experiences than would a model not including it.

This doesn't make sense. A model doesn't have beliefs, and if there is no belief, there is nothing it (the belief) predicts. Instead, for a belief to "pay rent" it is necessary and sufficient that it makes different predictions than believing its negation.

If you call increasing-welfare "good" and I call honoring-ancestors "good", our models do not make different predictions about what will happen, only about which things should be assigned the label "good". That is what it means for a belief to not pay rent.

Compare:

If you call a boulder a "tree" and I call a plant with a woody trunk a "tree", our models do not make different predictions about what will happen, only about which things should be assigned the label "tree". That is what it means for a belief to not pay rent.

Of course our beliefs pay rent here, they just pay different rent. If we both express our beliefs with "There is a tree behind the house" then we have just two different beliefs, because we expect different experiences. Which has nothing to do with anti-realism about trees.

↑ comment by dxu · 2024-07-13T01:18:37.210Z · LW(p) · GW(p)

I seem to recall hearing a phrase I liked, which appears to concisely summarize the concern as: "There's no canonical way to scale me up."

Does that sound right to you?

Replies from: None

↑ comment by [deleted] · 2024-07-13T01:23:53.030Z · LW(p) · GW(p)

I mentioned it above :)

Richard Ngo, as part of his comprehensive overview of Realism about Rationality [LW · GW] and his skepticism of it, pointed out [LW(p) · GW(p)] the very important fact that "there is no canonical way to scale [a human] up", which means that you need to make arbitrary choices (with no real guiding hand available to you) of which modifications to make to a human if you want to make him/her more intelligent, knowledgeable, coherent etc.

↑ comment by Tapatakt · 2024-07-13T12:06:25.621Z · LW(p) · GW(p)

Wow! Thank you for the list!

I noticed you write a lot of quite high-effort comments with a lot of links to other discussions of a topic. Do you "just" devote a lot of time and efforts to it or do you, for example, apply some creative use of LLMs?

Replies from: None

↑ comment by [deleted] · 2024-07-13T12:15:28.751Z · LW(p) · GW(p)

I write everything myself from scratch. I don't really use LLMs that much beyond coding assistance, where asking GPT-4o for help is often much faster than reading the documentation of a function or module I'm unfamiliar with.

The comment above actually only took ~ 1 hour to write. Of course, that's mostly because all of the ideas behind it (including the links and what I thought about each one) had been ruminating in my head for long enough ^[1] that I already knew everything I wanted to write; now I just had to write it (and type in all the links).

^{^}
Mostly in the back of my mind as I was doing other things.

↑ comment by Oleg Trott (oleg-trott) · 2024-07-13T02:13:33.432Z · LW(p) · GW(p)

Quoting from the CEV link:

The main problems with CEV include, firstly, the great difficulty of implementing such a program - “If one attempted to write an ordinary computer program using ordinary computer programming skills, the task would be a thousand lightyears beyond hopeless.” Secondly, the possibility that human values may not converge. Yudkowsky considered CEV obsolete almost immediately after its publication in 2004.

Neither problem seems relevant to what I'm proposing. My implementation is just a prompt. And there is no explicit optimization (after the LM has been trained).

Has anyone proposed exactly what I'm proposing? (slightly different wording is OK, of course)

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-07-13T11:29:20.765Z · LW(p) · GW(p)

I don't think anyone has proposed this. I think the most similar proposal is my instructions-following [LW · GW] AGI (particularly since I'm also mostly thinking of just such a text prompt in a language model agent as the implementation).

My proposal with its checking emphasis is aimed more at the range where the AGI is human level and above, where yours seems more aimed at the truly super intelligent range. Mine keeps the human in charge of figuring out what they would've wanted in case the AGI gets that wrong.

Other related work is linked in that post.

The above objections to CEV partly apply to your proposal. There is probably not just one thing X would've wanted with more consideration, since conclusions may depend on circumstances.

I'm not sure that breaks the proposal; it could be that any of the several things X might've wanted would serve adequately.

comment by JuliaHP · 2024-07-12T18:38:51.890Z · LW(p) · GW(p)

The step from "tell AI to do Y" to "AI does Y" is a big part of the entire alignment problem. The reasons chatbots might seem aligned in this sense is that the thing you ask for often lives in a continuous space, and when not too strong optimization pressure is applied, when you ask for Y, Y+epsilon is good enough. This ceases to be the case when your Y is complicated and high optimization pressure is applied, UNLESS you can find a Y which has a strong continuity property in the sense you care about, which I am unaware of anyone who knows how to do.

Not to mention that "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do" does not narrow down behaviour to a small enough space. There will be many to you reasonable looking interpretations, many of which will allow for satisfaction, while still allowing the AI to kill everyone.

Replies from: Seth Herd, oleg-trott

↑ comment by Seth Herd · 2024-07-13T11:38:27.876Z · LW(p) · GW(p)

It seems like all of the many correct answers to what X would've wanted might not include the AGI killing everyone.

Wrt the continuity property, I think Max Harm's corrigibility proposal has that, without suffering as obviously from the multiple interpretations you mention. Ambitious value learning is intended to as well, but has more of that problem. Roger Dearnaley's alignment as a basin of attraction addresses that stability property more directly. Sorry I don't have links handy.

Replies from: JuliaHP

↑ comment by JuliaHP · 2024-07-13T12:28:31.894Z · LW(p) · GW(p)

>It seems like all of the many correct answers to what X would've wanted might not include the AGI killing everyone.
Yes, but if it wants to kill everyone it would pick one which does. The space "all possible actions" also contains some friendly actions.

>Wrt the continuity property, I think Max Harm's corrigibility proposal has that
I think it understands this and is aiming to have that yeah. It looks like a lot of work needs to be done to flesh it out.

I dont have a good enough understanding of ambitious value learning & Roger Dearnaleys proposal to properly comment on these. Skimming + priors put fairly low odds on that they deal with this in the proper manner, but I could be wrong.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-07-13T21:10:48.619Z · LW(p) · GW(p)

I don't think Dearnaley's proposal is detailed enough to establish whether or not it would really in practice have a "basin of attraction". I take it to be roughly the same idea as ambitious value learning and CEV. All of them might be said to have a basin of attraction (and therefore your continuity property) for this reason: if they initially misunderstand what humans want initially (a form of your delta) they should work to understand it better and make sure they understand it, as a byproduct of having their goal be not a certain set of outcomes, but a variable standing for outcomes humans prefer, while the exact value of that variable can remain unknown and refined as one possible sub-goal.

Another related thing that springs to mind: all goals may have your continuity property with a slightly different form of delta. If an AGI has one main goal, and a few other less important goals/values, those might (in some decision-making processes) be eliminated in favor of the more important goal (if continuing to have those minor goals would hurt its ability to achieve the more important goal).

The other important piece to note about the continuity property is that we don't know how large a delta would be ruinous. It's been said that "value is fragile" but the post But exactly how complex and fragile? [LW · GW] got almost zero meaningful discussion. Nobody knows until we get around to working that out. It could be that a small delta in some AGI architectures would just result in a world with slightly more things like dance parties and slightly less things like knitting circles, disappointing to knitters but not at all catastrophic. I consider that another important unresolved issue.

Back to your intial point: I agree that other preferences could interact disastrously with the indeterminacy of something like CEV. But it's hard for me to imagine an AGI whose goal is to do what humanity wants but also has a preference for wiping out humanity. But it's not impossible. I guess with the complexity of pseudo-goals in a system like an LLM, it's probably something we should be careful of.

↑ comment by Oleg Trott (oleg-trott) · 2024-07-13T19:56:21.736Z · LW(p) · GW(p)

many of which will allow for satisfaction, while still allowing the AI to kill everyone.

This post is just about alignment of AGI's behavior with its creator's intentions, which is what Yoshua Bengio was talking about.

If you wanted to constrain it further, you'd say that in the prompt. But I feel that rigid constraints are probably unhelpful, the way The Three Laws of Robotics are. For example, anyone could threaten suicide and force the AGI to do absolutely anything short of killing other people.

comment by [deleted] · 2024-07-12T21:23:25.897Z · LW(p) · GW(p)

I'm not familiar with the ASI alignment literature, but presumably he is. I googled "would have wanted" + "alignment" on this site, and this didn't seem to turn up much. If this has already been proposed, please let me know in the comments.

Alignment: "Do what I would have wanted you to do"

Contents

48 comments

Humans as clumps of molecules following physics

LLM as next-text-on-internet predictor