A problem shared by many different alignment targets

thomascederborg

A problem shared by many different alignment targets

post by ThomasCederborg · 2025-01-15T14:22:12.754Z · LW · GW · 16 comments

  An analysis of a few alignment targets, and the problems that they suffer from
    Condorcet AIs and solid majorities
    Maximising the sum of individual utility functions
    Returning to the idea of removing the certainty of victory from a CAI
    Ways of removing the problematic negotiation advantage from PCEV
  Describing these proposals as slight variations to a single proposal: to build an AI that is describable as doing what a group wants
    A trivial scenario that illustrates why no GAI design is describable as having the SPADI feature
    Groups and Individuals really are completely different types of things
  Two very different types of mappings and a problematic implicit assumption
    Three coherent proposals where someone can reject a given CEVH mapping
      Delegating the choice of definitions to an extrapolated human
None
16 comments

The first section describes problems with a few different alignment targets. The second section argues that it is useful to view all of them as variations of a single alignment target: building an AI that does what a Group wants that AI to do. The post then goes on to argue that all of the individual problems described in the first section can be viewed as special cases of an underlying problem. A problem that is inherent in the core concept of a Group AI.

The implication of the first section is that the specific alignment targets analysed there should not be implemented. The rest of the post is more ambitious. The following sections argue that there exists an underlying problem that many alignment targets suffer from. The implication of this argument is that no alignment target that suffers from this underlying problem should be implemented. This more ambitious argument is related to the argument about Membrane formalisms that was outlined here [LW · GW].

A previous post [LW · GW] argued that the type of Alignment Target Analysis (ATA) done in the present post needs to happen now. In brief: an alignment target with a hidden problem getting successfully implemented is a serious risk. There might not be a lot of time to do ATA later (for example because shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure [LW · GW]). And risk reduction is feasible, because one can reduce risks without finding any good alignment target.

ATA can be roughly described as asking what would happen if a given Sovereign AI proposal were to be successfully implemented. This means that there are very serious AI risks that no level of ATA progress can reduce (including most extinction risks related to misalignment). But there are other serious AI risks that can be reduced by ATA progress. ATA is complementary with several other risk mitigation strategies. The risk reductions that can be achieved by ATA are not easily replaced by other strategies. See this post [LW · GW] and this comment [LW(p) · GW(p)] for a more detailed description of ATA research.

An analysis of a few alignment targets, and the problems that they suffer from

Besides the discussion of specific alignment targets, this section also tries to draw attention to the fact that intuitions built up by analysing concepts in other contexts, might not work in the AI context.

Condorcet AIs and solid majorities

Let's start by exploring a concept from ordinary politics: the Condorcet Criterion from voting theory. If there exists an option that beats every other option in a one-on-one vote, then that option will be chosen by any voting rule that conforms to the Condorcet Criterion. In Scott Garrabrant's related Lottery Condorcet Criterion [LW · GW] options are lotteries over outcomes instead of deterministic outcomes. Let's write Condorcet AI (CAI) for any powerful AI that (i): decides what to do based on what billions of extrapolated humans want it to do, and that (ii): picks outcomes in a way that satisfy the Condorcet Criterion or the Lottery Condorcet Criterion.

Now consider the case with a 51% solid majority (a majority that agrees about everything). They think that, all else equal, it would be sort of nice if the CAI were to hurt the other 49% as much as possible. In this scenario any CAI would hurt the 49% as much as possible. This follows directly from the definition of the Condorcet Criterion, and Garrabrant's definition of the Lottery Condorcet Criterion [LW · GW] (where the chosen option would be a “lottery” that always results in the outcome preferred by the majority). The majority does not have to feel strongly about this according to any reasonable definition of “feels strongly”. (The fact that all Condorcet AIs share this feature was first mentioned here [LW · GW]).

This issue is not a problem in the context that the Condorcet Criterion concept was developed for: humans electing other humans that go on to hold constitutionally defined positions in a polity with both formal and informal checks and balances. In ordinary politics there does not exist any entity that can suddenly start oppressing a 49% minority without cost or risk. If a narrow majority barely care about the persecution, they will not vote for a policy that would lead to civil war. (In addition to this difference: elements such as extrapolation and clever AIs also impacts issues like coordination. For example making it more important to take things like solid majorities into account. Especially if it turns out that many individuals intrinsically value having the same values as the majority).

It is possible to bring the AI scenario closer to the ordinary politics context by removing the certainty of victory for the majority (for example by having extrapolated delegates act and negotiate, based on the assumption that votes are settled stochastically). We will return to this idea below.

For now however, this example can illustrate a more general phenomenon related to transferring a concept to the AI context: that intuitions built up by examining how a concept behaves in one context, might fail when the concept is transferred to the AI context. There is nothing unusual about this phenomenon. It's just another example of intuitions about a concept breaking, when that concept is used in novel contexts. But it is a phenomenon that is useful to keep in mind during the rest of this post.

Maximising the sum of individual utility functions

This subsection will analyse a type of AI that maximises the sum of normalised utility functions, where each function is designed to represent the wishes of one human in a group of billions of humans. A thought experiment illustrates a problem with an AI that uses the Variance, MADAM, or MD normalisations discussed in this paper. The paper discusses both the interpersonal and the intertheoretic case (focusing on the intertheoretic case). But the problem described below is only present when the interpersonal version of the Variance, MD, or MADAM proposals is used in an AI context. The fact that the problem only shows up in the AI context is interesting, and will be discussed further below (it illustrates another instance of intuitions breaking when concepts are transferred to the AI context).

Let's focus on the Variance normalisation method and write VarAI for this AI (the three normalisation methods mentioned above all suffer from the same problem, and the thought experiment outlined below will also work for the other two options). Now let's do a short walkthrough of the basic Alignment Target Analysis (ATA) process. Let's start by optimistically assuming that for each person, a utility function is found that perfectly represents what this person wants (making optimistic assumptions allows us to focus on problems that remain, despite these assumptions).

VarAI uses a flawless method to obtain a utility function for each person, that perfectly describes everything that matters to her. VarAI then rescales these functions so that they all have the same variance. And then VarAI maximises the sum. One obvious initial question to ask about any proposed alignment target is: what type of person would gain a disproportionate amount of influence over this AI? (In the CAI case discussed above the answer was: any member of a narrow but solid majority).

When summing utility functions, a person with a very spiky utility function can in general outvote several people that have more even utility functions. So let's check how VarAI would treat a person with the most spiky function possible. Bob assigns max utility to exactly one option. And min utility to every other option. If we look at the definitions in detail, we see that in a group consisting of Bob and billions of normal human individuals, Bob would gain complete control over VarAI. As explained below, Bob would get his exact outcome implemented regardless of what he wants VarAI to do.

If interpersonal utility comparisons are used to define a goal for a clever and powerful AI, then there exists only a single decision situation that matters to Bob: choosing what successor AI to create. And for Bob, the only option without min utility is the option where the exact successor AI that Bob wants is created as soon as possible. And given complete control as soon as possible.

For any ordinary human individual, there exists a large number of potential successor AI designs that are close to the best possible design. And a large number of potential successor AI designs that are close to the worst possible design. Consider for example the set of AI designs such that: (i): each individual human is mapped to a clever AI delegate representing her interests, (ii): a different mapping is used for each individual, (iii): each delegate gets between 0 and 10^20 votes, and (iv): votes are aggregated in some reasonable sounding way.

For any normal human individual, the set of potential successor AI designs that satisfy (i)-(iv) includes a large number of close-to-best designs. And a large number of close-to-worst designs (and this set is of course just a tiny subset of possible successor AI designs).

The same conclusion is reached if one instead considers all the different ways in which a successor AI might reorganise the world. For any normal human individual there will be a large number of close-to-best worlds. And a large number of close-to-worst worlds.

For example: if Steve's best option is a world where large numbers of individuals lead pleasant lives, then there would be a large number of ways to slightly modify the circumstances and the people involved such that the outcome would remain close-to-optimal. If Steve's worst possible outcome is a world where a large number of people suffer, then there would also exist a large number of close-to-worst outcomes, corresponding to slight variations to people and circumstances. This remains true even if Steve only cares about biological (non-uploaded) humans that have descended from current humans. It also remains true if all of these people are instead simulated copies of Steve with slight variations.

So, regardless of how one counts successor AI options, any normal human would have a large number of close-to-best options. And a large number of close-to-worst options.

But for Bob there exists only one option that does not have min utility. After rescaling, the utility of Bob's one good option would be a large number. Bob would easily outvote billions of normal human individuals, regardless of what he would like the AI to do. In other words: if Bob wants to implement an outcome. And this outcome is seen as literally the worst possible outcome by billions of ordinary human individuals. Then Bob wins the vote with a large margin. Bob also wins if an AI is using the MADAM or the MD normalisation methods.

It is worth noting that this problem is not present in the intertheoretic context. In the intertheoretic context a single human individual is using a mapping from a weighted set of moral theories and a situation, to a subset of actions that are already being considered by that human (this subset is then given the label: Morally Permissible). For a start: the number of actions is very different.

In the AI case, there is also a risk that Bob will ask the AI to hurt people (it seems difficult to predict what such an unusual mind would ask an AI to do. But this is certainly one possibility). This is not really an issue with moral theories. It would not be surprising to learn that a given human will ask an AI to hurt other humans (especially not a highly unusual type of pathological mind). It would however be very surprising to learn that a human puts credence in one moral theory, that wants an AI to hurt another moral theory (that this human also puts credence in). There is nothing incoherent about moral theories that refer to each other, and that wants to hurt each other. But in practice this seems to be a lot more common with humans (this issue will be discussed further at the end of the next subsection).

Returning to the idea of removing the certainty of victory from a CAI

One feature of a CAI from the introductory thought experiment was that a narrow majority was certain of complete victory. Therefore they had no reason to negotiate or compromise, even if a large minority felt very strongly about something (including in the extreme case where a 49% minority would strongly prefer to avoid a scenario where a clever and powerful AI is hurting them as much as possible). A majority that barely cares would abandon any plans of oppressing such a minority, if trying to achieve such an outcome came with some form of non-negligible risk or cost.

Consider a modification to a CAI proposal where extrapolated delegates of individuals negotiate and vote under the assumption that votes are settled stochastically. This removes the issue described in the CAI thought experiment, because the majority in the CAI thought experiment would back down instantly. Even large majorities would be forced to negotiate with small minorities to some degree. If the concerns of a small minority are completely ignored, then they will vote for whatever outcome they prefer (without taking any of the considerations of the majority into account). In order to avoid the possibility of their concerns being fully ignored in the eventual outcome, the majority would need to negotiate. The amount of concessions a minority would be able to get would increase if the minority is larger.

If this sounds familiar, it might be because it is a description of the most recently published version of the proposal to build an AI that implements the Coherent Extrapolated Volition of Humanity: Parliamentarian CEV (PCEV). This proposal also has problems. Let's optimistically assume that each human is represented in negotiations by a delegate that perfectly represents her interests (this will allow us to focus on a problem that remains despite this assumption).

It turns out that PCEV gives extra power to anyone that wants PCEV to harm other humans. Consider Bill who believes that every other human deserves to suffer. Since delegates negotiate under the assumption that votes are settled stochastically, Bill voting for an outcome where PCEV makes everyone suffer would be seen as bad in expectation. How bad it is in expectation would depend on the seriousness of the harm that Bill wants PCEV to inflict. The only way to convince Bill to not vote for an outcome where PCEV makes everyone suffer, is to give Bill something. If Bill wants PCEV to inflict a minor form of harm on everyone, then he will only be able to ask for something minor. If Bill wants PCEV to inflict a more serious form of harm, then he will be able to ask for something larger. In other words: the more harm Bill wants to inflict, the more power he will get over the actions of PCEV.

In yet other words: people that want to inflict harm on others get a negotiation advantage in PCEV. People that want to inflict serious harm get a bigger advantage than people that want to inflict less serious harm. The largest possible advantage goes to any group that wants PCEV to hurt everyone else as much as possible. (Yudkowsky's description of this feature of PCEV can be found by searching the CEV arbital page for: ADDED 2023).

The fact that PCEV would be dominated by whatever group of people wants to inflict the most serious harm on others, means that a successfully implemented PCEV would be very dangerous. A powerful and clever AI dominated by such people would in fact lead to a very bad outcome [LW · GW].

PCEV is another example where intuitions built up in one context do not survive the transfer to the AI context. Nick Bostrom proposed to use a Moral Parliament to construct a mapping from a set of weighted ethical theories and a situation, to a set of actions. Bostrom suggested that an individual could use this mapping to obtain a set of actions. And Bostrom further suggested these actions could then be given the label: morally permissible. Yudkowsky proposed to use that same Parliament procedure as a component, when constructing a mapping from billions of humans to an AI. In other words: Yudkowsky proposed PCEV. These two mappings are very different types of things.

In both cases the idea is to resolve disagreements (in the original proposal between moral theories. In PCEV between human individuals). Moral theories can disagree on how disagreements should be resolved. And human individuals can also disagree on how disagreements should be resolved. But moral theories and human individuals are very different types of things (at least if a normal human individual puts credence in all moral theories that are being considered). For example: if a given human individual puts credence in some set of theories. Then it would be very surprising to discover that one of those theories wants to hurt the other theories. But for a large number of human individuals this is guaranteed to happen.

In other words: technically the problem does exist in the intertheoretic / moral uncertainty context. There is in principle nothing stopping a moral theory from referring to some other moral theory, and demanding that this other theory must be hurt as much as possible. And such a moral theory would in fact dominate Bostrom's intertheoretic aggregation method. A mind that puts credence in such a set of moral theories is not an impossible mind. But this is not a set of moral theories that one would expect to find in a human that is trying to resolve moral uncertainty. The issue is with intuitions. In the moral uncertainty context, the intuition that it is safe to ignore this type of hostility is just far more reasonable. More generally: these are just very different types of mappings (the difference between various types of mappings will be discussed further below).

Ways of removing the problematic negotiation advantage from PCEV

One proposed way of patching PCEV is to keep the random dictator negotiation baseline used in PCEV but restrict the options that delegates are able to vote for to AI actions that are Pareto Improvements [LW · GW]. No group of people would be allowed to vote for an outcome where the AI hurts everyone else as much as possible. Because being hurt as much as possible by a powerful and clever AI is not an improvement for the people being hurt. This removes the problematic negotiation advantage issues that PCEV suffers from. If this sounds like a coherent proposal, then it is another example of intuitions breaking during the transfer to the AI context.

For a large group of humans, any conceivable action by a clever and powerful AI (including inaction) will make things worse for someone. Thus, in the AI context, no possible action will ever be a Pareto Improvement [LW(p) · GW(p)]. The proposal in question [LW · GW] is thus based on a definition of acceptable actions, according to which all conceivable actions are unacceptable. Due to the fact that the set of actions that delegates are supposed to vote on is always empty, this is not a coherent proposal. (All other proposals discussed in the present post are variations of the idea of building an AI that does what a group wants the AI to do. But the Pareto-AI idea is not a coherent proposal. Which in turn means that it is not a Group AI proposal).

This is another case where intuitions that were built up about a concept in one context, fails in the AI context. (Since this last claim is a claim about intuitions, it might be false for some readers. Some readers might for example find it intuitively obvious that no possible action can ever be a Pareto Improvement for a large group of humans. For such readers the proposal described above was presumably instantly recognised as incoherent. And for such readers, intuitions about the Pareto Improvement concept did not fail when it was transferred to the AI context. The same is true for all claims made about intuitions throughout the present post. Including in places where this caveat is not explicitly spelled out).

Let's look at another way of patching PCEV. Even a tiny group of fanatics can completely dominate PCEV if they are hateful enough and determined enough [LW · GW]. One way of dealing with this situation would be to remove the stochastic negotiation baseline from PCEV. This is a coherent proposal. And it would prevent such a tiny group from dominating the outcome. But this just brings us back to the issue with solid majorities, illustrated by the initial CAI thought experiment. This non-stochastic-PCEV would allow literally any solid majority (no matter how narrow) to do literally anything they want to everyone else. In other words: all proposals along the lines of non-stochastic-PCEV suffer from a serious problem, for the reasons illustrated by the CAI thought experiment at the beginning of the present section.

We have now come full circle, and it is time to switch to a more ambitious argument. The next section argues that these are all just slight variations of a single proposal: to build an AI that does what a group would like this AI to do. It will be further argued that all problems mentioned so far are due to an underlying problem. A problem that is inherent in the core concept of this Group AI alignment target proposal. This underlying problem (that the rest of the post will be focused on) can be roughly described as: the core concept of a Group AI is incompatible with giving each individual meaningful influence over the way in which the Group AI adopts preferences that refer to her.

Describing these proposals as slight variations to a single proposal: to build an AI that is describable as doing what a group wants

Let's take the perspective of Steve: an ordinary human individual that will not be given any special influence over an AI project. This AI project seeks to implement one of the proposals mentioned above (where the AI in question will get its goal from a group of billions of humans, that include Steve). The rest of this post will argue that from his perspective, the AIs mentioned above all look like slight variations of the same proposal. And further argue that this proposal is bad for Steve. (We will not make a lot of assumptions about Steve. So this conclusion should be relevant for many different readers).

A very rough and informal description of the argument that will be outlined in the rest of the post: A clever AI would be very good at coming up with clever ways of making sure that Steve will have the type of outcome that this AI wants for Steve. Thus, the central question for Steve would be which Steve-referring preferences this AI will have. If Steve has no meaningful influence over the adoption of these preferences, then it would be rational for Steve to strongly oppose the AI project in question. One could give each person such influence. But none of the proposals discussed so far does this. And doing so is incompatible with the core concept of a Group AI (as will be shown in the next subsection).

This argument (that the Group AI alignment target proposal is bad for Steve) will assume that Steve is vulnerable to a clever AI that wants to hurt Steve. But this does not correspond to any dramatic assumptions about Steve. In particular: it does not imply an assumption that Steve is selfish. An AI that wants to hurt Steve can find clever ways of targeting whatever it is that Steve happens to care about. And a clever AI would be good at finding clever ways of getting around whatever constraints it is operating under (such as also wanting specific things for billions of other people).

A brief summary of the argument of the present section: a clever AI would be good at getting around various constraints. So what really matters for Steve is what this AI wants for Steve. But if the design is a Group AI, then the designers have made the decision to implement a design where Steve will be denied any meaningful influence over what this clever AI will want for Steve. This feature seems far more relevant than any of the details that distinguishes the proposals mentioned so far. So, from Steve's perspective, it would be useful to view them all as slight variations of the same proposal. In other words: from Steve's perspective the real problem with all of the proposals in the first section is this lack of influence over the adoption of Steve-referring preferences. Not the details of the various mappings that defines the Group entity that will determine Steve's fate.

For all proposals mentioned in the first section, (i): the AI will adopt preferences that refer to Steve, (ii): the AI will give Steve a negligible amount of influence regarding which Steve-preferences the AI will adopt, (iii): a Group entity is defined as the output of a mapping, that maps sets of billions of human individuals to entities of the type that can be said to want things (iv): the AI will adopt all of its Steve-preferences from the result of this mapping, when the mapping is pointed at the current human population (a population that includes many different types of unusual minds. For example along the lines of Bob the determined from the VarAI thought experiment. And a population where it is common to intrinsically value hurting other humans. For example along the lines of thinking that heretics deserve eternal torture in hell), (v): the AI will have a very large ability to hurt Steve, and finally (vi): there are absolutely no restrictions, on what this AI can do to Steve.

The set of alignment targets that have all of these properties is a tiny fraction of AI design space. It is the result of several, very specific, choices. And when viewed from the perspective of Steve (who has no special influence over any AI project) this is a very surprising set of choices (if Steve assumes that the designers care about his well being). In other words: if we are trying to build an AI that would be good for Steve, then this sounds like a very counterintuitive set of choices. These proposals are all members of a very small, highly specific, and very counterintuitive, class of proposals. From Steve's perspective, it is difficult to see why the details would matter. If one actually thinks about what (i)-(vi) means for Steve (who has no special influence over a given AI project), then it should not be surprising to discover that this proposed AI is very bad for Steve in expectation (for reasons that are not related to any specific detail of any specific variant of this alignment target proposal).

Let's denote any AI that is describable as doing what a group wants that AI to do as a Group AI, or a GAI. All proposals covered so far are GAI proposals. One way of summarising the point made in the present section would be: the problem is not in the details of any given GAI proposal. The problem is that there is simply no reason to expect some arbitrarily defined Group entity to be good for individuals. There is simply no reason to think that whatever G entity is implied by an arbitrarily chosen set of definitions would have Steve-referring preferences that are safe for Steve. So, if Steve has no meaningful influence regarding the adoption of those preferences that refer to Steve. Then Steve has no reason to think that a GAI will want to help Steve, as opposed to want to hurt Steve.

All individuals could be given such influence. But, as will be shown in the next subsection, doing so is incompatible with the core concept of building an AI that is describable as doing what a group wants the AI to do. In other words: giving individuals such influence means that the proposal is not a GAI.

Let's introduce some notation for an alignment target feature. Iff an alignment target gives each person meaningful influence regarding the adoption of those preferences that refer to her, then we say that this alignment target has the Self Preference Adoption Decision Influence (SPADI) feature. In some cases it will be unclear if a given proposal should be described as having the SPADI feature. And in some cases it will be unclear if a given proposal should be described as a Group AI (GAI). While plenty of unclear cases exist, there also exist plenty of clear cases. It will be shown below that if an alignment target is possible to describe as a GAI, then it is not possible to describe it as having the SPADI feature. (The SPADI feature was previously discussed here [LW(p) · GW(p)], and is related to the EM feature discussed here [LW · GW]).

A trivial scenario that illustrates why no GAI design is describable as having the SPADI feature

This subsection outlines a scenario, illustrating why the SPADI feature is incompatible with the Group AI alignment target proposal. The scenario involves a decision. Every GAI will choose one option. No AI with the SPADI feature will choose this option. This is sufficient for proving the non-existence of a GAI design with the SPADI feature. (Some readers might find this scenario to be unsatisfactory as an intuitive explanation of the source of this incompatibility. Therefore the scenario is then followed by a longer discussion of the underlying reasons for the incompatibility.)

Consider an AI that is getting its goal entirely from a group of billions of individuals that include Dave, who cracks his knuckles. Dave wants to continue being the type of person who likes to crack his knuckles. Everyone else would like Dave to stop being the type of person who likes to crack his knuckles. The AI is able to modify Dave in a way that has no relevant side effects (no one would, for example, see the intervention as an impermissible violation of Dave's rights). Dave's knuckle cracking is the only source of disagreement (they all, for example, agree on how much resources each individual should get). They also all agree that everyone feels equally strongly about Dave's knuckle cracking. Everyone agrees that the will of the group is to put a stop to the knuckle cracking. None of them have any relevant meta preferences. For example: no one has a meta preference about individual autonomy. No one has a meta preference about doing what the group wants. They only disagree on this one issue: whether or not Dave should continue to be the type of person that likes to crack his knuckles.

If a proposed AI design implies an AI that would leave Dave alone when faced with this group of people, then it is simply not describable as a GAI proposal. Such an AI does not do what this Group wants it to do, by any reasonable set of definitions. There exist cases where it is unclear whether or not a given design is describable as a GAI proposal. But this case is clear. No AI that leaves Dave alone is describable as a GAI. All GAI designs chooses the option of modifying Dave.

If a proposal implies an AI that would turn Dave into someone that no longer likes to crack his knuckles, then it is simply not possible to describe this proposal as having the SPADI feature. Dave does not want to be modified. Modifying Dave is not instrumental in any way (in this scenario, such a modification would not be done as a way of achieving any type of second order effect). So if Dave is modified, it is because the AI adopted preferences that refers to Dave, in a way that Dave had no control over. There exist cases where it is unclear whether or not a given design is describable as having the SPADI feature. But this case is clear. No AI that modifies Dave is describable as having the SPADI feature.

In other words: all designs that are describable as a GAI chooses one option (they all modify Dave). And no design that is describable as having the SPADI feature chooses this option (they all leave Dave alone). Therefore there exists no AI that is describable as a GAI with the SPADI feature. In yet other words: the SPADI feature is incompatible with the core concept of building a GAI. While this is sufficient for establishing the incompatibility, this incompatibility will still be discussed further in the rest of this section, as well as in the following section (in a way that is hopefully more intuitive). It will then be argued that the SPADI feature is a necessary feature of an alignment target (similar to how the EM feature is a necessary feature of a Membrane formalism [LW · GW]). (The argument against building a GAI is based on establishing that the SPADI feature is both necessary, and also incompatible with building a GAI).

There are of course scenarios where a given AI with the SPADI feature will act in the same way as a given GAI. But the same can be said about the set of AIs that are describable as Cheesecake Maximising AIs (CMAI). In a scenario where all individuals care monomaniacally about maximising Cheesecake, then we would expect an AI to be maximising Cheesecake, regardless of whether it is a GAI, or CMAI, or an AI with the SPADI feature that gets its goal entirely from these people. This does not imply the existence of an AI design that is a member of more than one of these three sets. No CMAI can be described as a GAI, or as having the SPADI feature (even though a given CMAI will sometimes respond to a given situation in the same way as a given GAI, or a given AI with the SPADI feature).

Some readers might find the above to be an unsatisfying explanation of this incompatibility. So the next subsection will attempt to discuss the underlying cause of the incompatibility. In other words: some readers might have intuitions regarding an AI that does what a group wants. These intuitions might object to this incompatibility. One type of implicit assumption that would explain such an intuition would be some version of the following: “If an AI is supposed to do what a group of humans wants it to do. And this AI ends up hurting individual humans. Then surely there exists someone, out there, that will object to this. Some Group entity, with a separate existence, that will object to any action that is so obviously wrong”. The reason that such an implicit assumption leads to incorrect intuitions, is that no such entity exists.

Groups and Individuals really are completely different types of things

Let's introduce an analogy with human individuals and cells, as a way to counter any implicit assumptions along the lines that doing what is good for a group must be good for individuals. It would not be particularly surprising to discover that some reasonable way of defining what it means for an AI to do what Gregg wants an AI to do, will lead to every one of Gregg's cells dying (for example because Gregg wants to be uploaded and would prefer that his cells are not left alive after uploading). This outcome is not an indication of any form of problem with the extrapolation method.

In other words: when evaluating an extrapolation method, the death of every one of Gregg's cells is not a red flag. Gregg and one of Gregg's cells are just two completely different types of things. The interests of Gregg and one of Gregg's cells might of course coincide in many everyday situations (wearing a seatbelt might be good for both). But there is nothing surprising about discovering that doing what one type of thing wants, is bad for a completely different type of thing.

Now consider Bruce, who suffers from a confusion that is similar to the implicit assumption of an independently existing Group entity. Bruce makes the same basic type of mistake, but in terms of individuals and cells. Specifically: Bruce believes that if a given way of extrapolating Gregg leads to all of Gregg's cells dying, then there must be some form of problem with the extrapolation method. Bruce assumes that Gregg's cells would be fine if a better extrapolation method could be found. Bruce makes the implicit assumption that a “collection of cells entity” exists, and that this entity shares Bruce's definition of what is good for cells. The problem is again that no such entity exists. One can define any number of mappings from sets of cells to the type of entity that can be said to want things. When pointed at Gregg, these mappings all result in completely different such entities. They will all want completely different things.

One can try to construct an AI that does what an individual wants it to do (based on the implicit assumption that this will, by definition, be good for cells if done correctly). One can then point it at Gregg (and interpret any implied outcome where Gregg's cells die as indicating a bad solution to a well defined technical problem). Alternatively, one can instead try to construct an AI that ensures the survival of cells. These are two very different design problems. Trying to solve these two very different problems will result in very different types of proposals.

Similarly: groups and individuals are also completely different types of things. Trying to construct an AI that does what a Group entity wants the AI to do is one design problem. Trying to construct an AI that will be safe for individuals is a very different design problem. If one is trying to solve this latter design problem, then it becomes natural to ask how one might prevent the scenario where an AI wants to hurt an individual. And if one is trying to figure out how to prevent this scenario, then it would be natural to focus on the part of the process where the AI adopts preferences that refer to this individual. But if one is instead trying to construct an AI that implements the will of some independently existing Group entity, then the adoption of self-referring-preferences is a detail. Focusing on this particular detail is just a distraction from the real goal: trying to describe the will of this free floating G entity. Implementing a special rule to deal with this detail would subvert the will of the G entity.

If Allan implicitly assumes the separate existence of a Group entity, then Allan might intuitively view the choice of GAI definitions as an attempt to find the correct way of representing the wishes of this G entity. Allan might falsely think that definitions can be arrived at by solving a technical problem (the technical problem of describing the will of this separately existing G entity). This matters when Allan forms intuitions regarding various proposals. Especially proposals to delegate the choice of those definitions that will imply a specific G entity (for example proposals to delegate this choice to an extrapolated human or a Corrigible AI). It matters because Allan implicitly assumes that the task that has been delegated, can in some sense be viewed as solving a technical problem.

Talking to Allan about the GAI alignment target proposal, might be a bit like talking to Bruce about the proposal to build an AI that does what Gregg wants the AI to do. One problem with both perspectives is the implicit assumption of a non existing entity. Another problem is the failure to understand that it would not be at all surprising to discover that doing what one type of thing wants, is very bad for a completely different type of thing.

Unless they get rid of their implicit assumptions about non-existing entities, the intuitions of Bruce and Allan will strongly reject any reasonable description of the situations they face. This confusion in turn means that Bruce will not design an AI that protects cells. And Allan will not design an AI that protects individuals. Allan will continue to propose AI designs that are bad for individuals. Because Allan is not trying to design an AI that is good for individuals. Allan is trying to design an AI that implements the will of a free floating G entity (that does not exist). Allan is not actually failing to find an AI design that is safe for individuals. Because that is not what Allan is actually trying to find.

To Bruce one might try saying: A human individual is a completely different type of thing than the cells that make up an individual. There is no particular reason to expect that doing what a given individual would want, will be good for the cells in that individual.

To Allan one might try saying: You arbitrarily define an abstract entity, called a ``group'', and then you propose to do what this entity would like done. Since this entity is a completely different type of thing than the individuals that make up a group, there is no particular reason to expect that doing what a given group would want, will be good for the individuals in that group.

The next section will focus on a more general issue: The fact that concepts often have counterintuitive implications when they are used in a new context. In particular: one should expect intuitions built up while thinking about extrapolating the will of an individual to fail completely for proposals to extrapolate the will of a group. As will be outlined in the next section: the main issue is that extrapolating the will of an individual, and extrapolating the will of a group, are simply two completely different types of mappings. They do map to similar entities. But they map from completely different types of things.

Two very different types of mappings and a problematic implicit assumption

The Coherent Extrapolated Volition of a human Individual (CEVI) is a completely different type of thing, than the Coherent Extrapolated Volition of Humanity (CEVH). Both are mappings to an entity of the type that can be said to want things. But only CEVI is a mapping from an entity of the type that can be said to want things (the original human). CEVH does not map from such an entity. CEVH only maps to such an entity. A group of billions of human individuals can only be seen as such an entity, if one already has a specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements. Such a disagreement resolution rule is one necessary part of the definition of any CEVH mapping.

In the CEVI case, there already exists someone that can coherently object to the resulting mind. Thus, a claim that some specific CEVI mapping is bad, can sometimes be described as referring to something coherent and real. However, let's explore what it would mean to claim that a CEVH mapping is bad. What would it mean to claim that Humanity would object to some version of CEVH? Basically, it would correspond to an implicit assumption that some Group entity already exists. A G entity that exists completely independently of any mapping. A G entity of the type that can be said to want things. A pre-existing, free floating, G entity of the type that can coherently be said to object to a bad way of describing what it wants.

There exists a large number of reasonable sounding ways of mapping billions of humans to such a Group entity. These mappings all result in completely different G entities. They all want completely different things. Some G entities are dominated by Bob the determined from the VarAI thought experiment. Other G entities are dominated by any solid majority, no matter how narrow. Still other G entities are dominated by a tiny group of individuals that intrinsically value hurting other individuals [LW · GW]. Etc, etc, etc, etc.

In other words: for a given group of billions of humans, there exists a very large number of such G entities. They all want completely different things. There is no objectively correct G entity out there, with a separate existence, that can coherently object to its interests not being represented properly. Because before one has specified some specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements, billions of human individuals simply cannot speak with one voice. The only mind that could object to a given mapping, is the mind that is created by that very mapping. This is a very different scenario, compared to the case when one takes the CEV of an Individual (CEVI). It would thus be a remarkable coincidence, if intuitions built up while thinking about extrapolation in the CEVI context, would turn out to be appropriate in the CEVH context.

Another way to see this difference is to examine the proposal to pick a mapping that does nothing. For CEVI this results in an entity of the type that can be said to want things (the original human). For CEVH however, this proposal is a nonsense proposal. Because it does not result in an entity of the type that can be said to want things. So, for a CEVI mapping, there exists a natural place to start. A natural mapping that one can take as the start of an iterative design process. For a CEVH mapping however, the idea of a natural place to start is nonsense. This is an important difference between the CEVI mapping and the CEVH mapping.

The intuition that there exists a natural place to start modifying a CEVH mapping from, corresponds to an intuition that there exists some objectively correct G entity, out there, with a separate existence, that can object to a bad way or representing what it wants the AI to do. (There exists many places where one can start with a CECH mapping. Each place to start corresponds to a different way of resolving disagreements. And thus implies a different initial G entity. These initial G entities all want completely different things. Some are completely dominated by any solid majority, no matter how narrow. Others are dominated by a tiny majority that intrinsically value hurting others. Still others are dominated by Bob the determined from the VarAI thought experiment. Etc, etc, etc, etc).

In other words: there is no reason to believe that intuitions built up while thinking about CEVI mappings would be appropriate for CEVH mappings. In yet other words: it would be a remarkable coincidence if intuitions developed while thinking about extrapolating individuals, would turn out to be appropriate when thinking about the proposal to build an AI that implements the Coherent Extrapolated Volition of Humanity.

As with most things, one can coherently make claims along the lines of: “this CEVH mapping is bad for me”, or “this CEVH mapping is bad for Jeff”. That is however a very different type of claim, compared to claims along the lines of: “this CEVH mapping is a bad way of representing the will of the free floating G entity, that exists independently from any choice of definitions”. We will return to this topic in the next subsection where we will explore three related proposals, all based on the idea of evaluating a given CEVH mapping relative to some specific person: Corrigibility, a last judge, and the idea of letting an extrapolated person pick a set of definitions.

But let’s first try to reason about this situation, without the intuition that there exists some Group entity, “out there”, that can coherently object to “bad CEVH mappings”. After abandoning this intuition, it becomes entirely unsurprising to learn that a given Group AI will want to hurt individuals. Why would it not? And it becomes natural to think about how to protect individuals from a powerful AI. So, how does one protect an individual (that gets no special treatment) from an AI that gets its goal from billions of humans?

It is difficult to protect someone once a powerful and clever AI has already decided that it wants to hurt that individual. So the most natural way to protect an individual is to look at the point at which preferences that refer to her are adopted by an AI. Unless specifically given such influence, no individual in a large group will have any meaningful influence over this decision. And without such influence there is no reason to think that the resulting AI will want to help, as opposed to want to hurt (if we remember to disregard all intuitions built on top of implicit assumptions about an independently existing G entity that will come to the rescue). Therefore, such influence is a necessary feature of a safe alignment target. (In other words: while the SPADI feature is not sufficient for safety, it is necessary. The role that a feature like this can play in a design process was previously discussed in the context of Membrane formalisms [LW · GW]).

Every individual can be given such influence. Doing so means that the resulting AI is no longer describable as a Group AI. Because no GAI is describable as having the SPADI feature (as was illustrated above by Dave and his knuckle cracking). Therefore, the core concept of the Group AI alignment target proposal is incompatible with the necessary SPADI feature.

If one remembers to disregard all intuitions that rely on the implicit assumption of a separately existing G entity that must be catered to, then abandoning the idea of a Group AI should not be a dramatic suggestion. Without this implicit assumption about a free floating G entity, the fact that the SPADI feature is incompatible with building a GAI, is no longer a negative aspect of the SPADI feature. The adoption of self-referring preferences is no longer a detail. Defining a special rule for this step is no longer subverting the will of any free floating G entity that must be catered to. Giving individuals influence over the adoption of self-referring preferences is no longer interfering with an attempt to solve the technical problem of describing the will of the independently existing G entity.

In other words: insisting on the SPADI feature is not dividing by zero. It is not blasphemy. It is not breaking the rules. It is not going against the will of anything that actually exists. For a given set of billions of humans, there exists a large number of possible G entities, implied by different implementable CEVH mappings (such as PCEV, VarAI, non-stochastic-PCEV, etc, etc, etc). These entities all want completely different things. Every conceivable action, including inaction, thus goes against the wishes of a large number of such entities (including every conceivable proposed AI project. And the proposal to not build any AI at all). There is thus no possible action (including inaction) that anyone can ever take, that would avoid going against the wishes of a large number of such entities. Insisting on the SPADI feature is thus not special in this regard.

When one insists on the SPADI feature, what one is actually doing is protecting individuals. And this is a set of minds that actually does exist, in a way that is separate from any specific proposed mapping. This is a type of mind that one is causally connected to in a much stronger sense. On the other hand: the connection to any specific G entity goes through some arbitrarily chosen set of CEVH mapping definitions (even if one assumes the successful implementation of the GAI alignment target proposal, then this does not imply any specific G entity).

In yet other words: it would be nonsense to say that a GAI project is trying to describe what some free floating G entity wants an AI to do. Because one step of a GAI project would be to choose a set of definitions for a mapping from billions of individuals to an entity of the type that can be said to want things. Thus, any reasoning that implicitly assumes that such a G entity already exists, is known to be fundamentally confused (it is not flawed in the sense of being false. But flawed in the sense of meaning nothing. Two people debating how best to describe the will of the free floating Group entity, are just as confused as two people debating how big the spherical triangle is).

Three coherent proposals where someone can reject a given CEVH mapping

This subsection will discuss three related proposals. In each case there will be someone that can coherently be said to have an opinion about a given CEVH mapping, (i): Corrigibility, (ii): a last judge, (iii): letting an extrapolated person pick definitions. These three ideas are distinct and will be discussed separately below. But they all share two important features. They can all be used by many different AI projects. For example projects aiming for a GAI. Or projects aiming for an AI with the SPADI feature. So none of these ideas offer much help when trying to decide what alignment target an AI project should aim at. And in all three cases, an effort to implement the idea might fail. So none of these ideas removes the risk involved with initiating an AI project aiming at a bad alignment target (because aiming at a bad alignment target remains dangerous, even if efforts are made to implement one of these ideas).

Some versions of these ideas can be seen as proposing general purpose methods for hitting alignment targets (either directly or by allowing the designers to adopt a trial and error approach to AI design). They are basically proposed ways of reducing the probability of extinction from a misaligned AI. Such methods are out of scope of the present text. They could be used to implement PCEV or VarAI. They could also be used to implement some AI with the SPADI feature. In other words: they would not reduce the risk of someone successfully implementing a bad alignment target. And they would not help much when trying to figure out what alignment target to aim at. They are in fact complementary risk mitigation strategies (see this comment [LW(p) · GW(p)] for a longer discussion). The present subsection is instead discussing versions of these three proposals that are supposed to be able to prevent scenarios where someone successfully implements a bad alignment target.

Let's start with the idea to build a Corrigible AI that would allow a human designer to go through some form of trial and error process while defining a given mapping. In this case the person that rejects a CEVH mapping is a human designer that has discussed the mapping with a clever and Corrigible AI. This type of Corrigible AI proposal is importantly different from the proposal to build a Corrigible AI that can only be used for limited tasks, such as shutting down competing AI projects. (The connection between ATA and the idea of using a limited AI to shut down competing AI projects was previously discussed here [LW · GW] and here [LW(p) · GW(p)]). The type of Corrigibility that we are exploring in the present subsection however, is the type that might prevent you from successfully hitting a bad alignment target. In other words: if you start an AI project aiming at a bad alignment target, the result might be a clever AI explaining to you that your chosen alignment target implies an outcome that you would not like.

While an effort to build a Corrigible AI could prevent a bad alignment target from getting successfully implemented, it is also possible that the effort to make an AI Corrigible will fail on some implementation step. Additionally, an effort to make an AI Corrigible might also fail even if everything is successfully implemented [LW · GW]. It is also possible that an AI might be Corrigible when discussing some topics, but not Corrigible when discussing other topics [LW(p) · GW(p)]. To end up with a Corrigible AI, some human has to successfully define what it means for an AI to be Corrigible. If this definition step fails, then there will be no Corrigible AI available to allow a second attempt (even if everything is implemented successfully, the resulting AI will not be Corrigible). In other words: this idea does not remove the risk from an AI project that is aiming at a bad alignment target. Which in turn means that the idea does not remove the need to analyse alignment targets. If one does manage to construct some Corrigibility method that looks like it might work, then this can be used for an AI project aiming for a GAI. It can also be used for an AI project aiming for an alignment target with the SPADI feature. So the idea is not very helpful when deciding what alignment target to aim at.

Let's now turn to the last judge idea. The proposal is to give the extrapolated version of a person a veto over one specific mapping. This can prevent bad outcomes if it works. However, since the implementation of such an add on might fail, it does not remove the risk from an AI project that is aiming at a bad alignment target. Besides implementation issues, the idea also depends on having a good definition of what it means to extrapolate a human. Extrapolation is a tricky concept, and the definitions of an extrapolation dynamic must be chosen by a non extrapolated designer. Even a successfully implemented last judge might be ineffective due to a problem with the extrapolation definition [LW · GW]. Such an add on can be added to an AI project aiming for a GAI. It can also be added to an AI project aiming for an AI design with the SPADI feature. So the idea is not very helpful when deciding what alignment target to aim at.

Delegating the choice of definitions to an extrapolated human

An extrapolated human can be asked to pick the definitions of a GAI. Alternatively, an extrapolated human can be asked to pick the definitions of an AI with the SPADI feature. So this proposal also does not help us much when choosing between a GAI and an AI with the SPADI feature. But let's explore what it would mean to define a GAI in this way. Let's again take the perspective of Steve, who has no special influence over an AI project. An extrapolation method will be aimed at Tim, and the result of the extrapolation will choose definitions for an AI. Let's refer to the outcome of this human defined extrapolation method as Extrapolated Tim (ET). And let's remember that we really do not know much about what to expect from a superintelligent mind, that comes out at the other end of a human defined extrapolation method (even if we assume successful implementation).

For this to be a coherent idea, the proposal must be something other than a plan to ask ET to find the correct description of what an independently existing Group entity wants an AI to do (because no such G entity exists). In other words: ET is not being asked to solve a technical problem. If ET had been asked to solve a technical problem. Then it might have been enough for the extrapolation method to result in an intelligent mind that is honestly attempting to solve this technical problem. But the plan is either to ask ET to do something other than solve a technical problem. Or the plan is hopelessly confused. In particular: if ET picks a set of definitions that implies a mind that wants to hurt Steve, then there does not exist any coherent sense in which ET has failed at a technical task. The task that ET has been given has many valid solutions. Some of these valid solutions will be very bad for Steve. PCEV is one such valid solution.

The fact that there exists no coherent sense in which PCEV is a bad solution to a technical task matters a great deal. It matters when we are trying to estimate what would be required for a given plan to be safe for Steve. When evaluating a plan where some specific extrapolation method will be used to pick definitions for a given alignment target, this fact is very important to keep in mind. There exists no free floating G entity that can be counted on to want to help Steve. So if the plan is to ask ET to pick definitions for the GAI alignment target, then the choice of alignment target does not provide any protection for Steve. Because this alignment target choice means that Steve will have no influence regarding the adoption of those preferences that refer to Steve.

In other words: the task that ET has been given is not a well defined technical problem with a solution that can be found by being smart enough. The mapping that ET is defining, does not map from an entity of the type that can be said to want things (it only maps to such an entity). Thus, any intuition along the lines of there existing a G entity, that can object in the sense that an individual could object, is simply confused. Because the specific decision that has been delegated to ET, is the decision of how to construct such an entity. This means that an issue with the extrapolation dynamic cannot be counted on to self correct. There is no well defined technical problem to ground in.

(As a tangent, it is worth noting that if the extrapolation of Tim's values goes off the rails, then there exists no objectively correct solution that might steer an intelligent and well meaning ET back towards objectively good technical solutions. ET finding a valid solution to the task that ET has been assigned offers no meaningful protection even to Tim. Because the chosen alignment target means that no one will have any meaningful influence regarding the adoption of self preferences. So the fact that the solution found by ET is valid does not imply any form of safety. On the contrary: it is the very fact that the solution is valid, that ensures that Tim will have no influence regarding the adoption of those preferences that refer to Tim).

In yet other words: ET's job is to define a mapping from billions of individuals to an entity of the type that can be said to want things. This is simply not a well defined technical task that can be solved simply by being smart and well intentioned. So what Steve should expect from a successful plan is for ET to pick a very reasonable sounding and clever CEVH mapping. And then Steve should expect to be subjected to the will of whatever mind this mapping implies. Steve has no reason to expect this mind to want to help Steve, as opposed to want to hurt Steve. Because the one thing that Steve actually does know, is that Steve will have no influence regarding the adoption of those preferences that refer to Steve (because if the plan succeeds, then ET will pick the definitions of a GAI).

(Any intuition along the lines that ET could solve the well defined technical problem of locating the mapping that an idealised version of Tim would pick, is either based on nonsense or circular in a trivial sense. Because ET is, literally, the chosen definition of what it means to be an idealised version of Tim).

To summarise: This method can be used to pick the definitions of a GAI. Or used to pick the definitions of an AI with the SPADI feature. Delegating the choice of definitions to the outcome of a human defined extrapolation procedure also does not remove the risks involved in aiming for a bad alignment target. Picking extrapolation definitions is tricky even if we assume successful implementation. And when asked to pick definitions for a GAI, then the task delegated to ET is not the type of task that might steer a badly extrapolated ET back to safe solutions, by virtue of ET being good at solving technical tasks. When ET picks definitions for the GAI alignment target proposal, then validity does not imply safety. On the contrary: validity then ensures a lack of control over the adoption of self-preferences.

In other words, and more generally: Adopting this method does not imply that the AI project in question is aiming at any specific alignment target. It also does not remove the risk from an AI project that is aiming at a bad alignment target. Which in turn means that it still matters what alignment target is aimed at.

In yet other words and even more generally: unless this method is assumed to always result in an AI that is functionally equivalent to an AI that does whatever Tim wants that AI to do, it will continue to matter what alignment target ET is asked to pick definitions for. And it will continue to be rational for Steve to insist on an alignment target, such that successful implementation implies safety for Steve.

16 comments

Comments sorted by top scores.

comment by Seth Herd · 2025-01-15T21:00:30.901Z · LW(p) · GW(p)

I very much agree with your top-level claim: analyzing different alignment targets well before we use them is a really good idea.

But I don't think those are the right alignment targets to analyze. I think none of those are very likely to actually be deployed as alignment targets for the first real AGIs. I think that Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] Ω or roughly equivalently (and better-framed for the agent foundations crowd), Corrigibility as Singular Target [LW · GW] is far superior to anything else. I think it's so superior that anyone sitting down and thinking about the topic, for instance just before launching something they viscerally believe might actually be able to learn and self-improve, will likely see it the same way.

On top of that logic, the people actually building the stuff would rather have it aligned to their goals than everyones.

Replies from: ThomasCederborg

↑ comment by ThomasCederborg · 2025-01-16T22:28:26.063Z · LW(p) · GW(p)

I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the path to a Group AI goes through an initial limited AI (referred to as a Task AI). In other words: the classical proposed path to an AI that implements the CEV of Humanity actually starts with an initial AI that is not an AI Sovereign (and such an AI could for example be the type of instruction following AI that you mention). In yet other words: your proposed AI is not an alternative to a Group AI. Its successful implementation does not prevent the later implementation of a Group AI. Your proposed AI is in fact one step in the classical (and still fairly popular) proposed path to a Group AI.

I actually have two previous [LW · GW] posts [LW · GW] that were devoted to making the case for analysing the types of alignment targets that the present post is focusing on. The present post is instead focusing on doing such analysis. This previous post [LW · GW] outlined a comprehensive argument in favour of analysing these types of alignment targets. Another previous post specifically focused on illustrating that Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure [LW · GW]. See also this comment [LW(p) · GW(p)] where I discuss the difference between proposing solutions on the one hand, and pointing out problems on the other hand.

Charbel-Raphaël responded to my post by arguing that no Sovereign AI should ever be created [LW(p) · GW(p)]. My reply [LW(p) · GW(p)] pointed out that this is mostly irrelevant to the question at hand. The only relevant question is whether or not a Sovereign AI might be successfully implemented eventually. If that is the case, then one can reduce the probability of some very bad outcomes by doing the type of Alignment Target Analysis that my previous two posts were arguing for (and that the present post is an example of). The second half of this reply [LW(p) · GW(p)] (later in the same thread) includes a description of an additional scenario where an initial limited AI is followed by a Sovereign AI (and this Sovereign AI is implemented without significant time spent on analysing the specific proposal, due to Internal Time Pressure).

Regarding Corrigibility as a Singular Target:

I don't think that one can rely on this idea to prevent the outcome where a dangerous Sovereign AI proposal is successfully implemented at some later time (for example after an initial AI has been used to buy time). One issue is the difficulty of defining critical concepts such as Explanation and Understanding. I previously discussed this with Max Harms here [LW(p) · GW(p)], and with Nathan Helm-Burger here [LW(p) · GW(p)]. Both of those comments are discussing attempts to make an AI pursue Corrigibility as a Singular Target (which should not be confused with my post on Corrigibility [LW · GW], which discussed a different type of Corrigibility).

Regarding what the designers might want:

The people actually building the stuff might not be the ones deciding what should be built. For example: if a messy coalition of governments enforces a global AI pause, then this coalition might be able to decide what will eventually be built. If a coalition is capable of successfully enforcing a global AI pause, then I don't think that we can rule out the possibility that they will be able to enforce a decision to build a specific type of AI Sovereign (they could for example do this as a second step, after first managing to gain effective control over an initial instruction following AI). If that is the case, then the proposal to build something along the lines of a Group AI might very well be one of the politically feasible options (this was previously discussed in this post [LW · GW] and in this comment [LW(p) · GW(p)]).

Replies from: Seth Herd

↑ comment by Seth Herd · 2025-01-16T23:24:08.725Z · LW(p) · GW(p)

I agree with essentially all of this. See my posts

If we solve alignment, do we die anyway? [LW · GW] on AGI nonproliferation and government involvement

and

Intent alignment as a stepping-stone to value alignment [LW · GW] on eventually building sovereign ASI using intent-aligned (IF or Harms-corrigible) AGI to help with alignment. Wentworth recently pointed out that idiot sycophantic AGI combined with idiotic/time-pressured humans might easily screw up that collaboration, and I'm afraid I agree. I hope we do it slowly and carefully, but not slowly enough to fall into the attractor of a vicious human getting the reigns and keeping them forever.

The only thing I don't agree with (AFAICT on a brief look - I'm rushed myself right now so LMK what else I'm missing if you like) is that we might have a pause. I see that as so unlikely as to not be worth time thinking about. I have yet to see any coherent argument for how we get one in time. If you know of such an argument, I'd love to see it!

Replies from: ThomasCederborg

↑ comment by ThomasCederborg · 2025-02-09T19:03:20.979Z · LW(p) · GW(p)

Given that you agreed with most of what I said in my reply, it seems like you should also agree that it is important to analyse these types of alignment targets. But in your original comment you said that you do not think that it is important to analyse these types of alignment targets.

Let's write Multi Person Sovereign AI Proposal (MPSAIP) for an alignment target proposal to build an AI Sovereign that gets its goal from the global population (in other words: the type of alignment target proposals that I was analysing in the post). I followed your links and can only find one argument against the urgency of analysing MPSAIPs now: that an Instruction Following AI (IFAI) would make this unnecessary [LW · GW]. I can see why one might expect that an IFAI would help to some degree when analysing MPSAIPs. But I don't see how the idea of an IFAI could possibly remove the urgent need to analyse MPSAIPs now.

In your post on distinguishing value alignment from intent alignment [LW · GW], you define value alignment as being about all of humanity's long term, implicit deep values. It thus seems like you are not talking about anything along the lines of building an AI that will do whatever some specific person wants that AI to do. Please correct me if I'm wrong, but your position thus seems to be built on top of the assumption that it would be safe to assume that an IFAI can be used to solve the problem of how to describe all of humanity's long term, implicit deep values.

A brief summary of why I think that this is false: You simply cannot delegate the task of picking a goal to an AI (no matter how clever this AI is). You can define the goal indirectly and have the AI work out the details. But the task is simply not possible to delegate. For the same reason: you simply cannot delegate the task of picking a MPSAIP to an AI (no matter how clever this AI is). You can define things indirectly and have the AI work out the details. This is equivalent to fully solving the field of MPSAIP analysis. It would for example necessarily involve defining some procedure for dealing with disagreements amongst individuals that disagree on how to deal with disagreements (because individuals will not agree on which MPSAIP to pick). PCEV is one such procedure. It sounds reasonable but would lead to an outcome far worse than extinction [LW · GW]. VarAI is another procedure that sounds reasonable but that is in fact deeply problematic. As shown in the post, this is not easy (partly because intuitions about well known concepts tend to break when transferred to the AI context). In other words: you can't count on an IFAI to notice a bad MPSAIP, for the same reason that you can't count on Clippy to figure out that it has the wrong goal.

How useful would an IFAI be for analysing MPSAIPs?

I can see why one might think that an IFAI would be somewhat useful. But I don't see how one can be confident that it would be very useful (let alone be equivalent to a solution). If one does not hold this position, then the existence of an IFAI does not remove the need to analyse MPSAIPs now. (The idea that an IFAI might be counted on to buy sufficient time to analyse MPSAIPs is covered below, in the section where I answer your question about an AI pause).

The idea that an IFAI would be extremely useful for Alignment Target Analysis seems to be very common. But there is never any actual reason given for why this might be true. In other words: while I have heard similar ideas many times, I have never been able to get any actual argument in favour of the position, that an IFAI would be very useful for analysing MPSAIPs (by you or by anyone else). It is either implicit in some argument, or just flatly asserted. There seems to be two versions of this idea. One version is the delegation plan. In other words: the plan where one builds an IFAI that does know how to describe all of humanity's long term, implicit deep values. The other version is the assistant plan. In other words: the plan where one builds an IFAI that does not know how to describe all of humanity's long term, implicit deep values (and then uses that IFAI as an assistant while analysing MPSAIPs). I will cover them separately below.

The delegation plan: The scenario where an IFAI does know how to define all of humanity's long term, implicit deep values

I don't know how this plan could possibly remove the need for analysing MPSAIPs now. I don't know why anyone would believe this (similarly to how I don't know why anyone would believe that Clippy can be counted on to figure out that it has the wrong goal). It is clearly a common position. But as far as I am aware, there exists no positive argument for this position. Without any actual argument in favour of this position, it is a bit tricky to argue against this position. But I will do my best.

A preliminary point is that the task of picking one specific mapping, that maps from billions of humans to an entity of the type that can be said to want things, is not a technical task with a findable solution (see the post for much more on this). In yet other words: if one were to actually describe in detail the argument that one can delegate the task of analysing MPSAIPs to an IFAI, then one would run into a logical problem (if one tried to actually spell out the details step by step, one would be unable to do so). The problem one would run into, would be the same problem that one would run into if one were to try to argue that Clippy will figure out that it has the wrong goal (if one tried to actually spell out the details step by step, one would be unable to do so). Neither finding the correct goal nor analysing MPSAIPs is a technical task with a findable solution. Thus, neither task can be delegated to an AI, no matter how clever it is.

Let's say that we have an IFAI that is able to give an answer, when you ask it how to describe all of humanity's long term, implicit deep values. This is equivalent to the IFAI having already picked a specific MPSAIP.

I see only two ways of arriving at such an IFAI. One is that something has gone wrong, and the IFAI has settled on an answer by following some process that the designers did not intend it to follow. This is a catastrophic implementation failure. In other words: unless the plan was for the IFAI to choose an MPSAIP using some unknown procedure, the project has not gone according to plan. In this case I see no particular reason to think that the outcome would be any better than the horrors implied by PCEV [LW · GW].

The only other option that I see is that the designers have already fully solved the problem of how to define all of humanity's long term, implicit deep values (presumably indirectly, by defining a process that leads to such a definition). In other words: if one plans to build an IFAI like this, then one has to fully solve the entire field of analysing MPSAIPs, before one builds the IFAI. In yet other words: if this is the plan, then this plan is an argument in favour of the urgent need to analyse MPSAIPs.

The assistant plan: The scenario where an IFAI does not know how to define all of humanity's long term, implicit deep values

To conclude that analysing MPSAIPs now is not urgent, one must assume that this type of IFAI assistant is guaranteed to have a very dramatic positive effect (a somewhat useful IFAI assistant would not remove the urgent need for analysing MPSAIPs now). It seems to be common to simply assume that an IFAI assistant will basically render prior work on analysing MPSAIPs redundant (the terminology differs. And it is often only implicit in some argument or plan. But the assumption is common). I have however never seen any detailed plan for how this would actually be done. (The situation is similar to how the delegation plan is never actually spelled out). I think that as soon as one were to lay out the details of how this would work, one would realise that one has a plan that is built on top of an incorrect assumption (similar to the type of incorrect implicit assumption that one would find, if one were to spell out the details of why exactly Clippy can be counted on to realise that it has the wrong goal).

It is difficult to argue against this position directly, since I don't know how this IFAI is supposed to be used (let alone why this would be guaranteed to have a large positive effect). But I will try to at least point to some difficulties that one would run into.

Let's say that Allan is asking the IFAI questions, as a step in the process of analysing MPSAIPs. Every question Allan asks of an IFAI like this would pose a very dramatic risk. Allan is leaning heavily on a set of definitions, for example definitions of concepts like Explanation and Understanding. Even if those definitions have held up while the IFAI was used to do other things (such as shutting down competing AI projects), those definitions could easily break when discussing MPSAIPs. Since the IFAI does not know what a bad MPSAIP is, the IFAI has no way of noticing that it is steering Allan towards a catastrophically bad MPSAIP. Regardless of how clever the IFAI is, there is simply no chance of it noticing this. Just as there is no chance of Clippy discovering that it has the wrong goal.

In other words: if a definition of Explanation breaks during a discussion with an IFAI, and Allan ends up convinced that he must implement PCEV, then we will end up with the horrors implied by PCEV. (If you think that the IFAI will recognise the outcome implied by PCEV as a bad outcome, then you are imagining the type of IFAI that was treated in the previous subsection (and such an IFAI can only be built after the field of analysing MPSAIPs have been fully solved)). This was previously discussed here [LW(p) · GW(p)] and here [LW(p) · GW(p)] (using different terminology).

(To be clear: this subsection is not arguing against the plan of building an IFAI of this type. And it is not arguing against the idea that this type of IFAI might be somewhat useful. It is not even arguing against the idea that it might be possible to use an IFAI like this in a way that dramatically increases the ability to analyse MPSAIPs. It is simply arguing against the idea that one can be sure that an IFAI like this will in fact be used in a way that will dramatically increase the ability to analyse MPSAIPs. This is enough to show that the IFAI idea does not remove the urgent need to analyse MPSAIPs now).

Regarding the probability of a pause

The probability of a politically enforced pause is not important for any argument that I am trying to make. Not much changes if we replace a politically enforced pause with an IFAI. Some group of humans will still decide what type of Sovereign AI will eventually be built. If they successfully implement a bad Sovereign AI proposal, then the outcome could be massively worse than extinction [LW · GW]. So it makes sense to reduce the probability of that. One tractable way of reducing this probability is by analysing MPSAIPs.

In other words: if you achieve a pause by doing something other than building an AI Sovereign (for example by implementing a politically enforced pause, or by using an IFAI). Then the decision of what AI Sovereign to eventually build will remain in human hands. So then you will still need progress on analysing MPSAIPs to avoid bad Sovereign AI proposals. There is no way of knowing how long it will take to achieve the needed level of such progress. And there is no way of knowing how much time a pause will actually result in. So, even if we did know exactly what method will be used to shut down competing projects. And we also knew exactly who will make decisions regarding Sovereign AI. Then there is still no way of knowing that there will be sufficient time to analyse MPSAIPs. Therefore, such analysis should start now. (And as illustrated by my post, such progress is tractable).

One point that should be made here, is that you can end up with a multipolar world even if there is a single IFAI that flawlessly shuts down all unauthorised AI projects. If a single IFAI is under the control of some set of existing political power structures, then this would be a multipolar world. Regardless of who is in control (for example the UN Security Council (UNSC), the UN general assembly, or some other formalisation of global power structures), it is still possible for some ordinary political movement to gain power over the IFAI, by ordinary political means. Elected governments can be voted out. Governments along the lines of the USSR can evidently also be brought down by ordinary forms of political movements. So there is in general nothing strange about someone being in control of an IFAI, but finding themselves in a situation where they must either act quickly and decisively, or risk permanently losing control to people with different values. This means that shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure [LW · GW].

Let's consider the scenario where a UNSC resolution is needed to ask the IFAI a question, or to ask the IFAI to do something (such as shutting down competing AI projects, or launching an AI Sovereign). There is currently an agreement of what AI Sovereign to build. But there is also an agreement that it would be good to first analyse this proposal a bit more, to make sure there is no hidden problem with it. In this case, losing control of any of the five countries with a veto would remove the ability to launch an AI Sovereign (if control is lost to a novel and growing political movement, then control could be lost permanently. The result of losing control of one permanent UNSC member could mean that a deadlock will persist until the new movement eventually controls all five). So, the people currently in control would basically have to either act quickly or risk permanently losing power to people with different values. If they decide to aim at their preferred MPSAIP, then it would be very nice if the field of analysing MPSAIPs had progressed to the point where it is possible to notice that this MPSAIP implies an outcome worse than extinction (for example along the lines of the outcome implied by PCEV. But presumably due to a harder-to-notice problem).

I used the UNSC as an example in the preceding paragraph, because it seems to me like the only legal way of taking the actions that would be necessary to robustly shut down all competing AI projects (being the only legal option, and thus a sort of default option, makes it difficult to rule out this scenario). But the same type of Internal Time Pressure might also arise in other arrangements. This comment [LW · GW] outlines a scenario where a global electorate is in charge (which seems like another reasonable candidate for how to define what it means to do the default thing). This post [LW · GW] outlines a scenario where a group of augmented humans are in charge (in that scenario buying time is achieved by uploading. Not by shutting down competing AI projects. This seems like something that someone might do if they don't feel comfortable with using force. But simultaneously don't feel ready to take the decision to give up control to some specific political process).

The reason that I keep going on about the need for Alignment Target Analysis (ATA) is that there seems to currently exist exactly zero people in the world devoted to doing ATA full time. Making enough ATA progress to reduce the probability of bad outcomes is also tractable (trying to solve ATA would be a completely different thing. But there still exists a lot of low hanging fruit in terms of ATA progress that reduces the probability of bad outcomes). It thus seems entirely possible to me that we will end up with a PCEV style catastrophe [LW · GW] that could have been easily prevented. Reducing the probability of that seems like a reasonable thing to do. But it is not being done.

An attempt to summarise how I view the situation

At our current level of ATA progress it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction [LW · GW]. I simply don't see how one can think that it is safe to stay at this level of progress. Intuitively this seems like a dangerous situation. The fact that there exists no research project dedicated to improving this situation seems like a mistake (as illustrated by my post, reducing the probability of bad outcomes is a tractable research project). It seems like many people do have some reason for thinking that the current state of affairs is acceptable. As far as I can tell however, these reasons are not made public. This is why I think that it makes sense to spend time on trying to figure out what you believe to be true, and why you believe it to be true (and this is also why I appreciate you engaging on this).

In other words: arguing that ATA should be a small percentage of AI safety work would be one type of argument. Arguing that the current situation is reasonable would be a fundamentally different type of argument. It is clearly the case that plenty of people are convinced that it is reasonable to stay at the current level of ATA progress (in other words: many of people are acting in a way that I can only explain if I assume that they feel very confident, that it is safe to stay at our current level of ATA progress). I think that they are wrong about this. But since no argument in favour of this position is ever outlined in detail, there is no real way of addressing this directly.

PS:

I'm fine with continuing this discussion here. But it probably makes sense to at least note that it would have fitted better under this post [LW · GW] (which makes the case for analysing this type of alignment targets. And actually discusses the specific topic of why various types of Non-Sovereign-AIs would not replace doing this now). As a tangent, the end of that post actually explicitly asked people to outline their reasons for thinking that ATA now is not needed. Your response here seems to be an example of this. So I very much appreciate your engagement on this. In other words: I don't think you are the only one that have ideas along these lines. I think that there are plenty of people with similar ways of looking at things. And I really wish that those people would clearly outline their reasons for thinking that the current situation is reasonable. Because I think that those reasons will fall apart if they are outlined in any actual detail. So I really appreciate that you are engaging on this. And I really wish that more people would do the same.

Replies from: Seth Herd

↑ comment by Seth Herd · 2025-02-09T19:54:10.988Z · LW(p) · GW(p)

Thanks! I don't have time to process this all right now, so I'm just noting that I do want to come back to it quickly and engage fully.

Here's my position in brief: I think analyzing alignment targets is valuable. Where my current take differs from yours (I think) is that I think that effort would be best spent analyzing what you term corrigibility in the linked post (I got partway through and will have to come back to it), and I've called instruction-following.

I think that's far more important to do first, because that's approximately what people are aiming for right now. I fully agree that there are other values mixed in with the training other than instruction-following. I think the complexity and impurity of that target makes it more urgent, not less, to have good minds analyzing the alignment targets that developers are most likely to pursue first by default. See my recent post Seven sources of goals in LLM agents [LW · GW]. This is my main research focus, but I know of no one else focusing on this, and few people who even give it part-time attention. This seems like a bad allocation of resources; there might be major flaws in the alignment target that we don't find until developers are far along that path and reluctant to rework it.

You said

Please correct me if I'm wrong, but your position thus seems to be built on top of the assumption that it would be safe to assume that an IFAI can be used to solve the problem of how to describe all of humanity's long term, implicit deep values.

I wrote a little more about this in Intent alignment as a stepping-stone to value alignment [LW · GW].

I definitely do not think it would be safe to assume that IF/corrigible AGI can solve value alignment for other/stronger AGI. John Wentworth's The Case Against AI Control Research [LW · GW] has a pretty compelling argument for how we'd collaborate with sycophantic parahuman AI/AGI to screw up aligning the next step in AGI/ASI.

I do not think any of this is safe. I think we are long past the point where we should be looking for perfectly reliable solutions. I strongly believe we must look for the best viable solutions, factoring in the practicality/likelyhood of getting them actually implemented. I worry that the alignment community's desire for best (let alone "provably safe") alignment solutions will prevent them from working on solutions that give us the best possible shot within the economic, psychological, and political realities governing the creation of AGI.

So we seem to be in agreement that the current alignment target of instruction-following/corrigibility should really get more analysis. I am working on that. I'll just toss out one difficulty I'm reckoning with, which no one has (at least expicitly) recognized AFAIK: IF AGI must be oriented to prioritize taking and following new instructions over old (otherwise it won't listen for new instructions since that would risk not achieving all the goals from past instructions). With that as first priority, it would avoid being shut down, since that would prevent it from hearing further instructions.

This is the sort of alignment target analysis that thinkers like you could help with. I wish we just had far more people thinking about this. Given the sharp limitations, it still seems like prioritizing IF/corrigibilty (and the messy mix of moralistic behavior devs are training for) seems higher priority than value alignment targets that we perhaps "should" use but will in fact almost certainly not.

Much more in a few days after I publish my next piece on the complexities of instruction-following mixed with other implicit/trained goals, and fully process your pieces! Thanks for engaging. I appreciate your efforts in this direction, whether or not you decide to analyze intent alignment targets other than value alignment targets.

comment by Noosphere89 (sharmake-farah) · 2025-01-15T21:25:35.102Z · LW(p) · GW(p)

The Coherent Extrapolated Volition of a human Individual (CEVI) is a completely different type of thing, than the Coherent Extrapolated Volition of Humanity (CEVH). Both are mappings to an entity of the type that can be said to want things. But only CEVI is a mapping from an entity of the type that can be said to want things (the original human). CEVH does not map from such an entity. CEVH only maps to such an entity. A group of billions of human individuals can only be seen as such an entity, if one already has a specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements. Such a disagreement resolution rule is one necessary part of the definition of any CEVH mapping.

I like to state this as the issue that all versions of CEV/group alignment that want to aggregate thousands of people's or more values requires implicitly resolving disagreements in values, which in turn require value-laden choices, and at that point, you are essentially doing value-alignment to what you think is good, and the nominal society is just a society of you.

I basically agree with Seth Herd here, in that instruction following is both the most likely and the best alignment target for purposes of AI safety (at least assuming offense-defense balance issues aren't too severe).

comment by Martin Randall (martin-randall) · 2025-01-19T02:54:02.048Z · LW(p) · GW(p)

I don't really follow the concern with Pareto-improvements. In the thread with Davidad you give an example of heretics and fanatics. So we have something like:

9 Heretics: have 1 cake, want cake, no torture.
1 Fanatics: have 1 cake, want cake, want heretics to get torture, no cake
There is a button that produces cake, it can be pressed twenty times.
There is a button that produces torture. It can be pressed many times.

Suppose that the Heretics have a utility function like (amount of cake I get - amount of torture I get).The Fanatic has a utility function like (amount of cake I get + amount of torture of Heretics - amount of cake given to Heretics). Then there is a pareto-improvement available of giving the Fanatic eleven pieces of cake while giving each Heretic one piece of cake. This isn't especially fair, but is better than PCEV without the Pareto constraint.

I don't have a formal way of putting this, but as long as the potential gains from a negotiated agreement outweigh the extent to which agents desire to reduce each other's utility, there will be Pareto improvements available. It seems likely that we're in that situation, given that the fanatics and heretics of the world already trade with each other.

Replies from: ThomasCederborg

↑ comment by ThomasCederborg · 2025-01-19T04:41:05.954Z · LW(p) · GW(p)

Consider Bob, who takes morality very seriously. Bob thinks that any scenario where the fate of the world is determined by an unethical AI, is worse than the scenario with no AI. Bob sticks with this moral position, regardless of how much stuff Bob would get in a scenario with an unethical AI. For a mind as powerful as an AI, Bob considers it to be a moral imperative to ensure that heretics do not escape punishment. If a group contains at least one person like Bob (and at least one person that would strongly object to being punished), then the set of Pareto-improvements is empty. In a population of billions, there will always exist at least some people with Bob's type of morality (and plenty of people that would strongly object to being punished). Which in turn means that for humanity, there exist no powerful AI, such that creating this AI would be a Pareto-improvement.

Replies from: martin-randall

↑ comment by Martin Randall (martin-randall) · 2025-01-19T16:38:00.898Z · LW(p) · GW(p)

The AI could deconstruct itself after creating twenty cakes, so then there is no unethical AI, but presumably Bob's preferences refer to world-histories, not final-states.

However, CEV is based on Bob's extrapolated volition, and it seems like Bob would not maintain these preferences under extrapolation:

In the status quo, heretics are already unpunished - they each have one cake and no torture - so objecting to a non-torturing AI doesn't make sense on that basis.
If there were no heretics, then Bob would not object to a non-torturing AI, so Bob's preference against a non-torturing AI is an instrumental preference, not a fundamental preference.
Bob would be willing for a no-op AI to exist, in exchange for some amount of heretic-torture. So Bob can't have an infinite preference against all non-torturing AIs.
Heresy may not have meaning in the extrapolated setting where everyone knows the true cosmology (whatever that is)
Bob tolerates the existence of other trade that improves the lives of both fanatics and heretics, so it's unclear why the trade of creating an AI would be intolerable.

The extrapolation of preferences could significantly reduce the moral variation in a population of billions. My different moral choices to others appear to be based largely on my experiences, including knowledge, analysis, and reflection. Those differences are extrapolated away. What is left is influences from my genetic priors and from the order I obtained knowledge. I'm not even proposing that extrapolation must cause Bob to stop valuing heretic-torture.

If the extrapolation of preferences doesn't cause Bob to stop valuing the existence of a non-torturing AI at negative infinity, I think that is fatal to all forms of CEV. The important thing then is to fail gracefully without creating a torture-AI.

Replies from: ThomasCederborg

↑ comment by ThomasCederborg · 2025-01-20T03:12:08.447Z · LW(p) · GW(p)

Bob really does not want the fate of the world to be determined by an unethical AI. There is no reason for such a position to be instrumental. For Bob, this would be worse than the scenario with no AI (in the Davidad proposal, this is the baseline that is used to determine whether or not something is a Pareto-improvement). Both scenarios contain non-punished heretics. But only one scenario contains an unethical AI. Bob prefers the scenario without an unethical AI (for non-instrumental reasons).

Regarding extrapolation:

The question is whether or not at least one person will continue to view a non-punishing AI as unethical after extrapolation. (When determining whether or not something is a Pareto-improvement, the average fanatic is not necessarily relevant).

Many people would indeed presumably change their minds regarding the morality of at least some things (for example when learning new facts). For the set of Pareto-improvements to be empty however, you only need two people: a single fanatic and a single heretic.

In other words: for the set to be empty it is enough that a single person continues to view a single other person (that we can call Dave), as being deserving of punishment (in the sense that an AI has a moral obligation to punish Dave). The only missing component is then that Dave has to object strongly to being punished for being a heretic (this objection can actually also be entirely based on moral principles). Just two people out of billions need to take these moral positions for the set to be empty. And the building blocks that make up Bob's morality are not actually particularly rare.

The first building block of Bob's morality is that of a moral imperative (the AI is seen as unethical for failing to fulfill its moral obligation to punish heretics). In other words: if someone finds themselves in a particular situation, then they are viewed as having a moral obligation to act in a certain way. Moral instincts along the lines of moral imperatives are fairly common. A trained firefighter might be seen as having important moral obligations if encountering a burning building with people inside. An armed police officer might be seen as having important moral obligations if encountering an active shooter. Similarly for soldiers, doctors, etc. Failing to fulfill an important moral obligation is fairly commonly seen as very bad.

Let's take Allan, who witnesses a crime being committed by Gregg. If the crime is very serious, and calling the police is risk free for Allan, then failing to call the police can be seen as a very serious moral outrage. If Allan does not fulfill this moral obligation, it would not be particularly unusual for someone to view Allan as deeply unethical. This general form of moral outrage is not rare. Not every form of morality includes contingent moral imperatives. But moralities that do include such imperatives are fairly common. There is obviously a lot of disagreements regarding who has what moral obligations. Just as there are disagreements regarding what should count as a crime. But the general moral instinct (that someone like Allan can be deeply unethical) is not exotic or strange.

The obligation to punish bad people is also not particularly rare. Considering someone to be unethical because they get along with a bad person is not an exotic or rare type of moral instinct. It is not universal. But it is very common.

And the specific moral position that heretics deserve to burn in hell is actually quite commonly expressed. We can argue about what percentage of people saying this actually means it. But surely we can agree that there exist at least some people that actually mean what they say.

The final building block in Bob's morality is objecting to having the fate of the world be determined by someone unethical. I don't think that this is a particularly unusual thing to object to (on entirely non-instrumental grounds). Many people care deeply about how a given outcome is achieved.

Some people that express positions along the lines of Bob might indeed back down if things get real. I think that for some people, survival instinct would in fact override any moral outrage. Especially if the non-AI scenario is really bad. Some fanatics would surely blink when coming face to face with any real danger. (And some people will probably abandon their entire moral framework in a heartbeat, the second someone offers them a really nice cake). But for at least some people, morality is genuinely important. And you only need one person like Bob, out of billions, for the set to be empty.

So, if Bob is deeply attached to his moral framework. And the moral obligation to punish heretics is a core aspect of his morality. And this aspect of his morality is entirely built from ordinary and common types of moral instincts. Then an extrapolated version of Bob would only accept a non-punishing AI, if this extrapolation method has completely rewritten Bob's entire moral framework (in ways that Bob would find horrific).

Replies from: martin-randall

↑ comment by Martin Randall (martin-randall) · 2025-01-25T02:33:23.449Z · LW(p) · GW(p)

Summarizing Bob's beliefs:

Dave, who does not desire punishment, deserves punishment.
Everyone is morally required to punish anyone who deserves punishment, if possible.
Anyone who does not fulfill all moral requirements is unethical.
It is morally forbidden to create an unethical agent that determines the fate of the world.
There is no amount of goodness that can compensate for a single morally forbidden act.

I think it's possible (20%) that such blockers mean that there are no Pareto improvements. That's enough by itself to motivate further research on alignment targets, aside from other reasons one might not like Pareto PCEV.

However, three things make me think this is unlikely. Note that my (%) credences aren't very stable or precise.

Firstly, I think there is a chance (20%) that these beliefs don't survive extrapolation, for example due to moral realism or coherence arguments. I agree that this means that Bob might find his extrapolated beliefs horrific. This is a risk with all CEV proposals.

Secondly, I expect (50%) there are possible Pareto improvements that don't go against these beliefs. For example, the PCEV could vote to create an AI that is unable to punish Dave and thus not morally required to punish Dave. Alternatively, instead of creating a Sovereign AI that determines the fate of the world, the PCEV could vote to create many human-level AIs that each improve the world without determining its fate.

Thirdly, I expect (80%) some galaxy-brained solution to be implemented by the parliament of extrapolated minds who know everything and have reflected on it for eternity.

Replies from: ThomasCederborg

↑ comment by ThomasCederborg · 2025-01-30T09:16:06.196Z · LW(p) · GW(p)

I'm sorry if the list below looks like nitpicking. But I really do think that these distinctions are important.

Bob holds 1 as a value. Not as a belief.

Bob does not hold 2 as a belief or as a value. Bob thinks that someone as powerful as the AI has an obligation to punish someone like Dave. But that is not the same as 2.

Bob does not hold 3 as a belief or as a value. Bob thinks that for someone as powerful as the AI, the specific moral outrage in question renders the AI unethical. But that is not the same as 3.

Bob does hold 4 as a value. But it is worth noting that 4 does not describe anything load-bearing. The thought experiment would still work even if Bob did not think that the act of creating an unethical agent that determines the fate of the world is morally forbidden. The load-bearing part is that Bob really does not want the fate of the world to be determined by an unethical AI (and thus prefers the scenario where this does not happen).

Bob does not hold 5 as a belief or as a value. Bob prefers a scenario without an AI, to a scenario where the fate of the world was determined by an unethical AI. But that is not the same as 5. The description I gave of Bob does not in any way conflict with Bob thinking that most morally forbidden acts can be compensated for by expressing sincere regret at some later point in time. The description of Bob would even be consistent with Bob thinking that almost all morally forbidden acts can be compensated for by writing a big enough check. He just thinks that the specific moral outrage in question, directly means that the AI committing it is unethical. In other words: other actions are simply not taken into consideration, when going from this specific moral outrage, to the classification of the AI as unethical. (He also thinks that a scenario where the fate of the world is determined by an unethical AI is really bad. This opinion is also not taking any other aspects of the scenario into account. Perhaps this is what you were getting at with point 5).

I insist on these distinctions because the moral framework that I was trying to describe, is importantly different from what is described by these points. The general type of moral sentiment that I was trying to describe is actually a very common, and also a very simple, type of moral sentiment. In other words: Bob's morality is (i): far more common, (ii): far simpler, and (iii): far more stable, compared to the morality described by these points. Bob's general type of moral sentiment can be described as: a specific moral outrage renders the person committing it unethical in a direct way. Not in a secondary way (meaning that there is for example no summing of any kind going on. There is no sense in which the moral outrage in question is in any way compared to any other set of actions. There is no sense in which any other action plays any part whatsoever when Bob classifies the AI as unethical).

In yet other words: the link from this specific moral outrage to classification as unethical is direct. The AI doing nice things later is thus simply not related in any way to this classification. Plenty of classifications work like this. Allan will remain a murderer, no matter what he does after committing a murder. John will remain a military veteran, no matter what he does after his military service. Jeff will remain an Olympic gold winner, no matter what he does after winning that medal. Just as for Allan, John, and Jeff, the classification used to determine that the AI is unethical is simply not taking other actions into account.

The classification is also not the result of any real chain of reasoning. There is no sense in which Bob first concludes that the moral outrage in question should be classified as morally forbidden, followed by Bob then deciding to adhere to a rule which states that all morally forbidden things should lead to the unethical classification (and Bob has no such a rule).

This general type of moral sentiment is not universal. But it is quite common. Lots of people can think of at least one specific moral outrage that leads directly to them viewing a person committing it as unethical (at least when committed deliberately by a grownup that is informed, sober, mentally stable, etc). In other words: lots of people would be able to identify at least one specific moral outrage (perhaps out of a very large set of other moral outrages). And say that this specific moral outrage directly implies that the person is unethical. Different people obviously do not agree on which subset of all moral outrages should be treated like this (even people that agree on what should count as a moral outrage can feel differently about this). But the general sentiment where some specific moral outrage simply means that the person committing it is unethical is common.

The main reason that I insist on the distinction is that this type of sentiment would be far more stable under reflection. There are no moving parts. There are no conditionals or calculations. Just a single, viscerally felt, implication. Attached directly to a specific moral outrage. For Bob, the specific moral outrage in question is a failure to adhere to the moral imperative to punish people like Dave.

Strong objections to the fate of the world being determined by someone unethical are not universal. But this is neither complex nor particularly rare. Let's add some details to make Bob's values a bit easier to visualise. Bob has a concept that we can call a Dark Future. It is basically referring to scenarios where Bad People win The Power Struggle and manage to get enough power to choose the path of humanity (powerful anxieties along these lines seem quite common. And for a given individual it would not be at all surprising if something along these lines eventually turn into a deeply rooted, simple, and stable, intrinsic value).

A scenario where the fate of the world is determined by an unethical AI is classified as a Dark Future (again in a direct way). For Bob, the case with no AI does not classify as a Dark Future. And Bob would really like to avoid a Dark Future. People who thinks that it is more important to prevent bad people from winning than to prevent the world from burning might not be very common. But there is nothing complex or incoherent about this position. And the general type of sentiment (that it matters a lot who gets to determine the fate of the word) seems to be very common. Not wanting Them to win can obviously be entirely instrumental. An intrinsic value might also be overpowered by survival instinct when things get real. But there is nothing surprising about something like this eventually solidifying into a deeply held intrinsic value. Bob does sound unusually bitter and inflexible. But there only needs to be one person like Bob in a population of billions.

To summarise: a non punishing AI is directly classified as unethical. Additional details are simply not related in any way to this classification. A trajectory where an unethical AI determines the fate of humanity is classified as a Dark Future (again in a direct way). Bob finds a Dark Future to be worse than the no AI scenario. If someone were to specifically ask him, Bob might say that he would rather see the world to burn than see Them win. But if left alone to think about this, the world burning in the non-AI scenario is simply not the type of thing that is relevant to the choice (when the alternative is a Dark Future).

Regarding the probability that extrapolation will change Bob:

First I just want to again emphasise that the question is not if extrapolation will change one specific individual named Bob. The question is whether or not extrapolation will change everyone with these types of values. Some people might indeed change due to extrapolation.

My main issue with the point about moral realism is that I don't see why it would change anything (even if we only consider one specific individual, and also assume moral realism). I don't see why discovering that The Objectively Correct Morality disagrees with Bob's values would change anything (I strongly doubt that this sentence means anything. But for the rest of this paragraph I will reason from the assumption that it both does mean something, and that it is true). Unless Bob has very strong meta preferences related to this, the only difference would presumably be to rephrase everything in the terminology of Bob's values. For example: extrapolated Bob would then really not want the fate of the world to be determined by an AI that is in strong conflict with Bob's values (not punishing Dave directly implies a strong value conflict. The fate of the world being determined by someone with a strong value conflict directly implies a Dark Future. And nothing has changed regarding Bob's attitude towards a Dark Future). As long as this is stronger than any meta preferences Bob might have regarding The Objectively Correct Morality, nothing important changes (Bob might end up needing a new word for someone that is in strong conflict with Bob's values. But I don't see why this would change Bob's opinion regarding the relative desirability of a scenario that contains a non-punishing AI, compared to the scenario where there is no AI).

I'm not sure what role coherence arguments would play here.

Regarding successor AIs:

It is the AI creating these successor AIs that is the problem for Bob (not the successor AIs themselves). The act of creating a successor AI that is unable to punish is morally equivalent to not punishing. It does not change anything. Similarly: the act of creating a lot of human level AIs is in itself determining the fate of the world (even if these successor AIs do not have the ability to determine the fate of the world).

Regarding the last paragraph that talks about finding a clever solution:

I'm not sure I understand this paragraph. I agree that if the set is not empty, then a clever AI will presumably find an action that is a Pareto Improvement. I am not saying that there exists an action that is a Pareto Improvement, but that this action is difficult to find. I am saying that at least one person will demand X and that at least one person will refuse X. Which means that a clever AI will just use its cleverness to confirm that the set is indeed empty.

I'm not sure that the following is actually responding to something that you are saying (since I don't know if I understand what you mean). But it seems relevant to point out that the Pareto constraint is part of the AIs goal definition. Which in turn means that before determining the members of the set of Pareto Improvements, there is no sense in which there exists a clever AI that is trying to make things work out well. In other words: there does not exist any clever AI, that has the goal of making the set non-empty. No one has, for example, an incentive to tweak the extrapolation definitions to make the set non-empty.

Also: in the proposal in question, extrapolated delegates are presented with a set. Their role is then supposed to be to negotiate about actions in this set. I am saying that they will be presented with an empty set (produced by an AI that has no motivation to bend rules to make this set non-empty). If various coalitions of delegates are able to expand this set with clever tricks, then this would be a very different proposal (or a failure to implement the proposal in question). This alternative proposal would for example lack the protections for individuals, that the Pareto constraint is supposed to provide. Because the delegates of various types of fanatics could then also use clever tricks to expand the set of actions under consideration. The delegates of various factions of fanatics could then find clever ways of adding various ways of punishing heretics into the set of actions that are on the table during negotiations (which brings us back to the horrors implied by PCEV). Successful implementation of Pareto PCEV implies that the delegates are forced to abide by the various rules governing their negotiations (similar to how successful implementation of classical PCEV implies that the delegates have successfully been kept in the dark regarding how votes are actually settled).

A few tangents:

This last section is not a direct response to anything that you wrote. In particular, the points below are not meant as arguments against things that you have been advocating for. I just thought that this would be a good place to make a few points, that are related to the general topics that we are discussing in this thread (there is no post dedicated to Pareto PCEV, so this is a reasonable place to elaborate on some points related specifically to PPCEV).

I think that if one only takes into account the opinions of a group that is small enough for a Pareto Improvement to exist, then the outcome would be completely dominated by people that are sort of like Bob, but that are just barely possible to bribe (for the same reason that PCEV is dominated by such people). The bribe would not primarily be about resources, but about what conditions various people should live under. I think that such an outcome would be worse than extinction from the perspective of many people that are not part of the group being taken into consideration (including from the perspective of people like Bob. But also from the perspective of people like Dave). And it would just barely be better than extinction for many in that group.

I similarly think that if one takes the full population, but bend the rules until one gets a non-empty set of things that sort of looks close to Pareto Improvements, then the outcome will also be dominated by people like Bob (for the same reason that PCEV is dominated by people like Bob). Which in turn implies a worse-than-extinction outcome (in expectation, from the perspective of most individuals).

In other words: I think that if one goes looking for coherent proposals that are sort of adjacent to this idea, then one would tend to find proposals that implies very bad outcomes. For the same reasons that proposals along the lines of PCEV implies very bad outcomes. A brief explanation of why I think this: if one tweaks this proposal until it refers to something coherent, then Steve has no meaningful influence regarding the adoption of those preferences that refer to Steve. Because when one is transforming this into something coherent, then Steve cannot retain influence over everything that he cares about strongly enough (as this would result in overlap). And there is nothing in this proposal that gives Steve any special influence regarding the adoption of those preferences that refer to Steve. Thus, in adjacent-but-coherent proposals, Steve will have no reason to expect that the resulting AI will want to help Steve, as opposed to want to hurt Steve.

It might also be useful to zoom out a bit from the specific conflict between what Bob wants and what Dave wants. I think that it would be useful to view the Pareto constraint as many individual constraints. This set of constraints would include many hard constraints. In particular, it would include many trillions of hard individual-to-individual constraints (including constraints coming from a significant percentage of the global population, that have non-negotiable opinions regarding the fates of billions of other individuals). It is an equivalent but more useful way of representing the same thing. (In addition to being quite large, this set would also be very diverse. It would include hard constraints from many different kinds of non-standard minds. With many different kinds of non-standard ways of looking at things. And many different kinds of non-standard ontologies. Including many types of non-standard ontologies that the designers never considered). We can now describe alternative proposals where Steve gets a say regarding those constraints that only refer to Steve. If one is determined to start from Pareto PCEV, then I think that this is a much more promising path to explore (as opposed to exploring different ways of bending the rules until every single hard constraint can be simultaneously satisfied).

I also think that it would be a very bad idea to go looking for an extrapolation dynamic that re-writes Bob's values in a way that makes Bob stop wanting Dave to be punished (or that makes Bob bribable). I think that extrapolating Bob in an honest way, followed by giving Dave a say regarding those constraints that refer to Dave, is a more promising place to start looking for ways of keeping Dave safe from people like Bob. I for example think that this is less likely to result in unforeseen side effects (extrapolation is problematic enough without this type of added complexity. The option of designing different extrapolation dynamics for different groups of people is a bad option. The option of tweaking an extrapolation dynamic that will be used on everyone, with the intent of finding some mapping that will turn Bob into a safe person, is also a bad option).

Replies from: martin-randall

↑ comment by Martin Randall (martin-randall) · 2025-02-01T19:44:20.305Z · LW(p) · GW(p)

A lot to chew on in that comment.

A baseline of "no superintelligence"

I think I finally understand, sorry for the delay. The key thing I was not grasping is that Davidad proposed this baseline [LW(p) · GW(p)]:

The "random dictator" baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for "Pareto improvement" being "no superintelligence"). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.

This makes Bob's argument very simple:

Creating a PPCEV AI causes a Dark Future. This is true even if the PPCEV AI no-ops, or creates a single cake. Bob can get here in many ways, as can Extrapolated-Bob.
The baseline is no superintelligence, so no PPCEV AI, so not a Dark Future (in the same way).

Option 2 is therefore better than option 1. Therefore there are no Pareto-improving proposals. Therefore the PPCEV AI no-ops. Even Bob is not happy about this, as it's a Dark Future.

I think this is 100% correct.

An alternative baseline

Let's update Davidad's proposal by setting the baseline to be whatever happens if the PPCEV AI emits a no-op. This means:

Bob cannot object to a proposal because it implies the existence of PPCEV AI. The PPCEV AI already exists in the baseline.
Bob needs to consider that if the PPCEV AI emits a no-op then whoever created it will likely try something else, or perhaps some other group will try something.
Bob cannot object to a proposal because it implies that the PPCEV emits something. The PPCEV already emits something in the baseline.

My logic is that if creating a PPCEV AI is a moral error (and perhaps it is) then at the point where the PPCEV AI is considering proposals then we already made that moral error. Since we can't reverse the past error, we should consider proposals as they affect the future.

This also avoids treating a no-op outcome as a special case. A no-op output is a proposal to be considered. It is always in the set of possible proposals, since it is never worse than the baseline, because it is the baseline.

Do you think this modified proposal would still result in a no-op output?

Replies from: ThomasCederborg

↑ comment by ThomasCederborg · 2025-02-09T20:23:35.641Z · LW(p) · GW(p)

There are no Pareto improvements relative to the new Pareto Baseline that you propose. Bob would indeed classify a scenario with an AI that takes no action as a Dark Future. However, consider Bob2, who takes another perfectly coherent position on how to classify an AI that never acts. If something literally never takes any action, then Bob2 simply does not classify it as a person. Bob2 therefore does not consider a scenario with an AI that literally never does anything to be a Dark Future (other than this difference, Bob2 agrees with Bob about morality). This is also a perfectly reasonable ontology. A single person like Bob2 is enough to make the set of Pareto Improvements relative to your proposed Pareto Baseline empty.

(As a tangent, I just want to explicitly note here that this discussion is about Pareto Baselines. Not Negotiation Baselines. The negotiation baseline in all scenarios discussed in this exchange is still Yudkowsky's proposed Random Dictator negotiation baseline. The Pareto Baseline is relevant to the set of actions under consideration in the Random Dictator negotiation baseline. But it is a distinct concept. I just wanted to make this explicit for the sake of any reader that is only skimming this exchange)

The real thing that you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies (including a large number of non-standard ontologies. Some presumably a lot more strange than the ontologies of Bob and Bob2). The concept of a Pareto Improvement was really not designed to operate in a context like this. It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context. Few concepts have actually been properly explored in the AI context (this is related to the fact that the Random Dictator negotiation baseline actually works perfectly fine in the context that it was originally designed for: a single individual trying to deal with Moral Uncertainty. Something similar is also true for the Condorcet Criterion. The intuition failures that seem to happen when people move concepts from CEVI style mappings to CEVH style mappings is also related. Etc, etc, etc. It simply does not seem to exist a workable alternative, to actually exploring a concept, in whatever AI context that one wants to use it in. Simply importing concepts from other contexts, just does not seem to be a reliable way of doing things. This state of affairs is extremely inconvenient).

Let's consider the economist Erik, who claims that Erik's Policy Modification (EPM) is a Pareto Improvement over current policy. Consider someone pointing out to Erik that some people want heretics to burn in hell, and that EPM would be bad for such people, since it would make life better for heretics in expectation. If Erik does decide to respond, he would presumably say something along the lines of: it is not the job of economic policy to satisfy people like this. He probably never explicitly decided to ignore such people. But his entire field is based on the assumption that such people do not need to be taken into consideration when outlining economic policy. When having a political argument about economic policy, such people are in fact not really an obstacle (if they do participate, they will presumably oppose EPM with arguments that do not mention hellfire). The implicit assumption that such positions can be ignored thus holds in the context of debating economic policy. But this assumption breaks when we move the concept to the AI context (where every single type of fanatic is informed, extrapolated, and actually given a very real, and absolute, veto over every single thing that is seen as important enough).

Let's look a bit at another Pareto Baseline that might make it easier to see the problem from a different angle (this thought experiment is also relevant to some straightforward ways in which one might further modify your proposed Pareto Baseline in response to Bob2). Consider the Unpleasant Pareto Baseline (UPB). In UPB the AI implements some approximation of everyone burning in hell (specifically: the AI makes everyone experience the sensation of being on fire for as long as it can). It turns out that it only takes two people to render the set of Pareto Improvements relative to UPB empty: Gregg and Jeff from my response [LW(p) · GW(p)] to Davidad's comment. Both want to hurt heretics, but they disagree about who is a heretic. Due to incompatibilities in their respective religions, every conceivable mind is seen as a heretic by at least one of them. Improving the situation of a heretic is Not Allowed. Improving the situation of any conceivable person, in any conceivable way, is thus making things worse from the perspective of at least one of them.

Gregg and Jeff do have to be a lot more extreme than Bob or Bob2. They might for example be non-neurotypical (for example sharing a condition that has not yet been discovered). And raised in deeply religious environments, whose respective rules they have adopted in an extremely rigid way. So they are certainly rare. But there only needs to be two people like this for the set of Pareto Improvements relative to UPB to be empty. (presumably no one would ever consider building an AI with UPB as a Pareto Baseline. This thought experiment is not meant to illustrate any form of AI risk. It's just a way of illustrating a point about attempting to simultaneously satisfy trillions of hard constraints, defined in billions of ontologies)

(I really appreciate you engaging on this in such a thorough and well thought out manner. I don't see this line of reasoning leading to anything along the lines of a workable patch or a usable Pareto Baseline. But I'm very happy to keep pulling on these threads, to see if one of them leads to some interesting insight. So by all means: please keep pulling on whatever loose ends you can see)

Replies from: martin-randall

↑ comment by Martin Randall (martin-randall) · 2025-02-15T20:28:41.108Z · LW(p) · GW(p)

I'm much less convinced by Bob2's objections than by Bob1's objections, so the modified baseline is better. I'm not saying it's solved, but it no longer seems like the biggest problem.

I completely agree that it's important that "you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies". On the other hand, the set of actions is potentially even larger, with septillions of reachable stars. My instinct is that this allows a large number of Pareto improvements, provided that the constraints are not pathological. The possibility of "utility inverters" (like Gregg and Jeff) is an example of pathological constraints.

Utility Inverters

I recently re-read What is malevolence? On the nature, measurement, and distribution of dark traits [LW · GW]. Some findings:

Over 16% of people agree or strongly agree that they “would like to make some people suffer even if it meant that I would go to hell with them”. Over 20% of people agree or strongly agree that they would take a punch to ensure someone they don’t like receives two punches.

Such constraints don't guarantee that there are no Pareto improvements, but they make it very likely, I agree. So what to do? In the article you propose Self Preference Adoption Decision Influence (SPADI), defined as "meaningful influence regarding the adoption of those preferences that refer to her". We've come to a similar place by another route.

There's some benefit in coming from this angle, we've gained some focus on utility inversion as a problem. Some possible options:

Remove utility inverting preferences in the coherently extrapolated delegates. We could call this Coherent Extrapolation of Equanimous Volition, for example. People can prefer that Dave stop cracking his knuckles, but can't prefer that Dave suffer.
Remove utility inverting preferences when evaluating whether options are pareto improvements. Actions cannot be rejected because they make Dave happier, but can be rejected because Dave cracking his knuckles makes others unhappier.

I predict you won't like this because of concerns like: what if Gregg just likes to see heretics burn, not because it makes the heretics suffer, but because it's aesthetically pleasing to Gregg? No problem, the AI can have Gregg see many burning heretics, that's just an augmented-reality mod, and if it's truly an aesthetic preference then Gregg will be happy with that outcome.

Pareto at Scale

It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context.

I don't think we have to frame this as "the AI context", I think the difference is more about scale. Would this count as Computational Social Choice? Might be interesting to do a literature search. I happened across Safe Pareto Improvements for Delegated Game Playing, which isn't the right paper, but makes me hopeful of finding something more to the point. The paper also helped me realize that finding the result of a parliament is probably NP-hard.

Replies from: ThomasCederborg

↑ comment by ThomasCederborg · 2025-03-03T17:45:39.087Z · LW(p) · GW(p)

The proposals described in your points 1 and 2 are very different from any of the proposals that I am analysing in the post. I consider this to be a good thing. But I wanted to note explicitly that this discussion has now moved very far away from what was discussed in the post, and is best seen as a new discussion (a discussion that starts with the proposals described in your points 1 and 2). Making this clear is important, because it means that many points made in the post (and also earlier in this thread) do not apply to the class of proposals that we are now discussing.

In particular: all alignment targets analysed in the post are Group AIs. But the alignment target described in your point 1: Coherent Extrapolation of Equanimous Volition (CEEV), is not a Group AI. Given that the primary focus of the post is to analyse the Group AI idea, the analysis of CEEV below is best seen as starting a completely new discussion. Among other things, this means that many arguments from the post about Group AIs will probably not apply to CEEV. (CEEV is still very bad for individuals. Because it is still the case that no individual has any meaningful influence regarding the way in which CEEV adopts those preferences that refer to her. One specific issue is that some CEEV delegates will still prefer outcomes where heretics are punished, because some delegates will still have an aversion to unethical AIs. The issue is described in detail in the last section of this comment).

The rule for deciding which actions are available to Delegates during negotiations, described in your point 2, is also a large departure from anything discussed in the post. The described rule would accept actions, even though those actions would make things dramatically worse for some people. I think that this makes it a very different kind of rule, compared to Davidad's proposed Pareto Improvement rule. The points that I made about Pareto Improvements in the post, and earlier in this thread, do not apply to this new class of rules. (The set of actions is still rendered empty by the rule described in your point 2, due to a large and varied set of hard constraints demanding that the AI must not be unethical. A single pair of such demands can render the set empty, by having incompatible views regarding what it means for an AI to be unethical. Some pairs of demands like this have nothing to do with utility inversion. The issue is described in detail in the next section of this comment).

It also makes sense to explicitly note here that with the rule described in your point 2, you have now started to go down the path of removing entire classes of constraints from consideration (as opposed to going down the path of looking for new Pareto Baselines). Therefore, my statement that the path that you are exploring is unlikely to result in a non-empty set no longer applies. That statement was expressing doubt about finding a usable Pareto Baseline that would result in a non-empty set. But in my view you are now doing something very different (and far more interesting) than looking for a usable Pareto Baseline that would result in a non-empty set.

I will spend most of this comment talking about the proposals described in your points 1 and 2. But let's first try to wrap up the previous topics, starting with Bob2. Bob2 is only different from Bob in the sense that Bob2 does not see an AI that literally never acts as a person. I don't see why Bob2's way of looking at things would be strange or unusual. A thing that literally never acts can certainly be seen as a person. But it doesn't have to be seen as a person. Both perspectives seem reasonable. These two different classifications are baked into a core value, related to the Dark Future concept. (In other words: Bob and Bob2 have different values. So there is no reason to think that learning new facts would make them agree on this point. Because there is no reason to think that learning new facts would change core values). In a population of billions, there will thus be plenty of people that share Bob2's way of looking at such an AI. So if the AI is pointed at billions of humans, the set of Pareto Improvements will be rendered empty by people like Bob2 (relative to the alternative no-AI-action Pareto Baseline that you discussed here [LW(p) · GW(p)]).

Now let’s turn to your point about the size of the action space. Most of my previous points probably do not apply to rules that will ignore entire classes of constraints (such as the “pathological constraints” that you mention). In that case everything depends on how one defines this class of constraints. Rules that do ignore classes of constraints are discussed in the next section of this comment. However: for rules that do not ignore any constraints, the number of actions is not necessarily relevant (in other words: while we are still talking about Pareto Improvements, the number of actions is not necessarily relevant). One can roughly describe the issue as: If one constraint demands X. And another constraint refuses X. Then the set is empty. Regardless of the number of actions.

I'm not sure whether or not there is any significant disagreement left on this issue. But I will still elaborate a bit more on how I see the original situation (the situation where pathological constraints are not ignored).

One can say that everything is short circuited by the fact that humans often have very strong opinions about who should be in charge. (And there are many different types of ontologies that are compatible with such sentiments. Which means that we can expect a great variety in terms of what this implies regarding demands about the AI). Wanting the right type of person to be in charge can be instrumental. But it does not have to be instrumental. And there is nothing unusual about demanding things that are entirely symbolic. (In other words: there is nothing unusual about Dennis, who demands that The Person in Charge must do or value things that have no connection with the material situation of Dennis).

This is not part of every ontology. But caring about who is in charge is a common human value (at least common enough for a population of billions to include a great variety of hard constraints related to this general type of sentiment). The number of actions does not help if one person rejects all trajectories where the person in charge is X. And another person rejects any trajectory unless the person in charge is X. (Combined with the classification of a trajectory that contains a non-misaligned and clever AI, that takes any action, as a trajectory where the first such AI is in charge). (I don’t know if we actually disagree on anything here. Perhaps you would classify all constraints along these lines as Pathological Constraints). (In the next section I will point out that while such incompatible pairs can be related to utility inversion. They do not have to be.)

I will first discuss the proposal described in your point 2 in the next section, and then discuss the proposal described in your point 1 in the last section (because finding the set of actions that are available to delegates happens before delegates start negotiating).

The rule for determining which set of actions will be included in negotiations between delegates

The rule described in your point 2 still results in an empty set, for the same reason that Davidad's original Pareto Improvement rule results in an empty set. The rule described in your point 2 still does not remove the problem of Bob from the original thought experiment of the post. Because the thing that Bob objects to is an unethical AI. The issue is not about Bob wanting to hurt Dave, or about Bob wanting to believe that the AI is ethical (or that Bob might want to believe that Dave is punished. Or that Bob might want to see Dave being punished). The issue is that Bob does not want the fate of humanity to be determined by an unethical AI.

Demanded punishments also do not have to refer to Dave's preferences. It can be the case that Gregg demands that Dave's preferences are inverted. But it can also be the case that Gregg demands that Dave be subjected to some specific treatment (and this can be a treatment that Dave will categorically reject). There is nothing unexpected about a fanatic demanding that heretics be subjected to a specific type of treatment. It is not feasible to eliminate all “Problematic Constraints” along these lines by eliminating some specific list of constraint types (for example along the lines of: utility inverting constraints, or hostile constraints, or demands that people suffer). Which in combination with the fact that Dave still has no meaningful influence over those constraints that are about Dave, means that there is still nothing preventing someone from demanding that things happen to Dave, that Dave finds completely unacceptable. A single such constraint is sufficient for rendering the action space empty (regardless of the size of the action space).

When analysing this type of rule it might actually be best to switch to a new type of person, that has not been part of my past thought experiments. Specifically: the issue with the rule described in your point 2 can also be illustrated using a thought experiment that does not involve any preferences that in any way refer to any human. The basic situation is that two people have incompatible demands regarding how an AI must interact with a specific sacred place or object, in order for the AI to be considered acceptable.

Let's take ancient Egyptian religion as an example in order to avoid contemporary politics. Consider Intef who was named after the Pharaoh who founded Middle Kingdom Egypt, and Ahmose who was named after the Pharaoh who founded New Kingdom Egypt. They both consider it to be a moral imperative to restore temples to their rightful state (if one has the power to do so). But they disagree on when Egyptian religion was right, and therefore disagree on what the AI must do to avoid being classified as unethical (in the sense of the Dark Future concept).

Specifically: a Middle Kingdom temple was destroyed and the stones were used to build a New Kingdom temple. Later that temple was also destroyed. Intef considers it to be a moral imperative to use the stones to rebuild the older temple (if one has the ability to do so). And Ahmose considers it to be a moral imperative to use the same stones to rebuild the newer temple (if one has the ability to do so). Neither of them thinks that an unethical AI is acceptable (after the AI is classified as unethical the rest of the story follows the same path as the examples with Bob or Bob2). So the set would still be empty, even if a rule simply ignores every constraint that in any way refers to any human.

Neither of these demands are in any way hostile (or vicious, or based in hate, or associated with malevolent people, or belligerent, or anything else along such lines). Neither of these demands is on its own problematic or unreasonable. On its own, either of these demands is in fact trivial to satisfy (the vast majority of people would presumably be perfectly ok with either option). And neither of these demands looks dangerous (nor would they result in an advantage in Parliamentarian Negotiations). Very few people would watch the world burn rather than let Intef use the original stones to rebuild his preferred temple. But it only takes one person like Ahmose to make the set of actions empty.

Let's go through another iteration and consider AI47 who uses a rule that ignores some additional constraints. When calculating whether or not an action can be used in delegate negotiations, AI47 ignores all preferences that (i): refer to AI47 (thus completely ignoring all demands that AI47 not be unethical), or (ii): refer to any human, or (iii): are dangerous, or (iv): are based on hate / bitterness / spite / ego / etc / etc, or (v): make demands that are unreasonable or difficult to satisfy. Let's say that in the baseline trajectory that alternative trajectories are compared to, AI47 never acts. If AI47 never acts, then this would lead to someone eventually launching a misaligned AI that would destroy the temple stones (and also kill everyone).

Intef and Ahmose both think that if a misaligned AI destroys the stones, then this counts as the stones being destroyed in an accident (comparable from a moral standpoint to the case where the stones are destroyed by an unpreventable natural disaster). Conditioned on a trajectory where the stones are not used to restore the right temple, both prefer a trajectory where the stones are destroyed by accident. (In addition to caring about the ethics of the AI that is in charge, they also care about the stones themselves.). And there is no way for a non-misaligned, clever AI (like AI47), to destroy the stones by accident (in a sense that they would consider to be equivalent to an unpreventable natural disaster). So the set is still empty.

In other words: even though this is no longer an attempt to find a usable Pareto Baseline that simultaneously satisfies many trillions of hard constraints, a single pair of constraints can still make the set empty. And it is still an attempt to deal with a large set of hard constraints, defined in a great variety of ontologies. It is also still true that (in addition to constraints coming from people like Intef and Bob2) this set will also include constraints defined in many ontologies that we will not be able to foresee (including the ontologies of a great variety of non-neurotypical individuals, that have been exposed to a great variety of value systems during childhood). This is an unusual feature of the AI context (compared to other contexts that deal with human preferences). A preference defined in an ontology that no one ever imagined might exist, has no impact on debates about economic policy. But unless one simply states that a rule should ignore any preference that was not considered by the designers, then the quest to find a rule that actually implies a non-empty set, must deal with this highly unusual feature of the AI context.

(Intef and Ahmose pose a lot more problems in this step, than they pose in the step where delegates are negotiating. In that later step, their delegates have no problematic advantage. Their delegates are also not trying to implement anything worse than extinction. This is probably why this type of person has not been part of any of my past thought experiments. I have not thought deeply about people like Intef and Ahmose)

(There exists several contemporary examples of this general type of disagreements over sacred locations or objects. Even the specific example of reusing temple stones was a common behaviour in many different times and places. But the ancient Egyptians are the undisputed champions of temple stone reuse. And people nowadays don’t really have strong opinions regarding which version of ancient Egyptian religion is the right version. Which is why I think it makes sense to use this example)

(I'm happy to keep exploring this issue. I would not be surprised if this line of inquiry leads to some interesting insight)

(if you are looking for related literature, you might want to take a look at the Sen ``Paradox'' (depending on how one defines “pathological preferences”, they may or may not be related to “nosy preferences”))

(Technical note: this discussion makes a series of very optimistic assumptions in order to focus on problems that remain despite these assumptions. For example assuming away a large number of very severe definitional issues. Reasoning from such assumptions does not make sense if one is arguing that a given proposal would work. But it does make sense when one is showing that a given proposal fails, even if one makes such optimistic assumptions. This point also applies to the next section)

Coherent Extrapolation of Equanimous Volition (CEEV)

Summary: In the CEEV proposal described in your point 1, many different types of fanatics would still be represented by delegates that want outcomes where heretics are punished. For example fanatics that would see a non-punishing AI as unethical. Which means that CEEV still suffers from the problem that was illustrated by the original PCEV thought experiment [LW · GW]. In other words: having utility inverting preferences is one possible reason to want an outcome where heretics are punished. Such preferences would not be present in CEEV delegates. But another reason to want an outcome where heretics are punished is a general aversion to unethical AIs. Removing utility inverting preferences from CEEV delegates would not remove their aversion to unethical AIs. Yet another type of sentiment that would be passed on to CEEV delegates, is the case where someone would want heretics to be subjected to some specific type of treatment (simply because, all else equal, it would be sort of nice if the universe ended up like this). There are many other types of sentiments along these lines that would also be passed on to CEEV delegates (including a great variety of sentiments that we have no hope of comprehensively cataloguing). Which means that many different types of CEEV delegates would still want an outcome where heretics are hurt. All of those delegates would still have a very dramatic advantage in CEEV negotiations.

Let’s start by noting that fanatics can gain a very dramatic negotiation advantage in delegate negotiations, without being nearly as determined as Gregg or Bob. Unlike the situation discussed in the previous section, in delegate negotiations people just need to weakly prefer an outcome where heretics are subjected to some very unpleasant treatment. In other words: people can gain a very dramatic negotiation advantage simply because they feel that (all else equal) it would be sort of nice to have some type of outcome, that for some reason involves bad things happening to heretics.

There exists a great variety of reasons for why someone might have such sentiments. In other words: some types of fanatics might lose their negotiation advantage in CEEV. But many types of fanatics would retain their advantage (due to a great variety of preferences defined in a great variety of ontologies). Which in turn means that CEEV suffers from the same basic problem that PCEV suffers from.

You mention the possibility that an AI might lie to a fanatic regarding what is happening. But a proposed outcome along such lines would change nothing. CEEV delegates representing fanatics that have an aversion to unethical AIs would for example have no reason to accept such an outcome. Because the preferences of the fanatics in question is not about their beliefs regarding unethical AIs. Their preferences are about unethical AIs.

In addition to fanatics with an aversion to unethical AIs, we can also look at George, who wants heretics to be punished as a direct preference (without any involvement of preferences related to unethical AIs). George might for example want all heretics to be subjected to some specific treatment (demands that heretics be subjected to some specific treatment are not unusual). No need for anything complicated or deeply felt. George might simply feel that it would be sort of nice if the universe would be organised like this (all else equal).

George could also want the details of the treatment to be worked out by a clever AI (without referring to any form of utility inversion or suffering. Or even referring in any way to any heretic, when specifying the details of the treatment). George might for example want all heretics to be put in whatever situation, that would make George feel the greatest amount of regret. In other words: this type of demand does not have to be related to any form of utility inversion. The details of the treatment that George would like heretics to be subjected to, does not even need to be determined by any form of reference to any heretic. In yet other words: there are many ways for fanatics along the lines of George to gain a very large negotiation advantage in CEEV. (The proposal that CEEV might lie to George about what is happening to heretics would change nothing. Because George's preference is not about George's beliefs.)

The type of scenario that you describe, where George might want to see Dave being hurt, is not actually an issue here. Let's look more generally at George’s preferences regarding George's experiences, George's beliefs, George's world model, etc. None of those pose a problem in original PCEV (because they do not result in a negotiation advantage for George's delegate). (We might not have any actual disagreement regarding these types of preferences. I just wanted to be clear about this point).

From the perspective of Steve, the underlying issue with CEEV is that Steve still has no meaningful control over the way in which CEEV adopts those preferences that refer to Steve. Which in turn means that Steve still has no reason to think that CEEV will want to help Steve, as opposed to want to hurt Steve. This point would remain true even if one were to remove additional types of preferences from delegates.

Eliminating some specific list of preference types (for example along the lines of: utility inverting preferences, or hostile preferences, or preferences that people suffer, etc) does not qualitatively change this situation. Because eliminating such a list of preference types does not result in Steve gaining meaningful influence regarding the adoption of those preferences that refer to Steve. Which in the case of Parliamentarian Negotiations means that delegates will still want to hurt Steve, for a great variety of reasons (for example due to sentiments along the lines of an aversion to unethical AIs. And also due to a long and varied list of other types of sentiments, that we have no hope of exhaustively cataloguing).

In other words: all those delegates that (for reasons related to a great variety of sentiments) still want outcomes where people are subjected to horrific forms of treatment, will still have a very large negotiation advantage in CEEV. And such delegates will also have a very large negotiation advantage in any other proposal without the SPADI feature, that is based on the idea of eliminating some other specific list of preference types from delegates.

Since this discussion is exploring hypotheticals (as a way of reaching new insights), I’m happy to keep looking at proposals without the SPADI feature. But given the stakes, I do want to make a tangential point regarding plans that are supposed to end with a successfully implemented AI without the SPADI feature (presumably as the end point of some larger plan that includes things along the lines of: an AI pause, augmented humans, an initial Limited AI, etc, etc).

In other words: I am happy to keep analysing proposals without the SPADI feature. Because it is hard to predict what one will find when one is pulling on threads like this. And because analysing a dangerous proposal reduces the probability of it being implemented. But I also want to go on a tangent and explain why successfully implementing any AI without the SPADI feature would be extremely bad. And explicitly note that this is true regardless of which specific path one takes to such an AI. And also explicitly note that this is true, regardless of whether or not anyone manages to construct a specific thought experiment illustrating the exact way in which things go bad.

Let’s look at a hypothetical future proposal to illustrate these two points. Let’s say that someone proposes a plan that is supposed to eventually lead to the implementation of an AI that gets its preferences from billions of humans. This AI does not have the SPADI feature. Now let’s say that this proposed alignment target avoids the specific issues illustrated by all existing thought experiments. Let’s further say that no one is able to construct a specific thought experiment that illustrates exactly how this novel alignment target proposal would lead to a bad outcome. The absence of a thought experiment that illustrates the specific path to a bad outcome, would not in any way shape or form imply that the resulting AI does not want to hurt Steve, if such a proposed plan is successfully implemented. In other words: since Steve will have no meaningful influence regarding the adoption of those preferences that refer to Steve, Steve will have no reason to expect the actual resulting AI to want to help Steve, as opposed to want to hurt Steve. PCEV implied a massively worse than extinction outcome, also before the specific problem was described [LW · GW] (and PCEV spent a lot of years as a fairly popular proposal without anyone noticing the issue).

In yet other words: the actual AI, that is actually implied, by some proposed set of definitions, can end up wanting to hurt Steve, regardless of whether or not someone is able to construct a thought experiment that illustrates the exact mechanism by which this AI will end up wanting to hurt Steve. Which in combination with the fact that Steve does not have any meaningful influence regarding the adoption of those preferences that refer to Steve, means that Steve has no reason to expect this AI to want to help Steve, as opposed to want to hurt Steve.

In yet other words: the SPADI feature is far from sufficient for basic safety. But it really is necessary for basic safety. Which in turn means that if a proposed AI does not have the SPADI feature, then this AI is known to be extremely bad for human individuals in expectation (if successfully implemented). This is true with or without a specific thought experiment illustrating the specific mechanism that would lead to this AI wanting to hurt individuals. And it is true regardless of what path was taken to the successful implementation of such an AI. (Just wanted to be explicit about these points. Happy to keep analysisng proposals without the SPADI feature.)

(you might also want to take a look at this post [LW · GW])

A problem shared by many different alignment targets

Contents

An analysis of a few alignment targets, and the problems that they suffer from

Condorcet AIs and solid majorities

Maximising the sum of individual utility functions

Returning to the idea of removing the certainty of victory from a CAI

Ways of removing the problematic negotiation advantage from PCEV

Describing these proposals as slight variations to a single proposal: to build an AI that is describable as doing what a group wants

A trivial scenario that illustrates why no GAI design is describable as having the SPADI feature

Groups and Individuals really are completely different types of things

Two very different types of mappings and a problematic implicit assumption

Three coherent proposals where someone can reject a given CEVH mapping

Delegating the choice of definitions to an extrapolated human

16 comments

Regarding Corrigibility as a Singular Target:

Regarding what the designers might want:

How useful would an IFAI be for analysing MPSAIPs?

The delegation plan: The scenario where an IFAI does know how to define all of humanity's long term, implicit deep values

The assistant plan: The scenario where an IFAI does not know how to define all of humanity's long term, implicit deep values

Regarding the probability of a pause

An attempt to summarise how I view the situation

PS:

Regarding extrapolation:

Regarding the probability that extrapolation will change Bob:

Regarding successor AIs:

Regarding the last paragraph that talks about finding a clever solution:

A few tangents:

A baseline of "no superintelligence"

An alternative baseline

Utility Inverters

Pareto at Scale

The rule for determining which set of actions will be included in negotiations between delegates

Coherent Extrapolation of Equanimous Volition (CEEV)