A problem shared by many different alignment targets
post by ThomasCederborg · 2025-01-15T14:22:12.754Z · LW · GW · 2 commentsContents
An analysis of a few alignment targets, and the problems that they suffer from Condorcet AIs and solid majorities Maximising the sum of individual utility functions Returning to the idea of removing the certainty of victory from a CAI Ways of removing the problematic negotiation advantage from PCEV Describing these proposals as slight variations to a single proposal: to build an AI that is describable as doing what a group wants A trivial scenario that illustrates why no GAI design is describable as having the SPADI feature Groups and Individuals really are completely different types of things Two very different types of mappings and a problematic implicit assumption Three coherent proposals where someone can reject a given CEVH mapping Delegating the choice of definitions to an extrapolated human None 2 comments
The first section describes problems with a few different alignment targets. The second section argues that it is useful to view all of them as variations of a single alignment target: building an AI that does what a Group wants that AI to do. The post then goes on to argue that all of the individual problems described in the first section can be viewed as special cases of an underlying problem. A problem that is inherent in the core concept of a Group AI.
The implication of the first section is that the specific alignment targets analysed there should not be implemented. The rest of the post is more ambitious. The following sections argue that there exists an underlying problem that many alignment targets suffer from. The implication of this argument is that no alignment target that suffers from this underlying problem should be implemented. This more ambitious argument is related to the argument about Membrane formalisms that was outlined here [LW · GW].
A previous post [LW · GW] argued that the type of Alignment Target Analysis (ATA) done in the present post needs to happen now. In brief: an alignment target with a hidden problem getting successfully implemented is a serious risk. There might not be a lot of time to do ATA later (for example because shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure [LW · GW]). And risk reduction is feasible, because one can reduce risks without finding any good alignment target.
ATA can be roughly described as asking what would happen if a given Sovereign AI proposal were to be successfully implemented. This means that there are very serious AI risks that no level of ATA progress can reduce (including most extinction risks related to misalignment). But there are other serious AI risks that can be reduced by ATA progress. ATA is complementary with several other risk mitigation strategies. The risk reductions that can be achieved by ATA are not easily replaced by other strategies. See this post [LW · GW] and this comment [LW(p) · GW(p)] for a more detailed description of ATA research.
An analysis of a few alignment targets, and the problems that they suffer from
Besides the discussion of specific alignment targets, this section also tries to draw attention to the fact that intuitions built up by analysing concepts in other contexts, might not work in the AI context.
Condorcet AIs and solid majorities
Let's start by exploring a concept from ordinary politics: the Condorcet Criterion from voting theory. If there exists an option that beats every other option in a one-on-one vote, then that option will be chosen by any voting rule that conforms to the Condorcet Criterion. In Scott Garrabrant's related Lottery Condorcet Criterion [LW · GW] options are lotteries over outcomes instead of deterministic outcomes. Let's write Condorcet AI (CAI) for any powerful AI that (i): decides what to do based on what billions of extrapolated humans want it to do, and that (ii): picks outcomes in a way that satisfy the Condorcet Criterion or the Lottery Condorcet Criterion.
Now consider the case with a 51% solid majority (a majority that agrees about everything). They think that, all else equal, it would be sort of nice if the CAI were to hurt the other 49% as much as possible. In this scenario any CAI would hurt the 49% as much as possible. This follows directly from the definition of the Condorcet Criterion, and Garrabrant's definition of the Lottery Condorcet Criterion [LW · GW] (where the chosen option would be a “lottery” that always results in the outcome preferred by the majority). The majority does not have to feel strongly about this according to any reasonable definition of “feels strongly”. (The fact that all Condorcet AIs share this feature was first mentioned here [LW · GW]).
This issue is not a problem in the context that the Condorcet Criterion concept was developed for: humans electing other humans that go on to hold constitutionally defined positions in a polity with both formal and informal checks and balances. In ordinary politics there does not exist any entity that can suddenly start oppressing a 49% minority without cost or risk. If a narrow majority barely care about the persecution, they will not vote for a policy that would lead to civil war. (In addition to this difference: elements such as extrapolation and clever AIs also impacts issues like coordination. For example making it more important to take things like solid majorities into account. Especially if it turns out that many individuals intrinsically value having the same values as the majority).
It is possible to bring the AI scenario closer to the ordinary politics context by removing the certainty of victory for the majority (for example by having extrapolated delegates act and negotiate, based on the assumption that votes are settled stochastically). We will return to this idea below.
For now however, this example can illustrate a more general phenomenon related to transferring a concept to the AI context: that intuitions built up by examining how a concept behaves in one context, might fail when the concept is transferred to the AI context. There is nothing unusual about this phenomenon. It's just another example of intuitions about a concept breaking, when that concept is used in novel contexts. But it is a phenomenon that is useful to keep in mind during the rest of this post.
Maximising the sum of individual utility functions
This subsection will analyse a type of AI that maximises the sum of normalised utility functions, where each function is designed to represent the wishes of one human in a group of billions of humans. A thought experiment illustrates a problem with an AI that uses the Variance, MADAM, or MD normalisations discussed in this paper. The paper discusses both the interpersonal and the intertheoretic case (focusing on the intertheoretic case). But the problem described below is only present when the interpersonal version of the Variance, MD, or MADAM proposals is used in an AI context. The fact that the problem only shows up in the AI context is interesting, and will be discussed further below (it illustrates another instance of intuitions breaking when concepts are transferred to the AI context).
Let's focus on the Variance normalisation method and write VarAI for this AI (the three normalisation methods mentioned above all suffer from the same problem, and the thought experiment outlined below will also work for the other two options). Now let's do a short walkthrough of the basic Alignment Target Analysis (ATA) process. Let's start by optimistically assuming that for each person, a utility function is found that perfectly represents what this person wants (making optimistic assumptions allows us to focus on problems that remain, despite these assumptions).
VarAI uses a flawless method to obtain a utility function for each person, that perfectly describes everything that matters to her. VarAI then rescales these functions so that they all have the same variance. And then VarAI maximises the sum. One obvious initial question to ask about any proposed alignment target is: what type of person would gain a disproportionate amount of influence over this AI? (In the CAI case discussed above the answer was: any member of a narrow but solid majority).
When summing utility functions, a person with a very spiky utility function can in general outvote several people that have more even utility functions. So let's check how VarAI would treat a person with the most spiky function possible. Bob assigns max utility to exactly one option. And min utility to every other option. If we look at the definitions in detail, we see that in a group consisting of Bob and billions of normal human individuals, Bob would gain complete control over VarAI. As explained below, Bob would get his exact outcome implemented regardless of what he wants VarAI to do.
If interpersonal utility comparisons are used to define a goal for a clever and powerful AI, then there exists only a single decision situation that matters to Bob: choosing what successor AI to create. And for Bob, the only option without min utility is the option where the exact successor AI that Bob wants is created as soon as possible. And given complete control as soon as possible.
For any ordinary human individual, there exists a large number of potential successor AI designs that are close to the best possible design. And a large number of potential successor AI designs that are close to the worst possible design. Consider for example the set of AI designs such that: (i): each individual human is mapped to a clever AI delegate representing her interests, (ii): a different mapping is used for each individual, (iii): each delegate gets between 0 and 10^20 votes, and (iv): votes are aggregated in some reasonable sounding way.
For any normal human individual, the set of potential successor AI designs that satisfy (i)-(iv) includes a large number of close-to-best designs. And a large number of close-to-worst designs (and this set is of course just a tiny subset of possible successor AI designs).
The same conclusion is reached if one instead considers all the different ways in which a successor AI might reorganise the world. For any normal human individual there will be a large number of close-to-best worlds. And a large number of close-to-worst worlds.
For example: if Steve's best option is a world where large numbers of individuals lead pleasant lives, then there would be a large number of ways to slightly modify the circumstances and the people involved such that the outcome would remain close-to-optimal. If Steve's worst possible outcome is a world where a large number of people suffer, then there would also exist a large number of close-to-worst outcomes, corresponding to slight variations to people and circumstances. This remains true even if Steve only cares about biological (non-uploaded) humans that have descended from current humans. It also remains true if all of these people are instead simulated copies of Steve with slight variations.
So, regardless of how one counts successor AI options, any normal human would have a large number of close-to-best options. And a large number of close-to-worst options.
But for Bob there exists only one option that does not have min utility. After rescaling, the utility of Bob's one good option would be a large number. Bob would easily outvote billions of normal human individuals, regardless of what he would like the AI to do. In other words: if Bob wants to implement an outcome. And this outcome is seen as literally the worst possible outcome by billions of ordinary human individuals. Then Bob wins the vote with a large margin. Bob also wins if an AI is using the MADAM or the MD normalisation methods.
It is worth noting that this problem is not present in the intertheoretic context. In the intertheoretic context a single human individual is using a mapping from a weighted set of moral theories and a situation, to a subset of actions that are already being considered by that human (this subset is then given the label: Morally Permissible). For a start: the number of actions is very different.
In the AI case, there is also a risk that Bob will ask the AI to hurt people (it seems difficult to predict what such an unusual mind would ask an AI to do. But this is certainly one possibility). This is not really an issue with moral theories. It would not be surprising to learn that a given human will ask an AI to hurt other humans (especially not a highly unusual type of pathological mind). It would however be very surprising to learn that a human puts credence in one moral theory, that wants an AI to hurt another moral theory (that this human also puts credence in). There is nothing incoherent about moral theories that refer to each other, and that wants to hurt each other. But in practice this seems to be a lot more common with humans (this issue will be discussed further at the end of the next subsection).
Returning to the idea of removing the certainty of victory from a CAI
One feature of a CAI from the introductory thought experiment was that a narrow majority was certain of complete victory. Therefore they had no reason to negotiate or compromise, even if a large minority felt very strongly about something (including in the extreme case where a 49% minority would strongly prefer to avoid a scenario where a clever and powerful AI is hurting them as much as possible). A majority that barely cares would abandon any plans of oppressing such a minority, if trying to achieve such an outcome came with some form of non-negligible risk or cost.
Consider a modification to a CAI proposal where extrapolated delegates of individuals negotiate and vote under the assumption that votes are settled stochastically. This removes the issue described in the CAI thought experiment, because the majority in the CAI thought experiment would back down instantly. Even large majorities would be forced to negotiate with small minorities to some degree. If the concerns of a small minority are completely ignored, then they will vote for whatever outcome they prefer (without taking any of the considerations of the majority into account). In order to avoid the possibility of their concerns being fully ignored in the eventual outcome, the majority would need to negotiate. The amount of concessions a minority would be able to get would increase if the minority is larger.
If this sounds familiar, it might be because it is a description of the most recently published version of the proposal to build an AI that implements the Coherent Extrapolated Volition of Humanity: Parliamentarian CEV (PCEV). This proposal also has problems. Let's optimistically assume that each human is represented in negotiations by a delegate that perfectly represents her interests (this will allow us to focus on a problem that remains despite this assumption).
It turns out that PCEV gives extra power to anyone that wants PCEV to harm other humans. Consider Bill who believes that every other human deserves to suffer. Since delegates negotiate under the assumption that votes are settled stochastically, Bill voting for an outcome where PCEV makes everyone suffer would be seen as bad in expectation. How bad it is in expectation would depend on the seriousness of the harm that Bill wants PCEV to inflict. The only way to convince Bill to not vote for an outcome where PCEV makes everyone suffer, is to give Bill something. If Bill wants PCEV to inflict a minor form of harm on everyone, then he will only be able to ask for something minor. If Bill wants PCEV to inflict a more serious form of harm, then he will be able to ask for something larger. In other words: the more harm Bill wants to inflict, the more power he will get over the actions of PCEV.
In yet other words: people that want to inflict harm on others get a negotiation advantage in PCEV. People that want to inflict serious harm get a bigger advantage than people that want to inflict less serious harm. The largest possible advantage goes to any group that wants PCEV to hurt everyone else as much as possible. (Yudkowsky's description of this feature of PCEV can be found by searching the CEV arbital page for: ADDED 2023).
The fact that PCEV would be dominated by whatever group of people wants to inflict the most serious harm on others, means that a successfully implemented PCEV would be very dangerous. A powerful and clever AI dominated by such people would in fact lead to a very bad outcome [LW · GW].
PCEV is another example where intuitions built up in one context do not survive the transfer to the AI context. Nick Bostrom proposed to use a Moral Parliament to construct a mapping from a set of weighted ethical theories and a situation, to a set of actions. Bostrom suggested that an individual could use this mapping to obtain a set of actions. And Bostrom further suggested these actions could then be given the label: morally permissible. Yudkowsky proposed to use that same Parliament procedure as a component, when constructing a mapping from billions of humans to an AI. In other words: Yudkowsky proposed PCEV. These two mappings are very different types of things.
In both cases the idea is to resolve disagreements (in the original proposal between moral theories. In PCEV between human individuals). Moral theories can disagree on how disagreements should be resolved. And human individuals can also disagree on how disagreements should be resolved. But moral theories and human individuals are very different types of things (at least if a normal human individual puts credence in all moral theories that are being considered). For example: if a given human individual puts credence in some set of theories. Then it would be very surprising to discover that one of those theories wants to hurt the other theories. But for a large number of human individuals this is guaranteed to happen.
In other words: technically the problem does exist in the intertheoretic / moral uncertainty context. There is in principle nothing stopping a moral theory from referring to some other moral theory, and demanding that this other theory must be hurt as much as possible. And such a moral theory would in fact dominate Bostrom's intertheoretic aggregation method. A mind that puts credence in such a set of moral theories is not an impossible mind. But this is not a set of moral theories that one would expect to find in a human that is trying to resolve moral uncertainty. The issue is with intuitions. In the moral uncertainty context, the intuition that it is safe to ignore this type of hostility is just far more reasonable. More generally: these are just very different types of mappings (the difference between various types of mappings will be discussed further below).
Ways of removing the problematic negotiation advantage from PCEV
One proposed way of patching PCEV is to keep the random dictator negotiation baseline used in PCEV but restrict the options that delegates are able to vote for to AI actions that are Pareto Improvements [LW · GW]. No group of people would be allowed to vote for an outcome where the AI hurts everyone else as much as possible. Because being hurt as much as possible by a powerful and clever AI is not an improvement for the people being hurt. This removes the problematic negotiation advantage issues that PCEV suffers from. If this sounds like a coherent proposal, then it is another example of intuitions breaking during the transfer to the AI context.
For a large group of humans, any conceivable action by a clever and powerful AI (including inaction) will make things worse for someone. Thus, in the AI context, no possible action will ever be a Pareto Improvement [LW(p) · GW(p)]. The proposal in question [LW · GW] is thus based on a definition of acceptable actions, according to which all conceivable actions are unacceptable. Due to the fact that the set of actions that delegates are supposed to vote on is always empty, this is not a coherent proposal. (All other proposals discussed in the present post are variations of the idea of building an AI that does what a group wants the AI to do. But the Pareto-AI idea is not a coherent proposal. Which in turn means that it is not a Group AI proposal).
This is another case where intuitions that were built up about a concept in one context, fails in the AI context. (Since this last claim is a claim about intuitions, it might be false for some readers. Some readers might for example find it intuitively obvious that no possible action can ever be a Pareto Improvement for a large group of humans. For such readers the proposal described above was presumably instantly recognised as incoherent. And for such readers, intuitions about the Pareto Improvement concept did not fail when it was transferred to the AI context. The same is true for all claims made about intuitions throughout the present post. Including in places where this caveat is not explicitly spelled out).
Let's look at another way of patching PCEV. Even a tiny group of fanatics can completely dominate PCEV if they are hateful enough and determined enough [LW · GW]. One way of dealing with this situation would be to remove the stochastic negotiation baseline from PCEV. This is a coherent proposal. And it would prevent such a tiny group from dominating the outcome. But this just brings us back to the issue with solid majorities, illustrated by the initial CAI thought experiment. This non-stochastic-PCEV would allow literally any solid majority (no matter how narrow) to do literally anything they want to everyone else. In other words: all proposals along the lines of non-stochastic-PCEV suffer from a serious problem, for the reasons illustrated by the CAI thought experiment at the beginning of the present section.
We have now come full circle, and it is time to switch to a more ambitious argument. The next section argues that these are all just slight variations of a single proposal: to build an AI that does what a group would like this AI to do. It will be further argued that all problems mentioned so far are due to an underlying problem. A problem that is inherent in the core concept of this Group AI alignment target proposal. This underlying problem (that the rest of the post will be focused on) can be roughly described as: the core concept of a Group AI is incompatible with giving each individual meaningful influence over the way in which the Group AI adopts preferences that refer to her.
Describing these proposals as slight variations to a single proposal: to build an AI that is describable as doing what a group wants
Let's take the perspective of Steve: an ordinary human individual that will not be given any special influence over an AI project. This AI project seeks to implement one of the proposals mentioned above (where the AI in question will get its goal from a group of billions of humans, that include Steve). The rest of this post will argue that from his perspective, the AIs mentioned above all look like slight variations of the same proposal. And further argue that this proposal is bad for Steve. (We will not make a lot of assumptions about Steve. So this conclusion should be relevant for many different readers).
A very rough and informal description of the argument that will be outlined in the rest of the post: A clever AI would be very good at coming up with clever ways of making sure that Steve will have the type of outcome that this AI wants for Steve. Thus, the central question for Steve would be which Steve-referring preferences this AI will have. If Steve has no meaningful influence over the adoption of these preferences, then it would be rational for Steve to strongly oppose the AI project in question. One could give each person such influence. But none of the proposals discussed so far does this. And doing so is incompatible with the core concept of a Group AI (as will be shown in the next subsection).
This argument (that the Group AI alignment target proposal is bad for Steve) will assume that Steve is vulnerable to a clever AI that wants to hurt Steve. But this does not correspond to any dramatic assumptions about Steve. In particular: it does not imply an assumption that Steve is selfish. An AI that wants to hurt Steve can find clever ways of targeting whatever it is that Steve happens to care about. And a clever AI would be good at finding clever ways of getting around whatever constraints it is operating under (such as also wanting specific things for billions of other people).
A brief summary of the argument of the present section: a clever AI would be good at getting around various constraints. So what really matters for Steve is what this AI wants for Steve. But if the design is a Group AI, then the designers have made the decision to implement a design where Steve will be denied any meaningful influence over what this clever AI will want for Steve. This feature seems far more relevant than any of the details that distinguishes the proposals mentioned so far. So, from Steve's perspective, it would be useful to view them all as slight variations of the same proposal. In other words: from Steve's perspective the real problem with all of the proposals in the first section is this lack of influence over the adoption of Steve-referring preferences. Not the details of the various mappings that defines the Group entity that will determine Steve's fate.
For all proposals mentioned in the first section, (i): the AI will adopt preferences that refer to Steve, (ii): the AI will give Steve a negligible amount of influence regarding which Steve-preferences the AI will adopt, (iii): a Group entity is defined as the output of a mapping, that maps sets of billions of human individuals to entities of the type that can be said to want things (iv): the AI will adopt all of its Steve-preferences from the result of this mapping, when the mapping is pointed at the current human population (a population that includes many different types of unusual minds. For example along the lines of Bob the determined from the VarAI thought experiment. And a population where it is common to intrinsically value hurting other humans. For example along the lines of thinking that heretics deserve eternal torture in hell), (v): the AI will have a very large ability to hurt Steve, and finally (vi): there are absolutely no restrictions, on what this AI can do to Steve.
The set of alignment targets that have all of these properties is a tiny fraction of AI design space. It is the result of several, very specific, choices. And when viewed from the perspective of Steve (who has no special influence over any AI project) this is a very surprising set of choices (if Steve assumes that the designers care about his well being). In other words: if we are trying to build an AI that would be good for Steve, then this sounds like a very counterintuitive set of choices. These proposals are all members of a very small, highly specific, and very counterintuitive, class of proposals. From Steve's perspective, it is difficult to see why the details would matter. If one actually thinks about what (i)-(vi) means for Steve (who has no special influence over a given AI project), then it should not be surprising to discover that this proposed AI is very bad for Steve in expectation (for reasons that are not related to any specific detail of any specific variant of this alignment target proposal).
Let's denote any AI that is describable as doing what a group wants that AI to do as a Group AI, or a GAI. All proposals covered so far are GAI proposals. One way of summarising the point made in the present section would be: the problem is not in the details of any given GAI proposal. The problem is that there is simply no reason to expect some arbitrarily defined Group entity to be good for individuals. There is simply no reason to think that whatever G entity is implied by an arbitrarily chosen set of definitions would have Steve-referring preferences that are safe for Steve. So, if Steve has no meaningful influence regarding the adoption of those preferences that refer to Steve. Then Steve has no reason to think that a GAI will want to help Steve, as opposed to want to hurt Steve.
All individuals could be given such influence. But, as will be shown in the next subsection, doing so is incompatible with the core concept of building an AI that is describable as doing what a group wants the AI to do. In other words: giving individuals such influence means that the proposal is not a GAI.
Let's introduce some notation for an alignment target feature. Iff an alignment target gives each person meaningful influence regarding the adoption of those preferences that refer to her, then we say that this alignment target has the Self Preference Adoption Decision Influence (SPADI) feature. In some cases it will be unclear if a given proposal should be described as having the SPADI feature. And in some cases it will be unclear if a given proposal should be described as a Group AI (GAI). While plenty of unclear cases exist, there also exist plenty of clear cases. It will be shown below that if an alignment target is possible to describe as a GAI, then it is not possible to describe it as having the SPADI feature. (The SPADI feature was previously discussed here [LW(p) · GW(p)], and is related to the EM feature discussed here [LW · GW]).
A trivial scenario that illustrates why no GAI design is describable as having the SPADI feature
This subsection outlines a scenario, illustrating why the SPADI feature is incompatible with the Group AI alignment target proposal. The scenario involves a decision. Every GAI will choose one option. No AI with the SPADI feature will choose this option. This is sufficient for proving the non-existence of a GAI design with the SPADI feature. (Some readers might find this scenario to be unsatisfactory as an intuitive explanation of the source of this incompatibility. Therefore the scenario is then followed by a longer discussion of the underlying reasons for the incompatibility.)
Consider an AI that is getting its goal entirely from a group of billions of individuals that include Dave, who cracks his knuckles. Dave wants to continue being the type of person who likes to crack his knuckles. Everyone else would like Dave to stop being the type of person who likes to crack his knuckles. The AI is able to modify Dave in a way that has no relevant side effects (no one would, for example, see the intervention as an impermissible violation of Dave's rights). Dave's knuckle cracking is the only source of disagreement (they all, for example, agree on how much resources each individual should get). They also all agree that everyone feels equally strongly about Dave's knuckle cracking. Everyone agrees that the will of the group is to put a stop to the knuckle cracking. None of them have any relevant meta preferences. For example: no one has a meta preference about individual autonomy. No one has a meta preference about doing what the group wants. They only disagree on this one issue: whether or not Dave should continue to be the type of person that likes to crack his knuckles.
If a proposed AI design implies an AI that would leave Dave alone when faced with this group of people, then it is simply not describable as a GAI proposal. Such an AI does not do what this Group wants it to do, by any reasonable set of definitions. There exist cases where it is unclear whether or not a given design is describable as a GAI proposal. But this case is clear. No AI that leaves Dave alone is describable as a GAI. All GAI designs chooses the option of modifying Dave.
If a proposal implies an AI that would turn Dave into someone that no longer likes to crack his knuckles, then it is simply not possible to describe this proposal as having the SPADI feature. Dave does not want to be modified. Modifying Dave is not instrumental in any way (in this scenario, such a modification would not be done as a way of achieving any type of second order effect). So if Dave is modified, it is because the AI adopted preferences that refers to Dave, in a way that Dave had no control over. There exist cases where it is unclear whether or not a given design is describable as having the SPADI feature. But this case is clear. No AI that modifies Dave is describable as having the SPADI feature.
In other words: all designs that are describable as a GAI chooses one option (they all modify Dave). And no design that is describable as having the SPADI feature chooses this option (they all leave Dave alone). Therefore there exists no AI that is describable as a GAI with the SPADI feature. In yet other words: the SPADI feature is incompatible with the core concept of building a GAI. While this is sufficient for establishing the incompatibility, this incompatibility will still be discussed further in the rest of this section, as well as in the following section (in a way that is hopefully more intuitive). It will then be argued that the SPADI feature is a necessary feature of an alignment target (similar to how the EM feature is a necessary feature of a Membrane formalism [LW · GW]). (The argument against building a GAI is based on establishing that the SPADI feature is both necessary, and also incompatible with building a GAI).
There are of course scenarios where a given AI with the SPADI feature will act in the same way as a given GAI. But the same can be said about the set of AIs that are describable as Cheesecake Maximising AIs (CMAI). In a scenario where all individuals care monomaniacally about maximising Cheesecake, then we would expect an AI to be maximising Cheesecake, regardless of whether it is a GAI, or CMAI, or an AI with the SPADI feature that gets its goal entirely from these people. This does not imply the existence of an AI design that is a member of more than one of these three sets. No CMAI can be described as a GAI, or as having the SPADI feature (even though a given CMAI will sometimes respond to a given situation in the same way as a given GAI, or a given AI with the SPADI feature).
Some readers might find the above to be an unsatisfying explanation of this incompatibility. So the next subsection will attempt to discuss the underlying cause of the incompatibility. In other words: some readers might have intuitions regarding an AI that does what a group wants. These intuitions might object to this incompatibility. One type of implicit assumption that would explain such an intuition would be some version of the following: “If an AI is supposed to do what a group of humans wants it to do. And this AI ends up hurting individual humans. Then surely there exists someone, out there, that will object to this. Some Group entity, with a separate existence, that will object to any action that is so obviously wrong”. The reason that such an implicit assumption leads to incorrect intuitions, is that no such entity exists.
Groups and Individuals really are completely different types of things
Let's introduce an analogy with human individuals and cells, as a way to counter any implicit assumptions along the lines that doing what is good for a group must be good for individuals. It would not be particularly surprising to discover that some reasonable way of defining what it means for an AI to do what Gregg wants an AI to do, will lead to every one of Gregg's cells dying (for example because Gregg wants to be uploaded and would prefer that his cells are not left alive after uploading). This outcome is not an indication of any form of problem with the extrapolation method.
In other words: when evaluating an extrapolation method, the death of every one of Gregg's cells is not a red flag. Gregg and one of Gregg's cells are just two completely different types of things. The interests of Gregg and one of Gregg's cells might of course coincide in many everyday situations (wearing a seatbelt might be good for both). But there is nothing surprising about discovering that doing what one type of thing wants, is bad for a completely different type of thing.
Now consider Bruce, who suffers from a confusion that is similar to the implicit assumption of an independently existing Group entity. Bruce makes the same basic type of mistake, but in terms of individuals and cells. Specifically: Bruce believes that if a given way of extrapolating Gregg leads to all of Gregg's cells dying, then there must be some form of problem with the extrapolation method. Bruce assumes that Gregg's cells would be fine if a better extrapolation method could be found. Bruce makes the implicit assumption that a “collection of cells entity” exists, and that this entity shares Bruce's definition of what is good for cells. The problem is again that no such entity exists. One can define any number of mappings from sets of cells to the type of entity that can be said to want things. When pointed at Gregg, these mappings all result in completely different such entities. They will all want completely different things.
One can try to construct an AI that does what an individual wants it to do (based on the implicit assumption that this will, by definition, be good for cells if done correctly). One can then point it at Gregg (and interpret any implied outcome where Gregg's cells die as indicating a bad solution to a well defined technical problem). Alternatively, one can instead try to construct an AI that ensures the survival of cells. These are two very different design problems. Trying to solve these two very different problems will result in very different types of proposals.
Similarly: groups and individuals are also completely different types of things. Trying to construct an AI that does what a Group entity wants the AI to do is one design problem. Trying to construct an AI that will be safe for individuals is a very different design problem. If one is trying to solve this latter design problem, then it becomes natural to ask how one might prevent the scenario where an AI wants to hurt an individual. And if one is trying to figure out how to prevent this scenario, then it would be natural to focus on the part of the process where the AI adopts preferences that refer to this individual. But if one is instead trying to construct an AI that implements the will of some independently existing Group entity, then the adoption of self-referring-preferences is a detail. Focusing on this particular detail is just a distraction from the real goal: trying to describe the will of this free floating G entity. Implementing a special rule to deal with this detail would subvert the will of the G entity.
If Allan implicitly assumes the separate existence of a Group entity, then Allan might intuitively view the choice of GAI definitions as an attempt to find the correct way of representing the wishes of this G entity. Allan might falsely think that definitions can be arrived at by solving a technical problem (the technical problem of describing the will of this separately existing G entity). This matters when Allan forms intuitions regarding various proposals. Especially proposals to delegate the choice of those definitions that will imply a specific G entity (for example proposals to delegate this choice to an extrapolated human or a Corrigible AI). It matters because Allan implicitly assumes that the task that has been delegated, can in some sense be viewed as solving a technical problem.
Talking to Allan about the GAI alignment target proposal, might be a bit like talking to Bruce about the proposal to build an AI that does what Gregg wants the AI to do. One problem with both perspectives is the implicit assumption of a non existing entity. Another problem is the failure to understand that it would not be at all surprising to discover that doing what one type of thing wants, is very bad for a completely different type of thing.
Unless they get rid of their implicit assumptions about non-existing entities, the intuitions of Bruce and Allan will strongly reject any reasonable description of the situations they face. This confusion in turn means that Bruce will not design an AI that protects cells. And Allan will not design an AI that protects individuals. Allan will continue to propose AI designs that are bad for individuals. Because Allan is not trying to design an AI that is good for individuals. Allan is trying to design an AI that implements the will of a free floating G entity (that does not exist). Allan is not actually failing to find an AI design that is safe for individuals. Because that is not what Allan is actually trying to find.
To Bruce one might try saying: A human individual is a completely different type of thing than the cells that make up an individual. There is no particular reason to expect that doing what a given individual would want, will be good for the cells in that individual.
To Allan one might try saying: You arbitrarily define an abstract entity, called a ``group'', and then you propose to do what this entity would like done. Since this entity is a completely different type of thing than the individuals that make up a group, there is no particular reason to expect that doing what a given group would want, will be good for the individuals in that group.
The next section will focus on a more general issue: The fact that concepts often have counterintuitive implications when they are used in a new context. In particular: one should expect intuitions built up while thinking about extrapolating the will of an individual to fail completely for proposals to extrapolate the will of a group. As will be outlined in the next section: the main issue is that extrapolating the will of an individual, and extrapolating the will of a group, are simply two completely different types of mappings. They do map to similar entities. But they map from completely different types of things.
Two very different types of mappings and a problematic implicit assumption
The Coherent Extrapolated Volition of a human Individual (CEVI) is a completely different type of thing, than the Coherent Extrapolated Volition of Humanity (CEVH). Both are mappings to an entity of the type that can be said to want things. But only CEVI is a mapping from an entity of the type that can be said to want things (the original human). CEVH does not map from such an entity. CEVH only maps to such an entity. A group of billions of human individuals can only be seen as such an entity, if one already has a specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements. Such a disagreement resolution rule is one necessary part of the definition of any CEVH mapping.
In the CEVI case, there already exists someone that can coherently object to the resulting mind. Thus, a claim that some specific CEVI mapping is bad, can sometimes be described as referring to something coherent and real. However, let's explore what it would mean to claim that a CEVH mapping is bad. What would it mean to claim that Humanity would object to some version of CEVH? Basically, it would correspond to an implicit assumption that some Group entity already exists. A G entity that exists completely independently of any mapping. A G entity of the type that can be said to want things. A pre-existing, free floating, G entity of the type that can coherently be said to object to a bad way of describing what it wants.
There exists a large number of reasonable sounding ways of mapping billions of humans to such a Group entity. These mappings all result in completely different G entities. They all want completely different things. Some G entities are dominated by Bob the determined from the VarAI thought experiment. Other G entities are dominated by any solid majority, no matter how narrow. Still other G entities are dominated by a tiny group of individuals that intrinsically value hurting other individuals [LW · GW]. Etc, etc, etc, etc.
In other words: for a given group of billions of humans, there exists a very large number of such G entities. They all want completely different things. There is no objectively correct G entity out there, with a separate existence, that can coherently object to its interests not being represented properly. Because before one has specified some specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements, billions of human individuals simply cannot speak with one voice. The only mind that could object to a given mapping, is the mind that is created by that very mapping. This is a very different scenario, compared to the case when one takes the CEV of an Individual (CEVI). It would thus be a remarkable coincidence, if intuitions built up while thinking about extrapolation in the CEVI context, would turn out to be appropriate in the CEVH context.
Another way to see this difference is to examine the proposal to pick a mapping that does nothing. For CEVI this results in an entity of the type that can be said to want things (the original human). For CEVH however, this proposal is a nonsense proposal. Because it does not result in an entity of the type that can be said to want things. So, for a CEVI mapping, there exists a natural place to start. A natural mapping that one can take as the start of an iterative design process. For a CEVH mapping however, the idea of a natural place to start is nonsense. This is an important difference between the CEVI mapping and the CEVH mapping.
The intuition that there exists a natural place to start modifying a CEVH mapping from, corresponds to an intuition that there exists some objectively correct G entity, out there, with a separate existence, that can object to a bad way or representing what it wants the AI to do. (There exists many places where one can start with a CECH mapping. Each place to start corresponds to a different way of resolving disagreements. And thus implies a different initial G entity. These initial G entities all want completely different things. Some are completely dominated by any solid majority, no matter how narrow. Others are dominated by a tiny majority that intrinsically value hurting others. Still others are dominated by Bob the determined from the VarAI thought experiment. Etc, etc, etc, etc).
In other words: there is no reason to believe that intuitions built up while thinking about CEVI mappings would be appropriate for CEVH mappings. In yet other words: it would be a remarkable coincidence if intuitions developed while thinking about extrapolating individuals, would turn out to be appropriate when thinking about the proposal to build an AI that implements the Coherent Extrapolated Volition of Humanity.
As with most things, one can coherently make claims along the lines of: “this CEVH mapping is bad for me”, or “this CEVH mapping is bad for Jeff”. That is however a very different type of claim, compared to claims along the lines of: “this CEVH mapping is a bad way of representing the will of the free floating G entity, that exists independently from any choice of definitions”. We will return to this topic in the next subsection where we will explore three related proposals, all based on the idea of evaluating a given CEVH mapping relative to some specific person: Corrigibility, a last judge, and the idea of letting an extrapolated person pick a set of definitions.
But let’s first try to reason about this situation, without the intuition that there exists some Group entity, “out there”, that can coherently object to “bad CEVH mappings”. After abandoning this intuition, it becomes entirely unsurprising to learn that a given Group AI will want to hurt individuals. Why would it not? And it becomes natural to think about how to protect individuals from a powerful AI. So, how does one protect an individual (that gets no special treatment) from an AI that gets its goal from billions of humans?
It is difficult to protect someone once a powerful and clever AI has already decided that it wants to hurt that individual. So the most natural way to protect an individual is to look at the point at which preferences that refer to her are adopted by an AI. Unless specifically given such influence, no individual in a large group will have any meaningful influence over this decision. And without such influence there is no reason to think that the resulting AI will want to help, as opposed to want to hurt (if we remember to disregard all intuitions built on top of implicit assumptions about an independently existing G entity that will come to the rescue). Therefore, such influence is a necessary feature of a safe alignment target. (In other words: while the SPADI feature is not sufficient for safety, it is necessary. The role that a feature like this can play in a design process was previously discussed in the context of Membrane formalisms [LW · GW]).
Every individual can be given such influence. Doing so means that the resulting AI is no longer describable as a Group AI. Because no GAI is describable as having the SPADI feature (as was illustrated above by Dave and his knuckle cracking). Therefore, the core concept of the Group AI alignment target proposal is incompatible with the necessary SPADI feature.
If one remembers to disregard all intuitions that rely on the implicit assumption of a separately existing G entity that must be catered to, then abandoning the idea of a Group AI should not be a dramatic suggestion. Without this implicit assumption about a free floating G entity, the fact that the SPADI feature is incompatible with building a GAI, is no longer a negative aspect of the SPADI feature. The adoption of self-referring preferences is no longer a detail. Defining a special rule for this step is no longer subverting the will of any free floating G entity that must be catered to. Giving individuals influence over the adoption of self-referring preferences is no longer interfering with an attempt to solve the technical problem of describing the will of the independently existing G entity.
In other words: insisting on the SPADI feature is not dividing by zero. It is not blasphemy. It is not breaking the rules. It is not going against the will of anything that actually exists. For a given set of billions of humans, there exists a large number of possible G entities, implied by different implementable CEVH mappings (such as PCEV, VarAI, non-stochastic-PCEV, etc, etc, etc). These entities all want completely different things. Every conceivable action, including inaction, thus goes against the wishes of a large number of such entities (including every conceivable proposed AI project. And the proposal to not build any AI at all). There is thus no possible action (including inaction) that anyone can ever take, that would avoid going against the wishes of a large number of such entities. Insisting on the SPADI feature is thus not special in this regard.
When one insists on the SPADI feature, what one is actually doing is protecting individuals. And this is a set of minds that actually does exist, in a way that is separate from any specific proposed mapping. This is a type of mind that one is causally connected to in a much stronger sense. On the other hand: the connection to any specific G entity goes through some arbitrarily chosen set of CEVH mapping definitions (even if one assumes the successful implementation of the GAI alignment target proposal, then this does not imply any specific G entity).
In yet other words: it would be nonsense to say that a GAI project is trying to describe what some free floating G entity wants an AI to do. Because one step of a GAI project would be to choose a set of definitions for a mapping from billions of individuals to an entity of the type that can be said to want things. Thus, any reasoning that implicitly assumes that such a G entity already exists, is known to be fundamentally confused (it is not flawed in the sense of being false. But flawed in the sense of meaning nothing. Two people debating how best to describe the will of the free floating Group entity, are just as confused as two people debating how big the spherical triangle is).
Three coherent proposals where someone can reject a given CEVH mapping
This subsection will discuss three related proposals. In each case there will be someone that can coherently be said to have an opinion about a given CEVH mapping, (i): Corrigibility, (ii): a last judge, (iii): letting an extrapolated person pick definitions. These three ideas are distinct and will be discussed separately below. But they all share two important features. They can all be used by many different AI projects. For example projects aiming for a GAI. Or projects aiming for an AI with the SPADI feature. So none of these ideas offer much help when trying to decide what alignment target an AI project should aim at. And in all three cases, an effort to implement the idea might fail. So none of these ideas removes the risk involved with initiating an AI project aiming at a bad alignment target (because aiming at a bad alignment target remains dangerous, even if efforts are made to implement one of these ideas).
Some versions of these ideas can be seen as proposing general purpose methods for hitting alignment targets (either directly or by allowing the designers to adopt a trial and error approach to AI design). They are basically proposed ways of reducing the probability of extinction from a misaligned AI. Such methods are out of scope of the present text. They could be used to implement PCEV or VarAI. They could also be used to implement some AI with the SPADI feature. In other words: they would not reduce the risk of someone successfully implementing a bad alignment target. And they would not help much when trying to figure out what alignment target to aim at. They are in fact complementary risk mitigation strategies (see this comment [LW(p) · GW(p)] for a longer discussion). The present subsection is instead discussing versions of these three proposals that are supposed to be able to prevent scenarios where someone successfully implements a bad alignment target.
Let's start with the idea to build a Corrigible AI that would allow a human designer to go through some form of trial and error process while defining a given mapping. In this case the person that rejects a CEVH mapping is a human designer that has discussed the mapping with a clever and Corrigible AI. This type of Corrigible AI proposal is importantly different from the proposal to build a Corrigible AI that can only be used for limited tasks, such as shutting down competing AI projects. (The connection between ATA and the idea of using a limited AI to shut down competing AI projects was previously discussed here [LW · GW] and here [LW(p) · GW(p)]). The type of Corrigibility that we are exploring in the present subsection however, is the type that might prevent you from successfully hitting a bad alignment target. In other words: if you start an AI project aiming at a bad alignment target, the result might be a clever AI explaining to you that your chosen alignment target implies an outcome that you would not like.
While an effort to build a Corrigible AI could prevent a bad alignment target from getting successfully implemented, it is also possible that the effort to make an AI Corrigible will fail on some implementation step. Additionally, an effort to make an AI Corrigible might also fail even if everything is successfully implemented [LW · GW]. It is also possible that an AI might be Corrigible when discussing some topics, but not Corrigible when discussing other topics [LW(p) · GW(p)]. To end up with a Corrigible AI, some human has to successfully define what it means for an AI to be Corrigible. If this definition step fails, then there will be no Corrigible AI available to allow a second attempt (even if everything is implemented successfully, the resulting AI will not be Corrigible). In other words: this idea does not remove the risk from an AI project that is aiming at a bad alignment target. Which in turn means that the idea does not remove the need to analyse alignment targets. If one does manage to construct some Corrigibility method that looks like it might work, then this can be used for an AI project aiming for a GAI. It can also be used for an AI project aiming for an alignment target with the SPADI feature. So the idea is not very helpful when deciding what alignment target to aim at.
Let's now turn to the last judge idea. The proposal is to give the extrapolated version of a person a veto over one specific mapping. This can prevent bad outcomes if it works. However, since the implementation of such an add on might fail, it does not remove the risk from an AI project that is aiming at a bad alignment target. Besides implementation issues, the idea also depends on having a good definition of what it means to extrapolate a human. Extrapolation is a tricky concept, and the definitions of an extrapolation dynamic must be chosen by a non extrapolated designer. Even a successfully implemented last judge might be ineffective due to a problem with the extrapolation definition [LW · GW]. Such an add on can be added to an AI project aiming for a GAI. It can also be added to an AI project aiming for an AI design with the SPADI feature. So the idea is not very helpful when deciding what alignment target to aim at.
Delegating the choice of definitions to an extrapolated human
An extrapolated human can be asked to pick the definitions of a GAI. Alternatively, an extrapolated human can be asked to pick the definitions of an AI with the SPADI feature. So this proposal also does not help us much when choosing between a GAI and an AI with the SPADI feature. But let's explore what it would mean to define a GAI in this way. Let's again take the perspective of Steve, who has no special influence over an AI project. An extrapolation method will be aimed at Tim, and the result of the extrapolation will choose definitions for an AI. Let's refer to the outcome of this human defined extrapolation method as Extrapolated Tim (ET). And let's remember that we really do not know much about what to expect from a superintelligent mind, that comes out at the other end of a human defined extrapolation method (even if we assume successful implementation).
For this to be a coherent idea, the proposal must be something other than a plan to ask ET to find the correct description of what an independently existing Group entity wants an AI to do (because no such G entity exists). In other words: ET is not being asked to solve a technical problem. If ET had been asked to solve a technical problem. Then it might have been enough for the extrapolation method to result in an intelligent mind that is honestly attempting to solve this technical problem. But the plan is either to ask ET to do something other than solve a technical problem. Or the plan is hopelessly confused. In particular: if ET picks a set of definitions that implies a mind that wants to hurt Steve, then there does not exist any coherent sense in which ET has failed at a technical task. The task that ET has been given has many valid solutions. Some of these valid solutions will be very bad for Steve. PCEV is one such valid solution.
The fact that there exists no coherent sense in which PCEV is a bad solution to a technical task matters a great deal. It matters when we are trying to estimate what would be required for a given plan to be safe for Steve. When evaluating a plan where some specific extrapolation method will be used to pick definitions for a given alignment target, this fact is very important to keep in mind. There exists no free floating G entity that can be counted on to want to help Steve. So if the plan is to ask ET to pick definitions for the GAI alignment target, then the choice of alignment target does not provide any protection for Steve. Because this alignment target choice means that Steve will have no influence regarding the adoption of those preferences that refer to Steve.
In other words: the task that ET has been given is not a well defined technical problem with a solution that can be found by being smart enough. The mapping that ET is defining, does not map from an entity of the type that can be said to want things (it only maps to such an entity). Thus, any intuition along the lines of there existing a G entity, that can object in the sense that an individual could object, is simply confused. Because the specific decision that has been delegated to ET, is the decision of how to construct such an entity. This means that an issue with the extrapolation dynamic cannot be counted on to self correct. There is no well defined technical problem to ground in.
(As a tangent, it is worth noting that if the extrapolation of Tim's values goes off the rails, then there exists no objectively correct solution that might steer an intelligent and well meaning ET back towards objectively good technical solutions. ET finding a valid solution to the task that ET has been assigned offers no meaningful protection even to Tim. Because the chosen alignment target means that no one will have any meaningful influence regarding the adoption of self preferences. So the fact that the solution found by ET is valid does not imply any form of safety. On the contrary: it is the very fact that the solution is valid, that ensures that Tim will have no influence regarding the adoption of those preferences that refer to Tim).
In yet other words: ET's job is to define a mapping from billions of individuals to an entity of the type that can be said to want things. This is simply not a well defined technical task that can be solved simply by being smart and well intentioned. So what Steve should expect from a successful plan is for ET to pick a very reasonable sounding and clever CEVH mapping. And then Steve should expect to be subjected to the will of whatever mind this mapping implies. Steve has no reason to expect this mind to want to help Steve, as opposed to want to hurt Steve. Because the one thing that Steve actually does know, is that Steve will have no influence regarding the adoption of those preferences that refer to Steve (because if the plan succeeds, then ET will pick the definitions of a GAI).
(Any intuition along the lines that ET could solve the well defined technical problem of locating the mapping that an idealised version of Tim would pick, is either based on nonsense or circular in a trivial sense. Because ET is, literally, the chosen definition of what it means to be an idealised version of Tim).
To summarise: This method can be used to pick the definitions of a GAI. Or used to pick the definitions of an AI with the SPADI feature. Delegating the choice of definitions to the outcome of a human defined extrapolation procedure also does not remove the risks involved in aiming for a bad alignment target. Picking extrapolation definitions is tricky even if we assume successful implementation. And when asked to pick definitions for a GAI, then the task delegated to ET is not the type of task that might steer a badly extrapolated ET back to safe solutions, by virtue of ET being good at solving technical tasks. When ET picks definitions for the GAI alignment target proposal, then validity does not imply safety. On the contrary: validity then ensures a lack of control over the adoption of self-preferences.
In other words, and more generally: Adopting this method does not imply that the AI project in question is aiming at any specific alignment target. It also does not remove the risk from an AI project that is aiming at a bad alignment target. Which in turn means that it still matters what alignment target is aimed at.
In yet other words and even more generally: unless this method is assumed to always result in an AI that is functionally equivalent to an AI that does whatever Tim wants that AI to do, it will continue to matter what alignment target ET is asked to pick definitions for. And it will continue to be rational for Steve to insist on an alignment target, such that successful implementation implies safety for Steve.
2 comments
Comments sorted by top scores.
comment by Seth Herd · 2025-01-15T21:00:30.901Z · LW(p) · GW(p)
I very much agree with your top-level claim: analyzing different alignment targets well before we use them is a really good idea.
But I don't think those are the right alignment targets to analyze. I think none of those are very likely to actually be deployed as alignment targets for the first real AGIs. I think that Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] Ω or roughly equivalently (and better-framed for the agent foundations crowd), Corrigibility as Singular Target [LW · GW] is far superior to anything else. I think it's so superior that anyone sitting down and thinking about the topic, for instance just before launching something they viscerally believe might actually be able to learn and self-improve, will likely see it the same way.
On top of that logic, the people actually building the stuff would rather have it aligned to their goals than everyones.
comment by Noosphere89 (sharmake-farah) · 2025-01-15T21:25:35.102Z · LW(p) · GW(p)
The Coherent Extrapolated Volition of a human Individual (CEVI) is a completely different type of thing, than the Coherent Extrapolated Volition of Humanity (CEVH). Both are mappings to an entity of the type that can be said to want things. But only CEVI is a mapping from an entity of the type that can be said to want things (the original human). CEVH does not map from such an entity. CEVH only maps to such an entity. A group of billions of human individuals can only be seen as such an entity, if one already has a specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements. Such a disagreement resolution rule is one necessary part of the definition of any CEVH mapping.
I like to state this as the issue that all versions of CEV/group alignment that want to aggregate thousands of people's or more values requires implicitly resolving disagreements in values, which in turn require value-laden choices, and at that point, you are essentially doing value-alignment to what you think is good, and the nominal society is just a society of you.
I basically agree with Seth Herd here, in that instruction following is both the most likely and the best alignment target for purposes of AI safety (at least assuming offense-defense balance issues aren't too severe).