Posts

Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure 2024-10-03T00:01:34.011Z
The case for more Alignment Target Analysis (ATA) 2024-09-20T01:14:41.411Z
A necessary Membrane formalism feature 2024-09-10T21:33:09.508Z
Corrigibility could make things worse 2024-06-11T00:55:21.596Z
The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question. 2023-11-22T18:59:44.389Z
Making progress on the ``what alignment target should be aimed at?'' question, is urgent 2023-10-05T12:55:06.736Z
A problem with the most recently published version of CEV 2023-08-23T18:05:46.658Z

Comments

Comment by ThomasCederborg on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-23T05:28:29.951Z · LW · GW

I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comment (details below).

The first section argues that (unless Bob's basic moral framework has been dramatically changed by the memory deletion) no level of memory deletion will prevent BPA from wanting to find and hurt Steve. In brief: BPA will still be subject to the same moral imperative to find and hurt any existing heretics (including Steve).

The second section argues that BPA is likely to find Steve. In brief: BPA is a clever AI and the memory deletion is a human constructed barrier (the Advocates are extrapolations of people that has already been subjected to these memory wipes. So Advocates cannot be involved when negotiating the rules governing these memory wipes). BPA would still have access to a lot of different information sources that it can use to find Steve.

The third section argues that if BPA finds Steve, then BPA would be able to hurt Steve. In brief: creating OldSteve is still not prevented by any rule or constraint that you have mentioned so far.

The fourth section argues that the side effects of memory deletion would be severe. In brief: memories of every conversation about any deleted person would also be deleted. Besides all direct memories involving any deleted person, many indirect memories would also be deleted. This would seriously complicate extrapolation. (Extrapolation is already a very tricky definitional problem. And this definitional problem cannot be delegated to extrapolated Advocates, since they are the result of the extrapolation dynamic being defined).

The last section deals with your proposed resource destruction mechanism. In brief: in one thought experiment (that did not involve threats) almost half the population would very strongly prefer to destroy ELYSIUM. This disproved your claim that such a situation would not arise. Also: no thought experiment involved anyone trying to gain resources. And no thought experiment involved anyone issuing any form of threat that would not be followed up on. People burned resources that they did not need. They did this to either hurt people directly. Or to issue genuinely credible threats.

 

BPA will still want to find and hurt heretics

It would not be surprising if a comprehensive, human defined, memory deletion operation would completely re write someone's basic moral framework. But I'm guessing that your proposed memory deletion is not supposed to be done in a way that changes a persons basic moral framework. So let's reason from the assumption that it does not.

This means that Bob still considers BPA to be bound by a non negotiable moral imperative. So BPA still wants to find and punish any heretic that might exist.

In other words: the Cosmic Block is not the type of mechanism that might prevent BPA from wanting to find and hurt Steve. It is instead a practical obstacle that BPA needs to overcome (which is something very different). It is a human constructed practical barrier, that is supposed to protect Steve from a clever AI that wants to find and hurt Steve.

 

BPA will likely be able to find Steve

Unless the details of your proposed Cosmic Block are constructed by an AI that prevents All Bad Things, these rules must come from somewhere else. AI assisted negotiations cannot be done by the Advocates. Advocates are the result of extrapolating memory wiped people (otherwise the whole point of the Cosmic Bloc is lost). So the Advocates cannot be involved in defining the memory wipe rules.

In other words: unless the memory wipe rules are negotiated by a completely separate set of (previously unmentioned) AIs, the memory wipe rules will be human defined.

This means that a human constructed barrier must hold against a clever AI trying to get around it. Even if we were to know that a human defined barrier has no humanly-findable security holes, this does not mean that it will actually hold against a clever AI. A clever AI can find security holes that are not humanly-findable.

The specific situation that BPA will find itself in does not seem to be described in sufficient detail for it to be possible to outline a specific path along which BPA finds Steve. But from the currently specified rules, we do know that BPA has access to several ways of gathering information about Steve.

People can pool resources (as described in your original proposal). So Advocates can presumably ask other Advocates about potential partners for cohabitation. Consider the case where BPA is negotiating with other Advocates regarding who will be included in a potential shared environment. This decision will presumably involve information about potential candidates. Whether or not a given person is accepted, would presumably depend on detailed personal information.

Advocates can also engage in mutual resource destruction to prevent computations happening within other Utopias. You describe this mechanism as involving negotiations between Advocates, regarding computations happening within other people's Utopias. Such negotiations would primarily be between the Advocates of people that have very different values. This is another potential information source about Steve.

Steve would also have left a lot of effects on the world, besides effects on peoples memories. Steve might for example have had a direct impact on what type of person someone else has turned into. Deleting this impact would be even more dramatic than deleting memories.

Steve might have also have had a significant impact on various group dynamics (for example: his family, the friend groups that he has been a part of, different sets of coworkers and classmates, online communities, etc). Unless all memories regarding the general group dynamics of every group that Steve has been a part of is deleted, Steve's life would have left behind many visible effects.

The situation is thus that a clever AI is trying to find and hurt Steve. There are many different types of information sources that can be combined in clever ways to find Steve. The rules of all barriers between this AI and Steve are human constructed. Even with perfect enforcement of all barriers, this still sounds like a scenario where BPA will find Steve (for the same reason that a clever AI is likely to find its way out of a human constructed box, or around a human constructed Membrane).

 

There is still nothing protecting Steve from BPA

If BPA locates Steve, then there is nothing preventing BPA from using OldSteve to hurt Steve. What is happening to OldSteve is still not prevented by any currently specified rule. The suffering of OldSteve is entirely caused by internal dynamics. OldSteve never lacks any form of information. And the harm inflicted on OldSteve is not in any sense marginal.

I do not see any strong connections between the OldSteve thought experiment and your Scott Alexander quote (which is concerned with the question of what options and information should be provided by a government run by humans. To children raised by other humans). More generally: scenarios that include a clever AI that is specifically trying to hurt someone, has a lot of unique properties (important properties that are not present in scenarios that lack such an AI). I think that these scenarios are dangerous. And I think that they should be avoided (as opposed to first created and then mitigated). (Avoiding such scenarios is a necessary, but definitely not sufficient, feature of an alignment target).

 

Memory wipes would complicate extrapolation

All deleted memories must be so thoroughly wiped that a clever AI will be unable to reconstruct them (otherwise the whole point of the Cosmic Block is negated). Deleting all memories of a single important negative interpersonal relationship would be a huge modification. Even just deleting all memories of one famous person that served as a role model would be significant.

Thoroughly deleting your memory of a person, would also impact your memory of every conversation that you have ever had about this person. Including conversations with people that are not deleted. Most long term social relationships involves a lot of discussions of other people (one person describing past experiences to the other, discussions of people that both know personally, arguments over politicians or celebrities, etc, etc). Thus, the memory deletion would significantly alter the memories of essentially all significant social relationships. This is not a minor thing to do to a person. (That every person would be subjected to this is not obviously implied by the text in The ELYSIUM Proposal.)

In other words: even memories of non deleted people would be severely modified. For example: every discussion or argument about a deleted person would be deleted. Two people (that do not delete each other) might suddenly have no idea why they almost cut all contact a few years ago, and why their interactions has been so different for the last few years. Either their Advocates can reconstruct the relevant information (in which case the deletion does not serve its purpose). Or their Advocates must try to extrapolate them while lacking a lot of information.

Getting the definitions involved in extrapolation right, seems like it will be very difficult even under ordinary circumstances. Wide ranging and very thorough memory deletion would presumably make extrapolation even more tricky. This is a major issue.

 

Your proposed resource destruction mechanism

No one in any of my thought experiments was trying to get more resources. The 55 percent majority (and the group of 10 people) have a lot of resources that they do not care much about. They want to create some form of existence for themselves. This only takes a fraction of available resources to set up. They can then burn the rest of their resources on actions within the resource destruction mechanism. They either burn these resources to directly hurt people. Or they risk these resources by making threats that are completely credible. In the thought experiments where someone does issue a threat, the threat is issued because: a person giving in > burning resources to hurt someone who refuses > leaving someone that refuses alone. They are perfectly ok with an outcome where resources are spent on hurting someone that refuses to comply (they are not self modifying as a negotiation strategy. They just think that this is a perfectly ok outcome).

Preventing this type of threats would be difficult because (i): negotiations are allowed, and (ii): in any scenario where threats are prevented, the threatened action would simply be taken (for non strategic reasons). There is no difference in behaviour between scenarios where threats are prevented, and scenarios where threats are ignored.

The thought experiment where a majority burns resources to hurt a minority was a simple example scenario where almost half of the population would very strongly prefer to destroy ELYSIUM (or strongly prefer that ELYSIUM was never created). It was a response to your claim that your resource destruction mechanisms would prevent such a scenario. This thought experiment did not involve any form of threat or negotiation.

Let's call a rule that prevents the majority from hurting the minority a Minority Protection Rule (MPR). There are at least two problems with your claim that a pre-AI majority would prevent the creation of a version of ELYSIUM that has an MPR.

First: without an added MPR, the post-AI majority is able to hurt the minority without giving up anything that they care about (they burn resources they don't need). So there is no reason to think that an extrapolated post-AI majority would want to try to prevent the creation of a version of ELYSIUM with an MPR. They would prefer the case without an MPR. This does not imply that they care enough to try to prevent the creation of a version of ELYSIUM with an MPR. Doing so would presumably be very risky, and they don't gain anything that they care much about. When hurting the minority does not cost them anything that they care about, they do it. That does not imply that this is an important issue for the majority.

More importantly however: you are conflating, (i): a set of un-extrapolated and un-coordinated people living in a pre-AI world, with (ii): a set of clever AI Advocates representing these same people, operating in a post-AI world. There is nothing unexpected about humans opposing / supporting an AI that would be good / bad for them (from the perspective of their extrapolated Advocates). That is the whole point of having extrapolated Advocates.

Comment by ThomasCederborg on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-19T15:23:11.232Z · LW · GW

Implementing The ELYSIUM Proposal would lead to the creation of a very large, and very diverse, set of clever AIs that wants to hurt people: the Advocates of a great variety of humans, that wants to hurt others in a wide variety of ways, for a wide variety of reasons. Protecting billions of people from this set of clever AIs would be difficult. As far as I can tell, nothing that you have mentioned so far would provide any meaningful amount of protection from a set of clever AIs like this (details below). I think that it would be better to just not create such a set of AIs in the first place (details below).

 

Regarding AI assisted negotiations

I don't think that it is easy to find a negotiation baseline for AI-assisted negotiations that results in a negotiated settlement that actually deals with such a set of AIs. Negotiation baselines are non trivial. Reasonable sounding negotiation baselines can have counterintuitive implications. They can imply power imbalance issues that are not immediately obvious. For example: the random dictator negotiation baseline in PCEV gives a strong negotiation advantage to people that intrinsically values hurting other humans. This went unnoticed for a long time. (It has been suggested that it might be possible to find a negotiation baseline (a BATNA) that can be viewed as having been acausally agreed upon by everyone. However, it turns out that this is not actually possible for a group of billions of humans).

 

The proposal to have a simulated war that destroys resources

10 people without any large resource needs could use this mechanism to kill 9 people they don't like at basically no cost (defining C as any computation done within the Utopia of the person they want to kill). Consider 10 people that just want to live a long life, and that do not have any particular use for most of the resources they have available. They can destroy all computational resources of 9 people without giving up anything that they care about. This also means that they can make credible threats. Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.

This mechanism does not rule out scenarios where a lot of people would strongly prefer to destroy ELYSIUM. A trivial example would be a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority. In this scenario almost half of all people would very strongly prefer to destroy ELYSIUM. Such a majority could alternatively credibly threaten the minority and force them to modify the way they live their lives. The threat would be especially credible if the majority likes the scenario where a minority is punished for refusing to conform.

In other words: this mechanism seems to be incompatible with your description of personalised Utopias as the best possible place to be (subject only to a few non intrusive ground rules).

 

The Cosmic Block and a specific set of tests

This relies on a set of definitions. And these definitions would have to hold up against a set of clever AIs trying to break them. None of the rules that you have proposed so far would prevent the strategy used by BPA to punish Steve, outlined in my initial comment. OldSteve is hurt in a way that is not actually prevented by any rule that you have described so far. For example: the ``is torture happening here'' test would not trigger for what is happening to OldSteve. So even if Steve does in principle have the ability to stop this by using some resources destroying mechanism, Steve will not be able to do so. Because Steve will never become aware of what Bob is doing to OldSteve. Steve considers OldSteve to be himself in a relevant sense. So, according to Steve's worldview, Steve will experience a lot of very unpleasant things. But the only version of Steve that would be able to pay resources to stop this, would not be able to do so.

So the security hole pointed out by me in my original thought experiment is still not patched. And patching this security hole would not be enough. To protect Steve, one would need to find a set of rules that preemptively patches every single security hole that one of these clever AIs could ever find.

 

I think that it would be better to just not create such a set of AIs

Let's reason from the assumption that Bob's Personal Advocate (BPA) is a clever AI that will be creating Bob's Personalised Utopia. Let's now again take the perspective of ordinary human individual Steve, that gets no special treatment. I think the main question that determines Steve's safety in this scenario, is how BPA is adopting Steve-referring-preferences. I think this is far more important for Steve's safety, than the question of what set of rules will govern Bob's Personalised Utopia. The question of what BPA wants to do to Steve, seems to me to be far more important for Steve's safety, than the question of what set of rules will constrain the actions of BPA.

Another way to look at this is to think in terms of avoiding contradictions. And in terms of making coherent proposals. A proposal that effectively says that everyone should be given everything that they want (or effectively says that everyone's values should be respected) is not a coherent proposal. These things are necessarily defined in some form of outcome or action space. Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.

This can be contrasted with giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her. Since this is defined in preference adoption space, it cannot guarantee that everyone will get everything that they want. But it also means that it does not imply contradictions (see this post for a discussion of these issues in the context of Membrane formalisms). Giving everyone such influence is a coherent proposal.

It also happens to be the case that if one wants to protect Steve from a far superior intellect, then preference adoption space seems to be a lot more relevant than any form of outcome or action space. Because if a superior intellect wants to hurt Steve, then one has to defeat a superior opponent in every single round of a near infinite definitional game (even under the assumption of perfect enforcement, winning every round in such a definitional game against a superior opponent seems hopeless). In other words: I don't think that the best way to approach this is to ask how one might protect Steve from a large set of clever AIs that wants to hurt Steve for a wide variety of reasons. I think a better question is to ask how one might prevent the situation where such a set of AIs wants to hurt Steve.

Comment by ThomasCederborg on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-18T05:29:31.200Z · LW · GW

My thought experiment assumed that all rules and constraints described in the text that you linked to had been successfully implemented. Perfect enforcement was assumed. This means that there is no need to get into issues such as relative optimization power (or any other enforcement related issue). The thought experiment showed that the rules described in the linked text does not actually protect Steve from a clever AI that is trying to hurt Steve (even if these rules are successfully implemented / perfectly enforced).

If we were reasoning from the assumption that some AI will try to prevent All Bad Things, then relative power issues might have been relevant. But there is nothing in the linked text that suggests that such an AI would be present (and it contains no proposal for how one might arrive at some set of definitions that would imply such an AI).

In other words: there would be many clever AIs trying to hurt people (the Advocates of various individual humans). But the text that you link to does not suggest any mechanism, that would actually protect Steve from a clever AI trying to hurt Steve.

There is a ``Misunderstands position?'' react to the following text:

The scenario where a clever AI wants to hurt a human that is only protected by a set of human constructed rules ...

In The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt individual humans (the Advocates of various individual humans). So I assume that the issue is with the protection part of this sentence. The thought experiment outlined in my comment assumes perfect enforcement (and my post that this sentence is referring to also assumes perfect enforcement). It would have been redundant, but I could have instead written:

The scenario where a clever AI wants to hurt a human that is only protected by a set of perfectly enforced human constructed rules ...

I hope that this clarifies things.

The specific security hole illustrated by the thought experiment can of course be patched. But this would not help. Patching all humanly findable security holes would also not help (it would prevent the publication of further thought experiments. But it would not protect anyone from a clever AI trying to hurt her. And in The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt people). The analogy with an AI in a box is apt here. If it is important that an AI does not leave a human constructed box (analogous to: an AI hurting Steve). Then one should avoid creating a clever AI that wants to leave the box (analogous to: avoid creating a clever AI that wants to hurt Steve). In other words: Steve's real problem is that a clever AI is adopting preferences that refer to Steve, using a process that Steve has no influence over.

(Giving each individual influence over the adoption of those preferences that refer to her would not introduce contradictions. Because such influence would be defined in preference adoption space. Not in any form of action or outcome space. In The ELYSIUM Proposal however, no individual would have any influence whatsoever, over the process by which billions of clever AIs, would adopt preferences, that refer to her)

Comment by ThomasCederborg on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-17T04:56:24.389Z · LW · GW

Let's optimistically assume that all rules and constraints described in The ELYSIUM Proposal are successfully implemented. Let's also optimistically assume that every human will be represented by an Advocate that perfectly represents her interests. This will allow us to focus on a problem that remains despite these assumptions.

Let's take the perspective of ordinary human individual Steve. Many clever and powerful AIs would now adopt preferences that refer to Steve (the Advocates of humans that have preferences that refer to Steve). Steve has no influence regarding the adoption of these Steve-Preferences. If one of these clever and powerful AIs wants to hurt Steve, then Steve is only protected by a set of human constructed rules.

The scenario where a clever AI wants to hurt a human that is only protected by a set of human constructed rules was previously discussed here. That post is about Membrane formalisms. But the argument is similar to the argument presented below. Both arguments are also similar to arguments about the difficulty of keeping a clever AI contained in a human constructed box (if it is important that an AI stays in a human constructed box. Then one should not build a clever AI that wants to leave the box. If a clever AI wants to leave the box, then plugging all human-findable security holes is not enough. Because the clever AI can find a security hole that is not humanly-findable). Very briefly: this general situation is dangerous for Steve, because the AI that wants to hurt Steve is more clever than the humans that constructed the rules that are supposed to protect Steve.

Let's explore one specific example scenario where a clever AI finds a way around the specific rules outlined in the text of The ELYSIUM Proposal. Bob does not want to hurt anyone. Bob certainly does not want to use his Utopia as a weapon. However, it is important for Bob that Bob's Utopia was constructed by an ethical AI. A moral imperative for such an AI is that it must punish heretics (if such heretics exist). Bob would prefer a world where no one is a heretic and no one suffers. But unfortunately Steve is a heretic. And the moral imperative to punish Steve is more important than the avoidance of suffering. So Bob's Personal Advocate (BPA) will try to punish Steve.

Steve now faces a clever AI trying to hurt him, and his only protection against this AI is a set of human constructed rules. Even if no human is able to find a way around some specific set of human constructed rules, BPA will be able to think up strategies that no human is able to comprehend (this more serious problem would remain, even if the security hole described below is fully patched). The real problem faced by Steve is that a clever AI has adopted Steve-referring-preferences. And Steve had no influence regarding the decision of which Steve-preferences would be adopted by this clever AI. But let's now return to discussing one specific strategy that BPA can use to hurt Steve without breaking any of the rules described in this specific text.

BPA is constrained by the requirement that all created minds must enthusiastically consent to being created. The other constraint is that BPA is not allowed to torture any created mind. The task of BPA is thus to construct a mind that (i): would enthusiastically consent to being created, and (ii): would suffer in ways that Steve would find horrific, even though no one is torturing this mind.

The details will depend on Steve's worldview. The mind in question will be designed specifically to hurt Steve. One example mind that could be created is OldSteve. OldSteve is what Steve would turn into, if Steve were to encounter some specific set of circumstances. Steve considers OldSteve to be a version of himself in a relevant sense (if Steve did not see things in this way, then BPA would have designed some other mind). OldSteve has adopted a worldview that makes it a moral obligation to be created. So OldSteve would enthusiastically consent to being created by BPA. Another thing that is true of OldSteve, is that he would suffer horribly due to entirely internal dynamics (OldSteve was designed by a clever AI, that was specifically looking for a type of mind that would suffer due to internal dynamics).

So OldSteve is created by BPA. And OldSteve suffers in a way that Steve finds horrific. Steve does not share the moral framework of OldSteve. In particular: Steve does not think that OldSteve had any obligation to be created. In general, Steve does not see the act of creating OldSteve as a positive act in any way. So Steve is just horrified by the suffering. BPA can crate a lot of copies of OldSteve with slight variations, and keep them alive for a long time.

(This comment is an example of Alignment Target Analysis (ATA). This post argued that doing ATA now is important, because there might not be a lot of time to do ATA later (for example because Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure). There are many serious AI risks that cannot be reduced by any level of ATA progress. But ATA progress can reduce the probability of a bad alignment target getting successfully implemented. A risk reduction focused ATA project would be tractable, because risks can be reduced even if one is not able to find any good alignment target. This comment discuss which subset of AI risks can (and cannot) be reduced by ATA. This comment is focused on a different topic but it contains a discussion of a related concept (towards the end it discusses the importance of having influence over the adoption of self-referring-preferences by clever AIs).)

Comment by ThomasCederborg on Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure · 2024-10-11T01:14:29.986Z · LW · GW

I changed the title from: ``A Pivotal Act AI might not buy a lot of time'' to: ``Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure''.

As explained by Martin Randall, the statement: ``something which does not buy ample time is not a pivotal act'' is false (based on the Arbital Guarded Definition of Pivotal Act). Given your ``Agreed react'' to that comment, this issue seems to be settled. In the first section of the present comment, I explain why I still think that the old title was a mistake. The second section outlines a scenario that better illustrates that a Pivotal Act AI might not buy a lot of time.

Why the old title was a mistake

The old title implied that launching the LAI was a very positive event. With the new title, launching the LAI may or may not have been a positive event. This was the meaning that I intended.

Launching the LAI drastically increased the probability of a win by shutting down all competing AI projects. It however also increased risks from scenarios where someone successfully hits a bad alignment target. This can lead to a massively worse than extinction outcome (for example along the lines of the outcome implied by PCEV). In other words: launching LAI may or may not have been a positive event. Thus, launching the LAI may or may not have been a Pivotal Act according to the Arbital Guarded Definition (which requires the event to be very positive).

The old title does not seem to be incompatible with the actual text of the post. But it is incompatible with my intended meaning. I didn't intend to specify whether or not LAI was a positive event. Because the argument about the need for Alignment Target Analysis (ATA) goes through regardless of whether or not launching LAI was a good idea. Regardless of whether or not launching LAI was a positive event, ATA work needs to start now to reduce risks. Because in both cases, ATA progress is needed to reduce risks. And in both cases, there is not a lot of time to do ATA later. (ATA is in fact more important in scenarios where launching the LAI was in fact a terrible mistake)

As I show in my other reply: there is a well established convention of using the term Pivotal Act as a shorthand for shutting down all competing AI projects. As can be seen by looking at the scenario in the post: this might not buy a lot of time. That is how I was using the term when I picked the old title.

A scenario that better illustrates why a Pivotal Act AI might not buy a lot of time

This section outlines a scenario where an unambiguous Pivotal Act is instantly followed by a very severe time crunch. It is possible to see that a Pivotal Act AI might not buy a lot of time by looking at the scenario in the post. But the present section will outline a scenario that better illustrates this fact. (In other words: this section outlines a scenario for which the old title would actually be a good title.) In this new scenario, a Pivotal Act dramatically reduces the probability of extinction by shutting down all unauthorised AI projects. It also completely removes the possibility of anything worse than extinction. Right after the Pivotal Act, there is a frenzied race against the clock to make enough progress on ATA before time runs out. Failure results in a significant risk of extinction.

Consider the case where Dave launches Dave's AI (DAI). If DAI had not been launched, everyone would have almost certainly been killed by some other AI. DAI completely and permanently shuts down all competing AI projects. DAI also reliably prevents all scenarios where designers fail to hit the alignment target that they are aiming at. Due to Internal Time Pressure, a Sovereign AI must then be launched very quickly (discussions of Internal Time Pressure can be found here, and here, and here). There is very little time to decide what alignment target to aim at. (The point made in this section is not sensitive to who gave Dave permission to launch DAI. Or sensitive to who DAI will defer to for the choice of alignment target. But for the sake of concreteness, let's say that the UN security council authorised DAI. And that DAI defers to a global electorate regarding the choice of alignment target).

By the time Dave launches DAI, work on ATA has already progressed very far. There already exist many alignment targets that would in fact lead to an unambiguous win (somehow, describing these outcomes as a win is objectively correct). Only one of the many proposed alignment targets still has an unnoticed problem. And this problem is not nearly as severe as the problem with PCEV. People take the risks of unnoticed problems very seriously. But due to severe Internal Time Pressure, there is not much they can do with this knowledge. The only option is to use their limited time to analyse all alignment targets that are being considered. (many very optimistic assumptions are made regarding both DAI and the level of ATA progress. This is partly to make sure that readers will agree that the act of launching DAI should count as a Pivotal Act. And partly to show that ATA might still be needed, despite these very optimistic assumptions).

The only alignment target that is not a clear win, is based on maximising the sum of re-normalised utility functions. The proposed AI includes a proposed way of mapping a human to a utility function. This always results in a perfect representation of what the human wants. (And there are no definitional issues with this mapping). These functions are then renormalised to have the same variance (as discussed here). Let's write VarAI for this AI. VarAI maximises the sum of the renormalised functions. The aggregation method described above has a problem that is obvious in retrospect. If that problem is explained, then it is clear that VarAI is an unacceptable alignment target. However, in this scenario, no one has noticed this problem. The question is now whether or not anyone will notice the problem (before an alignment target needs to be settled on).

Due to the problem with this aggregation method, VarAI would create a Soft Maximisation version of a Suffering Reducing AI (SMSRAI) as a successor AI (if VarAI is successfully implemented and pointed at the population in this thought experiment). So, if the proponents of VarAI happens to win the political fight, then the result would be SMSRAI. SMSRAI would work to reduce both average suffering and population size (without doing anything drastic). The quality of most peoples lives would increase very dramatically. Many people would choose to spend centuries living a very nice life. But after some conversations with SMSRAI, few would choose to live for millennia. No one would choose to live for many millennia. Humanity would very gradually go extinct in a few dozen generations. By the time it happens, no one would find this problematic in any way. Changes in attitudes are very gradual, and no one ever feels pressured or manipulated by SMSRAI (for example: if you ask SMSRAI to leave you alone, it backs off for a couple of centuries before politely asking again if you want to talk).

If anyone finds the problem with the aggregation method in time, then VarAI will be dropped as a potential alignment target. (Once the problem is explained, it is retrospectively obvious why VarAI is completely unacceptable as an alignment target). If no one finds this problem in time however, it is possible that VarAI will win the political fight.

In order to avoid accidental interference, Dave designed DAI to not discuss ATA. (As a general strategy for reducing risks from definitional issues, Dave actually made sure that DAI only does things that are absolutely necessary for the completion of a carefully chosen set of well defined tasks). So the only way to remove the possibility of extinction, is to reach the needed level of ATA understanding before it's too late. (If the problem with this aggregation method is not explained to them, their only path to understanding is to try doing ATA).

Despite severe Internal Time Pressure, the possibility of an unnoticed problem in an alignment target is taken very seriously. So a last judge safety add on is implemented. The last judge is extrapolated using a different extrapolation method than the one used in VarAI (in order to give the last judge a chance to veto a bad extrapolation method. The VarAI extrapolation definitions are in fact somehow objectively correct. But there is no way of verifying that fact). The result of the last judge extrapolation method turns out to be a very anxious mind. The result is a mind that is in general not comfortable with objecting to things (such as the extrapolation method that it is the output of, or the outcome implied by an alignment target). This mind is very reluctant to veto a scenario where no one is suffering, and where almost everyone are very happy with all aspects of how things turn out (SMSRAI very gradually, over many generations, ``helps people realise'' that the outcome is actually a good outcome. And people genuinely are having a very nice time, for a lot longer than most people expected). So the off switch is not triggered.

If Dave had not launched DAI, all humans would very likely have been killed very soon by some other AI. So I think a lot of people would consider Launching DAI to be a Pivotal Act. (It completely upset the game board. It drastically increased the probability of a win. It was a very positive event according to a wide range of value systems). But if someone wants humanity to go on existing (or wants to personally live a super long life), then there is not a lot of time to find the problem with VarAI (because without sufficient ATA progress, there still exists a significant probability of extinction). So, launching DAI was a Pivotal Act. And launching DAI did not result in a lot of time to work on ATA. Which demonstrates that a Pivotal Act AI might not buy a lot of time.

One can use this scenario as an argument in favour of starting ATA work now. It is one specific scenario that exemplifies a general class of scenarios: scenarios where starting ATA work now, would further reduce an already small risk of a moderately bad outcome. It is a valid argument. But it is not the argument that I was trying to make in my post. I was thinking of something a lot more dangerous. I was imagining a scenario where a bad alignment target is very likely to get successfully implemented unless ATA progresses to the needed levels of insight before it is too late. And I was imagining an alignment target that implied a massively worse than extinction outcome (for example along the lines of the outcome implied by PCEV). I think this is a stronger argument in favour of starting work on ATA now. And this interpretation was ruled out by the old title (which is why I changed the title).

(a brief tangent: if someone expects everything to turn out well. But would like to work on ATA in order to further reduce a small probability of something going moderately bad. Then I would be very happy to collaborate with such a person in a future ATA project. Having very different perspectives in an ATA project sounds like a great idea. An ATA project is very different from a technical design project where a team is trying to get something implemented that will actually work. There is really no reason for people to have similar worldviews or even compatible ontologies. It is a race against time to find a conceptual breakthrough of an unknown type. It is a search for an unnoticed implicit assumption of an unknown type. So genuinely different perspectives sounds like a great idea)

In summary: ``A Pivotal Act AI might not buy a lot of time'' is in fact a true statement. And it is possible to see this by looking at the scenario outlined in the post. But it was a mistake to use this statement as the title for this post. Because it implies things about the scenario that I did not intend to imply. So I changed the title and outlined a scenario that is better suited for illustrating that a Pivotal Act AI might not buy a lot of time.

 

PS:

I upvoted johnswentworth's comment. My original title was a mistake. And the comment helped me realise my mistake. I hope that others will post similar comments on my posts in the future. The comment deserves upvotes. But I feel like I should ask about these agreement votes.

The statement: ``something which does not buy ample time is not a pivotal act'' is clearly false. Martin Randall explained why the statement is false (helpfully pulling out the relevant quotes from the texts that johnswentworth cited). And then johnswentworth did an ``Agreed reaction'' on Martin Randall's explanation of why the statement is false. After this however, johnswentworth's comment (with the statement that had already been determined to be false) was agree voted to plus 7. That seemed odd to me. So I wanted to ask about it. (My posts sometimes question deeply entrenched assumptions. And johnswentworth's comment sort of looks like criticism (at least if one only skims the post and the discussion). So maybe there is no great mystery here. But I still wanted to ask about this. Mostly in case someone has noticed an object level error in my post. But I am also open to terminology feedback)

Comment by ThomasCederborg on Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure · 2024-10-04T15:11:07.627Z · LW · GW

I will change the title.

However: you also seem to be using the term Pivotal Act as a synonym for removing all time pressure from competing AI projects (which the AI in my post does). Example 3 of the arbital page that you link to also explicitly refers to an act that removes all time pressure from competing AI projects as a Pivotal Act. This usage is also present in various comments by you, Yudkowsky, and others (see links and quotes below). And there does not seem to exist any other established term for an AI that: (i): completely removes all time pressure from competing AI projects by uploading a design team and giving them infinite time to work, (ii): keeps the designers calm, rational, sane, etc indefinitely (with all definitional issues of those terms fully solved), and (iii): removes all risks from scenarios where someone fails to hit an alignment target. What other established term exists for such an AI? I think people would generally refer to such an AI as a Pivotal Act AI. And as demonstrated in the post: such an AI might not buy a lot of time.

Maybe using the term Pivotal Act as a synonym for an act that removes all time pressure from competing AI projects is a mistake? (Maybe the scenario in my post should be seen as showing that this usage is a mistake?). But it does seem to be a very well established way of using the term. And I would like to have a title that tells readers what the post is about. I think the current title probably did tell you what the post is about, right? (that the type of AI actions that people tend to refer to as Pivotal Acts might not buy a lot of time in reality)

In the post I define new terms. But if I use a novel term in the title before defining the this term, the title will not tell you what the post is about. So I would prefer to avoid doing that.

But I can see why you might want to have Pivotal Act be a protected term for something that is actually guaranteed to buy a lot of time (which I think is what you would like to do?). And perhaps it is possible to maintain (or re-establish?) this usage. And I don't want to interfere with your efforts to do this. So I will change the title.

If we can't find a better solution I will change the title to: Internal Time Pressure. It does not really tell you what the post will be about. But at least it is accurate and not terminologically problematic. And even though the term is not commonly known, Internal Time Pressure is actually the main topic of the post (Internal Time Pressure is the reason that the AI mentioned above, that does all the nice things mentioned, might not actually buy a lot of time).


Regarding current usage of the term Pivotal Act:

It seems to me like you and many others are actually using the term as a shorthand for an AI that removes time pressure from competing AI projects. I can take many examples of this usage just from the discussion that faul_sname links to in the other reply to your comment.

In the second last paragraph of part 1 of the linked post, Andrew_Critch writes:

Overall, building an AGI development team with the intention to carry out a “pivotal act” of the form “forcibly shut down all other A(G)I projects” is probably going to be a rough time, I predict.

No one seems to be challenging that usage of Pivotal Act (even though many other parts of the post are challenged). And it is not just this paragraph. The tl;dr also treats a Pivotal Act as interchangeable with: shut down all other AGI projects, using safe AGI. There are other examples in the post.

In this comment on the post, it seems to me that Scott Alexander is using a Pivotal Act AI as a direct synonym for an AI capable of destroying all competing AI projects.

In this comment it seems to me like you are using Pivotal Act interchangeably with shutting down all competing AI projects. In this comment, it seems to me that you accept the premise that uploading a design team and running them very quickly would be a Pivotal Act (but you question the plan on other grounds). In this comment, it seems to me that you are equating successful AI regulation with a Pivotal Act (but you question the feasibility of regulation).

In this comment, Yudkowsky seems to me to be accepting the premise that preventing all competing AI projects would count as a Pivotal Act. He says that the described strategy for preventing all competing AI projects is not feasible. But he also says that he will change the way he speaks about Pivotal Acts if the strategy actually does work (and this strategy is to shut down competing AI projects with EMPs. The proposed strategy does nothing else to buy time, other than shutting down competing AI projects). (It is not an unequivocal case of using Pivotal Act as a direct synonym for reliably shutting down all competing AI projects. But it really does seem to me like Yudkowsky is treating Pivotal Act as a synonym for: preventing all competing AI projects. Or at least that he is assuming that preventing all competing AI projects would constitute a Pivotal Act).

Consider also example 3 in the arbital page that you link to. Removing time pressure from competing AI projects by uploading a design team is explicitly defined as an example of a Pivotal Act. And the LAI in my post does exactly this. And the LAI in my post also does a lot of other things that increase the probability of a win (such as keeping the designers sane and preventing them from missing an aimed for alignment target).

This usage points to a possible title along the lines of: The AI Actions that are Commonly Referred to as Pivotal Acts, are not Actually Pivotal Acts (or: Shutting Down all Competing AI Projects is not Actually a Pivotal Act). This is longer and less informative about what the post is about (the post is about the need to start ATA work now, because there might not be a lot time to do ATA work later, even if we assume the successful implementation of a very ambitious AI, whose purpose was to buy time). But this title would not interfere with an effort to maintain (or re-establish?) the meaning of Pivotal Act as a synonym for an act that is guaranteed to buy lots of time (which I think is what you are trying to do?). What do you think about these titles?


PS:

(I think that technically the title probably does conform to the specific text bit that you quote. It depends on what the current probability of a win is. And how one defines: drastically increase the probability of a win. But given the probability that Yudkowsky currently assigns to a win, I expect that he would agree that the launch of the described LAI would count as drastically increasing the probability of a win. (In the described scenario, there are many plausible paths along which the augmented humans actually do reach the needed levels of ATA progress in time. They are however not guaranteed to do this. The point of the post is that doing ATA now increases the probability of this happening). The statement that the title conforms to the quoted text bit is however only technically true in an uninteresting sense. And the title conflicts with your efforts to guard the usage of the term. So I will change the title as soon as a new title has been settled on. If nothing else is agreed on, I will change the title to: Internal Time Pressure)

Comment by ThomasCederborg on The case for more Alignment Target Analysis (ATA) · 2024-09-25T15:22:12.686Z · LW · GW

Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify.


Summary

Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It is difficult to predict how long it would take to reach the level of understanding needed to prevent scenarios where a project successfully hits a bad alignment target. And there might not be a lot of time to do ATA later (for example because a tool-AI shuts down all unauthorised AI projects. But does not buy a lot of time due to internal time pressure). So a research effort should start now.

Therefore ATA is one of the current priorities. There are definitely very serious risks that ATA cannot help with (for example misaligned tool-AI projects resulting in extinction). There are also other important current priorities (such as preventing misuse). But ATA is one of the things that should be worked on now.

The next section outlines a few scenarios designed to clarify how I use the term ATA. The section after that outlines a scenario designed to show why I think that ATA work should start now.


What I mean with Alignment Target Analysis (ATA)

The basic idea with ATA is to try to figure out what would happen if a given AI project were to successfully align an autonomously acting AI Sovereign to a given alignment target. The way I use the term, there are very severe risks that cannot be reduced in any way, by any level of ATA progress (including some very serious misalignment and misuse risks). But there are also risks that can and should be reduced by doing ATA now. There might not be a lot of time to do ATA later, and it is not clear how long it will take to advance to the level of understanding that will be needed. So ATA should be happening now. But let's start by clarifying the term ATA, by outlining a couple of dangerous AI projects where ATA would have nothing to say.

Consider Bill, who plans to use methods from the current paradigm to build a tool-AI. Bill plans to use this tool AI to shut down competing AI projects and then decide what to do next. ATA has nothing at all to say about this situation. Let's say that Bill's project plan would lead to a powerful misaligned AI that would cause extinction. No level of ATA progress would reduce this risk.

Consider Bob who also wants to build a tool-AI. But Bob's AI would work. If the project would go ahead, then Bob would gain a lot of power. And Bob would use that power to do some very bad things. ATA has nothing to say about this project and ATA cannot help reduce this risk.

Now let's introduce an unusual ATA scenario, just barely within the limits of what ATA can be used for (the next section will give an example of the types of scenarios that makes me think that ATA should be done now. This scenario is meant to clarify what I mean with ATA). Consider Dave who wants to use methods from the current paradigm to implement PCEV. If the project plan moves forwards, then the actual result would be a powerful misaligned AI: Dave's Misaligned AI (DMAI). DMAI would not care at all what Dave is trying to do, and would cause extinction (for reasons that are unrelated to what Dave was aiming at). One way to reduce the extinction risk from DMAI would be to tell Dave that his plan would lead to DMAI. But it would also be valid to let Dave know that if his project were to successfully hit the alignment target that he is aiming for, then the outcome would be massively worse than extinction.

Dave assumes that he might succeed. So, when arguing against Dave's project, it is entirely reasonable to argue from the assumption that Dave's project will lead to PCEV. Pointing out that success would be extremely bad is a valid argument against Dave's plan, even if success is not actually possible.

You can argue against Dave's project by pointing out that the project will in fact fail. Or by pointing out that success would be very bad. Both of these strategies can be used to reduce the risk of extinction. And both strategies are cooperative (if Dave is a well meaning and reasonable person, then he would thank you for pointing out either of these aspects of his plan). While both strategies can prevent extinction in a fully cooperative way, they are also different in important ways. It might be the case that only one of these arguments is realistically findable in time. It might for example be the case that Dave is only willing to publish one part of his plan (meaning that there might not be sufficient public information to construct an argument about the other part of the plan). And even if valid arguments of both types are constructed in time, it might still be the case that Dave will only accept one of these arguments. (similar considerations are also relevant for less cooperative situations. For example if one is trying to convince a government to shut down Dave's project. Or if one is trying to convince an electorate to vote no on a referendum that Dave needs to win in order to get permission to move forwards)

The audience in question (Dave, bureaucrats, voters, etc) are only considering the plan because they believe that it might result in PCEV. Therefore it is entirely valid to reason from the assumption that Dave's plan will result in PCEV (when one is arguing against the plan). There is no logical reason why such an argument would interfere with attempts to argue that Dave's plan would in fact result in DMAI.

Now let's use an analogy from the 2004 CEV document to clarify what role I see an ATA project playing. In this analogy, building an AI Sovereign is analogous to taking power in a political revolution. So (in the analogy) Dave proposes a political revolution. One way a revolution can end in disaster is that the revolution leads to a destructive civil war that the revolutionaries loose (analogous to DMAI causing extinction). Another way a revolution can end in disaster is that ISIS takes power after the government is overthrown (analogous to the outcome implied by PCEV).

It is entirely valid to say to Dave: ``if you actually do manage to overthrow the government, then ISIS will seize power'' (assuming that this conditional is true). One can do this regardless of whether or not one thinks that Dave has any real chance of overthrowing the government. (Which in turn means that one can actually say this to Dave, without spending a lot of time trying to determine the probability that the revolution will in fact overthrow the government. Which in turn means that people with wildly different views on how difficult it is to overthrow the government can cooperate while formulating such an argument)

(this argument can be made separately from an argument along the lines of: ``our far larger neighbour has a huge army and would never allow the government of our country to be overthrown. Your revolution will fail even if every single soldier in our country joins you instantly. Entirely separately: the army of our county is in fact fiercely loyal to the government and you don't have enough weapons to defeat it. In addition to these two points: you are clearly bad at strategic thinking and would be outmanoeuvred in a civil war by any semi-competent opponent''. This line of argument can also prevent a hopeless civil war. The two arguments can be made separately and there is no logical reason for them to interfere with each other)

Analysing revolutionary movements in terms of what success would mean can only help in some scenarios. It requires a non-vague description of what should happen after the government falls. In general: this type of analysis cannot reduce the probability of lost civil wars, in cases where the post revolutionary strategy is either (i): too vaguely described to analyse, or (ii): actually sound (meaning that the only problem with the revolution in question is that it has no chance of success). Conversely however: arguments based on revolutions failing to overthrow the government cannot prevent revolutions that would actually end with ISIS in charge (analogous to AI projects that would successfully hit a bad alignment target). Scenarios that end in a bad alignment target getting successfully hit is the main reason that I think that ATA should happen now (in the analogy, the main point would be to reduce the probability of ISIS gaining power). Now let's leave the revolution analogy and outline one such scenario.


A tool-AI capable of shutting down all unauthorised AI projects might not buy a lot of time

It is difficult to predict who might end up controlling a tool-AI. But one obvious compromise would be to put it under the control of some group of voters (for example a global electorate). Let's say that the tool-AI is designed such that one needs a two thirds majority in a referendum, to be allowed to launch a Sovereign AI. There exists a Sovereign AI proposal that a large majority thinks sounds nice. A small minority would however prefer a different proposal.

In order to prevent inadvertent manipulation risks, the tool AI was designed to only discuss topics that are absolutely necessary for the process of shutting down unauthorised AI projects. Someone figures out how to make the tool-AI explain how to implement Sovereign AI proposals (and Explanation / Manipulation related definitions happens to hold for such discussions). But no one figures out how to get it to discuss any topic along the lines of ATA. The original plan was to take an extended period of time to work on ATA before implementing a Sovereign AI.

Both alignment targets use the same method for extrapolating people and for resolving disagreements. The difference is in terms of who is part of the initial group. The two proposals have different rules with respect to things like: animals, people in cryo, foetuses, artificial minds, etc. It doesn't actually matter which proposal gets implemented: the aggregation method leads to the same horrific outcome in both cases (due to an issue along the lines of the issue that PCEV suffers from. But more subtle and difficult to notice). (All proposed alignment targets along the lines of ``build an AI Sovereign that would do whatever some specific individual wants it to do'' are rejected out of hand by almost everyone).

In order to avoid making the present post political, let's say that political debates center around what to do with ecosystems. One side cares about nature and wants to protect ecosystems. The other side wants to prevent animal suffering (even if the cost of such prevention is the total destruction of every ecosystem on earth). It is widely assumed that including animals in the original group will lead to an outcome where animal suffering is prevented at the expense of ecosystems. (in order to make the following scenario more intuitive, readers that have an opinion regarding what should be done with ecosystems, can imagine that the majority shares this opinion)

The majority has enough support to launch their Sovereign AI. But the minority is rapidly and steadily gaining followers due to ordinary political dynamics (sometimes attitudes on a given issue changes steadily in a predictable direction). So the ability to get the preferred alignment target implemented can disappear permanently at any moment (the exact number of people that would actually vote yes in a referendum is difficult to estimate. But it is clearly shrinking rapidly). In this case the majority might act before they loose the ability to act. Part of the majority would however hesitate if the flaw with the aggregation method is noticed in time.

After the tool-AI was implemented, a large number of people started to work on ATA. There are also AI assistants that contribute to conceptual progress (they are tolerated by the tool-AI because they are not smarter than humans. And they are useful because they contribute a set of unique non-human perspectives). However, it turns out that ATA progress works sort of like math progress. It can be sped up significantly by lots of people working on it in parallel. But the main determinant of progress is how long people have been working on it. In other words: it turns out that there is a limit to how much the underlying conceptual progress can be sped up by throwing large numbers of people at ATA. So the question of whether or not the issue with the Sovereign AI proposal is noticed in time, is to a large degree determined by how long a serious ATA research project has been going on at the time that the tool-AI is launched (in other words: doing ATA now reduces the risk of a bad alignment target ending up getting successfully hit in this scenario).

(the idea is not that this exact scenario will play out as described. The point of this section was to give a detailed description of one specific scenario. For example: the world will presumably not actually be engulfed by debates about the Prime Directive from Star Trek. And a tool-AI controlled by a messy coalition of governments might lead to a time crunch due to dynamics that are more related to Realpolitik than any form of ideology. This specific scenario is just one example of a large set of similar scenarios)


PS:

On a common sense level I simply don't see how one can think that it is safe to stay at our current level of ATA progress (where it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction). The fact that there exists no research project dedicated to improving this situation seems like a mistake. Intuitively this seems like a dangerous situation. At the very least it seems like some form of positive argument would be needed before concluding that this is safe. And it seems like such an argument should be published so that it can be checked for flaws before one starts acting based on the assumption that the current situation is safe. Please don't hesitate to contact me with theories / questions / thoughts / observations / etc regarding what people actually believe about this.

Comment by ThomasCederborg on The case for more Alignment Target Analysis (ATA) · 2024-09-21T14:40:49.177Z · LW · GW

I interpret your comment as a prediction regarding where new alignment target proposals will come from. Is this correct?


I also have a couple of questions about the linked text:

How do you define the difference between explaining something and trying to change someone's mind? Consider the case where Bob is asking a factual question. An objectively correct straightforward answer would radically change Bob's entire system of morality, in ways that the AI can predict. A slightly obfuscated answer would result in far less dramatic changes. But those changes would be in a completely different direction (compared to the straightforward answer). Refusing to answer, while being honest about the reason for refusal, would send Bob into a tailspin. How certain are you that you can find a definition of Acceptable Forms of Explanation that holds up in a large number of messy situations along these lines? See also this.

And if you cannot define such things in a solid way, how do you plan to define ``benefit humanity''? PCEV was an effort to define ``benefit humanity''. And PCEV has been found to suffer from at least one difficult-to-notice problem. How certain are you that you can find a definition of  ``benefit humanity'' that does not suffer from some difficult-to-notice problem?

 

PS:

Speculation regarding where novel alignment target proposals are likely to come from are very welcome. It is a prediction of things that will probably be fairly observable fairly soon. And it is directly relevant to my work. So I am always happy to hear this type of speculation.

Comment by ThomasCederborg on The case for more Alignment Target Analysis (ATA) · 2024-09-21T14:35:57.457Z · LW · GW

Let's reason from the assumption that you are completely right. Specifically, let's assume that every possible Sovereign AI Project (SAIP) would make things worse in expectation. And let's assume that there exists a feasible Better Long Term Solution (BLTS).

In this scenario ATA would still only be a useful tool for reducing the probability of one subset of SAIPs (even if all SAIPs are bad some designers might be unresponsive to arguments, some flaws might not be realistically findable, etc). But it seems to me that ATA would be one complementary tool for reducing the overall probability of SAIP. And this tool would not be easy to replace with other methods. ATA could convince the designers of a specific SAIP that their particular project should be abandoned. If ATA results in the description of necessary features, then it might even help a (member of a) design team see that it would be bad if a secret project were to successfully hit a completely novel, unpublished, alignment target (for example along the lines of this necessary Membrane formalism feature).

ATA would also be a project where people can collaborate despite almost opposite viewpoints on the desirability of SAIP. Consider Bob who mostly just wants to get some SAIP implemented as fast as possible. But Bob still recognizes the unlikely possibility of dangerous alignment targets with hidden flaws (but he does not think that this risk is anywhere near large enough to justify waiting to launch a SAIP). You and Bob clearly have very different viewpoint regarding how the world should deal with AI. But there is actually nothing preventing you and Bob from cooperating on a risk reduction focused ATA project.

This type of diversity of perspectives might actually be very productive for such a project. You are not trying to build a bridge on a deadline. You are not trying to win an election. You do not have to be on the same page to get things done. You are trying to make novel conceptual progress, looking for a flaw of an unknown type.

Basically: reducing the probability of outcomes along the lines of the outcome implied by PCEV is useful according to a wide range of viewpoints regarding how the world should deal with AI. (there is nothing unusual about this general state of affairs. Consider for example Dave and Gregg who are on opposite sides of a vicious political trench war over the issue of pandemic lockdowns. There is nothing on the object level that prevents them from collaborating on a vaccine research effort. So this feature is certainly not unique. But I still wanted to highlight the fact that a risk mitigation focused ATA project does have this feature)

Comment by ThomasCederborg on The case for more Alignment Target Analysis (ATA) · 2024-09-21T14:29:10.324Z · LW · GW

I think I see your point. Attempting to design a good alignment target could lead to developing intuitions that would be useful for ATA. A project trying to design an alignment target might result in people learning skills that allows them to notice flaws in alignment targets proposed by others. Such projects can therefore contribute to the type of risk mitigation that I think is lacking. I think that this is true. But I do not think that such projects can be a substitute for an ATA project with a risk mitigation focus.


Regarding Orthogonal:

It is difficult for me to estimate how much effort Orthogonal spends on different types of work. But it seems to me that your published results are mostly about methods for hitting alignment targets. This also seems to me to be the case for your research goals. If you are successful, it seems to me that your methods could be used to hit almost any alignment target (subject to constraints related to finding individuals that want to hit specific alignment targets).

I appreciate you engaging on this, and I would be very interested in hearing more about how the work done by Orthogonal could contribute to the type of risk mitigation effort discussed in the post. I would, for example, be very happy to have a voice chat with you about this.

Comment by ThomasCederborg on The case for more Alignment Target Analysis (ATA) · 2024-09-20T07:31:05.255Z · LW · GW

The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.


The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:

There is a big difference between proposing an alignment target on the one hand. And pointing out problems with alignment targets on the other hand. For example: it is entirely possible to reduce risks from a dangerous alignment target, without having any idea how one might find a good alignment target. One can actually reduce risks without having any idea, what it even means for an alignment target to be a good alignment target.

The feature of PCEV mentioned in the post is an example of this. The threat posed by PCEV has presumably been mostly removed. This did not require anything along the lines of an answer. The analysis of Condorcet AI (CAI) is similar. The analysis simply describes a feature shared by all CAI proposals (the feature that a barely caring solid majority can do whatever they want with everyone else). Pointing this out presumably reduces the probability that a CAI will be launched by designers that never considered this feature. All claims made in the post about a VATA research project being tractable is referring to this type of risk mitigation being tractable. There is definitely no claim that a VATA research project can (i): find a good alignment target, (ii): somehow verify that this alignment target does not have any hidden flaws, and (iii): convince whoever is in charge to launch this target.

One can also go a bit beyond analysis of individual proposals, even if one does not have any idea how to find an answer. One can mitigate risk by describing necessary features (for example along the lines of this necessary Membrane formalism feature). This reduces risks from all proposal that clearly does not have such a necessary feature.

(and just to be extra clear: the post is not arguing that launching a Sovereign AI is a good idea. The post is assuming an audience that agree that it is possible that a Sovereign AI might be launched. And then the post is arguing that if this does happen, then there is a risk that such a Sovereign AI project will be aiming at a bad value alignment target. The post then further argues that this particular risk can be reduced by doing VATA)


Regarding people being skeptical of Value Alignment Target proposals:

If someone ends up with the capability to launch a Sovereign AI, then I certainly hope that they will be skeptical of proposed Value Alignment Targets. Such skepticism can avert catastrophe even if the proposed alignment target has a flaw that no one has noticed.

The issue is that a situation might arise where (i): someone has the ability to launch a Sovereign AI, (ii): there exists a Sovereign AI proposal that no one can find any flaws with, and (iii): there is a time crunch.


Regarding the possibility that there exists people trying to find an answer without telling anyone:

I'm not sure how to estimate the probability of this. From a risk mitigation standpoint, this is certainly not the optimal way of doing things (if a proposed alignment target has a flaw, then it will be a lot easier to notice that flaw, if the proposal is not kept secret). I really don't think that this is a reasonable way of doing things. But I think that you have a point. If Bob is about to launch an AI Sovereign with some critical flaw that would lead to some horrific outcome. Then secretly working Steve might be able to notice this flaw. And if Bob is just about to launch his AI, and speaking up is the only way for Steve to prevent Bob from causing a catastrophe, then Steve will presumably speak up. In other words: the existence of people like secretly working Steve would indeed offer some level of protection. It would mean that the lack of people with relevant intuitions is not as bad as it appears (and when allocating resources, this possibility would indeed point to less resources for VATA). But I think that what is really needed is at least some people doing VATA with a clear risk mitigation focus. And discussing their finding with each other. This does not appear to exist.


Regarding other risks, and the issue that findings might be ignored:

A VATA research project would not help with misalignment. In other words: even if the field of VATA was somehow completely solved tomorrow, AI could still lead to extinction. So the proposed research project is definitely not dealing with all risks. The point of the post is that the field of VATA is basically empty. I don't know of anyone that is doing VATA full time with a clear risk mitigation focus. And I don't know if you personally should switch to focusing on VATA. It would not surprise me at all if some other project is a better use of your time. It just seems like there should exist some form of VATA research project with a clear risk mitigation focus.

It is also possible that a VATA finding will be completely ignored (by leading labs, or by governments, or by someone else). It is possible that a Sovereign AI will be launched, leading to catastrophe, even though it has a known flaw (because the people launching it is just refusing to listen). But finding a flaw at least means that it is possible to avert catastrophe.

 

PS:

Thanks for the links! I will look into this. (I think that there are many fields of research that are relevant to VATA. It's just that one has to be careful. A concept can behave very differently when it is transferred to the AI context)

Comment by ThomasCederborg on A problem with the most recently published version of CEV · 2024-09-18T04:41:31.039Z · LW · GW

Regarding the political feasibility of PCEV:

PCEV gives a lot of extra power to some people, specifically because those people intrinsically value hurting other humans. This presumably makes PCEV politically impossible in a wide range of political contexts (including negotiations between a few governments). More generally: now that it has been pointed out that PCEV has this feature, the risks from scenarios where PCEV gets successfully implemented has presumably been mostly removed. Because PCEV is probably off the table as a potential alignment target, pretty much regardless of who ends up deciding what alignment target to aim an AI Sovereign at (the CEO of a tech company, a designs team, a few governments, the UN, a global electorate, etc).

PCEV is however just one example of a bad alignment target. Let's take the perspective of Steve, an ordinary human individual with no special influence over an AI project. The reason that PCEV is dangerous for Steve, is that PCEV (i): adopts preferences that refer to Steve, (ii): in a way that gives Steve no meaningful influence over the decision, of which Steve-referring preferences PCEV will adopt. PCEV is just one possible AI that would adopt preferences about Steve, in a way that Steve would have no meaningful influence over. So, even fully removing the all risks associated with PCEV in particular, does not remove all risks from this more general class of dangerous alignment targets. From Steve's perspective, the PCEV thought experiment is illustrating a more general danger: risks from scenarios where an AI will adopt preferences that refer to Steve, in a way that Steve will have no meaningful influence over.

Even more generally: scenarios where someone successfully implements some type of bad alignment target still pose a very real risk. Alignment Target Analysis (ATA) is still at a very early stage of development, and these risks are not well understood. ATA is also a very neglected field of research. In other words: there are serious risks that could be mitigated. But those risks are not currently being mitigated. (As a tangent, I think that the best way of looking at ATA is: risk mitigation through the identification of necessary features. As discussed here, identifying features that are necessary can be a useful risk mitigation tool, even if those features are far from sufficient, and even if one is not close to any form of solution)

Comment by ThomasCederborg on A necessary Membrane formalism feature · 2024-09-14T01:49:04.231Z · LW · GW

Regarding Corrigibility as an alternative safety measure:

I think that exploring the Corrigibility concept sounds like a valuable thing to do. I also think that Corrigibility formalisms can be quite tricky (for similar reasons that Membrane formalisms can be tricky: I think that they are both vulnerable to difficult-to-notice definitional issues). Consider a powerful and clever tool-AI. It is built using a Corrigibility formalism that works very well when the tool-AI is used to shut down competing AI projects. This formalism relies on a definition of Explanation, that is designed to prevent any form of undue influence. When talking with this tool-AI about shutting down computing AI projects, the definition of Explanation holds up fine. In this scenario, it could be the case that asking this seemingly corrigible tool-AI about a Sovereign AI proposal, is essentially equivalent to implementing that proposal.

Any definition of Explanation will necessarily be built on top of a lot of assumptions. Many of these will be unexamined implicit assumptions that the designers will not be aware of. In general, it would not be particularly surprising if one of these assumptions turns out to hold when discussing things along the lines of shutting down competing AI projects. But turns out to break when discussing a Sovereign AI proposal.

Let's take one specific example. Consider the case where the tool-AI will try to Explain any topic that it is asked about, until the person asking Understands the topic sufficiently. When asked about a Sovereign AI proposal, the tool-AI will ensure that two separate aspects of the proposal will be Understood, (i): an alignment target, and (ii): a normative moral theory according to which this alignment target is the thing that a Sovereign AI project should aim at. It turns out that Explaining a normative moral theory until the person asking Understands it, is functionally equivalent to convincing the person to adopt this normative moral theory. If the tool-AI is very good at convincing, then the tool-AI could be essentially equivalent to an AI that will implement whatever Sovereign AI proposal it is first asked to explain (with a few extra steps).

(I discussed this issue with Max Harms here)

Comment by ThomasCederborg on A necessary Membrane formalism feature · 2024-09-13T04:53:32.010Z · LW · GW

I agree that focus should be on preventing the existence of a Sovereign AI that seeks to harm people (as opposed to trying to deal with such an AI after it has already been built). The main reason for trying to find necessary features, is actually that it might stop a dangerous AI project from being pursued in the first place. In particular: it might convince the design team to abandon an AI project, that clearly lacks a feature that has been found to be necessary. An AI project that would (if successfully implemented) result in an AI Sovereign that would seek to harm people. For example a Sovereign AI that wants to respect a Membrane. But where the Membrane formalism does not actually prevent the AI from wanting to hurt individuals, because the formalism lacks a necessary feature.

One reason we might end up with a Sovereign AI that seeks to harm people is that someone makes two separate errors. Let's say that Bob gains control over a tool-AI, and uses it to shut down unauthorised AI projects (Bob might for example be a single individual, or a design team, or a government, or a coalition of governments, or the UN, or a democratic world government, or something else along those lines). Bob gains the ability to launch a Sovereign AI. And Bob settles on a specific Sovereign AI design: Bob's Sovereign AI (BSAI).

Bob knows that BSAI might contain a hidden flaw. And Bob is not being completely reckless about launching BSAI. So Bob designs a Membrane, whose function is to protect individuals (in case BSAI does have a hidden flaw). And Bob figures out how to make sure that BSAI will want to avoid piercing this Membrane (in other words: Bob makes sure that the Membrane will be internal to BSAI).

Consider the case where both BSAI, and the Membrane formalism in question, each have a hidden flaw. If both BSAI and the Membrane is successfully implemented, then the result would be a Sovereign AI that seeks to harm people (the resulting AI would want to both, (i): harm people, and (ii): respect the Membrane of every individual). One way to reduce the probability that such a project would go ahead, is to describe necessary features.

For example: if it is clear that the Membrane that Bob is planning to use, does not have the necessary Extended Membrane feature described in the post, then Bob should be able to see that this Membrane will not offer reliable protection from BSAI (which Bob knows might be needed, because Bob knows that BSAI might be flawed).

For a given AI project, it is not certain that there exists a realistically findable necessary feature, that can be used to illustrate the dangers of the project in question. And even if such a feature is found, it is not certain that Bob will listen. But looking for necessary features is still a tractable way of reducing the probability of a Sovereign AI that seeks to harm people.

A project to find necessary features is not really a quest for a solution to AI. It is more informative to see such a project as analogous to a quest to design a bulletproof vest for Bob, who will be going into a gunfight (and who might decide to put on the vest). Even if very successful, the bulletproof vest project will not offer full protection (Bob might get shot in the head). A vest is also not a solution. Whether Bob is a medic trying to evacuate wounded people from the gunfight, or Bob is a soldier trying to win the gunfight, the vest cannot be used to achieve Bob's objective. Vests are not solutions. Vests are still very popular amongst people who know that they will be going into a gunfight.

So if you will share the fate of Bob. And if you might fail to persuade Bob to avoid a gunfight. Then it makes sense to try to design a bulletproof vest for Bob (because if you succeed, then he might decide to wear it. And that would be very good if he ends up getting shot in the stomach). (the vest in this analogy is analogous to descriptions of necessary features, that might be used to convince designers to abandon a dangerous AI project. The vest in this analogy is not analogous to a Membrane)

Comment by ThomasCederborg on A necessary Membrane formalism feature · 2024-09-11T22:31:05.381Z · LW · GW

Thanks for the feedback! I see what you mean and I edited the post. (I turned a single paragraph abstract into a three paragraph Summary section. The text itself has not been changed)

Comment by ThomasCederborg on Corrigibility could make things worse · 2024-06-12T19:29:59.585Z · LW · GW

Thank you for engaging. If this was unclear for you, then I'm sure it was also unclear for others.

The post outlined a scenario where a Corrigibility method works perfectly for one type of AI (an AI that does not imply an identifiable outcome, for example a PAAI). The same Corrigibility method fails completely for another type of AI (an AI that does imply an identifiable outcome, for example PCEV). So the second AI, that does have an IO, is indeed not corrigible.

This Corrigibility method leads to an outcome that is massively worse than extinction. This bad outcome is the result of two things being true, (i): the fully Corrigible first AI made this outcome possible to reach, and (ii): since the Corrigibility method worked perfectly for the first AI, the designers falsely believed that the Corrigibility method would also work for the second AI.

The second AI wants many things. It wants to get an outcome, as close as possible to IO. The Corrigibility method resulted in the second AI also wanting many additional things (such as wanting all explanations it gives to count as AE, even if this makes the explanations less efficient. And wanting to avoid implementing anything, unless informed designers want that thing to be implemented). But in practice the Corrigibility method does not change the outcome in any way (it just adds an ``explanation step''). So I think it makes sense to say that the second AI has ``zero Corrigibility''. The first AI is completely corrigible. And if the designers had only used the Corrigibility method for the first AI, then the Corrigibility method would have worked perfectly.

This is what I was trying to communicate with the first sentence of the post:  ``A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse.''. I could have used that sentence as a title, but I decided against trying to include everything in the title. (I think it is ok to leave information out of the title, as long as the title is accurate. And the title is, in fact, accurate (because the first AI is genuinely corrigible. And things would have turned out a lot better, in the counterfactual situation where no one had developed any form of Corrigibility).)

One possible source of confusion, could be that you are perhaps interpreting this post as referring to some specific example scenario from your sequence. My post was mostly written before you posted your sequence. It is not meant as a comment on any specific AI in your sequence (which is why I don't link to your sequence). But given that you had just published your sequence, maybe you were expecting my scenario to contain a single, partially corrigible AI (without an IO). That is not the scenario that I was describing in my post.

However, I could actually make the same point using a scenario with a single AI (without an IO), that is partially Corrigible. (there exists a more general danger here, that is not strongly related to the number of AI designs involved). So, here is an attempt to make the same point using such a scenario instead. A possible title for such an alternative post would be: ``a partially corrigible AI could make things worse''. (this is also a standalone scenario. It is also not meant as a response to anything specific in your sequence)

I think that one could reasonably describe Corrigibility as being context dependent. A given AI could be fully Corrigible in one context (such as preventing competing AI projects), and not Corrigible at all in another context (such as discussing Alignment Target Analysis). I think that one could reasonably refer to such an AI as being partially Corrigible. And, as will be shown below, such an AI could lead to a massively worse than extinction outcome.

Summary: Consider the case where a design team uses a Corrigibility method to build an AI Assistant (AIA). The resulting AIA does not have an IO. When the design team try to use the AIA to prevent competing AI projects, everything works perfectly. However, when they try to use the AIA to understand Alignment Target Analysis (ATA), the Corrigibility method fails completely. Let's try two very rough analogies. Talking to the AIA about shutting down competing AI projects is very roughly analogous to using a djinn that grants wishes, while caring fully about intentions in exactly the right way. But talking to the AIA about ATA is very roughly analogous to using a djinn that grants wishes, while not caring about intentions at all (so, not a djinn with any form of preferred outcome. And not any form of ``malicious story djinn''. But also not a safe djinn).

The AIA always wants to interact in ways that counts as Acceptable Explanation (AE). When the designers ask it to shut down all hardware that is capable of running a powerful AI, it disobeys the order, and explains that human brains can, in theory, be used to run a powerful AI. When they ask it to shut down all non biological hardware that is capable of running a powerful AI, it first asks if it should delete itself, or if it should transfer itself to biological hardware. Etc. In short: it is Corrigible in this context. In particular: while talking about this topic, the definition of AE holds up.

When one of the designers asks the AIA to explain PCEV however, the definition of AE does not hold up. When the AIA was discussing the potential action of shutting down hardware, there were two components that it wanted the designers to understand, (i): the reasons for shutting down this hardware, and (ii): the effects of shutting down this hardware. In the hardware case, the designers already understood (i) reasonably well. So there was no need to explain it. When AIA is asked about PCEV, there are again two components that AIA wants the designers to understand, (i): the reasons for building PCEV, and (ii): the effects of building PCEV. PCEV is a two component thing. It is simultaneously an alignment target, and also a normative moral theory. It turns out that in this case, the designers do not actually understand (i) at all. They do not understand the normative moral theory behind PCEV. So the AIA explains this normative moral theory to the designers. If the AE definition had been perfect, this would not have been a problem. In other words: if the AIA had been completely Corrigible, then this would not have been a problem.

But it turns out that the people that designed the Corrigibility method did not in fact have a sufficiently good understanding of concepts along the lines of: ``normative moral theories'', ``explanations'', ``understanding'', etc (understanding these concepts sufficiently well was a realistic outcome. But in this scenario the designers failed to do so). As a result, the AE definition is not perfect, and the AIA is only partly Corrigible. So, the AIA ``explains'' the ``normative moral theory of PCEV'' until the designers ``understand'' it (using an explanation that counts as AE). This results in designers that feel a moral obligation to implement PCEV, regardless of what the result is. This new moral framework is robust to learning what happened. So, the result is a set of fully informed designers that are fully committed to implementing PCEV. So, the outcome is massively worse than extinction. (one way to reduce the probability of scenarios along these lines, is to make progress on ATA. Thus: even the existence of an AI that seems to be completely corrigible, is not an argument against the need to make progress on ATA. This also works as a way of gesturing at the more general point, that I was trying to gesture at in the first post)

(Wei Dai has been talking about related things for quite a while)

Comment by ThomasCederborg on Corrigibility could make things worse · 2024-06-11T01:22:39.194Z · LW · GW

The first AI is genuinely Corrigible. The second AI is not Corrigible at all. This leads to a worse outcome, compared to the case where there was no Corrigible AI. Do you disagree with the statement that the first AI is genuinely Corrigible? Or do you disagree with the statement that the outcome is worse, compared to the case where there was no Corrigible AI?

Comment by ThomasCederborg on A list of core AI safety problems and how I hope to solve them · 2024-05-03T21:44:39.585Z · LW · GW

Thank you for the clarification. This proposal is indeed importantly different from the PCEV proposal. But since hurting heretics is a moral imperative, any AI that allows heretics to escape punishment, will also be seen as unacceptable by at least some people. This means that the set of Pareto improvements is empty.

In other words: hurting heretics is indeed off the table in your proposal (which is an important difference compared to PCEV). However, any scenario that includes the existence of an AI, that allow heretics to escape punishment, is also off the table. The existence of such an AI, would be seen as intrinsically bad, by people that see hurting heretics as a moral imperative (for example: Gregg really does not want a world, where Gregg has agreed to tolerate the existence of an unethical AI, that disregards its moral duty, to punish heretics). More generally: anything that improves the lives of heretics, is off the table. If an outcome improves the lives of heretics (compared to the no-AI-baseline), then this outcome is also not a Pareto improvement. Because improving the lives of heretics, makes things worse from the point of view, of those that are deeply committed to hurting heretics.

In yet other words: it only takes two individuals, to rule out any outcome, that contains any improvement, for any person. Gregg and Jeff are both deeply committed to hurting heretics. But their definitions of ``heretic'' differ. Every individual is seen as a heretic by at least one of them. So, any outcome, that makes life better for any person, is off the table. Gregg and Jeff does have to be very committed to the moral position, that the existence of any AI, that neglects its duty to punish heretics, is unacceptable. It must for example be impossible to get them to agree to tolerate the existence of such an AI, in exchange for increased influence over the far future. But a population of billions only has to contain two such people, for the set of Pareto improvements to be empty.

Another way to approach this would be to ask: What would have happened, if someone had successfully implemented a Gatekeeper AI, built on top of a set of definitions, such that the set of Pareto improvements is empty?

For the version of the random dictator negotiation baseline that you describe, this comment might actually be more relevant, than the PCEV thought experiment. It is a comment on the suggestion by Andrew Critch, that it might be possible to view a Boundaries / Membranes based BATNA, as having been agreed to acausally. It is impossible to reach such an acausal agreement when a group include people like Gregg and Jeff, for the same reason that it is impossible to find an outcome that is a Pareto improvement, when a group include people like Gregg and Jeff. (that comment also discuss ideas, for how one might deal with the dangers that arise, when one combines people like Gregg and Jeff, with a powerful and clever AI)

Another way to look at this, would be to consider what it would mean to find a Pareto improvement, with respect to only Bob and Dave. Bob wants to hurt heretics, and Bob considers half of all people to be heretics. Dave is an altruist, that just wants people to have as good a life as possible. The set of Pareto improvements would now be made up entirely of different variations of the general situation: make the lives of non heretics much better, and make the lives of heretics much worse. For Bob to agree, heretics must be punished. And for Dave to agree, Dave must see the average life quality, as an improvement on the ``no superintelligence'' outcome. If the ``no superintelligence'' outcome is bad for everyone, then the lives of heretics in this scenario could get very bad.

More generally: people like Bob (with aspects of morality along the lines of: ``heretics deserve eternal torture in hell'') will have dramatically increased power over the far future, when one uses this type of negotiation baseline (assuming that things have been patched, in a way that results in a non empty set of Pareto improvements). If everyone is included in the calculation of what counts as Pareto improvements, then the set of Pareto improvements is empty (due to people like Gregg and Jeff). And if everyone is not included, then the outcome could get very bad, for many people (compared to whatever would have happened otherwise).

(adding the SPADI feature to your proposal would remove these issues, and would prevent people like Dave from being dis-empowered, relative to people like Bob. The details are importantly different from PCEV, but it is no coincidence that adding the SPADI feature removes this particular problem, for both proposals. The common denominator is that from the perspective of Steve, it is in general dangerous to encounter an AI, that has taken ``unwelcome'' or ``hostile'' preferences about Steve into account)

Also: my general point about the concept of ``fair Pareto improvements'' having counterintuitive implications in this novel context still apply (it is not related to the details of any specific proposal).

Comment by ThomasCederborg on A list of core AI safety problems and how I hope to solve them · 2024-04-28T01:48:59.279Z · LW · GW

There is a serious issue with your proposed solution to problem 13. Using a random dictator policy as a negotiation baseline is not suitable for the situation, where billions of humans are negotiating about the actions of a clever and powerful AI. One problem with using this solution, in this contexts, is that some people have strong commitments to moral imperatives, along the lines of ``heretics deserve eternal torture in hell''. The combination of these types of sentiments, and a powerful and clever AI (that would be very good at thinking up effective ways of hurting heretics), leads to serious problems when one uses this negotiation baseline. A tiny number of people with sentiments along these lines, can completely dominate the outcome.

Consider a tiny number of fanatics with this type of morality. They consider everyone else to be heretics, and they would like the AI to hurt all heretics as much as possible. Since a powerful and clever AI would be very good at hurting a human individual, this tiny number of fanatics, can completely dominate negotiations. People that would be hurt as much as possible (by a clever and powerful AI), in a scenario where one of the fanatics are selected as dictator, can be forced to agree to very unpleasant negotiated positions, if one uses this negotiation baseline (since agreeing to such an unpleasant outcome, can be the only way to convince a group of fanatics, to agree to not ask the AI to hurt heretics, as much as possible, in the event that a fanatic is selected as dictator).

This post, explore these issues in the context of the most recently published version of CEV: Parliamentarian CEV (PCEV). PCEV has a random dictator negotiation baseline. The post shows that PCEV results in an outcome massively worse than extinction (if PCEV is successfully implemented, and pointed at billions of humans).

Another way to look at this, is to note that the concept of ``fair Pareto improvements'' has counterintuitive implications, when the question is about AI goals, and some of the people involved, has this type of morality. The concept was not designed with this aspect of morality in mind. And it was not designed to apply to negotiations about the actions of a clever and powerful AI. So, it should not be very surprising, to discover that the concept has counterintuitive implications, when used in this novel context. If some change in the world improves the lives of heretics, then this is making the world worse, from the perspective of those people, that would ask an AI to hurt all heretics as much as possible. For example: reducing the excruciating pain of a heretic, in a way that does not affect anyone else in any way, is not a ``fair Pareto improvement'', in this context. If every person is seen as a heretic by at least one group of fanatics, then the concept of ``fair Pareto improvements'' has some very counterintuitive implications, when it is used in this context.

Yet another way of looking at this, is to take the perspective of human individual Steve, who will have no special influence over an AI project. In the case of an AI, that is describable as doing what a group wants, Steve has a serious problem (and this problem is present, regardless of the details of the specific Group AI proposal). From Steve's perspective, the core problem, is that an arbitrarily defined abstract entity, will adopt preferences, that is about Steve. But, if this is any version of CEV (or any other Group AI), directed at a large group, then Steve has had no meaningful influence, regarding the adoption of those preferences, that refer to Steve. Just like every other decision, the decision of what Steve-preferences the AI will adopt, is determined by the outcome of an arbitrarily defined mapping, that maps large sets of human individuals, into the space of entities that can be said to want things. Different sets of definitions, lead to completely different such ``Group entities''. These entities all want completely different things (changing one detail can for example change which tiny group of fanatics, will end up dominating the AI in question). Since the choice of entity is arbitrary, there is no way for an AI to figure out that the mapping ``is wrong'' (regardless of how smart this AI is). And since the AI is doing what the resulting entity wants, the AI has no reason to object, when that entity wants the AI to hurt an individual. Since Steve does not have any meaningful influence, regarding the adoption of those preferences, that refer to Steve, there is no reason for him to think that such an AI will want to help him, as opposed to want to hurt him. Combined with the vulnerability of a human individual, to a clever AI that tries to hurt that individual as much as possible, this means that any group AI would be worse than extinction, in expectation.

Discovering that doing what a group wants, is bad for human individuals in expectation, should not be particularly surprising. Groups and individuals are completely different types of things. So, this should be no more surprising, than discovering that any reasonable way of extrapolating Dave, will lead to the death of every single one of Dave's cells. Doing what one type of thing wants, might be bad for a completely different type of thing. And aspects of human morality, along the lines of ``heretics deserve eternal torture in hell'' shows up throughout human history. It is found across cultures, and religions, and continents, and time periods. So, if an AI project is aiming for an alignment target, that is describable as ``doing what a group wants'', then there is really no reason for Steve to think, that the result of a successful project, would want to help him, as opposed to want to hurt him. And given the large ability of an AI to hurt a human individual, the success of such a project would be massively worse than extinction (in expectation).

The core problem, from the perspective of Steve, is that Steve has no control over the adoption of those preferences, that refer to Steve. One can give each person influence over this decision, without giving anyone any preferential treatment (see for example MPCEV in the post about PCEV, mentioned above). Giving each person such influence, does not introduce contradictions, because this influence is defined in ``AI preference adoption space'', not in any form of outcome space. This can be formulated as an alignment target feature that is necessary, but not sufficient, for safety. Let's refer to this feature as the: Self Preference Adoption Decision Influence (SPADI) feature. (MPCEV is basically what happens, if one adds the SPADI feature to PCEV. Adding the SPADI feature to PCEV, solves the issue, illustrated by that thought experiment)

The SPADI feature is obviously very underspecified. There will be lots of border cases whose classification will be arbitrary. But there still exists many cases, where it is in fact clear, that a given alignment target, does not have the SPADI feature. Since the SPADI feature is necessary, but not sufficient, these clear negatives are actually the most informative cases. In particular, if an AI project is aiming for an alignment target, that clearly does not have the SPADI feature. Then the success of this AI project, would be worse than extinction, in expectation (from the perspective of a human individual, that is not given any special influence over the AI project). While there are many border cases, regarding what alignment targets could be described as having the SPADI feature, CEV is an example of a clear negative (in other words: there exists no reasonable set of definitions, according to which there exists a version of CEV, that has the SPADI feature). This is because building an AI that is describable as ``doing what a group wants'', is inherent in the core concept, of building an AI, that is describable as: ``implementing the Coherent Extrapolated Volition of Humanity''.

In other words: the field of alignment target analysis is essentially an open research question. This question is also (i): very unintuitive, (ii): very under explored, and (iii): very dangerous to get wrong. If one is focusing on necessary, but not sufficient, alignment target features. Then it is possible to mitigate dangers related to someone successfully hitting a bad alignment target, even if one does not have any idea of what it would mean, for an alignment target to be a good alignment target. This comment outlines a proposed research effort, aimed at mitigating this type of risk.

These ideas also have implications for the Membrane concept, as discussed here and here.

(It is worth noting explicitly that the problem is not strongly connected to the specific aspect of human morality discussed in the present comment (the ``heretics deserve eternal torture in hell'' aspect). The problem is about the lack of meaningful influence, regarding the adoption of self referring preferences. In other words, it is about the lack of the SPADI feature. It just happens to be the case, that this particular aspect of human morality is both (i): ubiquitous throughout human history, and also (ii): well suited for constructing thought experiments, that illustrates the dangers of alignment target proposals, that lack the SPADI feature. If this aspect of human morality disappeared tomorrow, the basic situation would not change (the illustrative thought experiments would change. But the underlying problem would remain. And the SPADI feature would still be necessary for safety).)

Comment by ThomasCederborg on Agent membranes/boundaries and formalizing “safety” · 2024-03-23T14:48:10.238Z · LW · GW

I think it is very straightforward to hurt human individual Steve without piercing Steve's Membrane. Just create and hurt minds that Steve cares about. But don't tell him about it (in other words: ensure that there is zero effect on predictions of things inside the membrane). If Bob knew Steve before the Membrane enforcing AI was built, and Bob wants to hurt Steve, then Bob presumably knows Steve well enough to know what minds to create (in other words: there is no need to have any form of access, to any form of information, that is within Steve's Membrane). And if it is possible to build a Membrane enforcing AI, it is presumably possible to build an AI that looks at Bob's memories of Steve, and creates some set of minds, whose fate Steve would care about. This does not involve any form of blackmail or negotiation (and definitely nothing acausal). Just Bob who wants to hurt Steve, and remembers things about Steve from before the first AI launch.

One can of course patch this. But I think there is a deeper issue in one specific case, that I think is important. Specifically: the case where the Membrane concept is supposed to protect Steve from a clever AI that wants to hurt Steve. Such an AI can think up things that humans can not think up. In this case, patching all human-findable security holes in the Membrane concept, will probably be worthless for Steve. It's like trying to keep an AI in a box by patching all human findable security holes. Even if it were know, that all human findable security holes, had been fully patched, I don't think that it changes things, from the perspective of Steve, if a clever AI tries to hurt him (whether the AI is inside a box, or Steve is inside a Membrane). This matters if the end goal is to build CEV. Specifically, it means that if CEV wants to hurt Steve, then the Membrane concept can't help him.

Let's consider a specific scenario. Someone builds a Membrane AI, with all human findable safety holes fully patched. Later, someone initiates an AI project, whose ultimate goal is to build an AI, that implements the Coherent Extrapolated Volition of Humanity. This project ends up successfully hitting the alignment target that it is aiming for. Let's refer the resulting AI as CEV.

One common aspect of human morality, is often expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell''. A tiny group of fanatics, with morality along these lines, can end up completely dominating CEV. I have outlined a thought experiment, where this happens to the most recently published version of CEV (PCEV). It is probably best to first read this comment, that clarifies some things talked about in the post that describes the thought experiment (including important clarifications regarding the topic of the post, and regarding the nature of the claims made, as well as clarifications regarding the terminology used).

So, in this scenario, PCEV will try to implement some outcome along the lines of LP. Now the Membrane concept has to protect Steve from a very clever attacker, that can presumably easily just go around whatever patches was used to plug the human findable safety holes. Against such an attacker, it's difficult to see how a Membrane will ever offer Steve anything of value (similar to how it is difficult to see how putting PCEV in a human constructed box, would offer Steve anything of value).

I like the Membrane concept. But I think that the intuitions that seems to be motivating it, should instead be used to find an alternative to CEV. In other words, I think the thing to aim for, is an alignment target, such that Steve can feel confident, that the result of a successful project, will not want to hurt Steve. One could for example use these underlying intuitions, to try to explore alignment targets along the lines of MPCEV, mentioned in the above post (MPCEV is based on giving each individual, meaningful influence, regarding the adoption, of those preferences, that refer to her. The idea being that Steve needs to have meaningful influence, regarding the decision, of which Steve-preferences, an AI will adopt). Doing so means that one must abandon the idea of building an AI, that is describable as doing what a group wants (which in turn, means that one must give up on CEV as an alignment target).

In the case of any AI, that is describable as a doing what a group wants, Steve has a serious problem (and this problem is present, regardless of the details of the specific Group AI proposal). Form Steve's perspective, the core problem, is that an arbitrarily defined abstract entity, will adopt preferences that is about Steve. But, if this is any version of CEV (or any other Group AI), directed at a large group, then Steve has had no meaningful influence, regarding the adoption of those preferences, that refer to Steve. Just like every other decision, the decision of what Steve-preferences to adopt, is determined by the outcome of an arbitrarily defined mapping, that maps large sets of human individuals, into the space of entities that can be said to want things. Different sets of definitions, lead to completely different such ``Group entities''. These entities all want completely different things (changing one detail can for example change which tiny group of fanatics, will end up dominating the AI in question). Since the choice of entity is arbitrary, there is no way for an AI to figure out that the mapping ``is wrong'' (regardless of how smart this AI is). And since the AI is doing what the resulting entity wants, the AI has no reason to object, when that entity wants to hurt an individual. Since Steve does not have any meaningful influence, regarding the adoption of those preferences, that refer to Steve, there is no reason for him to think that this AI will want to help him, as opposed to want to hurt him. Combined with the vulnerability of a human individual, to a clever AI that tries to hurt that individual as much as possible, this means that any group AI would be worse than extinction, in expectation. Discovering that doing what humanity wants, is bad for human individuals in expectation, should not be particularly surprising. Groups and individuals are completely different types of things. So, this should be no more surprising, than discovering that any reasonable way of extrapolating Dave, will lead to the death of every single one of Dave's cells.

One can of course give every individual meaningful influence, regarding the adoption of those preferences, that refer to her (as in MPCEV, mentioned in the linked post). So, Steve can be given this form of protection, without giving Steve any form of special treatment. But this means that one has to abandon the core concept of CEV.

I like the membrane concept on the intuition level. On the intuition level, it sort of rhymes with the MPCEV idea, of giving each individual, meaningful influence, regarding the adoption of those preferences, that refer to her. I'm just noting that it does not actually protect Steve, from an AI that already wants to hurt Steve. However, if the underlying intuition, that seems to me to be motivating this work, is instead used to look for alternative alignment targets, then I think it might be very useful for safety (by finding an alignment target, such that a successful project would result in an AI, that does not want to hurt Steve in the first place). So, I don't think the Membrane concept can protect Steve from a successfully implemented CEV, in the unsurprising event that CEV will want to hurt Steve. But if CEV is dropped as an alignment target, and the underlying intuition behind this work, is directed towards looking for alternative alignment targets, then I think the intuitions that seems to be motivating this work, would fit very well with the proposed research effort, described in the comment linked above.

(this is a comment about dangers related to successfully hitting a bad alignment target. It is for example not a comment about dangers related to a less powerful AI, or dangers related to projects that fail to hit an alignment target. These are very different types of dangers. So, my proposed idea of using the underlying intuitions, to look for alternative alignment targets, should be seen as complementary. It can be done, in addition to looking for Membrane related safety measures, that can protect against other forms of AI dangers. In other words: if some scenario does not involve a clever AI, that already wants to hurt Steve, then nothing I have said, implies that the Membrane concept, will be insufficient for protecting Steve. In other words: using the Membrane concept, as a basis for constructing safety measures, might be useful in general. It will however not help Steve, if a clever AI is actively trying to hurt Steve)

Comment by ThomasCederborg on «Boundaries», Part 1: a key missing concept from utility theory · 2024-03-23T14:25:15.631Z · LW · GW

For a set of typical humans, that are trying to agree on what an AI should do, there does not exists any fallback option, that is acceptable to almost everyone. For each fallback option, there exists a large number of people, that will find this option completely unacceptable on moral grounds. In other words: when trying to agree on what an AI should do, there exists no place that people can walk away to, that will be seen as safe / acceptable by a large majority of people.

Consider the common aspect of human morality, that is sometimes expressed in theological terms as: ``heretics deserve eternal torture in hell''. This is a common normative position (an aspect of morality), that shows up throughout human history. It is found across cultures, and religions, and regions, and time periods. Consider Steve, for whom this is a central aspect of his morality. His morality is central to his self image, and he classify most people as heretics. The scenario, where the world is organised by an AI, that do not punish heretics, is thus seen as a moral abomination. In other words, such a scenario is a completely unacceptable fallback option for Steve (in other words, Steve would reject any AI, where this is a negotiation baseline). Hurting heretics is a non negotiable moral imperative for Steve. In yet other words: if Steve learns that the only realistic path to heretics being punished, is for the AI to do this, then no fallback position, where the AI allows them avoid punishment, is acceptable.

Bob has an even more common position: not wanting to be subjected to a clever AI, that tries to hurt Bob as much as possible (in many cases, purely normative considerations would be sufficient for strong rejection).

There is simply no way to create a well defined fallback option, that is acceptable for both Steve and Bob. When implementing an AI, that gets its goal from a set of negotiating human individuals, different types of bargaining / negotiation rules, imply a wide variety of different BATNAs. No such AI will be acceptable to both Steve and Bob, because none of the negotiation baselines, will be acceptable to both Steve and Bob. If Bob is not tortured in the BATNA, then the BATNA is completely unacceptable to Steve. And if Bob is tortured, then it is completely unacceptable to Bob. In both cases, the rejection is made on genuinely held, fully normative, non strategic, grounds. In both cases, this normative rejection cannot be changed, by any veil of ignorance (unless that veil transforms people, into something that the original person would find morally abhorrent).

In yet other words: there exists no BATNA, that a set of humans would agree to under a veil of ignorance. If the BATNA involves Bob getting tortured, then Bob will refuse to agree. If the BATNA does not involve Bob getting tortured, then Steve will refuse to agree. Thus, for each possible BATNA, there exists a large number of humans, that will refuse to agree to it (as a basis for AI negotiations), under any coherent veil of ignorance variant.

This conclusion is reached, by just skimming the surface of the many, many, different of types of minds, that exists within a set of billions of humans. It is one of the most common aspects of human morality, that is completely incompatible with the existence of a fallback position, that is acceptable to a large majority. So, I don't think there is any hope of finding any set of patches, that will work for every unusual type of mind, that exists in a population of billions (even if one is willing to engage in some very creative definitional acrobatics, regarding what counts as agreement).

My analysis here, is that this aspect of human morality, implies a necessary (but not sufficient) feature, that any alignment target must have, for this alignment target to be preferable to extinction. Specifically: the AI in question must give each person meaningful influence, regarding the adoption of those preferences, that refer to her. We can provisionally refer to this feature as: Self Preference Adoption Decision Influence (SPADI). This is obviously very underspecified. There will be lots of border cases whose classification will be arbitrary. But there still exists many cases, where it is in fact clear, that a given alignment target, does not have the SPADI feature. Since the feature is necessary, but not sufficient, these clear negatives are actually the most informative cases. In particular, if an AI project is aiming for an alignment target, that clearly does not have the SPADI feature. Then the success of this AI project, would be worse than extinction, in expectation (from the perspective of a typical human individual that does not share Steve's type of morality, and that is not given any special influence over the AI project).

If a project has the SPADI feature, then this implies a BATNA, that will be completely unacceptable to Steve on moral grounds (because any given heretic will presumably veto the adoption of those preferences, that demand that she be hurt as much as possible). But I think that disappointing Steve is unavoidable, when constructing a non bad alignment target. Steve is determined to demand that any AI must hurt most people, as much as possible. And this is a deeply held normative position, that is core to Steve's self image. As long as Steve is still Steve in any coherent sense, then Steve will hold onto this rejection, regardless of what veils of ignorance one puts him under. So, if an AI is implemented, that does satisfy Steve (in any sense), then the outcome is known to be massively worse than extinction, for essentially any human individual, that is classified as a heretic by Steve (in other words: for most people). Thus, we should not look for solutions, that satisfy Steve. In fact, we can actually rephrase this as a necessary feature, that any BATNA must have: Steve must find this BATNA morally abhorrent. And Steve must categorically reject this BATNA, regardless of what type of coherent veil of ignorance, is employed. I think the SPADI feature is more informative, but if one is listing necessary features of a BATNA, then this is one such feature (and this feature can perhaps be useful, for the purpose of illustrating the fact, that we are looking for features, that are not supposed to be sufficient).

Another way to approach this, is to note that there exists many different definitions of heretic. If Gregg and Steve see each other as heretics, satisfying both is simply not possible (the specific definition of heretic is also central to the self image of both Steve and Gregg. So no coherent formulation of a veil of ignorance, will help Steve and Gregg agree on a BATNA). Satisfying any person with a morality along the lines of Steve, implies an outcome far worse than extinction for most people. Satisfying all people along the lines of Steve is also impossible, even in principle. Thus, it is difficult to see, what options there are, other than simply giving up on trying to give Steve an acceptable fallback option (in other words: we should look for a set of negotiation rules, that imply a BATNA, that is completely unacceptable to Steve, on fully normative, moral, grounds. In yet other words: a necessary feature of any veil of ignorance mediated, accusal agreement, is that it is strongly rejected by every person, with a morality along the lines of Steve). Thus, the fact that Steve would find any AI project with the SPADI feature, morally abhorrent, is not an argument against the SPADI feature. (both Steve and Gregg would obviously reject any notion, that they have similar moralities. This is an honest, non strategic, and strong rejection. And this rejection would remain, under any coherent veil of ignorance. But there is not much that could, or should, be done about this)

The SPADI feature is incompatible with building an AI, that is describable as ``doing what a group wants''. Thus, the SPADI feature is incompatible with the core concept of CEV. In other words: the SPADI feature is incompatible with building an AI that, in any sense, is describable as implementing the Coherent Extrapolated Volition of Humanity. So, accepting this feature, means abandoning CEV as an alignment target. In yet other words: if some AI gives each individual meaningful influence, regarding the decision, of which preferences to adopt, that refer to her. Then we know that this AI is not a version of CEV. In still other words: while there are many border cases, regarding what alignment targets could be described as having the SPADI feature, CEV is an example of a clear negative (because doing what a group wants, is inherent in the core concept, of building an AI, that is describable as: implementing the Coherent Extrapolated Volition of Humanity).

Discovering that building an AI that does what a group wants the AI to do, would be bad for the individuals involved, should in general not be particularly surprising (even before taking any facts about those individuals into account). Because groups and individuals are completely different types of things. There is no reason to be surprised, when doing what one type of thing wants, is bad for a completely different type of thing. It would for example not be particularly surprising to discover, that any reasonable way of extrapolating Dave, will lead to all of Dave's cells dying. In other words, there is no reason to be surprised, if one discovers that Dave would prefer, that Dave's cells not survive.

Similarly, there is no reason to be surprised, when one discovers that all reasonable ways of defining ``doing what a group wants'' is bad for the individuals involved. A group is an arbitrarily defined abstract entity. Such an entity is pointed at, using an arbitrarily defined mapping, from billions of humans, into the set of entities, that can be said to want things. Different mappings imply completely different entities, that all want completely different things (a slight definition change, can for example lead to a different BATNA, which in turn leads to a different group of fanatics dominating the outcome). Since the choice of which specific entity to point to, is fully arbitrary, no AI can discover that the mapping, that is pointing to such an entity ``is incorrect'' (regardless of how smart this AI is). That doing what this entity wants is bad for individuals, is not particularly surprising (because groups and individuals are completely different types of things). And an AI, that does what such an entity wants it to do, has no reason whatsoever, to object, if that entity wants the AI to hurt individuals. So, discovering that doing what such an entity wants, is bad for individuals, should in general not be surprising (even before we learn anything at all, about the individuals involved).

We now add three known facts about humans and AI designs, (i): a common aspect of human morality, is the moral imperative to hurt other humans (if discovering that no one else will hurt heretics, then presumably the moral imperative will pass to the AI. An obvious way of translating ``eternal torture in hell'' into the real world, is to interpret it as a command to hurt as much as possible), (ii): a human individual is very vulnerable to a clever AI, trying to hurt her as much as possible (and this is true for both selfish, and selfless, humans), and (iii): if the AI is describable as a Group AI, then no human individual can have any meaningful influence, regarding the adoption, of those preferences, that refer to her (if the group is large, and if the individual in question is not given any special treatment). These three facts very strongly imply that any Group AI, would be far worse than extinction, for essentially any human individual, in expectation.

I have outlined a thought experiment that might help to make things a bit more concrete. It shows that a successful implementation of the most recently published version of CEV (PCEV), would lead to an outcome, that would be far, far, worse than extinction. It is probably best to first read this comment, that clarifies some things talked about in the post that describes the thought experiment (the comment includes important clarifications regarding the topic of the post, and regarding the nature of the claims made, as well as clarifications regarding the terminology used).

To me, it looks like the intuitions that are motivating you to explore the membrane concept, is very compatible with the MPCEV proposal, in the linked post (which modifies PCEV, in a way that gives each individual, meaningful influence, regarding the adoption of those preferences, that refer to her). If CEV is abandoned, and the Membrane concept is used to describe / look for, alternative alignment targets, then I think this perspective might fit very well with my proposed research effort (see the comment mentioned above for details on this proposed research effort).

Comment by ThomasCederborg on A problem with the most recently published version of CEV · 2024-03-23T14:02:05.896Z · LW · GW

This comment is trying to clarify what the post is about, and by extension clarify which claims are made. Clarifying terminology is an important part of this. Both the post and my research agenda is focused on the dangers of successfully hitting a bad alignment target. This is one specific subset, of the existential threats that humanity face from powerful AI. Let's distinguish the danger being focused on, from other types of dangers, by looking at a thought experiment, with an alignment target that is very obviously bad. A well intentioned designer named Bill, who shares your values, sincerely believes that it is a good idea to start an AI project, aiming for a Suffering Maximising AI (SMAI). Bill's project might result in an AI that wants to do something along the lines of building lots of tiny pictures of sad faces, due to a failure to get the AI to want to maximise, a reasonable definition of Suffering. This type of scenario is not what I am focused on. No claim made in the post refer to any scenario along those lines. In this case, Bill's project is described as having failed. In other words: the aimed for alignment target was not hit. These types of dangers are thus out of scope, of both the post, and my research agenda. It is a very different type of danger, compared to the dangers that come from scenarios where Bill succeeds, and successfully hits the SMAI alignment target. Success would result in an AI, that wants to maximise something, that can be reasonably described as suffering (and little pictures of sad faces, is not a reasonable way of defining suffering). It is also possible that Bill's project results in an AI that wants to build a lot of molecular Squiggles, due to an issue that has nothing to do with any definition of Suffering (this is also a failure, and it is also out of scope).

Dangers along the lines of Squiggles and pictures of sad faces can be dealt with by improving alignment techniques. Or by explaining to Bill that he does not currently know how to hit an alignment target. All strategies along those lines are out of scope, simply because nothing along those lines, can ever be used to mitigate the dangers, that come from successfully hitting a bad alignment target. Nothing along those lines can help in scenarios, where it would be bad for the project to succeed (which is the case for Bill's SMAI project). Showing Bill that he does not know how to hit an alignment target, stops working when Bill learns how to hit an alignment target. And helping Bill improve his alignment techniques, dramatically increases the dangers that comes from successfully hitting a bad alignment target. The best way of dealing with such dangers, is by showing Bill, that it would be bad, for his project to succeed. In other words: to show Bill, that his favoured alignment target, is a bad alignment target. My research effort, which the post is a part of, is focused on analysing alignment targets. The reason that such analysis is needed, is that not all bad alignment targets, are as obviously bad as SMAI. The post can be used to illustrate the fact, that alignment target analysis can be genuinely unintuitive.

In other words: the purpose of the proposed research effort, is not to help Bill successfully implement SMAI. Or to help Bill find a good description of SMAI. Or to convince Bill to aim for a different version of SMAI. Instead, the purpose of the proposed alignment target analysis research effort, is to help Bill see that he is aiming for a bad alignment target. So that he will choose to abandon his SMAI project.

One could add a safety measure to Bill's SMAI project, that might give Bill a chance to try again, in case of a bad outcome. This can reduce many different types of dangers, including dangers from successfully hitting a bad alignment target (less formally: one can add a do over button). But since no such safety measure is guaranteed to work, no such measure can make it ok, for Bill to start his SMAI project. And, crucially, no such safety measure can be relevant to the question, of whether or not SMAI is a bad alignment target. Thus, analysing such safety measures, does not count as analysing alignment targets. Since they are not relevant to the proposed research effort, analysing such safety measures are also out of scope. It would not be particularly surprising, if analysing such safety measures turns out to be more important than analysing alignment targets. But this does not change the fact that proposed safety measures are irrelevant, when analysing alignment targets. In other words: they are a separate, and complementary, method of reducing the type of danger that I am focused on. (less formally: a do over button might save everyone from an outcome massively worse than extinction. So they are definitely not useless in general. They are just useless for the specific task of analysing alignment targets. And since they can never be guaranteed to work, they can never replace alignment target analysis. A do over button can also be added to many different proposed alignment targets. Which means that they are not very helpful when one is comparing two alignment targets)

I don't use the term alignment target for proposals along the lines of a Pivotal Act AI (PAAI), or any other AI whose function is similar to the typical functions of proposed treaties (such as buying time). So, CEV is an alignment target, but no proposal along the lines of a PAAI, is referred to as an alignment target. To get a bit more concrete, regarding what counts as an alignment target in my terminology, let's say that a set of designers have an AI proposal, called AIX. If AIX is the type of AI, that implies a decision, regarding which goal to give to a successor AI, then AIX counts as an alignment target. The act of building AIX, implies a choice of alignment target. So, to build AIX, one must be able to analyse alignment targets.

If, on the other hand, the designers do not plan for AIX to play a role in the decision process, regarding what goal to give to a successor AI, then AIX does not count as an alignment target. In this latter case, building AIX might be possible without being able to analyse alignment targets (and building AIX, at a point in time when one does not know how to analyse alignment targets, might be a good idea). But the need to analyse alignment targets would remain, even if this latter type of project is successful.

If AIX is an alignment target, then adding something along the lines of a last judge off switch, to AIX, does not change the fact that AIX is an alignment target. An AI project might add such a feature, specifically because the designers know, that the aimed for alignment target might be bad. In this case, the project is still described as aiming for a bad alignment target. Such designers are not certain about the alignment target in question. And they have taken precautions, in case it turns out to be a bad alignment target. So, it might not be possible to discover anything, that would be very surprising to the designers. But it is still possible to discover that the aimed for alignment target is bad. This might still help a group of not-very-surprised designers, to reach the not-very-surprising-conclusion, that they should abandon the project (since this type of add on might fail, no such add on can ever make it reasonable to start a project that is aiming for a bad alignment target. In other words: it was wrong of these designers, to start this project. But in this scenario, the designers started the project, while thinking that the alignment target might be good. So, there is nothing in this scenario that strongly indicates, that it will be impossible to reason with them, in case one discovers that their alignment target is bad).

A PAAI project might be one part of a larger project plan, that at a later stage calls for some other AI project, that will be trying to hit an agreed upon alignment target. This is similar to how a treaty might be part of such a larger project plan. This larger project counts as aiming for an alignment target. But the alignment target is solely determined by the details of second AI project. The details of any treaty / PAAI / etc is, in my terminology, irrelevant to the question: what alignment target is the larger project aiming for?

Dangers related to successfully hitting a bad alignment target, is a different class of dangers, compared to dangers related to a project, that fails to hit a given alignment target. Preventing dangers related to successfully hitting a bad alignment target, requires a dedicated research effort, because some important measures that can reduce this risk, does not reduce other AI related risks (and are thus unlikely to be found, without a dedicated effort). Specifically, one way to reduce this specific risk, is to analyse alignment targets, with a view to finding some set of features, that are necessary for safety. I propose to take the perspective of one individual that is not given any special treatment. Then one can look for a set of necessary features, that an alignment target of a given project must have, for it to be rational, for such an individual to support this AI project. This should be done, while making as few assumptions as possible regarding this individual. Such a set would help someone, that is trying to construct a new alignment target. Since these features are all of the necessary-but-not-sufficient type, the fact that a project has them all, is not enough to convince anyone, to support the project. But they can help an individual (for whom none of the assumptions made, are false), decide to oppose a project, that is aiming for an alignment target, that lacks one of the features in such a set. The post describes one such necessary feature: that each individual, is given meaningful influence, regarding the adoption of those preferences, that refer to her. In many cases, it will be unclear whether or not some proposal has this feature. But in some cases, it will be clear, that a given proposal does not have this feature. And since it is a necessary but not sufficient feature, it is these clear negatives, that are the most useful. It is for example clear, that no version of CEV, can have this feature. This feature is incompatible with the core concept, of building an AI, that is in any way describable as: implementing the Coherent Extrapolated Volition of Humanity. In other words: in order to build an AI that has this feature, one has to choose some other alignment target. This seems to be a somewhat controversial feature. Possibly because of this incompatibility.

Such features are central to the proposed research effort. And the feature mentioned is the most important part of the post. But, in order to avoid controversies and distractions, this comment will not involve any further discussion of CEV. I will instead make the distinctions that I need to make, by relying exclusively on examples with other alignment targets (whose status as bad alignment targets is not being disputed).

A simple, well understood, and uncontroversial example of a bad alignment target, is a Suffering Reducing AI (SRAI). SRAI cares about what happens to humans, and will not treat humans in an uncaring, strategic way. For example: SRAI would not accept an offer where the solar system is left alone in exchange for resources (with the only intervention being to prevent any other AI from being built). This rejection is not related to any specific detail, or any specific definition. Any reasonable definition of suffering, leads to an AI, that will reject such an offer. Because killing all humans is inherent in the core concept, of reducing suffering. For any reasonable definition of Suffering, accepting the offer, would lead to the continuation of a lot of Suffering. Unless there exists some way of using the offered resources, to reduce some other, larger, source of Suffering, the offer will be rejected (for the rest of this text, it will be assumed that no such opportunity exists).

Regardless of which reasonable set of definitions is used, a successful SRAI project, simply leads to an AI that rejects all such offers (and that kills everyone). In other words, if a SRAI project results in an AI that rejects the offer and kills everyone, then this behaviour is very consistent, with the project having succeeded. The core issue with any SRAI project, is that the project would be aiming for a bad alignment target. The path towards a successfully implemented SRAI project, is simply not relevant to the question, of whether or not this is a bad alignment target. If some treaty, or a Pivotal Act AI, or some other plan for buying time, or some other clever trick, was involved at some point. Then this has no impact whatsoever on the question: is SRAI a bad alignment target? Similarly: if the path to a successful SRAI implementation, involved a last judge off switch, which successfully prevented an issue that would have led to an AI, that would have been glad to accept the offer mentioned above (because that AI would have been able to use the offered resources, to build lots of molecular squiggles), then this is equally irrelevant to the question: is SRAI a bad alignment target? If some other type of do over button, happens to continue to work, until the project is abandoned, then this is equally irrelevant, to the question: is SRAI a bad alignment target? More generally: no try-again-safety-measure can be relevant to this question. And no other detail of the larger project plan, can, be relevant to this question. (it is possible that a failed SRAI project will result in some AI that rejects the offer. Only one positive claim is made: If the resulting AI accepts the offer, then the project did not successfully hit the aimed for alignment target. Or, in other words: if the project does successfully hit the aimed for alignment target, then the resulting AI will reject the offer)

When saying that SRAI is a bad alignment target, there is no intended implication, that an AI project aiming for the SRAI alignment target, will lead to a bad outcome. It could be that some set of safety measures are implemented. And these could hold up until the designers see that the aimed for alignment target is a bad alignment target. Such a project is a bad idea. It is an unnecessary risk. And if it successfully hits the alignment target that it is aiming for, then it will get everyone killed. Success of this project leads to extinction (regardless of path to success, and regardless of the details of the definitions). But there is no implied claim that the act of starting such a project would, in fact, lead to a bad outcome.

It is possible to use other definitions for things such as: ``alignment target'', and ``successfully hitting an alignment target'', and ``the project succeeding''. But the way that these phrases are being defined here, by using the various examples throughout this comment, makes it possible to refer to important concepts in a clear way. The words used to point to these concepts could be changed. But it must be possible to refer to the concepts themselves. Because they are needed in order to make important distinctions.

Now, let's return to the SMAI alignment target, mentioned at the beginning of this comment. Unless something very strange happens, the outcome of Bill's SMAI project, would accept the offer mentioned above (this is probably a good deal for the resulting AI, whether or not the project hits the aimed for alignment target. The offered resources can for example be used to create: little pictures of sad faces, or Suffering, or Squiggles). Thus, such an offer can not be used to distinguish different types of scenarios from each other, in the SMAI case. For a SMAI project, it is more informative to ask what it would mean from a human perspective, to give the resulting AI resources. In the SMAI case, this question is better for illustrating the importance of distinguishing between different types of AI related dangers.

So, consider the case where the outcome of this SMAI project is fully contained within some lifeless region. Now, let's ask what it would mean, to give it more resources. Let's say that the result of a SMAI project, wants to build lots of little pictures of sad faces (due to a failure to get the AI to want to maximise a reasonable definition of Suffering), or little molecular Squiggles (due to an implementation failure, that is unrelated to the definition of Suffering). In these two cases, it seems like it does not matter much, if resources are destroyed, or given to this AI (assuming that it is fully contained). Dangers related to outcomes along these lines, are out of scope, as they are the results of failed projects. If a SMAI project succeeds however, then it would be far better to destroy those resources, than for SMAI to have those resources. If a SMAI project successfully hits the SMAI alignment target, the details of the definitions would also matter. Possibly a great deal. Successfully hitting the SMAI alignment target, can lead to a very wide range of outcomes, depending on the details of the definitions. Different sets of reasonable definitions, lead to outcomes, that are different in important ways (the SMAI project being successful, means that the resulting AI will want to maximise some reasonable definition of suffering. Thus, the outcome of a successful SMAI project, will be a bad outcome. But different definitions still lead to importantly different outcomes). The dangers involved with successfully hitting the SMAI alignment target, is importantly different from the case where the project fails to hit an alignment target. One very important distinction, is that each scenario, in this very large set, of very diverse types of scenarios, are all best dealt with, by helping Bill see that he is aiming for a bad for alignment target.

(it is possible that a failed SMAI project will result in an AI, such that preventing it from having resources is important (even under a strong containment assumption). Only one positive claim is made: If humans are essentially indifferent to giving the resulting AI resources, then the project did not successfully hit the aimed for alignment target. Or, in other words: if the project does successfully hit the aimed for alignment target, then it is important to keep resources away from the resulting AI)

Shifting resources from the result of one successful SMAI project, to the result of another successful SMAI project, might be very valuable. But issues along those lines are also out of scope of the proposed research effort, of analysing alignment targets. In other words: the point of talking to Bill, is not to help Bill find a better description of SMAI. But instead to help Bill see, that he should aim for a different alignment target. These two objectives are very different. Just as it would be a bad idea to help Bill hit the SMAI alignment target. It would also be a bad idea to help Bill describe SMAI. And while different versions of SMAI might lead to importantly different outcomes, the purpose of the proposed research project, is not to switch from one version of SMAI, to another version of SMAI. The purpose of the proposed research effort, is instead to help Bill see, that SMAI is a bad alignment target.

The reason that the effort to analyse alignment targets needs a dedicated effort, is that the best way to stop the two projects mentioned above, is by pointing out, that it would be a bad thing for them to succeed. If it is possible to stop a project in this way, then this is a better option, than to implement safety measures, that might (or might not) hold up until the project is abandoned. For SRAI and SMAI, the needed insights already exists. So, to stop these two projects, there is no need for further analysis. But if I take an example where there is no consensus, that the alignment target in question is in fact bad. Then the distinctions that I am trying to make in this comment will get lost, in arguments about whether or not the alignment target in question, is in fact bad. So I will stick with SRAI and SMAI.

In other words: To deal with bad alignment targets, that has not currently been discovered to be bad, dedicated effort is needed. Efforts that deal with other types of dangers, does not need the types of insights, that are needed to analyse alignment targets. And such efforts, are thus not likely to lead to such insights. It is not known how long it would take a serious, dedicated, research effort to advance to the point, where the needed insights become possible to see. Partly, this is because it is not known which nice-sounding-but-actually-bad alignment targets, will be proposed in the future. We also don't know how long there will be to work on this issue. Even if we reason from the assumption, that some alignment target will get successfully hit, there would still be a lot of remaining uncertainty. We would still not know how long time there will be, to find the needed insights. Since we don't know how much time it would take, or how much time there will be, there is no particular reason to think that the effort will be completed in time. Combined with the stakes involved, this implies urgency. Despite this urgency, currently there exists no serious research project, that is dedicated to alignment target analysis. One positive thing that happens to be true, is that in scenarios where alignment target analysis is needed, there will probably be time to do such analysis. There is no particular reason to think that there will be enough time. But it seems very likely, that there will be a significant amount of time. Another positive thing, that happens to be true, is that one can prevent these types of dangers, without arriving at any actual answer (the dangers from a given bad alignment target, can for example be mitigated by noticing, that it is a bad alignment target. This can be done, without having any example of a good alignment target). Yet another positive thing, that also happens to be true, is that if the needed insights are in fact generated in time, then these insight will probably only have to be explained, to the types of people, that are capable of hitting an alignment target.

If one views the situation from a higher level of abstraction, then the lack of a dedicated research effort is even more strange. It obviously matters, what alignment target an AI project is aiming for. One way to phrase this, is that it simply cannot be the case, that the aimed for alignment target is both irrelevant to the outcome, and simultaneously supposed to reduce division, and unify people towards a common goal. (if an alignment target has been set in stone, then one might be able to reasonably argue that detailed descriptions of this alignment target would not be particularly valuable to a given project. But such arguments are simply not applicable, to suggestions that a proposed project should to aim for a different alignment target)

If an AI project is aiming at a bad alignment target, then this can not be fixed by adding some set of safety measures to this project. And a set of safety measures, is simply not relevant, when one tries to determine, whether or not the aimed for alignment target, is a bad alignment target. And since no safety measure is guaranteed to work, such measures can never make it ok, to launch an AI project, that is aiming at a bad alignment target. A very careful SRAI project, that is implementing a set of very well constructed safety measures, might obviously lead to SRAI. It is still putting everyone in unnecessary danger. One implication of this, is that if someone argues that an SRAI project is aiming for a bad alignment target, then this argument can not be countered, by pointing at safety measures. It is important to emphasise, that it is completely irrelevant, what this set of safety measures is. Such a counterargument can always be safely dismissed out of hand, without having any idea, of which specific type of do-over-button is being referred to. Such safety measures simply, can not, change the fact that SRAI is a bad alignment target. (if such a do-over-button is added to a SRAI project, and this do-over-button happens to work, then it can save everyone from getting killed by SRAI. So it is definitely not useless in general. It is just useless, for the specific task, of analysing alignment targets. In other words: it can stop everyone from getting killed. But it can not be used to build a counterargument, when someone points out that SRAI is a bad alignment target. In yet other words: it is only the counterargument that can be safely dismissed. Some specific safety measure, that such a counterargument is built around, might turn out to be more important than the entire field of alignment target analysis. So dismissing such a safety measure, based on the fact that it is being used as part of an invalid counterargument, is definitely not something that can be safely done)

So far, we have talked about safety measures, that might allow a designer to try again. Let's now turn our attention to what might perhaps be roughly described as a ``goal softener'', that could be added to an alignment target. In other words: a type of add on that will modify the behaviour implied by an alignment target (not an add on, that will allow a designer to abandon an alignment target, if it implies bad behaviour). Let's write sSRAI for a soft Suffering Reducing AI, that will determine the actual outcome, and that respects the underlying intentions, behind an effort to get it to (for example) ``avoid optimising too hard''. Specifically, sSRAI wants to reduce suffering, but sSRAI wants to do this, in a way that respects the intentions behind the designers attempts, to get it to do soft optimisation, of the SRAI alignment target.

Let's use an undramatic example scenario, to make a few points, that have wider applicability. Consider a successfully implemented sSRAI. sSRAI acts in a way that is genuinely in line with the underlying ideas and intentions, behind an effort to get the AI to avoid optimising too hard. sSRAI wants to act in accordance with the spirit of the soft optimisation principles that the designers were trying to implement. In other words: this feature is not used as a way to buy time, or as a way to get a second chance in case of failure (less formally: It is not a speed bump. And it is not a do over button). Specifically, sSRAI will always act in the world, in a way that is consistent with being uncertain, regarding how precisely Suffering should be defined. And sSRAI will avoid any action, that the designers would find alarming or disconcerting in any way. And sSRAI will avoid any action that would be perceived as weird or dramatic. And sSRAI would like to avoid any rapid change, taking place in society or culture. And sSRAI will avoid any action that would be seen as deceptive or manipulative.

Now, let's explore what sSRAI might do in the world, to see if this has turned the SRAI alignment target, into a good alignment target. sSRAI could, for example, offer a process of sSRAI guided self discovery, that causes people to settle on some new way of viewing life, that is far more fulfilling. There is a wide range of views that people arrive at, but almost all include the idea, that it is best to not have more than one child. Every year, asking sSRAI for such guidance becomes slightly more mainstream, and every year some fraction of the population asks for this. sSRAI also declines to create any set of circumstances, that would dramatically alter human society. For example, sSRAI declines to create any set of circumstances, that would make humans want to go on living indefinitely (as opposed to just wanting to live for a very, very long time). sSRAI also takes actions to increase comfort, happiness, and to reduce suffering. Basically, taking careful actions that lead to people living very long and very happy lives (this does imply gradual but significant changes. But those societal changes are seen as acceptable, as they are the least dramatic changes, that can achieve the objective of significantly reducing suffering in the short term).

At no point on this path, is sSRAI deviating from the spirit of the implied instructions. At no point is sSRAI deviating from what it is supposed to do: which is to reduce suffering, while adhering to the underlying intentions, of avoiding drastic / weird / etc actions, or rapid changes, or manipulations, etc, etc. At all points on this path, is it the case, that all of its plans, are entirely in line with genuinely wanting to respect the underlying intention, of soft optimisation. For example: the process of sSRAI guided self discovery, is not some trick, to get people to want weird things. It is genuinely a process, designed to lead to some viewpoint, that in turn leads to less suffering (and it never results in any viewpoint, that is a weird outlier, in the set of common, pre-AI, viewpoints). The process also takes the other aspirations of a given individual into account, and leads to viewpoints that additionally does things like increasing happiness, or improving artistic output (since taking peoples wishes into account when possible, is also implied by some of the underlying intentions, that lead the designers to try to get the AI to do soft optimisation. Specifically: intentions related to not having tunnel vision). And no one is ever manipulated into asking for it. It is simply offered to everyone. And then word spreads, that it is great. For example: sSRAI did not design it, with the design constraint, that it must become dominant (including such a constraint when designing the process, would have violated the designers underlying intentions. Specifically, they would see this as manipulation). But the process leads to outcomes, that people will eventually want, when it stops being seen as weird to ask for it. The process also does not cause any dramatic changes to peoples lives or personalities.

Additionally, and completely in line with the underlying intentions of the safety measures, sSRAI declines to enable any form of dramatic augmentation procedure (explaining, completely honestly, to anyone that wants dramatic augmentation, that this would both lead to dramatically increased suffering, and to rapid societal change). sSRAI will never decide to explain any concept or fact, that would result in transformative change or increased suffering (or allow any AI project, or any augmentation project, that would lead to such concepts or facts being discovered). In general, sSRAI will prevent any AI project, that would have dramatic impact on society or culture (again straightforwardly in line with the underlying intentions of the designers. Because they intended to prevent drastic AI actions in general, not just drastic sSRAI actions). This overall behaviour is a genuinely reasonable way of interpreting: ``exert a pushing force towards the outcome implied by the aimed for alignment target. But make sure that you always act in a way, that I would describe as pushing softly''.

Even conditioned on a successful implementation of sSRAI, the specific scenario outlined above would obviously never happen (for the same reason that the specific chess moves that I can come up with, are very unlikely to match the actual chess moves, of a good chess player, whose moves I am trying to predict). A more realistic scenario is that sSRAI would design some other path to a lifeless universe, that is a lot more clever, a lot faster, a lot less obvious, and even more in line with underlying intentions. If necessary, this can be some path, that simply can not be explained to any human, or any result of any of the augmentation procedures, that sSRAI will decide to enable. A path that simply can not be specified in any more detail than: ``I am trying to reduce suffering, for a wide range of definitions of this concept, while adhering to the underlying intentions, behind the design teams efforts, to prevent me from optimising too hard''. SRAI is simply a bad alignment target. And there does not exists any trick, that can change this fact, even in principle (even under an assumption of a successfully implemented, and genuinely clever, trick). If a last judge off switch happens to work, then the project mentioned above might be shut down by some extrapolated version of a last judge, with no harm done (in other words: there is no claim that a sSRAI project will lead to a lifeless universe. Just a claim that the project is a bad idea to try, and that it represents an unnecessary risk, because SRAI is a bad alignment target).

In other words: the core concept of reducing suffering, implies killing everyone. And this remains true, even if one does manage to get an AI to genuinely want to make sure that everyone stops living, using only slow, ethical, legal, non deceptive, etc, etc, methods. Or get the AI to use some incomprehensible strategy (that avoids sudden changes, or dramatic actions, or unpleasant aesthetics, or manipulation, etc, etc). SRAI is unsafe, because the implied outcome is unsafe. More generally: SRAI is unsafe because SRAI cares about what happens to you. This is dangerous, when it is combined with the fact, that you never had any meaningful influence, over the decision, of which you-referring-preferences, SRAI should adopt. Similar conclusions hold for other bad alignment targets. And not all of the implied outcomes are this undramatic, or this easy to analyse. For example: sSMAI does not have a known, easy to analyse, endpoint, that sSMAI is known to softly steer the future towards. But the danger implied by a soft push, towards a bad future, is not reduced by the fact that we are having trouble predicting the details. It makes it harder to describe the danger of a soft push in detail, but it does not reduce this danger.

We are now ready to illustrate the value of analysing alignment targets, by noting the difference between a world organised by sSRAI, and a world organised by sSMAI. Let's consider an AI that genuinely wants to respect the designers underlying intentions, that motivated them to try to get the AI to avoid optimising too hard. We can refer to it as soft AI, or sAI. Let's assume that sAI does not do anything that causes human culture to stray too far away from current value systems and behaviours. It does not take drastic actions, or manipulate, or deceive, etc, etc. It only acts, in ways that would be genuinely reasonable to view, as ``pushing softly''. It still seems clearly better, to have an AI that softly pushes to reduce suffering, than to have an AI that softly pushes to maximise suffering. In yet other words: analysing alignment targets will continue to matter, even if one is reasoning from some very, very, optimistic set of assumptions (the difference is presumably al lot less dramatic than the difference between SMAI and SRAI. But it still seems valuable to switch from sSMAI to sSRAI).

We can actually add even more optimistic assumptions. Let's say that Gregg's Method (GM), is completely guaranteed to turn SRAI into GMSRAI. GMSRAI will create a thriving and ever expanding human civilisation, with human individuals living as long lives as would be best for them. Adding GM to any other alignment target, is guaranteed to have equally dramatic effects, in a direction that is somehow guaranteed to be positive. It seems very likely, that even in the GMSRAI case, there will still be a remnant left, of the push towards a lifeless universe. So, the value of analysing alignment targets would remain, even if we had Greggs Method (because a remnant of the SRAI push, would still be preferable to a remnant of the SMAI push. So, finding a better alignment target, would remain valuable, even if we counterfactually had someone like Gregg to help us. In other words: while finding / modifying someone into / building / etc, a Gregg would be great, this would not actually remove the need to analyse alignment targets). (and again: the reason that analysing alignment targets requires a dedicated research effort, is that not all alignment targets are as easy to analyse and compare, as SRAI and SMAI)

Before leaving the sSRAI thought experiment, it makes sense to make three things explicit. One is that ideas along the lines of soft optimisation, have not been shown to be dead ends or bad ideas. Because nothing along these lines can stop a bad alignment target from being bad, even in principle. Basically: unless the AI project is secretly trying to implement an AI that does, whatever a single designer wants the AI to do (in which case the claimed alignment target is irrelevant. From the perspective of an individual without any special influence, such a project is equivalent to a ``random dictator AI'' (regardless of the words used to describe the project)), then it will continue to matter, that SRAI exerts a pushing force towards a lifeless universe. Either this push will impact the outcome, or the alignment target is an empty PR slogan (either it has an influence on the outcome. Or it does not have an influence on the outcome). So, nothing said here is incompatible with some concept along the lines of soft maximisation turning out to be a genuinely valuable and important concept (because we knew that it would fail to solve the task of turning a bad alignment target into a good alignment target, the second we knew what type of thing, was supposed to perform what type of task).

Secondly, the fact that no one in the sSRAI scenario recognises, that SRAI implies a push towards a lifeless universe, makes things look strange (the same thing makes the various SMAI examples look even stranger). This is unavoidable, when one constructs an example with an alignment target, that is universally understood to be a bad alignment target (if the audience understands this, then the people in the thought experiment should also understand this). In other words: no sSRAI project poses any actual threat, because SRAI is already known to be a bad alignment target. The goal of analysing alignment targets, is to neutralise threats from other alignment targets, that are not currently known to be bad.

And finally, it makes sense to make one particular distinction explicit: the sSRAI thought experiment is not discussing a safety measure of a type that is designed to give a designer a chance to try again, in case of a bad outcome (those are discussed in other parts of this comment). Thus, we have not shown anything at all, regarding this version of the concept. In more straightforwards, but a lot less exact, words: the sSRAI thought experiment is about the goal softener version of the concept. Not the do over button version. The issue with the goal softener is that it only dilutes the bad parts. And the issue with the do over button is that it might fail. So even using both don't make it ok to start a project that aims at a bad alignment target. Goal softeners and do over buttons can not change a bad alignment target into a good alignment target. They can however interact with each other. For example: when implementing a last judge off switch, there is a strong need to limit the extrapolation distance. Because extrapolation is not intuitive for humans. This need does not mix well with goal softeners. Because goal softeners can make it harder to see the bad effects of a bad goal. Seeing the push towards a lifeless universe in the sSRAI case requires no extrapolation at all. Because this push is already a universally known feature. Because SRAI is already well understood. For example: everyone already knows exactly why SRAI is fundamentally different from Clippy. But when trying to see a less obvious problem, the benefit of modest extrapolation can be canceled out by the obscuring effect of a well designed goal softener. Goal softeners can also interact with definitional issues in weird ways. The actions of SRAI is not sensitive to the definition of Suffering. All reasonable definitions lead to the same behaviour. But the actions of sSRAI might be very sensitive to the specific definition of Suffering used. Nothing discussed in this paragraph counts as analysing alignment targets. Simply because no topic discussed in this paragraph can ever be relevant to any question along the lines of: is SRAI a bad alignment target? So everything in this paragraph is out of scope of the proposed research agenda.

Being able to discuss the suitability of an alignment target, as a separate issue, is important. Because, if an AI project succeeds, and hits the alignment target that the project is aiming for, then it matters a lot, what alignment target the project was aiming for. It is important to be able to discuss this, as a separate question, removed from questions of strategies for buying time, or safety measures that might give a project a chance to try again, or measures that might lead some AI to settle for an approximation of the outcome implied by the alignment target, etc, etc. This is why we need the concepts, that all of these examples, are being used to define (the words used to refer to these concepts can of course be changed. But the concepts themselves must be possible to refer to).

Let's say that two well intentioned designers, named Dave and Bill, both share your values. Dave is proposing an AI project aiming for SRAI, and Bill is proposing an AI project aiming for SMAI. Both projects have a set of very clever safety measures. Both Dave and Bill say that if some specific attempt to describe SRAI / SMAI fails, and leads to a bad outcome, then their safety measures will keep everyone safe, and they will find a better description (and they are known to be honest. And they are known to share your definition of what counts as a bad outcome. But it is of course not certain, that their safety measures will actually hold). If you can influence a decision, regarding which of these two project gets funded, then it seems very important to direct the funding away from Bill. Because the idea that clever safety measures, makes the alignment target irrelevant, is just straightforwardly false. Even if something along the lines of soft optimisation, is somehow guaranteed to work in a way that fully respects underlying intentions, then it would still matter, which of these two alignment targets are hit (even under the assumption of a non drastic path, to a soft version of the outcome, sSRAI is still preferable to sSMAI).

A very careful sSRAI / sSMAI project can obviously lead to an implemented SRAI / SMAI, that does not respect the underlying intentions of soft maximisation (for example because the resulting AI does not care at all about these intentions due to some technical failure. Or because it only cares about something, that turns out to have no significant impact on the outcome). The claim that this should be seen as a very surprising outcome, of any real world project, would be an extraordinary claim. It would require an extraordinarily solid argument. Even if such a solid-seeming argument were to be provided. The most likely scenario would still be that this argument is wrong in some way. The idea that an actual real world project plan is genuinely solid, in a way that makes a non soft implementation, of the aimed for alignment target, genuinely unlikely, does not seem plausible (even with a solid seeming argument). It seems a lot more likely that someone has constructed a solid seeming argument, on top of an unexamined implicit assumption. I don't think that anyone, who takes AI dangers seriously, will dispute this fact. Conditioned on a project aiming for a bad alignment target, it is simply not possible for a careful person to rule out the scenario, where the project leads to the outcome, that is implied by the aimed for alignment target. Simply assuming that some safety measure will actually work, is just straightforwardly incompatible, with taking AI risks seriously.

It is worth pointing out, that this constitutes a fully separate argument, in favour of preferring a genuinely careful sSRAI project, to a genuinely careful sSMAI project (an argument that is fully distinct, from arguments based on preferring an outcome determined by sSRAI, to an outcome determined by sSMAI). If one acknowledges the value of switching from a genuinely careful sSMAI project, to a genuinely careful sSRAI project, then it is difficult to deny the necessity, of being able to analyse alignment targets.

From a higher level of abstraction, we can consider the scenario where two novel alignment targets are proposed. Neither one of them has any obvious flaws. This scenario takes place in a the real world, and involve real people as decision makers. Thus, these people deciding to wait indefinitely, is obviously not something that we can safely assume. In this scenario, it would be valuable, if an alignment target analysis effort, has advanced to the point where these two proposals can be meaningfully analysed and / or compared to each other (for example concluding that one is better than the other. Or concluding that both must be discarded, since they both lack at least one feature, that was found to be necessary for safety, while analysing other alignment targets). This remains valuable, even if a separate research effort has designed multiple layers of genuinely valuable safety measures.

Conditioned on Bill's SMAI project being inevitable, safety measures might be very useful (they might hold up until Bill finally discovers that he is wrong. Until Bill finally sees, that SMAI is a bad alignment target. Until Bill finally realise, that the problems are inherent in the core concept. That the problems are not related to the difficulty, of finding a good definition of Suffering). But, regardless of what safety measures exists, Bill should not initiate an AI project that aims for the SMAI alignment target. This conclusion is completely independent of the specifics of the set of safety measures involved. In other words, it is important to separate the issue of safety measures, from the question of whether or not, a project is aiming for a bad alignment target. When analysing alignment targets, it makes sense to assume that this target will be successfully hit (in other words, it makes sense to assume that there will be nothing along the lines of molecular Squiggles, nothing along the lines of tiny pictures of human faces, nothing along the lines of soft optimisation, nothing along the lines of a triggered do-over-button, nothing along the lines of someone reconsidering the wisdom of aiming at this alignment target, etc, etc, etc, etc). Because if a project is launched, aiming at a bad alignment target, then no person who takes AI risks seriously, can dismiss the scenario, where the outcome implied by this alignment target, ends up getting fully implemented.

(when designing safety measures, it might make sense to imagine the mirror of this. In other words, you could imagine that you are designing these measures for an unpreventable SMAI project. A project lead by a very careful designer named Bill, that share your values, and that is aiming for SMAI, due to a sincere misunderstanding. And it might make sense to assume, that Bill will be very slow to update, regarding the suitability of SMAI as an alignment target. That Bill has severe tunnel vision, and has used his very powerful mind to almost completely insulate himself from object level critique, and closed down essentially every avenue, that might force him to admit that he is wrong. And assume that when Bill sees a bad outcome, that is prevented by some safety measure that you design, then this will be explained away by Bill, as being a result of a failure to describe the SMAI alignment target. One can for example imagine that Bill is the result of an augmentation process, that dramatically increased technical ability and persuasion ability. But also lead to extreme tunnel vision and very entrenched views, on questions of alignment targets. For example because persuasion ability and ability to get things done, is the result of a total and unquestioning adherence to a set of assumptions, that are, on the whole, far superior to any set of assumptions used by any baseline human (but still not fully free from flaws, in the particular case of alignment target analysis). This line of reasoning is entirely about safety measure design principles. In other words: nothing in this parenthesis counts as analysing alignment targets. Thus, everything in this parenthesis is out of scope of the proposed research agenda)

Comment by ThomasCederborg on Extinction Risks from AI: Invisible to Science? · 2024-02-23T17:32:40.835Z · LW · GW

What about the term uncaring AI? In other words, an AI that would keep humans alive, if offered resources to do so. This can be contrasted with a Suffering Reducing AI (SRAI), which would not keep humans alive in exchange for resources. SRAI is an example of successfully hitting a bad alignment target, which is an importantly different class of dangers, compared to the dangers of an aiming failure leading to an uncaring AI. While an uncaring AI would happily agree to leave earth alone in exchange for resources, this is not the case for SRAI, because killing humans is inherent in the core concept, of reducing suffering. Any reasonable set of definitions, simply leads to a version of SRAI that rejects all such offers (assuming that the AI project that was aiming for the SRAI alignment target, manages to successfully hit this alignment target).

The term Uncaring AI is not meant to imply that the AI does not care about anything. Just that it does not care about anything that humans care about. Such as human lives. Which means that the question of extinction (and everything else that humans care about) is entirely determined by strategic considerations. The dangers stemming from the case where an aiming failure leads to an uncaring AI by accident, is importantly different, from the dangers stemming from a design team that successfully hits a bad alignment target. How about including a footnote, saying that you use Extinction as a shorthand for an outcome where humans are completely powerless, and where the fate of every living human is fully determined by an AI, that does not care about anything, that any human cares about? (and perhaps capitalise Extinction in the rest of the text) (and perhaps mention, in that same footnote, that if a neighbouring AI will pay such an uncaring AI to keep humans alive, then it would happily do so)

If an AI project succeeds, then it matters a lot, what alignment target the project was aiming for. Different bad alignment targets imply different behaviours. Consider a Suffering Maximising AI (SMAI). If a SMAI project successfully hits the SMAI alignment target, the resulting AI would create a lot of people. This is again importantly different from the case where the project fails to provide a reasonable definition of Suffering, and the resulting AI goes on to create little molecular pictures of sad faces or something similar. This molecular-sad-face-AI is, from a human perspective, basically the same as an AI that creates a lot of molecular squiggles, due to some other type of failure, that is unrelated to the definition of Suffering. They are both uncaring, and will both treat humans in a fully strategic manner. And both of these uncaring AIs, lead to the same outcome, as a failed SRAI project would lead to (whether through definitional issues, or something that leads to a lot of squiggles). They all treat everything that humans care about, in a fully strategic manner (they will all lead to Extinction, as defined in the proposed footnote mentioned above). But a successful SMAI project, would be importantly different from a successful SRAI project (which would actually lead to extinction, in the standard usage of that word. But would not lead to Extinction, as defined above). In the case of SMAI, different sets of reasonable definitions, also lead to importantly different outcomes (if the definition of suffering is reasonable, then the outcome of a successful SMAI project, will be a bad outcome. But different definitions still lead to importantly different outcomes). It is important to separate these dangers, from the dangers stemming from an uncaring AI. Because doing so allows us to explain, why it matters, which alignment target, an AI project is aiming for.

Let's say that two well intentioned designers, named Dave and Bill, both share your values. Dave is proposing an AI project aiming for SRAI, and Bill is proposing an AI project aiming for SMAI. Both projects have a set of very clever safety measures. Both Dave and Bill say that if some specific attempt to describe SRAI / SMAI fails, and leads to a bad outcome, then their safety measures will keep everyone safe, and they will find a better description (and they are known to be honest. And are known to share your definition of what counts as a bad outcome. But it is of course not certain, that their safety measures will actually hold). If you can influence the decision, of which project gets funded, then it seems very important to direct the funding away from Bill. Because the idea that clever safety measures, makes the alignment target irrelevant, is just straightforwardly false. An actual project, started in the actual real world, that are fully dedicated to the idea of carefully trying things out with soft maximisation, might of course result in non-soft maximisation, of the alignment target being aimed at (and it is better if this happens to Dave's proposed project, than to Bill's proposed project). Conditioned on a SMAI project being inevitable, then safety measures might be very useful (they might hold up until Bill finally discovers that he is wrong. Until Bill finally sees, that SMAI is a bad alignment target). Using your example safety measure, we can phrase this as: Even if Bill sincerely intends to use iterative goal specification, it is still of course possible, that his proposed SMAI project, will end in a successful implementation, of the aimed for alignment target: SMAI. It is possible that Bill will successfully use iterative goal specification to avoid catastrophe. But this is not guaranteed. Thus, Bill should still not aim for the SMAI alignment target. This conclusion is completely independent of the specifics of the set of safety measures involved. In other words, it is important to separate the issue of safety measures, from the question of whether or not, a project is aiming for a bad alignment target.

In yet other words: it seems important to be able to separate dangers related to successfully hitting bad alignment targets, from dangers related to aiming failures. Because in the case of a project that successfully hits the alignment target that it is aiming for, it matters a lot, which alignment target the project was aiming for (in the case of SMAI, the details of the definitions are also influencing the outcome in an important way). We need to maintain the ability to, for example, differentiate between the type of extinction outcome that SRAI implies, from the type of Extinction outcome that you are discussing.

For more on the dangers involved with successfully hitting the wrong alignment target, see:

A problem with the most recently published version of CEV 

(in the context of the linked philpapers paper, this is a word choice comment about the word: extinction. It is not a comment on the set of conditions that the paper identify as necessary for the purpose of evaluating arguments about these things. If taken extremely literally, your current word choices would imply, that you should add at least one more condition to your set: specifically the condition that the model must be able to talk about the strategic incentives that an uncaring AI would face (for example from a neighbouring AI, that an uncaring AI is expecting to possibly run into, in the distant future). Specifically, the additional condition is necessary for evaluating A2 and / or A3 (depending on how the text is interpreted). A model without this additional condition is useless for evaluating extinction arguments, in the same sense as a model with a static moon, is useless for evaluating arguments about rockets hitting the moon (they both fail on the first step). But I think that the above footnote is probably more in line with what you are trying to do with your paper. In other words: since you presumably do not want to add a strategic-environment-condition to your set of conditions, you will presumably prefer to add a footnote, and switch from extinction to Extinction (since this new condition would be necessary for evaluating arguments about extinction, but  would not be necessary for evaluating arguments about Extinction))
 

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-14T17:08:40.249Z · LW · GW

If your favoured alignment target suffers from a critical flaw, that is inherent in the core concept, then surely it must be useful for for you to discover this. So I assume that you agree that, conditioned on me being right about CEV suffering from such a flaw, you want me to tell you about this flaw. In other words, I think that I have demonstrated, that CEV suffers from a flaw, that is not related to any detail, of any specific version, or any specific description, or any specific proxy, or any specific attempt to describe what CEV is, or anything else along those lines. Instead, this flaw is inherent in the core concept, of building an AI that is describable as ``doing what a Group wants''. The Suffering Reducing AI (SRAI) alignment target is known to suffer from this type of a core flaw. The SRAI flaw is not related to any specific detail, of any specific version, or proxy, or attempt to describe what SRAI is, etc. And the flaw is not connected to any specific definition of ``Suffering''. Instead, the tendency to kill everyone, is inherent in the core concept of SRAI. It must surely be possible for you to update the probability that CEV also suffers from a critical flaw of this type (a flaw inherent in the core concept). SRAI sounds good on the surface, but it it is known to suffer from such a core flaw. Thus, the fact that CEV sounds good on the surface, does not rule out the existence of such a core flaw in CEV.

I do not think, that it possible to justify making no update, when discovering that the version of CEV, that you linked to, implies an outcome that would be far, far worse that extinction. I think that the probability must go up, that CEV contains a critical flaw, inherent in the core concept. Outcomes massively worse than extinction, is not an inherent feature, of any conceivable detailed description, of any conceivable alignment target. To take a trivial example, such an outcome is not implied by any given specific description of SRAI. The only way that you can motivate not updating, is if you already take the position, that any conceivable AI, that is describable as ``implementing the Coherent Extrapolated Volition of Humanity'', will lead to an outcome that is far, far, worse than extinction. If this is your position, then you can justify not updating. But I do not think that this is your position (if this were your position, then I don't think that CEV would be your favoured alignment target).

And this is not filtered evidence, where I constructed a version of CEV and then showed problems in that version. It is the version that you link to, that would be far, far, worse than extinction. So, from your perspective, this is not filtered. Other designs that I have mentioned elsewhere, like USCEV, or the ``non stochastic version of PCEV'', are versions that other people have viewed as reasonable attempts to describe what CEV is. The fact that you would like AI projects to implement safety measures, that would (if they work as intended) protect against these types of dangers, is great. I strongly support that. I would not be particularly surprised if a technical insight in this type of work turns out to be completely critical. But this does not allow you to justify not updating on unfiltered data. You simply can not block off all conceivable paths, leading to a situation, where you conclude that CEV suffers from the same type of core flaw, that SRAI is known to suffer from.

If one were to accept the line of argument, that all information of this type can be safely dismissed, then this would have very strange consequences. If Steve is running a SRAI project, then he could use this line of argument, to dismiss any finding, that a specific version of SRAI, leads to everyone dying. If Steve has a great set of safety measures, but simply does not update, when presented with the information, that a given version of SRAI would kill everyone, then Steve can never reach the point where he says: ``I was wrong. SRAI is not a good alignment target. The issue is not due to any details, of any specific version, or any specific definition or suffering, or anything else along those lines. The issue is inherent in the core concept of building an AI, that is describable as a SRAI. Regardless of how great some set of safety measures looks to the design team, no one should initiate a SRAI project''. Surely, you do not want to accept a line of argument, that would have allowed Steve, to indefinitely avoid making such a statement, in the face of any conceivable new information about the outcomes of different SRAI variants.

The alternative to debating specific versions, is to make arguments on the level, of what one should expect based on the known properties of a given proposed alignment target. I have tried to do this and I will try again. For example, I wonder how you would answer the question: ``why would an AI, that does what an arbitrarily defined abstract entity wants that AI to do, be good for a human individual?''. One can discover that the Coherent Extrapolated Volition of Steve, would lead to the death of all of Steve's cells (according to any reasonable set of definitions). One can similarly discover that the Coherent Extrapolated Volition of ``a Group'', is bad for the individuals in that group (according to any reasonable set of definitions). Neither statement suffers from any logical tension. For humans, this should in fact be the expected conclusion for any ``Group AI'', given that, (i): many humans certainly sound as if they will ask the AI to hurt other humans as much as possible, (ii): a human individual is very vulnerable, to a powerful AI that is trying to hurt her as much as possible, and (iii): in a ``Group AI'' no human individual can have any meaningful influence, in the initial dynamic, regarding the adoption of those preferences that refer to her (if the group is large). If you doubt the accuracy of one of these three points, then I would be happy to elaborate, on whichever one you find doubtful. None of this, has any connection, to any specific version, or proxy, or attempt to describe what CEV is, or anything else along those lines. It is all inherent in the core concept of CEV (and any other AI proposal, that is describable as ``building an AI that does what a group wants it to do''). If you want, we can restrict all further discussion to this form of argument.

If one has already taken the full implications of (i), (ii), and (iii) into account, then one does not have to make a huge additional update, when observing an unfiltered massively-worse-than-extinction type outcome. But this is only because, when one has taken the full implications of (i), (ii), and (iii) into account, then one has presumably already concluded, that CEV suffers from a critical, core, flaw.

I don't understand your sentence: ``Variants of CEV all having catastrophic flaws is some sort of argument that there is no optimization target that's worth optimizing for.''. The statement ``CEV is not a good alignment target'' does not imply the non existence of good alignment targets. Right? In other words: it looks to me like you are saying, that a rejection of CEV as an alignment target, is equivalent to a rejection of all conceivable alignment targets. To me, this sounds like nonsense, so I assume that this is not what you are saying. To take a trivial example: I don't think that SRAI is a good alignment target. But surely a rejection of CEV does not imply a rejection of SRAI. Right? Just to be clear: I am definitely not postulating the non existence of good alignment targets. Discovering that ``the Coherent Extrapolated Volition of Steve implies the death of all his cells'', does not imply the non existence of alignment targets, where Steve's cells survive. Similarly, discovering that ``the Coherent Extrapolated Volition of Humanity is bad for human individuals'', does not imply the non existence of alignment targets, that are good for human individuals. (I don't think that good alignment targets are easy to find, or easy to describe, or easy to evaluate, etc. But that is a different issue)

I think it's best that I avoid building a whole argument, based on a guess, regarding what you mean here. But I do want to say, that if you are using ``CEV'' as a shorthand for ``the Coherent Extrapolated Volition of a single designer'', then you have to be explicit about this if you want me to understand you. And similarly: if ``CEV'' is simply a label, that you assign to any reasonable answer, to the ``what alignment target should be aimed at?'' question (provisional or otherwise), then you have to be explicit about this if you want me to understand you. If that is the case then I would have to phrase my claim as: ``Under no reasonable set of definitions does the Coherent Extrapolated Volition of Humanity deserve the label ``CEV''''. This only sounds odd due to the chosen label. There is no more logical tension in that statement, than there is logical tension in the statement: ``Under no reasonable set of definitions, does the Coherent Extrapolated Volition of Steve, result in any of Steve's cells surviving'' (discovering this about Steve should not be very surprising. And discovering this about Steve does not imply the non existence of alignment targets where Steve's cells survive).


PS:

I am aware of the fact that you (and Yudkowsky, and Bostrom, and a bunch of other people), can not be reasonably described as having any form of reckless attitude along the lines of: ``Conditioned on knowing how to hit alignment targets, the thing to do is to just instantly hit some alignment target that sounds good''. I hope that it is obvious, that I am aware of this. But I wanted to be explicit about this, just in case it is not obvious to everyone, that I am aware of this. Given the fact that there is one of those green leaf thingies next to my username, it is probably best to be explicit about this sort of thing.

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-09T18:21:02.679Z · LW · GW

The version of CEV, that is described on the page that your CEV link leads to, is PCEV. The acronym PCEV was introduced by me. So this acronym does not appear on that page. But that's PCEV that you link to. (in other words: the proposed design, that would lead to the LP outcome, can not be dismissed as some obscure version of CEV. It is the version that your own CEV link leads to. I am aware of the fact, that you are viewing PCEV as: ``a proxy for something else'' / ``a provisional attempt to describe what CEV is''. But this fact still seemed noteworthy)

On terminology: If you are in fact using ``CEV'' as a shorthand, for ``an AI that implements the CEV of a single human designer'', then I think that you should be explicit about this. After thinking about this, I have decided that without explicit confirmation that this is in fact your intended usage, I will proceed as if you are using CEV as a shorthand, for ``an AI that implements the Coherent Extrapolated Volition of Humanity'' (but I would be perfectly happy to switch terminology, if I get such confirmation). (another reading of your text, is that: ``CEV'' (or: ``something like CEV'') is simply a label that you attach, to any good answer, to the correct phrasing of the ``what alignment target should be aimed at?'' question. That might actually be a sort of useful shorthand. In that case I would, somewhat oddly, have to phrase my claim as: under no reasonable set of definitions, does the Coherent Extrapolated Volition of Humanity, deserve the label ``CEV'' / ``something like CEV''. Due to the chosen label(s), the statement looks odd. But there is no more logical tension in the above statement, than there is logical tension in the following statement: ``under no reasonable set of definitions, does the Coherent Extrapolated Volition of Steve, result in the survival of any of Steve's cells'' (which is presumably a true statement for at least some human individuals). Until I hear otherwise, I will however stay with the terminology, where ``CEV'' is shorthand for ``an AI that implements the Coherent Extrapolated Volition of Humanity'', or ``an AI that is helping humanity'', or something less precise, that is still hinting at something along those lines)

It probably makes sense to clarify my own terminology some more. I think this can be done by noting, that I think that CEV, sounds like a perfectly reasonable way of helping ``a Group'' (including the PCEV version that you link to, and that implies the LP outcome). I just don't think that helping ``a Group'' (that is made up of human individuals) is good for the (human) individuals that make up that ``Group'' (in expectation). Pointing a specific version of CEV (including PCEV) at a set of individuals, might be great for some other type of individuals. Let's consider a large number of ``insatiable, Clippy like maximisers''. Each of them cares exclusively about the creation of a different, specific, complex object. No instances of any of these very complex objects will ever exist, unless someone looks at the exact specification of a given individual, and uses this specification to create such objects. In this case PCEV might, from the perspective of each of those individuals, be the best thing that can happen (if special influence is off the table). It is also worth noting, that a given human individual might get what she wants, if some specific version of CEV is implemented. But CEV, or ``helping humanity'', is not good, for human individuals, in exception, compared to extinction. And why would it be? Groups and human individuals are completely different types of things. And a human individual is very vulnerable to a powerful AI, that wants to hurt her. And humanity certainly looks like it contains an awful lot of ``will to hurt'', specifically directed at existing human individuals.

If I zoom out a bit, I would describe the project of ``trying to describe what CEV is'' / ``trying to build an AI that helps humanity'' as: A project that searches for an AI design that helps an arbitrarily defined abstract entity. But this same project is, in practice, evaluating specific proposed AI designs, based on how they interact with a completely different type of thing: human individuals. You are for example presumably discarding PCEV, because the LP outcome implied by PCEV, contains a lot of suffering individuals (when PCEV is pointed at billions of humans). It is however not obvious to me why LP would be a bad way of helping an arbitrarily defined abstract entity (especially considering that the negotiation rules of PCEV simultaneously (i): implies LP, and is also (ii): an important part of the set of definitions, that is needed to differentiate the specific abstract entity that is to be helped, from the rest of the vast space of entities, that a mapping from billions-of-humans to the ``class-of-entities-that-can-be-said-to-want-things'', can point to). Thus, I suspect that PCEV is not actually being discarded, due to being bad at helping an abstract entity (and my guess it that PCEV is actually being discarded, because LP is bad for human individuals).

I think that one reasonable way of moving past this situation, is to switch perspective. Specifically: adopt the perspective of a single human individual, in a population of billions, and ask: ``without giving her any special treatment, compared to other existing humans, what type of AI, would want to help her''. And then try to answer this question, while making as few assumptions about her as possible (for example making sure that there is no implicit assumption, regarding whether she is ``selfish or selfless'', or anything along those lines. Both ``selfless and selfish'' human individuals, would strongly prefer to avoid being a Heretic in LP. Thus, discarding PCEV does not contain an implicit assumption related to the ``selfish or selfless'' issue. Discarding PCEV, does however, involve an assumption, that human individuals are not like the ``insatiable Clippy maximisers'' mentioned above. So, such maximisers might justifiably feel ignored, when we discard PCEV. But no one can justifiably feel ignored when we discard PCEV, on account of where she is on the ``selfish or selfless'' spectrum). When one adopts this perspective, it becomes obvious to suggest that, the initial dynamic, should grant this individual meaningful influence, regarding the adoption of those preferences, that refer to her. Making sure that such influence, is included as a core aspect of the initial dynamic, is made even more important, by the fact, that the designers will be unable to consider all implications of a given project, and will be forced to rely on, potentially flawed, safety measures (for example along the lines of a ``Last Judge'' off switch, which might fail to trigger. Combined with a learned DWIKIM layer, that might turn out to be very literal, when interpreting some specific class of statements). If such influence is included, in the initial dynamic, then the resulting AI is no longer describable as ``doing what a Group wants it to do''. Thus, the resulting AI can not be described as a version of CEV. (it might however be describable as ``something like CEV''. Sort of how one can describe an Orca as ``something like a shark'', despite the fact that an Orca is not a type of shark (or a type of a fish). I would guess, that you would say, that an AI that grants such influence, as part of the initial dynamic, is not ``something like CEV''. But I'm not sure about this)

(I should have added ``,in the initial dynamic,'' to the text in my earlier comments. It is explicit in the description of MPCEV, but I should have added this phrase to my comments here too. As a tangent, I agree that the intuition, that you were trying to counter, with your Boundaries / Membrane mention, is probably both common and importantly wrong. Countering this intuition makes sense, and I should have read this part of your comment more carefully. I would however like to note, that the description of the LP outcome, in the PCEV thought experiment, actually contains billions of (presumably very different) localities. Each locality is optimised according to very different criteria. Each place is designed to hurt a specific individual human Heretic. And each such location, is additionally bound by it's own unique ``comprehension constraint'', that refers to the specific individual Heretic, being punished in that specific location)

Perhaps a more straightforward way to move this discussion along is to ask a direct question, regarding what you would do if you were in the position, that I believe, that I find myself in. In other words: a well intentioned designer called John, wants to use PCEV as the alignment target for his project (rejecting any other version of CEV out of hand, by saying: ``if that is indeed a good idea, then it will be the outcome of Parliamentary Negotiations''). When someone points out that PCEV is a bad alignment target, John responds by saying that PCEV cannot, by definition, be a bad alignment target. John claims that any thought experiment, where PCEV leads to a bad outcome, must be due to a bad extrapolation of human individuals. John says that any given ``PCEV with a specific extrapolation procedure'' is just an attempt, to describe what PCEV is. If aiming at a given ``PCEV with a specific extrapolation procedure'' is a bad idea, then it is a badly constructed PCEV. Aiming at PCEV is a good idea, by intention that defines PCEV. John further says that his project will include features that (if they are implemented successfully, and are not built on top of any problematic unexamined implicit assumption) will to let John try again, if a given attempt to ``say what PCEV is'', fails. Do you agree that this project, is a bad idea? (compared to achievable alternatives, that start with a different set of, findable, assumptions) If so, what would you say to John? (what you are proposing is different from what John is proposing. I predict that you will say that John is making a mistake. My point is that, to me, it looks like you are making a mistake, of the same type as John's mistake. So, I wonder what you would say to John (your behaviour in this exchange, is not the same as John's behaviour in this thought experiment. But it looks to me, like you are making the same class of mistake, as John. So, I'm not asking how you would ``act in a debate, as a response to Johns behaviour''. Instead, I'm curious about how you would explain to John, that he is making an object level mistake))

Or maybe a better approach, is to go less meta, and get into some technical details. So, let's use the terminology in your CEV link, to explore some of the technical details in that post. What do you think would happen, if the learning algorithm that outputs the DWIKIM layer in John's PCEV project, is built on top of an unexamined implicit assumption, that turns out to be wrong? Let's say that the DWIKIM layer that pops out, interprets the request to build PCEV, as a request to implement the straightforward interpretation of PCEV. The DWIKIM layer happens to be very literal, when presented with the specific phrasing, used in the request. In other words: it interprets John as requesting, something along the lines of LP? I think this might result in an outcome, along the lines of LP (if the problems with the DWIKIM layer, stems form a problematic unexamined implicit assumption, related to extrapolation, then the exact same problematic assumption, might also render something along the lines of a ``Last Judge off switch add on'', ineffective). I think that it would be better, if John had aimed at something, that does not suffer from known, avoidable, s-risks. Something whose straightforward interpretation, is not known to imply an outcome, that would be far, far, worse than extinction. For the same reason, I make the further claim, that I do not think that it is a good idea, to subject everyone to the known, avoidable, s-risks associated with any AI, that is describable as ``doing what a Group wants'' (which includes all versions of CEV). Again, I'm certainly not against some feature that, might, let you try again, or that, might, re interpret an unsafe request, as a request for something completely different, that happens to be safe (such as, for example, a learned DWIKIM layer). I am aware of the fact, that you do not have absolute faith in the DWIKIM layer (if this layer is perfectly safe, in the sense of reliably re interpreting requests that straightforwardly imply LP, as something desirable to the designer. Then the full architecture would be functionally identical, to an AI, that simply does, whatever the designer wants the AI to do. In that case, you would not care what the request was. You might then, just as well have the designer ask the DWIKIM layer, for an AI, that maximises the number of bilberries. So, I am definitely not implying, that you are unaware, of the fact that the DWIKIM layer, is unable to provide reliable safety).

Zooming out a bit, it is worth noting that the details of the safety measure(s) is actually not very relevant to the points that I am trying to make here. Any conceivable, human implemented, safety measure, might fail. And, more importantly, these measures do not help much, when one is deciding what to aim at. For example: MPCEV, can also be built on top of a (potentially flawed) DWIKIM layer, in the exacts same way as you can build CEV on top of a DWIKIM layer (and you can stick a ``Last Judge off switch add on'' to MPCEV too. Etc, etc, etc). Or in yet other words: anything, along the lines of, a ``Last Judge off switch add on'' can be used by many different projects aiming at many different targets. Thus, the ``Last Judge'' idea, or any other idea along those lines (including a DWIKIM layer), provides very limited help, when one decides what to aim at. And even more generally: regardless of what safety measure is used, John is, still, subjecting everyone to an unnecessary, avoidable, s-risk. I hope we can agree that John should not do that with, any, version of ``PCEV with a specific extrapolation procedure''. The further claim, that I am making, is that no one should do that with, any, ``Group AI'', for similar reasons. Surely, discovering that this further claim is true, cannot be, by definition, impossible.

While re reading our exchange, I realised that I never actually clarified, that my primary reason for participating in this exchange (and my primary reason for publishing things on LW), is not actually to stop CEV projects. However, I think that a reasonable person might, based on my comments here, come to believe that my primary goal is to stop CEV projects (which is why the present clarification is needed). My focus is actually on trying to make progress on the ``what alignment target should be aimed at?'' question. In the present exchange, my target is the idea, that this question has already been given an answer (and, specifically, that the answer is CEV). The first step to progress, on the ``what alignment target should be aimed at?'' question, is to show that this question does not currently have an answer. This is importantly different, from saying that: ``CEV is the answer, but the details are unknown'' (I think that such a statement is importantly wrong. And I also think, that the fact that people still believe things along these lines, is standing in the way of getting a project off the ground, that is devoted to making progress on the ``what alignment target should be aimed at?'' question).

I think that it is very unlikely, that the relevant people will stay committed to CEV, until the technology arrives, that would make it possible for them to hit CEV as an alignment target (the reason I find this unlikely, is that, (i): I believe that I have outlined a sufficient argument, to show that CEV is a bad idea, and (ii): I think that such technology will take time to arrive, and (iii): it seems likely that this team of designers, who are by assumption capable of hitting CEV, will be both careful enough to read that argument before reaching the point of no return on their CEV launch, and also capable enough to understand it. Thus, since the argument against CEV already exists, in my estimate, it would not make sense to focus on s-risks, related to a successfully implemented CEV). If that unlikely day ever does arrive, then I might switch focus, to trying to prevent direct CEV related s-risk, by arguing against this imminent CEV project. But I don't expect to ever see this happening.

The set of paths that I am actually focused on reducing the probability of, can be hinted at by outlining the following specific scenario. Imagine a well intentioned designer that we can call Dave, who is aiming for Currently Unknown Alignment Target X (CUATX). Due to an unexamined implicit assumption, that CUATX is built on top of, turning out to be wrong in a critical way, CUATX implies an outcome, along the lines of LP. But the issue that CUATX suffers from, is far more subtle than the issue that CEV suffers from. And progress on the ``what alignment target should be aimed at?'' question, has not yet progressed to the point, where this problematic unexamined implicit assumption can be seen. CUATX has all the features, that are known at launch time, to be necessary for safety (such as the necessary, but very much not sufficient, feature that any safe AI must give each individual, meaningful influence, regarding the adoption of those preferences, that refer to her). Thus, the CUATX idea leads to a CUATX project, which in turn leads to an, avoidable, outcome along the lines of LP (after some set of human implemented safety measures fail). That is the type of scenario that I am trying to avoid (by trying to make sufficient progress on the ``what alignment target should be aimed at?'' question, in time). My real ``opponent in this debate'' is an implemented CUATX, not the idea of CEV (and very definitely not you. Or anyone else that has contributed, or is likely to contribute, valuable insights related to the ``what alignment target should be aimed at?'' question). It just happens to be the case, that the effort to prevent CUATX, that I am trying to get off the ground, starts by showing that CEV, is not an answer, to the ``what alignment target should be aimed at?'' question. And you just happen to be the only person, that is pushing back against this in public (and again: I really appreciate the fact that you chose to engage on this topic).

(I should also note explicitly, that I am most definitely not against exploring safety measures. They might stop CUATX. In some plausible scenarios, they might be the only realistic thing, that can stop CUATX. And I am not against treaties. And I am open to hearing more about the various human augmentation proposals that have been going around for many years. I am simply noting, that a safety measure, regardless of how clever it sounds, simply cannot fill the function of a substitute, for progress on the ``what alignment target should be aimed at?'' question. An attempt to get people to agree to a treaty might fail. Or a successfully implemented treaty might fail to actually prevent a race dynamic for long enough. And similarly, augmented humans might systematically tend towards being: (i): superior at alignment, (ii): superior at persuasion, (iii): well intentioned, and (iv): not better at dealing with the ``what alignment target should be aimed at?'' question, than the best baseline humans (but still, presumably, capable of understanding an insight on this question, at least if that insight is well explained). Regardless of augmentation technique, selection for ``technical ability and persuasion ability'' seems like a far more likely, de facto, outcome to me, due to being far easier to measure. I expect it to be far more difficult to measure the ability to deal with the ``what alignment target should be aimed at?'' question (and it is not obvious that the abilities needed to deal with the ``what alignment target should be aimed at?'' question, will be strongly correlated with the thing that I think will, de facto, have driven the trial and error augmentation process, of the augments that eventually hits an alignment target: ``technical-ability-and-persuasion-ability-and-ability-to-get-things-done''). Maybe the first augment will be great at making progress on the ``what alignment target should be aimed at?'' question, and will quickly render all previous work on this question irrelevant (and in that case, the persuasion ability is probably good for safety). But assuming that this will happen, seems like a very unsafe bet to make. Even more generally: I simply do not think that it is possible to come up with any type of clever sounding trick, that makes it safe to skip the ``what alignment target should be aimed at?'' question (to me, the ``revolution-analogy-argument'', in the 2004 CEV text, looks like a sufficient argument for the conclusion, that it is important to make progress on the ``what alignment target should be aimed at?'' question. But it seems like many people do not consider this, to be a sufficient argument for this conclusion. It is unclear to me, why this conclusion, seems to require such extensive further argument)).

If my overall strategic goal was not clear, then this was probably my fault (in addition to not making this goal explicit, I also seem to have a tendency to lose focus on this larger strategic picture, during back and fourth technical exchanges).

Two out of of my three LW posts are in fact entirely devoted to arguing, that making progress on the ``what alignment target should be aimed at?'' question, is urgent (in our present discussion, we have only talked about the one post, that is not exclusively focused on this). See:

Making progress on the ``what alignment target should be aimed at?'' question, is urgent 

The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question. 

(I am still very confused about this entire conversation. But I don't think that re reading everything, yet again, will help much. I have been continually paying, at least some, attention to SL4, OB, and LW since around 2002-2003. I can't remember exactly who said what when, or where. However, I have developed a strong intuition, that can be very roughly translated as: ``if something sounds strange, then it is very definitely not safe, to explain away this strangeness, by conveniently assuming that Nesov is confused on the object-level''. I am nowhere near the point where I would consider going against this intuition. So, I expect that I will remain very confused about this exchange, until there is some more information available. I don't expect to be able to just think my way out of this one (wild speculation, regarding what it might be, that I was missing, by anyone that happens to stumble on this comment, at any point in the future, are very welcome. For example in a LW comment, or in a LW DM, or in an email))

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-03T20:23:46.816Z · LW · GW

I was clearly wrong regarding how you feel about your cells. But surely the question of whether or not an AI that is implementing the CEV of Steve, would result in any surviving cells, is an empirical question? (which must settled by referring to facts about Steve. And trying to figure out what these facts mean in terms of how the CEV of Steve would treat his cells). It cannot possibly be the case that it is impossible, by definition, to discover that any reasonable way of extrapolating Steve would result in all his cells dying?

Thank you for engaging on this. Reading your description of how you view your own cells was a very informative window, into how a human mind can work. (I find it entirely possible, that I am very wrong regarding how most people view their cells. Or about how they would view their cells upon reflection. I will probably not try to introspect, regarding how I feel about my own cells, while this exchange is still fresh)

Zooming out a bit, and looking at this entire conversation, I notice that I am very confused. I will try to take a step back from LW and gain some perspective, before I return to this debate.

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-03T15:56:46.531Z · LW · GW

I think that extrapolation is a genuinely unintuitive concept. I would for example not be very surprised if it turns out that you are right, and that it is impossible to reasonably extrapolate you if the AI that is doing the extrapolation is cut off from all information about other humans. I don't think that this fact is in tension with my statement, that individuals and groups are completely different types of things. Taking your cell analogy: I think that implementing the CEV of you could lead to the death of every single cell in your body (for example if your mind is uploaded in a way that does not preserve information about any individual cell). I don't think that it is strange in general, if an extrapolated version of a human individual, is completely fine with the complete annihilation of every cell in her body (and this is true, despite the fact that ``hostility towards cells'' is not a common thing). Such an outcome is no indication of any technical failure, in an AI project, that was aiming for the CEV of that individual. This shows why there is no particular reason to think, that doing what a human individual wants, would be good for any of her cells (for any reasonable definition of ``doing what a human individual wants''). And this fact remains true, even if it is also the case, that a given cell would become impossible to understand, if that cell was isolated from other cells.

A related tangent here relates to the fact that extrapolation is a genuinely unintuitive concept. I think that this has important implications for AI safety. This fact is for example central to my argument about ``Last Judge'' type proposals in my post:

The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question. 

(I will try to reduce the commas. I see what you are talking about. I have in the past been forced to do something about an overuse of both footnotes and parentheses. Reading badly written academic history books seems to be making things worse (if one is analysing AI proposals where the AI is getting its goal from humans, then it makes sense to me to at least try to understand humans))

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-03T04:32:39.391Z · LW · GW

I think that ``CEV'' is usually used as shorthand for ``an AI that implements the CEV of Humanity''. This is what I am referring to, when I say ``CEV''. So, what I mean when I say that ``CEV is a bad alignment target'', is that, for any reasonable set of definitions, it is a bad idea, to build an AI, that does what ``a Group'' wants it to do (in expectation, from the perspective of essentially any human individual, compared to extinction). Since groups and individuals, are completely different types of things, it should not be surprising to learn, that doing what one type of thing wants (such as ``a Group''), is bad for a completely different type of thing (such as a human individual). In other words, I think that ``an AI that implements the CEV of Humanity'', is a bad alignment target, in the same sense, as I think that SRAI is a bad alignment target.

But I don't think your comment uses ``CEV'' in this sense. I assume that we can agree, that aiming for ``the CEV of a chimp'', can be discovered to be a bad idea (for example by referring to facts about chimps, and using thought experiments, to see what these facts about chimps, implies about likely outcomes). Similarly, it must be possible to discover, that aiming for ``the CEV of Humanity'', is also a bad idea (for human individuals). Surely, discovering this, cannot be, by definition, impossible. Thus, I think that you are in fact, not, using ``CEV'' as shorthand for ``an AI that implements the CEV of Humanity''. (I am referring to your sentence: ``If it's not something to aim at, then it's not a properly constructed CEV.'')

Your comment makes perfect sense, if I read ``CEV'' as shorthand for ``an AI that implements the CEV of a single human designer''. I was not expecting this terminology. But it is a perfectly reasonable terminology, and I am happy to make my argument, using this terminology. If we are using this terminology, then I think that you are completely right, about the problem that I am trying to describe, being a proxy issue (thus, if this is was indeed your intended meaning, then I was completely wrong, when I said that I was not referring to a proxy issue. In this terminology, it is indeed a proxy issue). So, using this terminology, I would describe my concerns as: ``an AI that implements the CEV of Humanity'' is a predictably bad proxy, for ``an AI that implements the CEV of a single human designer''. Because ``an AI that implements the CEV of Humanity'', is far, far, worse, than extinction, form the perspective of essentially any human individual (which, presumably, disqualifies it as a proxy, for ``an AI that implements the CEV of a single human designer''. If this does not disqualify it as a proxy, then I think that this particular human designer, is a very dangerous person (from the perspective of essentially any human individual)). Using this terminology (and assuming a non unhinged designer), I would say that if your proposed project, is to use ``an AI that implements the CEV of Humanity'', as a proxy, for ``an AI that implements the CEV of a single human designer'', then this constitutes a, predictable, proxy failure. Further, I would say that pushing ahead, despite this predictable failure, with a project that is trying to implement ``an AI that implements the CEV of Humanity'' (as a proxy), inflicts an unnecessary s-risk, on everyone. Thus, I think it would be a bad idea, to pursue such a project (from the perspective of essentially any human individual. Presumably including the designer).

If we take the case of Bob, and his Suffering Reducing AI (SRAI) project (and everyone has agreed to use this terminology), then we can tell Bob:

SRAI is not a good proxy, for  ``an AI that implements the CEV of Bob'' (assuming that you, Bob, do not want to kill everyone). Thus, you will run into a, predictable, issue, when your project tries to use SRAI as a proxy, for ``an AI that implements the CEV of Bob''. If you are implementing a safety measure successfully, then this will still, at best, lead to your project failing safely. At worst, your safety measure will fail, and SRAI will kill everyone. So, please don't proceed with your project, given that it would put everyone at risk of being killed by SRAI (and this would be an unnecessary risk, because your project will predictably fail, due to a predictable proxy issue).

By making sufficient progress, on the ``what alignment target should be aimed at?'' question, before Bob gets started on his SRAI project, it is possible to avoid the unnecessary extinction risks, associated with the proxy failure, that Bob will predictably run into, if his project uses SRAI, as a proxy for ``an AI that implements the CEV of Bob''. Similarly, it is possible to avoid the unnecessary s-risks, associated with the proxy failure, that Dave will predictably run into, if Dave uses ``an AI that implements the CEV of Humanity'', as a proxy, for ``an AI that implements the CEV of Dave'' (because any ``Group AI'', is very bad for human individuals (including Dave)).

Mitigating the unnecessary extinction risks, that are inherent in any SRAI project, does not require an answer, to the ``what alignment target should be aimed at?'' question (it was a long time ago, but if I remember correctly, Yudkowsky did this, around two decades ago. It seems likely, that anyone that is careful and capable enough, to hit an alignment target, will be able to understand that old explanation, of why SRAI, is a bad alignment target. So, generating such an explanation, was sufficient for mitigating the extinction risks, associated with a successfully implemented SRAI. Generating such an explanation, did not require an answer, to the ``what alignment target should be aimed at?'' question. One can demonstrate that a given bad answer, is a bad answer, without having any good answer). Similarly, avoiding the unnecessary s-risks, that are inherent in any ``Group AI'' project, does not require an answer, to the ``what alignment target should be aimed at?'' question. (I strongly agree, that finding an actual answer to this question, is probably very, very, difficult. I am simply pointing out, that even partial progress, on this question, can be very useful)

(I think that there are other issues, related to AI projects, whose purpose is to aim at ``the CEV, of a single human designer''. I will not get into this here, but I thought that it made sense, to at least mention, that there are other issues, related to this type of project)

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-02T22:52:15.280Z · LW · GW

I agree that ``the ends justify the means'' type thinking has led to a lot of suffering. For this, I would like to switch from the Chinese Cultural Revolution, to the French Revolution, as an example (I know it better, and I think it fits better, for discussions of this attitude). So, someone wants to achieve something, that are today seen as a very reasonable goal, such as ``end serfdom and establish formal equality before the law''. So, basically: their goals are positive, and they achieve these goals. But perhaps they could have achieved those goals, with less side effects, if it was not for their ``the ends justify the means'' attitude. Serfdom did end, and this change was both lasting, and spreading. After things had calmed down, the new economic relations, led to dramatically better material conditions, for the former serfs (and, for example, dramatic increase in life expectancy, due a dramatic reduction in poverty related malnutrition). But, during the revolutionary wars (and especially the Napoleon wars that followed), millions died. It sounds intuitively likely, that there would have been less destruction, if attitudes along these lines were less common.

So, yes, even when an event has such a large, and lasting, positive impact, that it is still celebrated, centuries later (14th of July is still a very big thing in France), one might find that this attitude caused concrete harm (millions of dead people, must certainly qualify as ``concrete harm''. And the French Revolution must certainly be classified as a celebrated event in any sense of that word (including, but not limited to, the literal: ``fireworks and party'' sense)).

And you are entirely correct, that damage from this type of attitude, was missing from my analysis.

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-02T22:17:59.909Z · LW · GW

I think that my other comment to this, will hopefully be sufficient, to outline what my position actually is. But perhaps a more constructive way forwards, would be to ask how certain you are, that CEV is in fact, the right thing to aim at? That is, how certain are you, that this situation is not symmetrical, to the case where Bob thinks that: ``a Suffering Reducing AI (SRAI), is the objectively correct thing to aim at''? Bob will diagnose any problem, with any specific SRAI proposal, as arising from proxy issues, related to the fact that Bob is not able to perfectly define ``Suffering'', and must always rely on a proxy (those proxy issues exists. But they are not the most serious issue, with Bob's SRAI project).

I don't think that we should let Bob proceed with an AI project, that aims to find the correct description of ``what SRAI is'', even if he is being very careful, and is trying to implement a safety measure (that will, while it continues to work as intended, prevent SRAI from killing everyone). Because those safety features might fail, regardless of whether or not someone has pointed out a critical flaw in them, before the project reaches the point of no return (this conclusion is not related to Corrigibility. I would reach the exact same conclusion, if Bob's SRAI project, was using any other safety measure). For the exact same reason, I simply do not think, that it is a good idea, to proceed with your proposed CEV project (as I understand that project). I think that doing so, would represent a very serious s-risk. At best, it will fail in a safe way, for predictable reasons. How confident are you, that I am completely wrong about this?

Finally, I should note, that I still don't understand your terminology. And I don't think that I will, until you specify what you mean with ``something like CEV''. My current comments, are responding to my best guess, of what you mean (which is, that MPCEV, from my linked to post, would not count as ``something like CEV'', in your terminology). (Does an Orca count as: ``something like a shark''? If it is very important, that some water tank is free of fish, then it is difficult for me to discuss Dave's ``let's put something like a shark, in that water tank'' project, until I have an answer to my Orca question.)

(I assume that this is obvious, but just to be completely sure that this is clear, it probably makes sense to note explicitly that I, very much, appreciate that you are engaging on this topic)

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-02T21:34:24.184Z · LW · GW

I don't think that they are all status games. If so, then why did people (for example) include long meditations, regarding whether or not, they personally, deserve to go to hell, in private diaries? While they were focusing on the ``who is a heretic?'' question, it seems that they were taking for granted, the normative position: ``if someone is a heretic, then she deserves eternal torture in hell''. But, on the other hand, private diaries are of course sometimes opened, while the people that wrote them are still alive (this is not the most obvious thing, that someone would like others to read, in a stolen diary. But people are not easy to interpret, especially across centuries of distance. Maybe for some people, someone else stealing their diary, and reading such meditations, would be awesome). And people are not perfect liars, so maybe the act of making such entries is, mostly, an effective way, of getting into an emotional state, such that one seems genuine, when expressing remorse to other people? So, maybe any reasonable way of extrapolating a diarist like this, will lead to a mind, that find the idea of hell, abhorrent. There is a lot of uncertainty here. There is probably also a very, very large diversity, among the set of humans that have adopted a normative position, along these lines (and not just in terms of terminology, and in terms of who counts as a heretic. Also in terms of what it is, that was lying underneath, the adoption of such normative positions. It would not be very surprising, if a given extrapolation procedure, leads to different outcomes, for two individuals, that sound very similar). As long as we agree that any AI design, must be robust to the possibility, that people mean what they say, then perhaps these issues are not critical to resolve (but, on the other hand, maybe digging into this some more, will lead to genuinely important insights). (I agree that there were probably a great number of people, especially early on, that was trying to achieve things that most people today would find reasonable, but whose actions contributed to destructive movements. Such issues are probably a lot more problematic in politics, than in the case where an AI is getting its goal from a set of humans) (none of my reasoning here is done, with EAs in mind)

I think that there exists a deeper problem, for the proposition, that perhaps it is possible to find some version of CEV, that is actually safe for human individuals (as opposed to the much easier task, of finding a version of CEV, such that no one is able to outline a thought experiment, before launch time, that shows, why this specific version, would lead to an outcome, that is far, far, worse than extinction). Specifically, I'm referring to the fact that ``heretics deserve eternal torture in hell'' style fanatics (F1), is just one very specific example, of a group of humans, that might be granted extreme influence, over CEV. In a population of billions, there will exist a very, very large number of ``never-explicitly-considered'' types of minds. Consider for example a different, tiny, group of Fanatics (F2), who (after being extrapolated) has a very strong ``all or nothing'' attitude, and a sacred rule against negotiations (let's explore what happens in the case, where this attitude is related to a religion, and where one in a thousand humans, will be part of F2). Unless negotiations deadlock in a very specific way, PCEV will grant F2, exactly zero direct influence. However, let's explore what happens, if another version of CEV is launched, that first maps each individual to a Utility function, and then maximise the Sum of those functions (USCEV). During the process, where a member of this religion, that we can call Gregg, ``becomes the person that Gregg wants to be'', the driving aspect of Gregg's personality, is a burning desire to become a true believer, and become morally pure. This includes, becoming the type of person, that would never break the sacred set of rules: ``Never accept any compromise, regarding what the world should look like! Never negotiate with heretics! Always take whatever action, is most likely to result in the world being organised, exactly as is described in the sacred texts!''. So, the only reasonable way to map, extrapolated Gregg, to a utility function, is to assign maximum utility to the Outcome demanded by the Sacred Texts (OST), and minimum utility, to every other outcome. Besides the number of people in F2, the bound on how bad OST can be (from the perspective of the non believers), and still be the implemented outcome, is that USCEV, must be able to think up something that is far, far, worse (technically, the minimum is not actually the worst possible outcome, but instead the worst outcome that USCEV can think up, for each specific non-believer). As long as there is a very large difference, between OST, and the worst thing that USCEV can think up, then OST will be the selected outcome. Maybe OST will look ok, to a non super intelligent observer. For example, OST could look like a universe where every currently existing human individual, after an extended period of USCEV guided self reflection, converge on the same belief system (and all subsequent children, are then brought up in this belief system). Or, maybe it will be overtly bad, with everyone forced to convert or die. Or maybe it will be a genuine s-risk, for example along the lines of LP.

As far as I can tell, CEV in general, and PCEV in particular, is, still, the current state of the art, in terms of finding an answer to the ``what alignment target, should be aimed at?'' question (and CEV has been the state of the art now, for almost two decades). I find this state of affairs strange, and deeply problematic. I'm confused by the relatively low interest, in efforts to make further progress on the ``what alignment target, should be aimed at?'' question (I think that, for example, the explanation, in the original CEV document, from 2004, was a very good explanation, for why this question matters. And I don't think that it is a coincidence, that the specific analogy used, to make that point, was a political revolution (a brief paraphrasing: such a revolution must (i): succeed, and also (ii): lead to a new government, that is actually a good government. Similarly, an AI must (i): hit an alignment target, and also (ii): this alignment target, must be a good thing to hit)). Maybe I shouldn't be surprised by this relative lack of interest. Maybe humans are just not great, in general, at reacting to ``AI danger''. But it still feels like I'm not seeing, I don't know, ... something (wild speculation by anyone that, at any point, happens to stumble upon this comment, regarding what this ... something ... might be, are very welcome. Either in a comment, or in a DM, or in an email).

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-02T04:19:35.859Z · LW · GW

It is getting late here, so I will stop after this comment, and look at this again tomorrow (I'm in Germany). Please treat the comment below as not fully thought through.

The problem from my perspective, is that I don't think that the objective, that you are trying to approximate, is a good objective (in other words, I am not referring to problems, related to optimising a proxy. They also exist, but they are not the focus of my current comments). I don't think that it is a good idea, to do what an abstract entity, called ``humanity'', wants (and I think that this is true, from the perspective of essentially any human individual). I think that it would be rational, for essentially any human individual, to strongly oppose the launch of any such ``Group AI''. Human individuals, and groups, are completely different types of things. So, I don't think that it should be the surprising, to learn that doing what a group wants, is bad for the individuals, in that group. This is a separate issue, from problems related to optimising for a proxy.

I give one example, of how things can go wrong, in the post:

A problem with the most recently published version of CEV 

This is of course just one specific example, and it is meant as an introduction, to the dangers, involved in building an AI, that is describable as ``doing what a group wants''. Showing that a specific version of CEV, would lead to an outcome, that is far, far, worse than extinction, does not, on its own, prove that all versions of CEV are dangerous. I do however think that all versions of CEV, are, very, very, dangerous. And I do think, that this specific thought experiment, can be used to hint at a more general problem. I also hope, that this thought experiment will at least be sufficient, for convincing most readers that there, might, exist a deeper problem, with the core concept. In other words, I hope that it will be sufficient, to convince most readers that you, might, be going after the wrong objective, when you are analysing different attempts ``to say what CEV is''.

While I'm not actually talking about implementation, perhaps it would be more productive, to approach this from the implementation angle. How certain are you, that the concept of Boundaries / Membranes, provides reliable safety, for individuals, from a larger group, that contains the type of fanatics, described in the linked post? Let's say that it turns out, that they do not, in fact, reliably provide such safety, for individuals. How certain are you then, that the first implemented system, that relies on Boundaries / Membranes, to protect individuals from such groups, will in fact result, in you being able to try again? I don't think that you can possibly know this, with any degree of certainty. (I'm certainly not against safety measures. If anyone attempts to do what you are describing, then I certainly hope that this attempt will involve safety measures) (I also have nothing against the idea of Boundaries / Membranes)

An alternative (or parallel) path, to trial and error, is to try to make progress on the ``what alignment target should be aimed at?'' question. Consider what you would say to Bob, who wants to build a Suffering Reducing AI (SRAI). He is very uncertain of his definition of ``Suffering'', and he is implementing safety systems. He knows that any formal definition of ``Suffering'' that he can come up with, will be a proxy, for the actually, correct, definition of Suffering. If it can be shown, that some specific implementation of SRAI, would lead to a bad outcome (such as an AI, that decides to kill everyone), then Bob will respond that the definition of Suffering, must be wrong (and that he has prepared safety systems, that will let him try to find a better definition of ``Suffering'').

This might certainly end well. Bob's safety systems might continue to work, until Bob realises, that the core idea, of building any AI, that is describable as a SRAI, will always lead to an AI, that simply kills everyone (in other words: until he realises, that he is going after the wrong objective). But I would say, that a better alternative, is to make enough progress, on the ``what alignment target should be aimed at?'' question, that it is possible to explain to Bob, that he is, in fact, going after the wrong objective (and is not, in fact, dealing with proxy issues). (in the case of SRAI, such progress has off course been around for a while. I think I remember reading an explanation of the ``SRAI issue'', written by Yudkowsky, decades ago. So, to deal with people like Bob, there is no actual need, for us, to make additional progress. But for people in a world where SRAI, is the state of the art, in terms of answering the ``what alignment target should be aimed at?'' question, I would advice them to focus on making further progress, on this question)

Alternatively, I could ask what you would say to Bob, if he thinks that ``reducing Suffering'', is ``the objectively correct thing to do'', and is convinced, that any implementation that leads to bad outcomes (such as an AI, that kills everyone), must be a proxy issue? I think that, just as any reasonable definition of ``Suffering'', implies a SRAI, that kills everyone, any reasonable set of definitions of ``a Group'', implies a Group AI, that is bad for human individuals (in expectation, when that Group AI is pointed at billions of humans, from the perspective of essentially any human individual, in the set of humans, that the Group AI is pointed at, compared to extinction). In other words, a Group AI is bad for human individuals in expectation, in the same sense as a SRAI kills everyone. I'm definitely not saying that this is true for ``minds in general''. If Dave is able to reliably see all implications of any AI proposal (or if Dave is invulnerable to a powerful AI that is trying to hurt Dave. Or if the minds that the Group AI will be pointed at, are known to be ``friendly towards Dave'' in some formal sense, that is fully understood by Dave), then this might not be true for Dave. But I claim that it is true for human individuals.

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-02T00:53:22.119Z · LW · GW

I'm not sure that I agree with this. I think it mostly depends on what you mean by: ``something like CEV''. All versions of CEV are describable as ``doing what a Group wants''. It is inherent in the core concept of building an AI, that is ``Implementing the Coherent Extrapolated Volition of Humanity''. This rules out proposals, where each individual, is given meaningful influence, regarding the adoption, of those preferences, that refer to her. For example as in MPCEV (described in the post that I linked to above). I don't see how an AI can be safe, for individuals, without such influence. Would you say that MPCEV counts as ``something like CEV''?

If so, then I would say that it is possible, that ``something like CEV'', might be a good, long term solution. But I don't see how one can be certain about this. How certain are you, that this is in fact a good idea, for a long term solution?

Also, how certain are you, that the full plan that you describe (including short term solutions, etc), is actually a good idea?

Comment by ThomasCederborg on Managing risks while trying to do good · 2024-02-01T23:39:29.799Z · LW · GW

In the case of damage from political movements, I think that many truly horrific things, have been done by people, that are well approximated as: ``genuinely trying to do good, and largely achieving their objectives, without major unwanted side effects'' (for example events along the lines of the Chinese Cultural Revolution, that you discuss in your older post, that you link to in your first footnote).

I think our central disagreement, might be a difference, in how we see human morality. In other words, I think that we might have different views, regarding what one should expect, from a human that is, genuinely, trying to do good, and that is succeeding. I'm specifically talking about one particular aspect of morality, that has been common in many different times and places, throughout human history. It is sometimes expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell''. The issue is not the various people, that have come up with various stories, along the lines of ``hell exists''. Humans are always coming up with stories about ``what is really going on''. There was, a lot, to choose from. The issue is the large number of people, in many different cultures, throughout human history, that has heard stories, that assumes a morality, that fits well with normative statements along the lines of ``heretics deserve eternal torture in hell''. And they have thought: ``this story feels right. This is the way that the world, should, work. On questions of morality, it feels right, to defer to the one, who set this up''. These types of stories, are not the only types of stories, that humans have found intuitive. But they are common. The specific aspect of human morality, that I am referring to, is just one aspect, out of many. But it is an important, and common, aspect. Many people, that are trying to do good, are not driven by anything even remotely like this specific aspect of morality. But some are. And I think that such people, have done some truly horrific things.

In other words: Given that this is one standard aspect of human morality, why would anyone be surprised, when the result of a ``person trying to be genuinely good (in a way that does not involve anything, along the lines of status or power seeking), and succeeding'', leads to extreme horror? Side effects along the lines of innocents getting hurt, or economic chaos, are presumably unwanted side effects, for these types of political movements. But why would one expect them to be seen as major issues, by someone that is, genuinely, trying to do good? Why would anyone be surprised, to learn that these side effects, were seen as a perfectly reasonable, and acceptable, costs to pay, for enforcing moral purity? In the specific event that you refer to in the post, that you link to in your first footnote (the Chinese Cultural Revolution), there was extraordinary levels of economic chaos, suffering, and a very large number of dead innocents. So, maybe these extraordinary levels of general disruption and destruction, would have been enough to discourage the movement, if they had been predicted. Alternatively, maybe the only thing driving this event, was something along the lines of ``seeking-power-for-the-sake-of-power''. But maybe not. Maybe they would have (even if they had predicted the outcome), concluded that enforcing moral purity, was more important (enforcing moral purity, on a large number of reluctant people, is not possible without power. So, power seeking behaviour, is not decisive evidence against this interpretation). Humans doing good, and succeeding, are simply not safe, for other humans (even under the assumption, that they would have proceeded, if the side effects had been predicted. And assuming that there is nothing along the lines of ``status seeking'', or ``corrupted by power'', going on). They are not safe, because their preferred outcome, is not safe, for other humans. So, I think that there is an important piece missing from your analysis: the damage done, by humans that, genuinely, tries to do good (humans that, genuinely, do not seek power, or ``status'', or anything similar. Humans whose actions are morally pure, according to their morality). And who succeeds, without causing any deal-breaking side effects. (I know that you have written elsewhere, about humans not being safe for other humans. I know that you have said that Morality is Scary. But I think that an important aspect of this issue, is still missing. I could obviously be completely be wrong about this, but if I had to guess, I would say that it is likely, that our disagreements, follows from the fact, that you do not consider: ``safety issues coming from humans'', as being strongly connected to: ``humans genuinely trying to do good, and succeeding'')

More generally, this implies that human morality is not safe, for other humans. If it was, then those sentiments, that are sometimes expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell'', would not keep popping up, throughout human history. A human that genuinely tries to do good, and that succeeds, is a very dangerous thing, for other humans. This actually has important AI safety implications. For example: this common aspect of human morality, implies a serious s-risk, if someone is ever able to successfully implement CEV. See for example my post:

A problem with the most recently published version of CEV 

(the advice in your post sounds good to me, if you assume that you are exclusively interacting with people, that share your values (and that, in addition to this, are also genuinely trying to do good). My comment is about events, along the lines of the Chinese Cultural Revolution (which involved people with values that, presumably, differs greatly from the values of essentially all readers of this blog). My comment is not about people who share your values, and tries to follow them (but that might be, subconsciously, trying to also achieve other things, such as ``status''). For people like this, your analysis sounds very reasonable to me. But I think that if one looks at history, a lot of: ``damage from people trying to do good'', comes from people that are not well approximated, as trying to do good, while ``being corrupted by power'', or ``subconsciously seeking status'', or anything along those lines)

Comment by ThomasCederborg on A problem with the most recently published version of CEV · 2024-01-29T19:42:06.248Z · LW · GW

I think that these two proposed constraints, will indeed remove some bad outcomes. But I don't think that they will help in the thought experiment outlined in the post. These fanatics want all heretics in existence to be punished. This is a normative convention. It is a central aspect of their morality. An AI that deviates from this ethical imperative, is seen as an unethical AI. Deleting all heretics, from the memory of the fanatics, will not change this aspect of their morality. It's genuinely not personal. They think that it would be highly unethical, for any AI, to let heretics go unpunished. They really do not want, the fate of the world, to be decided by an unethical AI. Any world, where such an unethical entity, has exerted such power, is a dark world. And the LP outcome can be implemented, even if the heretics are no longer around.

More generally: The problem, from the perspective of Steve, is that these two constraints, does not actually grant Steve any meaningful influence, regarding the adoption of those preferences, that refer to Steve. I think that such influence, is a necessary (but far from sufficient) feature, for an AI to be better than extinction (in expectation, from the perspective essentially any human individual). So, my proposal, would be to explore various ways, of giving each individual, meaningful influence, regarding the adoption of those preferences, that refer to her. One way of doing this, would be to explore different ways, of modifying PCEV, in such a way that the Modified version of PCEV (MPCEV), does give each individual, in the set of individuals that MPCEV is pointed at, such influence. For example along the lines of (some version of) the following rule:

If a preference is about Steve, then MPCEV will only take this preference into account, if: (i): the preference counts as concern for the well being of Steve, or if (ii): Steve would approve, of MPCEV taking this preference into account.


Even more generally, I think that it is important, and urgent, to make progress on what I call the ``what alignment target should be aimed at?'' question, and that you refer to as Goalcraft. (in addition to your past work on CEV variants, it was your Goalcraft post, that made me DM you, and point you to this post). Exploring different ways of modifying PCEV, sounds to me like a promising way, towards meaningful progress on this question. I think that s-risk from successfully hitting a bad alignment target, is a serious, and very under explored, issue. I think that there are important differences, between this type of s-risk, and the type of AI risks, that is associated with ``aiming failures''. In particular, progress on the ``what alignment target should be aimed at?'' question, can reduce the former type of s-risk (and this can be done, even if one does not find an actual answer). One way of reducing this s-risk, is to find problems with existing proposals. Another way, is to describe general features, that are necessary for safety (for example along the lines of the ``individuals must have meaningful influence, over the adoption, of those preferences, that refer to her'' feature mentioned above). A third way to reduce the s-risk, that comes from successfully hitting the wrong alignment target, is to show, that the ``what alignment target should be aimed at?'' question is, genuinely, unintuitive.

One very positive thing, that happens to be true, is that the class of bad outcomes, that I am trying to prevent, would probably involve a very capable design team, that is careful and clever enough, to actually hit, what they are aiming for. Explaining insights to such a design team, sounds feasible (including meta insights, such as the fact that this question is, genuinly, unintuitive). In other words: once insights have been generated, it will probably be relatively easy to communicate these insights (at least compared to many other ``AI is dangerous'' related communication tasks). First, however, such insights must be generated. And this will probably require some dedicated effort. So, the immediate task, as far as I can tell, is to create a community of people, that are fully focused, on exploring the ``what alignment target should be aimed at?'' question.

Comment by ThomasCederborg on A problem with the most recently published version of CEV · 2024-01-29T19:41:29.456Z · LW · GW

I do think that the outcome would be LP (more below), but I can illustrate the underlying problem, using a set of alternative thought experiments, that does not require agreement on LP vs MP.

Let's first consider the case where half of the heretics are seen as Mild Heretics (MH) and the other half as Severe Heretics (SH). MH are those that are open to converting, as part of a negotiated settlement (and SH are those that are not open to conversion). The Fanatics (F) would still prefer MP, where both MH and SH are hurt, as much as possible. But F is willing to agree to a Negotiated Position (NP), where MH escape punishment in exchange for conversion, but where SH is hurt, as much as possible, subject to a set of additional constraints. One such constraint would be a limit, on what types of minds, can be created and tortured, as a way of hurting SH.

F prefers MP, and will vote for MP unless MH agrees to vote for NP. Thus, agreeing to vote for NP is the only option available to MH, that would remove the possibility of them personally being targeted, by a powerful AI, using all its creativity, to think up clever ways of hurting them, as much as possible. This would also be their only way of reliably protecting some class of hypothetical future individuals, that they care about, and that would be created and hurt in MP. Thus, the negotiated position is NP.

This variant of the thought experiment is perhaps better at illustrating the deeply alien nature, of an arbitrarily defined abstract entity (given the label ``a Group''), that each individual would be subjected to, in case of the successful implementation, of any AI, that is describable as ``doing what a Group wants'' (the class of ``Group AI'' proposals, include all versions of CEV, as well as many other proposals). I think that this is far more dangerous, than an uncaring AI. In other words: a ``Group AI'' has preferences, that refer to you. But you have no meaningful influence, regarding the adoption of those preferences, that refer to you. That decision, just like every other decision, is entirely in the hands of an arbitrarily defined abstract entity (pointed at using an arbitrarily defined mapping, that maps sets of billions of human individuals, to the class of entities, that can be said to want things). My proposed way forward, is to explore designs, that gives each individual, meaningful influence, regarding the adoption of those preferences, that refer to her (doing so results in AI designs, that are no longer describable as ``doing what a Group wants''). I say more about this in my response to your second comment, to this post. But for the rest of this comment, I want to illustrate that the underlying issue is not actually dependent on agreement, with either of the two thought experiments discussed so far. Basically: I will argue that the conclusion, that PCEV is deeply problematic, is not dependent on agreement, on the details the these two thought experiments (in other words: I will outline an extended argument, for the premise, of your second comment).

First, it's worth noting explicitly, that the NP outcome is obviously not bad, in any ``objective'' sense. If Bob likes the idea of sentient minds being tortured, then Bob will see NP as a good outcome. If Dave only cares about launching an AI, as soon as possible (and is fully indifferent to what AI is launched), then Dave will simply not see either of these two thought experiments, as relevant in any way. But I think that most readers will agree, that NP is a bad outcome.

Let's turn to another class of thought experiments, that can be used to illustrate a less severe version of the same problem. Consider Steve, who wants everyone else to be punished. Steve is however willing to negotiate, and will agree to not vote for punishment, if he gets some extra bonus, that does not imply anyone else getting hurt (for example: above average influence regarding what should be done with distant galaxies. Or above average amount of resources to dispose of personally). The size of the bonus, is now strongly sensitive, to the severity of the punishment, that Steve wants to inflict on others. The more hateful Steve is, the larger bonus he gets. Yet again: this feature is not bad in any ``objective'' sense (Bob and Dave, mentioned above, wouldn't see this as problematic in any way). But, I hope that most readers will agree, that building an AI, that behaves like this, is a bad idea.

We can also consider the animal rights activists Todd and Jeff. Both say that they strongly oppose the suffering of any sentient being. Todd actually does oppose all forms of suffering. Jeff is, sort of, telling the truth, but he is operating under the assumption, that everyone would want to protect animals, if they were just better informed. What Jeff actually wants, is moral purity. He wants other people to behave correctly. And, even more importantly, Jeff wants other people to have the correct morality. And, when Jeff is faced with the reality, that other people will not adopt the correct morality, even when they are informed about the details of factory farming, and given time for reflection, then Jeff will decide that they deserve to be punished, for lack of moral purity. In a situation where Jeff is in a weak political position, and when he is still able to convince himself, that most people are just misinformed, Jeff is not openly taking any totalitarian, or hateful, political positions. However, when Jeff finds out that most people, even when fully informed, and given time to reflect, would still choose to eat meat (in a counterfactual situation, where the AI is unable to provide meat, without killing animals), then he wants them punished (as a question of moral principle. They deserve it, because they are bad people. An AI that lets them avoid punishment, is an unethical AI). In a political conversation, it is essentially impossible to distinguish Todd from Jeff. So, a reasonable debate rule, is that you should conduct debates, as if all of your opponents (and all your allies) are like Todd. Accusing Todd of being like Jeff is unfair, since there is absolutely no way, for Todd to ever prove that he is not like Jeff. It is also an accusation, that can be levelled at most people, for taking essentially any normative, or political, position. So, having informal debate rules, stating that everyone should act, as if all people involved in the conversation, are like Todd, often makes a lot of sense. It is however a mistake to simply assume, that all people, really are, like Todd (or that people will remain like Todd, even when they are informed, about the fact, that other people are not simply misinformed. And that value differences, are, in fact, a real thing). In particular, when we are considering the question of ``what alignment target should be aimed at?'', then it is important to take into account, the fact that PCEV would give a single person like Jeff, far more power, than a large number of people like Todd. Even if a given political movement, is completely dominated by people like Todd, the influence within PCEV, from the members of this movement, would be dominated by people like Jeff. Even worse, is the fact that the issues that would end up dominating any PCEV style negotiation, are those issues that attract people along the lines of Jeff (in other words: what people think about the actual issue of animal rights, would probably not have any significant impact, on the actual outcome, of PCEV. If some currently discussed question, turned out to matter to the negotiations of extrapolated delegates, then this would probably be a type of issue, that tend to interest the ``heretics deserve eternal torture in hell'' crowd). So, while asking the question of ``what alignment target should be aimed at?'', it is actually very important, to take the existence of people like Jeff, into account.

(I use animal rights as a theme, because the texts that introduce PCEV, use this theme (and, as far as I can tell, PCEV is the current state of the art, in terms of answering the ``what alignment target should be aimed at?'' question). The underlying dynamic is, however, not connected to this particular issue. In fact, sentiments along the lines of ``heretics deserve eternal torture in hell'' have, historically, not had particularly strong ties to animal rights (such sentiments have been common, in many different times and places, throughout human history. But they do not seem to be common, amongst current animal rights movements). However, the animal rights issue, does work to illustrate the point (also: sticking with this existing theme, means that I don't have to speculate out loud, regarding which existing group of people, are most like Jeff). Even though Jeff is a non standard type of fanatic, it is still perfectly possible, to use the power differential between Jeff and Todd, in PCEV, to illustrate the underlying problematic PCEV feature in question (since this feature of PCEV is not related to the specifics of the normative question under consideration, it is trivial to make the exact same point, using essentially any normative question / theme))


Regarding the validity of the thought experiment in the post:

If humans are mapped to utility functions, such that LP is close to maximally bad, then the negotiated outcome would indeed not be LP. However, I don't think that this would be a reasonable mapping, because I think that a clever enough AI, would be capable of thinking up something, that is far worse than LP (more below).

Regarding Pascal's Mugging. This term not is usually used, for these types of probabilities. If one in a hundred humans is a fanatic (or even one in a thousand), then I don't think that it makes sense to describe this as Pascal's Mugging. (for a set of individuals, such that LP and MP are basically the same, the outcome would indeed not be LP. But I still don't think that it would count as a variant of Pascal's Mugging) (perhaps I should not have used the phrase: ``tiny number of fanatics''. I did not mean ``tiny number'' in the ``negligible number'' sense. I was using it in the ``standard english'' sense)

I do not think that LP and MP will be even remotely similar. This assessment does not rely on the number of created minds in LP (or the number of years involved). Basically: everything that happens in LP, must be comprehensible to a heretic. That is not true for MP. And the comparison between LP and MP, is made by an extrapolated delegate.

In MP, the fanatics would ask an AI, to hurt the heretics as much as possible. So, for each individual heretic, the outcome in MP, is designed, by a very clever mind, specifically for the purpose of horrifying that heretic in particular (using an enormous amount of resources). The only constraint, is that any mind that is created by PCEV, must also be a heretic. In LP, the scenarios under consideration (that the 10^15 created minds will be subjected to), is limited to the set, that the heretic in question is capable of comprehending. Even if LP and MP would involve the same number of minds, and the same number of years, I still expect LP to be the negotiated outcome. MP is still the result of a very clever AI, that uses all of its creativity, to think up some outcome, specifically designed to horrify, this particular heretic. Betting against LP, as the negotiated outcome, means betting against the ability of a very powerful mind, to find a clever solution. In other words: I expect the outcome to be MP, for the same reason that I expect clever AI1, to defeat AI2 (who is equally clever, but is limited to strategies, that a human is capable of comprehending) in a war, even if AI2 starts with a lot more tanks. (if the sticking point is the phrase: ``the most horrific treatment, that this heretic is capable of comprehending'', then perhaps you will agree that the outcome would be LP, if the wording is changed to ``the most horrific treatment, that this heretic is capable of coming up with, given time to think, but without help from the AI'')