Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure

post by ThomasCederborg · 2024-10-03T00:01:34.011Z · LW · GW · 7 comments

Contents

  Summary
  The scenario
  Why this would be dangerous
  Conclusion
None
7 comments

Summary

A previous post [LW · GW] argued that work on Alignment Target Analysis (ATA) needs to start now in order to reduce the probability of a bad alignment target getting successfully implemented. The present post is focused on one specific scenario where starting ATA work now would reduce the probability of disaster. Different types of time crunch can lead to situations where there will not be much time to do ATA later. We will make some optimistic assumptions in order to focus on one specific type of time crunch that remains despite these assumptions. We will assume a human augmentation project that leads to smarter humans. Then we assume the creation of a Limited AI (LAI) that removes all external time pressure from competing AI projects by uploading the augmented humans. Then we describe a scenario where value drift and internal power struggles leads to internal time pressure. One faction takes a calculated risk and successfully hits an alignment target, despite the fact that this alignment target has never been properly analysed.

The scenario in brief: after creating an LAI that uploads them, the latest and most advanced generation of augmented humans realise that they have undergone value drift. This gives them an incentive to act fast. The earlier generations might discover the value drift at any time and disempower them. Their only chance of being in charge of a Sovereign AI project is thus to launch before the value drift is discovered. In other words: unless the latest generation act quickly, they permanently lose their ability to influence numerous decisions along the lines of (i): extrapolation definitions, (ii): how to resolve disagreements amongst individuals that disagree on how to resolve disagreements, (iii): how to structure add ons along the lines of a last judge off switch, etc.

We will also argue that augmented humans with an increased ability to hit alignment targets will not necessarily be good at analysing alignment targets (these are two very different types of skills). This means that an alignment target might get successfully implemented without ever being properly analysed. That would be dangerous, because even serious flaws in well known alignment targets can go undetected for a long time. The most recently published version of CEV is Parliamentarian CEV (PCEV). It turns out that PCEV gives a large amount of extra influence to humans that intrinsically value hurting other humans (search the CEV arbital page for ADDED 2023 for Yudkowsky’s description of the issue). An AI Sovereign dominated by such people would be very dangerous [LW · GW]. This issue went unnoticed for a long time.

What happened with PCEV shows that (i): ATA is difficult and risks from bad alignment targets getting successfully implemented are serious, and (ii): reducing these risks is a tractable research project (because risks can be reduced without finding any good alignment target: simply describing this feature of PCEV presumably removed most of the risk from scenarios where PCEV is successfully implemented). However, there does not exist a single research project dedicated to ATA. The post concludes by arguing that this neglect is a serious mistake.


Thanks to Chi Nguyen [LW · GW] for giving great feedback on an earlier version of this post.


The scenario

Some AI proposals are based on the idea of building an AI that buys time. These proposed AIs are not supposed to reorganise the world. They are instead supposed to perform some Pivotal Act that will remove time pressure from the designers, so that they can take their time while designing another AI. Let's write Limited AI (LAI) for an AI that is designed to do something that reduces external time pressure. This section describes a scenario, designed to show that an LAI that removes all external time pressure might not buy a lot of time due to internal time pressure. The present post elaborates on a scenario that was briefly mentioned in an earlier post [LW · GW], where it was one part of a comprehensive argument for Alignment Target Analysis (ATA) being urgent (the previous post also covered topics such as AI assistants helping with ATA).

Now let's make some very optimistic assumptions, so that we can focus on an issue that remains despite these assumptions. Consider a scenario where a group of augmented humans keep improving some augmentation method. Each new version of the augmentation method is only used on a quarter of the augments (as a precaution against side effects that are not noticed right away). Eventually these augments succeed in launching an LAI that uploads them and gives them infinite time to work (permanently and completely removing the threat from all competing AI projects, without needing to interfere with those projects. In other words: completely removing all external time pressure). Let's also assume that it is somehow common knowledge that the LAI will ensure that everyone stays calm / rational / sane / etc indefinitely (with all related definitional issues known to be fully solved). The LAI also removes all risks related to failure to hit the alignment target being aimed at. If a majority of the augments vote yes, then the LAI will allow them to launch a Sovereign AI. A majority vote can also disenfranchise augments (so that they can deal with augmentation method side effects).

At upload time there exists a popular Sovereign AI proposal that has been around for a while. No one has found any serious problems with it. The original plan was to continue augmenting after uploading, and to take a lot of time to analyse this proposal in depth before deciding what to do next. When the uploaded augments have had some time to reflect, the latest generation realises that the latest version of the augmentation method has resulted in value drift. The others could discover this at any time and disenfranchise the latest generation. If the augmentation project were to proceed, then the next generation would probably experience some difficult-to-predict form of value drift.

Even though all external time pressure has been fully removed, there still exists an internal time pressure: an internal group dynamic that generates an incentive to act quickly. In other words: the most advanced mind in existence now has an incentive to convince the others to move forwards as fast as possible. The rest of this section will argue that a group of value drifted augments might have a strong reason to maintain control (if they don't act fast, they loose control to people with different values). And it will argue that they might succeed at convincing the others to move quickly (they are the most advanced mind around). The next section will argue that this is dangerous. And that ATA work done now can reduce this danger.

The original plan was that they would continue to augment themselves until they felt ready to launch a Sovereign AI. But if this plan is followed, the latest generation would lose influence over the many definitions and other decisions involved in any Sovereign AI project. These value drifted augments would for example be unable to ensure the protection of things that only they care about (for example by designing the rules for something along the lines of a last judge off switch). Thus, the latest generation has an incentive to take a calculated risk, and launch a Sovereign AI as soon as possible. If they move forwards quickly, they know that they face the risk that the Sovereign AI proposal has a flaw that no one has noticed. But if they wait, they loose control to people with different values.

It is not certain that a group of augmented humans would take the risk of moving forwards in this situation. They could decide to come clean about what happened and hope that the outcome would still be acceptable to them. They would know that moving ahead quickly is risky. But if some specific alignment target is seen as reasonably likely to work as planned, then this risk might be viewed as smaller than the risks associated with losing control over implementation details. Especially if one effect of the value drift was to make them care intrinsically about some thing that no one else cares about.

To make this hypothetical more concrete, let's say that they feel a strong aversion towards any scenario where anyone implements any technical plan, that has some specific set of features (for example features associated with a lack of robustness). When they reflect on their values, they realise that they have started to consider this to be intrinsically bad. Not just instrumentally bad. They want to prevent all scenarios where anyone follows any such plan, regardless of context (including in contexts where there is no harm in failing, and where trial and error would be fun).

(Readers that would like to make this situation more intuitive may try a perspective shift. First imagine some specific thing that is important to you. Now imagine that only you and a small group of other people care about this thing. You are part of a project that plans to build an AI that will get its goal from billions of less advanced minds, that do not consider this specific thing to be relevant at all (it might for example work to imagine a project that will build an AI that does what billions of chimpanzees wants the AI to do). Perhaps this perspective shift can help illustrate why the latest generation might be prepared to take a calculated risk, rather than giving up control to people with different values. (from the perspective of a value drifted augment generation, ordinary humans might be the ones that are uncaring in a dangerous way).)

If they do decide to push for a quick launch, they might turn out to be very good at making this happen. They are the result of an augmentation method that selected for a set of abilities that are useful for getting things done. Even if it was not selected for on purpose, persuasion ability will presumably have been de facto selected for (for example as a side effect of selecting for the ability to come up with important novel ideas. It is not easy to distinguish between (i): the ability to explain why an important novel idea is in fact important, and (ii): general persuasion ability). So they might be very capable of getting their Sovereign AI project implemented. While they are not able to achieve arbitrary ends with fully general manipulation techniques, they might be in a good position to convince the others to move forwards with a pre existing Sovereign AI proposal.

One obvious option is to base their persuasion campaign on the idea that waiting is dangerous. They might for example say that interactions with the LAI that runs their upload environment comes with manipulation risks (which might very well be a genuine problem [LW(p) · GW(p)]). They might even point out that every new generation comes with a serious danger: value drift might produce a powerful mind with alien values, that might be able to hide this value drift (which in this thought experiment is definitely a real problem). More generally: when a novel type of powerful mind (that is the most advanced mind around), has a strong incentive to get something implemented fast, it is difficult to rule out scenarios where they succeed.


Why this would be dangerous

The danger is that the alignment target in question might have a hidden flaw. One might think that since these augmented humans were smart enough to successfully implement the LAI, then they must be smart enough to notice any flaw in the Sovereign AI proposal (at least if the flaw is humanly findable). In other words: one might think that if these augments are good at hitting alignment targets, then they must also be good at Alignment Target Analysis (ATA).

It is possible that improving the ability to hit an alignment target will dramatically improve the ability to notice flaws in proposed alignment targets. In other words: it is possible that these augmented humans will be so good at ATA, that they will have no use for ATA work done by non-augmented humans. But this is by no means guaranteed. Hitting an alignment target and analysing an alignment target are two very different types of skills. (Putting a man on the moon is one type of skill. Predicting what effects such a project would have on society is a very different type of skill). It is also possible that there are tradeoffs (so that selecting for one type of skill selects against the other type of skill).

One way of thinking might be very useful for designing a technical plan from scratch that will actually work. But that same way of thinking might be counterproductive when trying to find unexamined implicit assumptions in an existing alignment target proposal. One way of doing things is to build systems incrementally from scratch (steadily building towards a known target behaviour by incrementally adding well understood components). An alternative way of doing things is to sketch out lots of complete proposals and then check them for flaws. It could be that minds for whom the former strategy intuitively sounds like the way things should be done, are well suited for hitting alignment targets. But minds for whom the latter strategy intuitively sounds like the way things should be done, are well suited for noticing flaws in existing alignment target proposals. In this case, selecting for ability to hit alignment targets selects against ability to do ATA (because it selects for minds for whom the former way of doing things sounds like the way things should be done. And thus selects against minds for whom the latter way of doing things sounds like the way things should be done).

More generally: some features of a mind might be good for one ability, but bad for the other ability. If that is the case, then selecting for an ability to hit an alignment target might select against an ability to do ATA.

As a separate issue: even if the augmentation method does turn out to increase the ability to do ATA, this might not be enough to make any of them better than the best baseline humans. The best out of a small population of augmented humans might still not be as good at ATA as the best out of billions of baseline humans. Finally, even if they end up better than any baseline human at ATA under ideal conditions, this does not automatically result in de facto better performance. If they find themselves under pressure, they might never actually perform as well as the best baseline humans would perform (if those baseline humans are focused on doing ATA).

More generally: there is no particular reason to think that the augments in the above scenario would be able to make significant ATA progress in the time that they have available. Which means that this scenario might lead to an alignment target getting implemented, despite having a flaw that could have been caught by non augmented humans doing ATA (in other words: it means that doing ATA now reduces the probability of disaster).

As shown in a previous post [LW · GW], successfully hitting a bad alignment target can be very dangerous. In brief: the most recently published version of CEV is Parliamentarian CEV (PCEV). It turns out that PCEV gives a very large advantage to individuals that intrinsically value hurting other individuals. Those that want to inflict a lot of harm get a bigger advantage than those that want to inflict less harm. The largest possible advantage is given to groups that want the AI to hurt everyone else as much as possible. The fact that PCEV would be dominated by this type of people means that a successfully implemented PCEV would be massively worse than extinction.

This issue went undetected for many years, despite PCEV being a fairly prominent proposal (PCEV is the version of CEV that is on the CEV arbital page). So, even if the alignment target that the augments in the above scenario decides to aim at has been around for a while, it might still suffer from an undetected flaw. A flaw that could have been detected by baseline humans doing ATA. If that flaw is detected in time, the latest augment generation might accept the loss of control rather than rush things (at least if the flaw is serious enough). But if the flaw is not detected in time, they might instead take the calculated risk of moving ahead.

One might wonder why this post describes a specific scenario in such detail (given that every detail makes the scenario less likely). The main reason is that without the details, some readers might conclude that the described situation would not actually lead to an alignment target getting successfully implemented without being properly analysed. To conclude that it is safe to stay at the current level of ATA progress, one has to be confident that one has predicted and prevented every scenario like this (every scenario that leads to an alignment target getting successfully implemented without being properly analysed). Saying that current levels of ATA progress is safe, is equivalent to saying that no scenario like the one above exists. Thus, outlining one such scenario is a refutation of this safety claim. It is however possible to come up with any number of specific scenarios. To conclude that our current levels of ATA progress is safe, one has to first describe all of these paths. And then one has to reliably prevent all of them.

In other words: there exists a more general problem that this specific scenario is meant to illustrate. To conclude that it is safe to stay at current levels of ATA progress, one would need to deal with this more general problem. Basically: there exists a large number of hard-to-predict paths that ends in an alignment target getting successfully implemented, even though it suffers from a realistically findable flaw. Combined with the fact that risk mitigation has been shown to be tractable, it seems like a mistake to act based on the assumption that these risks do not need to be mitigated.


Conclusion

A previous post [LW · GW] outlined a comprehensive case for Alignment Target Analysis (ATA) being urgent. The present post elaborated on one specific scenario from that post: a Limited AI (LAI) removes external time pressure from competing AI projects, but still fail to buy a lot of time for ATA. The scenario illustrated a general problem: internal time pressure. Power struggles amongst whoever end up in charge of an LAI might lead to an alignment target getting successfully implemented without ever being properly understood (because someone takes a calculated risk). The field of ATA is still at a very early stage. And there does not exist a single research project dedicated to ATA. So if an LAI leads to a situation with internal time pressure, then there is no reason to think that the field will have advanced much from its current state. While the post focused on one specific set of circumstances, the issue with internal time pressure is a general problem.

Let's briefly look at another scenario. Consider an LAI that is instead under the control of a large population of ordinary humans. A large but shrinking majority might act before they lose the ability to act. Let's say that a two thirds majority in a referendum is needed for the LAI to permit the launch of a Sovereign AI. A majority currently has enough support for a specific alignment target. However, a minority of people with different values continues to grow every year (due to ordinary political dynamics). The majority might now decide to launch their favoured type of AI Sovereign before they lose the ability to do so. (In this scenario, the shrinking majority and the growing minority favour different alignment targets due to well known value differences. In other words: in this scenario, the time crunch arise for reasons unrelated to things such as hidden value changes and the wish to influence implementation details. But the basic dynamic is the same: there is an incentive to take a calculated risk and act decisively, before losing control to people with different values). See also section 3 of this comment [LW · GW].

ATA as a risk mitigation tool is tractable, because ATA does not need to result in a good alignment target in order to mitigate risks. Besides noticing problems with specific classes of proposals, one potential risk mitigation tool is to identify features that are necessary. A necessary feature can reduce risks even if it is far from sufficient. Even if it is not always clear whether or not a given proposal can be reasonably described as having the feature in question, identifying it as necessary can still be useful. Because this makes it possible to rule out those proposals that are clearly not describable as having the feature. The role that such a feature can play was discussed in a previous post [LW · GW] (in the context of Membrane formalisms).

It seems like there exists a wide range of reasons for why many people believe that it is safe to stay at our current level of ATA progress. Previous posts have discussed specific such reasons related to Corrigibility [LW · GW], the last judge idea [LW · GW], and other LAI proposals [LW · GW]. If anyone has a reason for believing that staying at our current level of ATA progress is safe (that is not covered by the above posts), then it would be greatly appreciated if those reasons were to be described somewhere. Finally: I'm trying to understand people who act based on the assumption that it is safe to stay at our current level of ATA progress (where it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction [LW · GW]). Please don't hesitate to contact me if you have any theories, observations, or questions related to this.

 

(I am also posting this on the EA Forum [EA · GW])

7 comments

Comments sorted by top scores.

comment by johnswentworth · 2024-10-03T16:30:58.718Z · LW(p) · GW(p)

Terminological note: something which does not buy ample time is not a pivotal act. Eliezer introduced the term to mean a specific thing, which he summarized as:

(as of late 2021) researchers use 'pivotal' and 'pivotal act' to refer to good events that upset the current gameboard - events that decisively settle a win, or drastically increase the probability of a win.

That same page also talks a bunch about how sticking to that definition is important, because there will predictably be lots of pressure to water the term down.

Replies from: martin-randall, ThomasCederborg, ThomasCederborg, faul_sname
comment by Martin Randall (martin-randall) · 2024-10-05T00:39:11.440Z · LW(p) · GW(p)

Something which might not buy ample time can still be a pivotal act. From the Arbital page that you link to:

Example 3: Suppose a behaviorist genie is restricted from modeling human minds in any great detail, but is still able to build and deploy molecular nanotechnology. Moreover, the AI is able to understand the instruction, "Build a device for scanning human brains and running them at high speed with minimum simulation error", and is able to work out a way to do this without simulating whole human brains as test cases. The genie is then used to upload a set of, say, fifty human researchers, and run them at 10,000-to-1 speeds.

This accomplishment would not of itself save the world or destroy it - the researchers inside the simulation would still need to solve the alignment problem, and might not succeed in doing so.

But it would (positively) upset the gameboard and change the major determinants of winning, compared to the default scenario where the fifty researchers are in an equal-speed arms race with the rest of the world, and don't have practically-unlimited time to check their work. The event where the genie was used to upload the researchers and run them at high speeds would be a critical event, a hinge where the optimum strategy was drastically different before versus after that pivotal act.

The Limited AI (LAI) scenario in this post is equivalent to this example and therefore qualifies as a Pivotal Act under the Arbital Guarded Definition. Additionally, looking at your specific quote, the LAI would "drastically increase the probability of a win".

comment by ThomasCederborg · 2024-10-04T15:11:07.627Z · LW(p) · GW(p)

I will change the title.

However: you also seem to be using the term Pivotal Act as a synonym for removing all time pressure from competing AI projects (which the AI in my post does). Example 3 of the arbital page that you link to also explicitly refers to an act that removes all time pressure from competing AI projects as a Pivotal Act. This usage is also present in various comments by you, Yudkowsky, and others (see links and quotes below). And there does not seem to exist any other established term for an AI that: (i): completely removes all time pressure from competing AI projects by uploading a design team and giving them infinite time to work, (ii): keeps the designers calm, rational, sane, etc indefinitely (with all definitional issues of those terms fully solved), and (iii): removes all risks from scenarios where someone fails to hit an alignment target. What other established term exists for such an AI? I think people would generally refer to such an AI as a Pivotal Act AI. And as demonstrated in the post: such an AI might not buy a lot of time.

Maybe using the term Pivotal Act as a synonym for an act that removes all time pressure from competing AI projects is a mistake? (Maybe the scenario in my post should be seen as showing that this usage is a mistake?). But it does seem to be a very well established way of using the term. And I would like to have a title that tells readers what the post is about. I think the current title probably did tell you what the post is about, right? (that the type of AI actions that people tend to refer to as Pivotal Acts might not buy a lot of time in reality)

In the post I define new terms. But if I use a novel term in the title before defining the this term, the title will not tell you what the post is about. So I would prefer to avoid doing that.

But I can see why you might want to have Pivotal Act be a protected term for something that is actually guaranteed to buy a lot of time (which I think is what you would like to do?). And perhaps it is possible to maintain (or re-establish?) this usage. And I don't want to interfere with your efforts to do this. So I will change the title.

If we can't find a better solution I will change the title to: Internal Time Pressure. It does not really tell you what the post will be about. But at least it is accurate and not terminologically problematic. And even though the term is not commonly known, Internal Time Pressure is actually the main topic of the post (Internal Time Pressure is the reason that the AI mentioned above, that does all the nice things mentioned, might not actually buy a lot of time).


Regarding current usage of the term Pivotal Act:

It seems to me like you and many others are actually using the term as a shorthand for an AI that removes time pressure from competing AI projects. I can take many examples of this usage just from the discussion [LW · GW] that faul_sname links to in the other reply [LW(p) · GW(p)] to your comment.

In the second last paragraph of part 1 [LW · GW] of the linked post, Andrew_Critch writes:

Overall, building an AGI development team with the intention to carry out a “pivotal act” of the form “forcibly shut down all other A(G)I projects” is probably going to be a rough time, I predict.

No one seems to be challenging that usage of Pivotal Act (even though many other parts of the post are challenged). And it is not just this paragraph. The tl;dr also treats a Pivotal Act as interchangeable with: shut down all other AGI projects, using safe AGI. There are other examples in the post.

In this comment [LW(p) · GW(p)] on the post, it seems to me that Scott Alexander is using a Pivotal Act AI as a direct synonym for an AI capable of destroying all competing AI projects.

In this comment [LW(p) · GW(p)] it seems to me like you are using Pivotal Act interchangeably with shutting down all competing AI projects. In this comment [LW(p) · GW(p)], it seems to me that you accept the premise that uploading a design team and running them very quickly would be a Pivotal Act (but you question the plan on other grounds). In this comment [LW(p) · GW(p)], it seems to me that you are equating successful AI regulation with a Pivotal Act (but you question the feasibility of regulation).

In this comment [LW(p) · GW(p)], Yudkowsky seems to me to be accepting the premise that preventing all competing AI projects would count as a Pivotal Act. He says that the described strategy for preventing all competing AI projects is not feasible. But he also says that he will change the way he speaks about Pivotal Acts if the strategy actually does work (and this strategy is to shut down competing AI projects with EMPs. The proposed strategy does nothing else to buy time, other than shutting down competing AI projects). (It is not an unequivocal case of using Pivotal Act as a direct synonym for reliably shutting down all competing AI projects. But it really does seem to me like Yudkowsky is treating Pivotal Act as a synonym for: preventing all competing AI projects. Or at least that he is assuming that preventing all competing AI projects would constitute a Pivotal Act).

Consider also example 3 in the arbital page that you link to. Removing time pressure from competing AI projects by uploading a design team is explicitly defined as an example of a Pivotal Act. And the LAI in my post does exactly this. And the LAI in my post also does a lot of other things that increase the probability of a win (such as keeping the designers sane and preventing them from missing an aimed for alignment target).

This usage points to a possible title along the lines of: The AI Actions that are Commonly Referred to as Pivotal Acts, are not Actually Pivotal Acts (or: Shutting Down all Competing AI Projects is not Actually a Pivotal Act). This is longer and less informative about what the post is about (the post is about the need to start ATA work now, because there might not be a lot time to do ATA work later, even if we assume the successful implementation of a very ambitious AI, whose purpose was to buy time). But this title would not interfere with an effort to maintain (or re-establish?) the meaning of Pivotal Act as a synonym for an act that is guaranteed to buy lots of time (which I think is what you are trying to do?). What do you think about these titles?


PS:

(I think that technically the title probably does conform to the specific text bit that you quote. It depends on what the current probability of a win is. And how one defines: drastically increase the probability of a win. But given the probability that Yudkowsky currently assigns to a win, I expect that he would agree that the launch of the described LAI would count as drastically increasing the probability of a win. (In the described scenario, there are many plausible paths along which the augmented humans actually do reach the needed levels of ATA progress in time. They are however not guaranteed to do this. The point of the post is that doing ATA now increases the probability of this happening). The statement that the title conforms to the quoted text bit is however only technically true in an uninteresting sense. And the title conflicts with your efforts to guard the usage of the term. So I will change the title as soon as a new title has been settled on. If nothing else is agreed on, I will change the title to: Internal Time Pressure)

Replies from: faul_sname, martin-randall
comment by faul_sname · 2024-10-04T21:13:36.873Z · LW(p) · GW(p)

Shutting Down all Competing AI Projects is not Actually a Pivotal Act

This seems like an excellent title to me.

comment by Martin Randall (martin-randall) · 2024-10-05T00:46:34.009Z · LW(p) · GW(p)

Please do not change the title. You have used the phase correctly from both a prescriptive and a descriptive approach to language. A title such as "Shutting Down all Competing AI Projects is not Actually a Pivotal Act" would be an incorrect usage and increase confusion.

comment by ThomasCederborg · 2024-10-11T01:14:29.986Z · LW(p) · GW(p)

I changed the title from: ``A Pivotal Act AI might not buy a lot of time'' to: ``Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure''.

As explained by Martin Randall [LW(p) · GW(p)], the statement: ``something which does not buy ample time is not a pivotal act'' is false (based on the Arbital Guarded Definition of Pivotal Act). Given your ``Agreed react'' to that comment, this issue seems to be settled. In the first section of the present comment, I explain why I still think that the old title was a mistake. The second section outlines a scenario that better illustrates that a Pivotal Act AI might not buy a lot of time.

Why the old title was a mistake

The old title implied that launching the LAI was a very positive event. With the new title, launching the LAI may or may not have been a positive event. This was the meaning that I intended.

Launching the LAI drastically increased the probability of a win by shutting down all competing AI projects. It however also increased risks from scenarios where someone successfully hits a bad alignment target. This can lead to a massively worse than extinction outcome (for example along the lines of the outcome implied by PCEV [LW · GW]). In other words: launching LAI may or may not have been a positive event. Thus, launching the LAI may or may not have been a Pivotal Act according to the Arbital Guarded Definition (which requires the event to be very positive).

The old title does not seem to be incompatible with the actual text of the post. But it is incompatible with my intended meaning. I didn't intend to specify whether or not LAI was a positive event. Because the argument about the need for Alignment Target Analysis (ATA) goes through regardless of whether or not launching LAI was a good idea. Regardless of whether or not launching LAI was a positive event, ATA work needs to start now to reduce risks. Because in both cases, ATA progress is needed to reduce risks. And in both cases, there is not a lot of time to do ATA later. (ATA is in fact more important in scenarios where launching the LAI was in fact a terrible mistake)

As I show in my other reply [LW(p) · GW(p)]: there is a well established convention of using the term Pivotal Act as a shorthand for shutting down all competing AI projects. As can be seen by looking at the scenario in the post: this might not buy a lot of time. That is how I was using the term when I picked the old title.

A scenario that better illustrates why a Pivotal Act AI might not buy a lot of time

This section outlines a scenario where an unambiguous Pivotal Act is instantly followed by a very severe time crunch. It is possible to see that a Pivotal Act AI might not buy a lot of time by looking at the scenario in the post. But the present section will outline a scenario that better illustrates this fact. (In other words: this section outlines a scenario for which the old title would actually be a good title.) In this new scenario, a Pivotal Act dramatically reduces the probability of extinction by shutting down all unauthorised AI projects. It also completely removes the possibility of anything worse than extinction. Right after the Pivotal Act, there is a frenzied race against the clock to make enough progress on ATA before time runs out. Failure results in a significant risk of extinction.

Consider the case where Dave launches Dave's AI (DAI). If DAI had not been launched, everyone would have almost certainly been killed by some other AI. DAI completely and permanently shuts down all competing AI projects. DAI also reliably prevents all scenarios where designers fail to hit the alignment target that they are aiming at. Due to Internal Time Pressure, a Sovereign AI must then be launched very quickly (discussions of Internal Time Pressure can be found here [LW · GW], and here [LW · GW], and here [LW · GW]). There is very little time to decide what alignment target to aim at. (The point made in this section is not sensitive to who gave Dave permission to launch DAI. Or sensitive to who DAI will defer to for the choice of alignment target. But for the sake of concreteness, let's say that the UN security council authorised DAI. And that DAI defers to a global electorate regarding the choice of alignment target).

By the time Dave launches DAI, work on ATA has already progressed very far. There already exist many alignment targets that would in fact lead to an unambiguous win (somehow, describing these outcomes as a win is objectively correct). Only one of the many proposed alignment targets still has an unnoticed problem. And this problem is not nearly as severe as the problem with PCEV [LW · GW]. People take the risks of unnoticed problems very seriously. But due to severe Internal Time Pressure, there is not much they can do with this knowledge. The only option is to use their limited time to analyse all alignment targets that are being considered. (many very optimistic assumptions are made regarding both DAI and the level of ATA progress. This is partly to make sure that readers will agree that the act of launching DAI should count as a Pivotal Act. And partly to show that ATA might still be needed, despite these very optimistic assumptions).

The only alignment target that is not a clear win, is based on maximising the sum of re-normalised utility functions. The proposed AI includes a proposed way of mapping a human to a utility function. This always results in a perfect representation of what the human wants. (And there are no definitional issues with this mapping). These functions are then renormalised to have the same variance (as discussed here). Let's write VarAI for this AI. VarAI maximises the sum of the renormalised functions. The aggregation method described above has a problem that is obvious in retrospect. If that problem is explained, then it is clear that VarAI is an unacceptable alignment target. However, in this scenario, no one has noticed this problem. The question is now whether or not anyone will notice the problem (before an alignment target needs to be settled on).

Due to the problem with this aggregation method, VarAI would create a Soft Maximisation version of a Suffering Reducing AI (SMSRAI) as a successor AI (if VarAI is successfully implemented and pointed at the population in this thought experiment). So, if the proponents of VarAI happens to win the political fight, then the result would be SMSRAI. SMSRAI would work to reduce both average suffering and population size (without doing anything drastic). The quality of most peoples lives would increase very dramatically. Many people would choose to spend centuries living a very nice life. But after some conversations with SMSRAI, few would choose to live for millennia. No one would choose to live for many millennia. Humanity would very gradually go extinct in a few dozen generations. By the time it happens, no one would find this problematic in any way. Changes in attitudes are very gradual, and no one ever feels pressured or manipulated by SMSRAI (for example: if you ask SMSRAI to leave you alone, it backs off for a couple of centuries before politely asking again if you want to talk).

If anyone finds the problem with the aggregation method in time, then VarAI will be dropped as a potential alignment target. (Once the problem is explained, it is retrospectively obvious why VarAI is completely unacceptable as an alignment target). If no one finds this problem in time however, it is possible that VarAI will win the political fight.

In order to avoid accidental interference [LW(p) · GW(p)], Dave designed DAI to not discuss ATA. (As a general strategy for reducing risks from definitional issues, Dave actually made sure that DAI only does things that are absolutely necessary for the completion of a carefully chosen set of well defined tasks). So the only way to remove the possibility of extinction, is to reach the needed level of ATA understanding before it's too late. (If the problem with this aggregation method is not explained to them, their only path to understanding is to try doing ATA).

Despite severe Internal Time Pressure, the possibility of an unnoticed problem in an alignment target is taken very seriously. So a last judge safety add on is implemented. The last judge is extrapolated using a different extrapolation method than the one used in VarAI (in order to give the last judge a chance to veto a bad extrapolation method. The VarAI extrapolation definitions are in fact somehow objectively correct. But there is no way of verifying that fact). The result of the last judge extrapolation method turns out to be a very anxious mind. The result is a mind that is in general not comfortable with objecting to things (such as the extrapolation method that it is the output of, or the outcome implied by an alignment target). This mind is very reluctant to veto a scenario where no one is suffering, and where almost everyone are very happy with all aspects of how things turn out (SMSRAI very gradually, over many generations, ``helps people realise'' that the outcome is actually a good outcome. And people genuinely are having a very nice time, for a lot longer than most people expected). So the off switch is not triggered.

If Dave had not launched DAI, all humans would very likely have been killed very soon by some other AI. So I think a lot of people would consider Launching DAI to be a Pivotal Act. (It completely upset the game board. It drastically increased the probability of a win. It was a very positive event according to a wide range of value systems). But if someone wants humanity to go on existing (or wants to personally live a super long life), then there is not a lot of time to find the problem with VarAI (because without sufficient ATA progress, there still exists a significant probability of extinction). So, launching DAI was a Pivotal Act. And launching DAI did not result in a lot of time to work on ATA. Which demonstrates that a Pivotal Act AI might not buy a lot of time.

One can use this scenario as an argument in favour of starting ATA work now. It is one specific scenario that exemplifies a general class of scenarios: scenarios where starting ATA work now, would further reduce an already small risk of a moderately bad outcome. It is a valid argument. But it is not the argument that I was trying to make in my post. I was thinking of something a lot more dangerous. I was imagining a scenario where a bad alignment target is very likely to get successfully implemented unless ATA progresses to the needed levels of insight before it is too late. And I was imagining an alignment target that implied a massively worse than extinction outcome (for example along the lines of the outcome implied by PCEV [LW · GW]). I think this is a stronger argument in favour of starting work on ATA now. And this interpretation was ruled out by the old title (which is why I changed the title).

(a brief tangent: if someone expects everything to turn out well. But would like to work on ATA in order to further reduce a small probability of something going moderately bad. Then I would be very happy to collaborate with such a person in a future ATA project. Having very different perspectives in an ATA project sounds like a great idea. An ATA project is very different from a technical design project where a team is trying to get something implemented that will actually work. There is really no reason for people to have similar worldviews or even compatible ontologies. It is a race against time to find a conceptual breakthrough of an unknown type. It is a search for an unnoticed implicit assumption of an unknown type. So genuinely different perspectives sounds like a great idea)

In summary: ``A Pivotal Act AI might not buy a lot of time'' is in fact a true statement. And it is possible to see this by looking at the scenario outlined in the post. But it was a mistake to use this statement as the title for this post. Because it implies things about the scenario that I did not intend to imply. So I changed the title and outlined a scenario that is better suited for illustrating that a Pivotal Act AI might not buy a lot of time.

 

PS:

I upvoted johnswentworth's comment. My original title was a mistake. And the comment helped me realise my mistake. I hope that others will post similar comments on my posts in the future. The comment deserves upvotes. But I feel like I should ask about these agreement votes.

The statement: ``something which does not buy ample time is not a pivotal act'' is clearly false. Martin Randall explained [LW(p) · GW(p)] why the statement is false (helpfully pulling out the relevant quotes from the texts that johnswentworth cited). And then johnswentworth did an ``Agreed reaction'' on Martin Randall's explanation of why the statement is false. After this however, johnswentworth's comment (with the statement that had already been determined to be false) was agree voted to plus 7. That seemed odd to me. So I wanted to ask about it. (My posts sometimes question deeply entrenched assumptions. And johnswentworth's comment sort of looks like criticism (at least if one only skims the post and the discussion). So maybe there is no great mystery here. But I still wanted to ask about this. Mostly in case someone has noticed an object level error in my post. But I am also open to terminology feedback)

comment by faul_sname · 2024-10-03T17:05:44.979Z · LW(p) · GW(p)

It does strike me that, to OP's point, "would this act be pivotal" is a question whose answer may not be knowable in advance. See also previous discussion [LW · GW] on pivotal act intentions vs pivotal acts (for the audience, I know you've already seen it and in fact responded to it).