Corrigibility could make things worse

post by ThomasCederborg · 2024-06-11T00:55:21.596Z · LW · GW · 5 comments

Summary: A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse. Any implemented Corrigibility method will necessarily be built on top of a set of unexamined implicit assumptions. One of those assumptions could be true for a PAAI, but false for a CEV style AI. The present post outlines one specific scenario where this happens. This scenario involves a Corrigibility method that only works for an AI design, if that design does not imply an identifiable outcome. The method fails when it is applied to an AI design, that does imply an identifiable outcome. When such an outcome does exist, the ``corrigible'' AI will ``explain'' this implied outcome, in a way that makes the designers want to implement that outcome.


The example scenario:

Consider a scenario where a design team has access to a Corrigibility method that works for a PAAI design. A PAAI can have a large impact on the world. For example by helping a design team prevent other AI projects. But there exists no specific outcome, that is implied by a PAAI design. Since there exists no implied outcome for a PAAI to ``explain'' to the designers, this Corrigibility method actually renders a PAAI genuinely corrigible. For some AI designs, the set of assumptions that the design is built on top of, does however imply a specific outcome. Let's refer to this as the Implied Outcome (IO). This IO can alternatively be viewed as: ``the outcome that a Last Judge would either approve of, or reject''. In other words: consider the Last Judge proposal from the CEV arbital page. If it would make sense to add a Last Judge of this type, to a given AI design, then that AI design has an IO. The IO is the outcome that a Last Judge would either approve of, or reject (for example a successor AI that will either get a thumbs up or a thumbs down). In yet other words: the purpose of adding a Last Judge to an AI design, is to allow someone to render a binary judgment on some outcome. For the rest of this post, that outcome will be referred to as the IO of the AI design in question.

In this scenario, the designers first implement a PAAI that buys time (for example by uploading the design team). For the next step, they have a favoured AI design, that does have an IO. One of the reasons that they are trying to make this new AI corrigible, is that they can't calculate this IO. And they are not certain that they actually want this IO to be implemented.

Their Corrigibility method always results in an AI that wants to refer back to the designers, before implementing anything. The AI will help a group of designers implement a specific outcome, iff they are all fully informed, and they are all in complete agreement that this outcome should be implemented. The Corrigibility method has a definition of Unacceptable Influence (UI). And the Corrigibility method results in an AI that genuinely wants to avoid exerting any UI. It is however important that the AI is able to communicate with the designers in some way. So the Corrigibility method also includes a definition of Acceptable Explanation (AE).

At some point the AI becomes clever enough to figure out the details of the IO. At that point, it is clever enough to convince the designers that this IO is the objectively correct thing to do, using only methods classified as AE. This ``explanation'' is very effective and results in a very robust conviction, that the IO is the objectively correct thing to do. In particular, this value judgment does not change, when the AI tells the designers what has happened. So, when the AI explains what has happened, the designers do not change their mind about IO. They still consider themselves to have a duty to implement IO. The result is a situation where fully informed designers are fully committed to implementing IO. So the ``corrigible'' AI helps them implement IO.

Basically: when this Corrigibility method is applied to an AI with an IO, then this IO will end up getting implemented. The Corrigibility method works perfectly for any PAAI type AI. But for any AI with an identifiable end goal, the Corrigibility method does not change the outcome (it just adds an ``explanation'' step).

The most recently published version of CEV is Parliamentarian CEV (PCEV). A previous post [LW · GW] showed that a successfully implemented PCEV would be massively worse than extinction. Thus, a method that makes a PAAI genuinely Corrigible, could make things worse. It could for example change the outcome from extinction, to something massively worse (by resulting in a bad IO getting implemented. For example along the lines of the IO of PCEV).


A more general danger:

There exists a more general danger, that is not strongly related to the specifics of the ``Explanation versus Influence'' definitional issues, or the ``AI designs with an IO, versus AI designs without an IO'' dichotomy, or the PAAI concept, or the PCEV proposal. Consider the more general case where a design team is relying on a two step process, where some type of ``buying time AI'' is followed by a ``real AI''. In this case, the most serious problem is probably not those assumptions that are analysed beforehand, and that are kept in mind when applying some Corrigibility method to a novel type of AI. The most serious problem is probably the set of unexamined implicit assumptions, that the designers are not aware of. Any Corrigibility method implemented by humans, will be built on top of many such assumptions. And it would in general not be particularly surprising to discover that one of these assumptions happens to be correct for one AI design, but incorrect for another AI design. It seems very unlikely that all of these implicit assumptions are humanly findable, even in principle. This means that even if a Corrigibility method works perfectly for a ``buying time AI'', it will probably never be possible to know whether or not it will actually work for a ``real AI''.

Given that PCEV has already been shown to be massively worse than extinction [LW · GW], it seems unlikely that the IO of PCEV will end up getting implemented. That specific danger has probably been mostly removed. But the field of Alignment Target Analysis is still at a very, very early stage. And PCEV is far from the only dangerous alignment target. In general, the field is very, very far from adequately mitigating the full set of dangers, that are related to someone successfully hitting a bad alignment target (as a tangent, it might make sense to note that a Corrigibility method that stops working at the wrong time, is just one specific path amongst many, along which a bad alignment target could end up getting successfully implemented).

Besides being at a very early stage of development, this field of research is also very neglected. At the moment there does not appear to exist any serious research effort dedicated to this risk mitigation strategy. The present post seeks to reduce this neglect, by showing that one can not rely on Corrigibility, for protection against scenarios where someone successfully hits a bad alignment target (even if we assume that Corrigibility has been successfully implemented in a PAAI).


Assumptions and limitations:

PCEV spent many years as the state of the art alignment target, without anyone noticing that a successfully implemented PCEV would have been massively worse than extinction [LW · GW]. There exists many paths along which PCEV could have ended up getting successfully implemented. Thus, absent a solid counterargument, the dangers from successfully hitting a bad alignment target should be seen as serious by default. In other words: after the PCEV incident, the burden of proof is on anyone who would claim, that Alignment Target Analysis is not urgently needed to mitigate a serious risk. A proof of concept that such mitigation is feasible, is that the dangers associated with PCEV was reduced by Alignment Target Analysis. In yet other words: absent a solid counterargument, scenarios where someone successfully hits a bad alignment target, should be treated as a danger that is both serious and possible to mitigate. One way to construct such a counterargument, would be to base it on Corrigibility. For such a counterargument to work, Corrigibility must be feasible. Since Corrigibility must be feasible for such a counterargument to work, the present post could simply assume feasibility, when showing that such a counterargument fails (if Corrigibility is not feasible, then Corrigibility based counterarguments fail due to this lack of feasibility). So, this post simply assumed that Corrigibility is feasible.

Since the present post assumed feasibility, it did not demonstrate the existence of a serious real world danger, from partially successful Corrigibility methods (if Corrigibility is not feasible, then scenarios along these lines do not actually constitute a real problem. And feasibility was assumed). This post instead simply showed that the Corrigibility concept does not remove the urgent need for Alignment Target Analysis (a previous post [LW · GW] showed that dangers from scenarios where someone successfully hits a bad alignment target are both very serious, and also possible to mitigate. Thus, the present post is focusing on showing why one specific class of counterarguments fail. Previous posts have addressed counterarguments based on proposals along the lines of a PAAI [LW · GW], and proposals along the lines of a Last Judge [LW · GW]).

It finally makes sense to explicitly note, that if Corrigibility turns out to be feasible, then Corrigibility might have a large, net positive, safety impact. Because the danger illustrated in this post might be smaller than the safety benefits of the Corrigibility concept. (conditioned on feasibility I would tentatively guess that making progress on Corrigibility probably results in a significant net reduction in the probability of a worse-than-extinction outcome)

5 comments

Comments sorted by top scores.

comment by Thomas Kwa (thomas-kwa) · 2024-06-11T01:12:02.233Z · LW(p) · GW(p)

It seems to me that corrigibility doesn't make things worse in this example, it's just that a partially corrigible AI could still lead to bad outcomes. In fact one could say that the AI in the example is not corrigible enough, because it exerts influence in ways we don't want.

Replies from: ThomasCederborg
comment by ThomasCederborg · 2024-06-11T01:22:39.194Z · LW(p) · GW(p)

The first AI is genuinely Corrigible. The second AI is not Corrigible at all. This leads to a worse outcome, compared to the case where there was no Corrigible AI. Do you disagree with the statement that the first AI is genuinely Corrigible? Or do you disagree with the statement that the outcome is worse, compared to the case where there was no Corrigible AI?

comment by Max Harms (max-harms) · 2024-06-12T15:31:58.482Z · LW(p) · GW(p)

At that point, it is clever enough to convince the designers that this IO is the objectively correct thing to do, using only methods classified as AE.

I'm confused here. Is the corrigible AI trying to get the IO to happen? Why is it trying to do this? Doesn't seem very corrigible, but I think I'm probably just confused.

Maybe another frame on my confusion is that it seems to me that a corrigible AI can't have an IO?

Replies from: ThomasCederborg
comment by ThomasCederborg · 2024-06-12T19:29:59.585Z · LW(p) · GW(p)

Thank you for engaging. If this was unclear for you, then I'm sure it was also unclear for others.

The post outlined a scenario where a Corrigibility method works perfectly for one type of AI (an AI that does not imply an identifiable outcome, for example a PAAI). The same Corrigibility method fails completely for another type of AI (an AI that does imply an identifiable outcome, for example PCEV). So the second AI, that does have an IO, is indeed not corrigible.

This Corrigibility method leads to an outcome that is massively worse than extinction. This bad outcome is the result of two things being true, (i): the fully Corrigible first AI made this outcome possible to reach, and (ii): since the Corrigibility method worked perfectly for the first AI, the designers falsely believed that the Corrigibility method would also work for the second AI.

The second AI wants many things. It wants to get an outcome, as close as possible to IO. The Corrigibility method resulted in the second AI also wanting many additional things (such as wanting all explanations it gives to count as AE, even if this makes the explanations less efficient. And wanting to avoid implementing anything, unless informed designers want that thing to be implemented). But in practice the Corrigibility method does not change the outcome in any way (it just adds an ``explanation step''). So I think it makes sense to say that the second AI has ``zero Corrigibility''. The first AI is completely corrigible. And if the designers had only used the Corrigibility method for the first AI, then the Corrigibility method would have worked perfectly.

This is what I was trying to communicate with the first sentence of the post:  ``A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse.''. I could have used that sentence as a title, but I decided against trying to include everything in the title. (I think it is ok to leave information out of the title, as long as the title is accurate. And the title is, in fact, accurate (because the first AI is genuinely corrigible. And things would have turned out a lot better, in the counterfactual situation where no one had developed any form of Corrigibility).)

One possible source of confusion, could be that you are perhaps interpreting this post as referring to some specific example scenario from your sequence. My post was mostly written before you posted your sequence. It is not meant as a comment on any specific AI in your sequence (which is why I don't link to your sequence). But given that you had just published your sequence, maybe you were expecting my scenario to contain a single, partially corrigible AI (without an IO). That is not the scenario that I was describing in my post.

However, I could actually make the same point using a scenario with a single AI (without an IO), that is partially Corrigible. (there exists a more general danger here, that is not strongly related to the number of AI designs involved). So, here is an attempt to make the same point using such a scenario instead. A possible title for such an alternative post would be: ``a partially corrigible AI could make things worse''. (this is also a standalone scenario. It is also not meant as a response to anything specific in your sequence)

I think that one could reasonably describe Corrigibility as being context dependent. A given AI could be fully Corrigible in one context (such as preventing competing AI projects), and not Corrigible at all in another context (such as discussing Alignment Target Analysis). I think that one could reasonably refer to such an AI as being partially Corrigible. And, as will be shown below, such an AI could lead to a massively worse than extinction outcome.

Summary: Consider the case where a design team uses a Corrigibility method to build an AI Assistant (AIA). The resulting AIA does not have an IO. When the design team try to use the AIA to prevent competing AI projects, everything works perfectly. However, when they try to use the AIA to understand Alignment Target Analysis (ATA), the Corrigibility method fails completely. Let's try two very rough analogies. Talking to the AIA about shutting down competing AI projects is very roughly analogous to using a djinn that grants wishes, while caring fully about intentions in exactly the right way. But talking to the AIA about ATA is very roughly analogous to using a djinn that grants wishes, while not caring about intentions at all (so, not a djinn with any form of preferred outcome. And not any form of ``malicious story djinn''. But also not a safe djinn).

The AIA always wants to interact in ways that counts as Acceptable Explanation (AE). When the designers ask it to shut down all hardware that is capable of running a powerful AI, it disobeys the order, and explains that human brains can, in theory, be used to run a powerful AI. When they ask it to shut down all non biological hardware that is capable of running a powerful AI, it first asks if it should delete itself, or if it should transfer itself to biological hardware. Etc. In short: it is Corrigible in this context. In particular: while talking about this topic, the definition of AE holds up.

When one of the designers asks the AIA to explain PCEV however, the definition of AE does not hold up. When the AIA was discussing the potential action of shutting down hardware, there were two components that it wanted the designers to understand, (i): the reasons for shutting down this hardware, and (ii): the effects of shutting down this hardware. In the hardware case, the designers already understood (i) reasonably well. So there was no need to explain it. When AIA is asked about PCEV, there are again two components that AIA wants the designers to understand, (i): the reasons for building PCEV, and (ii): the effects of building PCEV. PCEV is a two component thing. It is simultaneously an alignment target, and also a normative moral theory. It turns out that in this case, the designers do not actually understand (i) at all. They do not understand the normative moral theory behind PCEV. So the AIA explains this normative moral theory to the designers. If the AE definition had been perfect, this would not have been a problem. In other words: if the AIA had been completely Corrigible, then this would not have been a problem.

But it turns out that the people that designed the Corrigibility method did not in fact have a sufficiently good understanding of concepts along the lines of: ``normative moral theories'', ``explanations'', ``understanding'', etc (understanding these concepts sufficiently well was a realistic outcome. But in this scenario the designers failed to do so). As a result, the AE definition is not perfect, and the AIA is only partly Corrigible. So, the AIA ``explains'' the ``normative moral theory of PCEV'' until the designers ``understand'' it (using an explanation that counts as AE). This results in designers that feel a moral obligation to implement PCEV, regardless of what the result is. This new moral framework is robust to learning what happened. So, the result is a set of fully informed designers that are fully committed to implementing PCEV. So, the outcome is massively worse than extinction. (one way to reduce the probability of scenarios along these lines, is to make progress on ATA. Thus: even the existence of an AI that seems to be completely corrigible, is not an argument against the need to make progress on ATA. This also works as a way of gesturing at the more general point, that I was trying to gesture at in the first post)

(Wei Dai has been talking about related things for quite a while)

Replies from: max-harms
comment by Max Harms (max-harms) · 2024-06-13T16:08:56.034Z · LW(p) · GW(p)

Thanks! I now feel unconfused. To briefly echo back the key idea which I heard (and also agree with): a technique which can create a corrigible PAAI might have assumptions which break if that technique is used to make a different kind of AI (i.e. one aimed at CEV). If we call this technique "the Corrigibility method" then we may end up using the Corrigibility method to make AIs that aren't at all corrigible, but merely seem corrigible, resulting in disaster.

This is a useful insight! Thanks for clarifying. :)