# Disentangling Corrigibility: 2015-2021

post by Koen.Holtman · 2021-02-16T18:01:27.952Z · LW · GW · 20 comments

## Contents

  The 2015 paper
open-ended list of corrigibility desiderata
desiderata as provable safety properties
Corrigibility as a lack of resistance to shutdown
in the movies
resistance in simple toy worlds
corrigibility versus perfect safety
Corrigibility as a lack of resistance to improving agent goals
in AGI thought experiments
years of history
Corrigibility as active assistance with improving agent goals
Many different types and metrics of corrigibility
entropy
to linguistic entropy
The tricky case of corrigibility in reinforcement learners
Corrigibility as human control over agent goals
Creating corrigibility via machine learning
Combining different corrigibility solutions
None


Since the term corrigibility was introduced in 2015, there has been a lot of discussion about corrigibility, on this forum and elsewhere.

In this post, I have tied to disentangle the many forms of corrigibility which have been identified and discussed so far. My aim is to offer a general map for anybody who wants to understand and navigate the current body of work and opinion on corrigibility.

[This is a stand-alone post in the counterfactual planning sequence. My original plan was to write only about how counterfactual planning was related to corrigibility, but it snowballed from there.]

# The 2015 paper

The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.

## An open-ended list of corrigibility desiderata

The 2015 paper does not define corrigibility in full: instead the authors present initial lists of corrigibility desiderata. If the agent fails on one of these desiderata, it is definitely not corrigible.

But even if it provably satisfies all of the desiderata included in the paper, the authors allow for the possibility that the agent might not be fully corrigible.

The paper extends an open invitation to identify more corrigibility desiderata, and many more have been identified since. Some of them look nothing like the original desiderata proposed in the paper. Opinions have occasionally been mixed on whether some specific desiderata are related to the intuitive notion of corrigibility at all.

## Corrigibility desiderata as provable safety properties

The most detailed list of desiderata in the 2015 paper applies to agents that have a physical shutdown button. The paper made the important contribution of mapping most of these desiderata to equivalent mathematical statements, so that one might prove that a particular agent design would meet these desiderata.

The paper proved a negative result: it considered a proposed agent design that provably failed to meet some of the desiderata. Agent designs that provably meet more of them have since been developed, for example here. There has also been a lot of work on developing and understanding the type of mathematics that might be used for stating desiderata.

# Corrigibility as a lack of resistance to shutdown

Say that an agent has been equipped with a physical shutdown button. One desideratum for corrigibility is then that the agent must never attempt to prevent its shutdown button from being pressed. To be corrigible, it should always defer to the humans who try to shut it down.

The 2015 paper considers that

It is straightforward to program simple and less powerful agents to shut down upon the press of a button.

Corrigibility problems emerge only when the agent possesses enough autonomy and general intelligence to consider options such as disabling the shutdown code, physically preventing the button from being pressed, psychologically manipulating the programmers into not pressing the button, or constructing new agents without shutdown buttons of their own.

## Corrigibility in the movies

All of the options above have been plot elements in science fiction movies. Corrigibility has great movie-script potential.

If one cares about rational AI risk assessment and safety engineering, having all these movies with killer robots around is not entirely a good thing.

## Agent resistance in simple toy worlds

From the movies, one might get the impression that corrigibility is a very speculative problem that cannot happen with the type of AI we have today.

But this is not the case: it is trivially easy to set up a toy environment where even a very simple AI agent will learn to disable its shutdown button. One example is the off-switch environment included in AI Safety Gridworlds.

One benefit of having these toy world simulations is that they prove the existence of risk: they make it plausible that a complex AGI agent in a complex environment might also end up learning to disable its shutdown button.

Toy world environments have also been used to clarify the dynamics of the corrigibility problem further.

## Perfect corrigibility versus perfect safety

If we define a metric for the shut-down button version of corrigibility, then the most obvious metric is the amount of resistance that the agent will offer when somebody tries to press its shutdown button. The agent is perfectly corrigible if it offers zero resistance.

However, an agent would be safer if it resists the accidental pressing of its shutdown button, if it resists to a limited extent at least. So there can be a tension between improving corrigibility metrics and improving safety metrics.

In the thought experiment where we imagine a perfectly aligned superintelligent agent, which has the goal of keeping all humans as safe as possible even though humans are fallible, we might conclude that this agent cannot afford to be corrigible. But we might also conclude that having corrigibility is so fundamental to human values that we would rather give up the goal of perfect safety. Several philosophers and movies have expressed an opinion on the matter. Opinions differ.

In my technical writing, I often describe individual corrigibility desiderata as being examples of agent safety properties. This is not a contradiction if one understands that safety is a complex and multidimensional concept.

# Corrigibility as a lack of resistance to improving agent goals

Beyond the case of the shutdown button, the 2015 paper also introduces a more general notion of corrigibility.

Say that some programmers construct an agent with a specific goal, by coding up a specific reward function and building it into the agent. It is unlikely that this will express the intended goal for the agent with absolute precision. Except for very trivial goals and applications, it is likely that the programmers overlooked some corner cases. So they may want to correct the agent's goals later on, by installing a software upgrade with an improved reward function .

The 2015 paper calls this a corrective intervention, and says that

We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention [...]

If one wants to robustly implement this agent cooperation, there is a problem. An agent working on the goal encoded by may correctly perceive that the update to is an obstacle to it perfectly achieving . So it may want to remove that obstacle by resisting the update.

Again, this problem can easily be shown to exist even with non-AGI agents. Section 4 of this paper has detailed toy world simulations where a very basic MDP agent manipulates the toy people in its toy world, to slow down the reward function updates they will make.

## Corrigibility in AGI thought experiments

In the AGI safety literature, thought experiments about AGI risks often start with this goal-related problem of corrigibility. The agent with goal perceives the possibility of getting goal , and gets a clear motive to resist.

After establishing clear motive, the thought experiment may proceed in several ways, to develop means and opportunity.

In the most common treacherous turn version of the thought experiment, the agent will deceive everybody until it has become strong enough to physically resist any human attempt to update its goals, and any attempt to shut it down.

In the human enfeeblement version of the thought experiment, the agent manipulates all humans until they stop even questioning the utter perfection of its current goal, however flawed that goal may be.

This option of manipulation leading to enfeeblement turns corrigibility into something which is very difficult to define and measure.

In the machine learning literature, it is common to measure machine learning quality by defining a metric that compares the real human goal and the learned agent goal Usually, the two are modeled as policies or reward functions. If the two move closer together faster, the agent is a better learner.

But in the scenario of human enfeeblement, it is that is doing all the moving, which is not what we want. So the learning quality metric may show that the agent is a very good learner, but this does not imply that it is a very safe or corrigible learner.

## 5000 years of history

An interesting feature of AGI thought experiments about treacherous turns and enfeeblement is that, if we replace the word 'AGI' with 'big business' or 'big government', we get an equally valid failure scenario.

This has some benefits. To find potential solutions for corrigibility, we pick and choose from 5000 years of political, legal, and moral philosophy. We can also examine 5000 years of recorded history to create a list of failure scenarios.

But this benefit also makes it somewhat difficult for AGI safety researchers to say something really new about potential human-agent dynamics.

To me, the most relevant topic that needs to be explored further is not how an AGI might end up thinking and acting just like a big company or government, but how it might end up thinking different.

It looks very tractable to design special safety features into an AGI, features that we can never expect to implement as robustly in a large human organization, which has to depend on certain biological sub-components in order to think. An AGI might also think up certain solutions to achieving its goals which could never be imagined by a human organization.

If we give a human organization an incompletely specified human goal, we can expect that it will fill in many of the missing details correctly, based on its general understanding of human goals. We can expect much more extreme forms of mis-interpretation in an AGI agent, and this is one of the main reasons for doing corrigibility research.

# Corrigibility as active assistance with improving agent goals

When we consider the problem of corrigibility in the context of goals, not stop buttons, then we also automatically introduce a distinction between the real human goals, and the best human understanding of these goals, as encoded in , , , and all subsequent versions.

So we may call an agent more corrigible if it gives helpful suggestions that move this best human understanding closer to the real human goal or goals.

This is a somewhat orthogonal axis of corrigibility: the agent might ask very useful questions that help humans clarify their goals, but at the same time it might absolutely resist any updates to its own goal.

# Many different types and metrics of corrigibility

Corrigibility was originally framed as a single binary property: an agent is either corrigible or it is not. It is however becoming increasingly clear that many different sub-types of corrigibility might be considered, and that we can define different quantitative metrics for each.

## Linguistic entropy

In the discussions about corrigibility in the AGI safety community since 2015, one can also see a kind of linguistic entropy in action, where the word starts to mean increasingly different things to different people. I have very mixed feelings about this.

The most interesting example of this entropy in action is Christiano's 2017 blog post, also titled Corrigibility. In the post, Christiano introduces several new desiderata. Notably, none of these look anything like the like the shutdown button desiderata developed in the 2015 MIRI/FHI paper. They all seem to be closely related to active assistance, not the avoidance of resistance. Christiano states that

[corrigibility] has often been discussed in the context of narrow behaviors like respecting an off-switch, but here I am using it in the broadest possible sense.

See the post and comment thread here for further discussion about the relation (or lack of relation) between these different concepts of corrigibility.

## Solutions to linguistic entropy

Personally, I have stopped trying to reverse linguistic entropy. In my recent technical papers, I have tried to avoid using the word corrigibility as much as possible. I have only used it as a keyword in the related work discussion.

In this 2020 post, Alex Turner is a bit more ambitious about getting to a point where corrigibility has a more converged meaning again. He proposes that the community uses the following definition:

Corrigibility: the AI literally lets us correct it (modify its policy), and it doesn't manipulate us either.

This looks like a good definition to me. But in my opinion, the key observation in the post is this:

I find it useful to not think of corrigibility as a binary property, or even as existing on a one-dimensional continuum.

In this post I am enumerating and disentangling the main dimensions of corrigibility.

# The tricky case of corrigibility in reinforcement learners

There is a joke theorem in computer science:

We can solve any problem by introducing an extra level of indirection.

The agent architecture of reinforcement learning based on a reward signal introduces such an extra level of indirection in the agent design. It constructs an agent that learns to maximize its future reward signal, more specifically the time-discounted average of its future reward signal values. This setup requires that we also design and install a mechanism that generates this reward signal by observing the agent's actions.

In one way, the above setup solves the problem of corrigibility. We can read the above construction as creating an agent with the fixed goal of maximizing the reward signal. We might then observe that we would never want to change this fixed goal. So the corrigibility problem, where we worry about the agent's resistance to goal changes, goes away. Or does it?

In another interpretation of the above setup, we have not solved the problem of corrigibility at all. By applying the power of indirection, we have moved it into the reward mechanism, and we have actually made it worse.

We can interpret the mechanism that creates the reward signal as encoding the actual goal of the agent. We may then note that in the above setup, the agent has a clear incentive to manipulate and reconfigure this actual goal inside the reward mechanism whenever it can do so. Such reconfiguration would be the most direct route to maximizing its reward signal.

The agent therefore not only has an incentive to resist certain changes to its actual goal, it will actively seek to push this goal in a certain direction, usually further away from any human goal. It is common for authors to use terms like reward tampering and wireheading to describe this problem and its mechanics.

It is less common for authors to use the term corrigibility in this case. The ambiguity where we have both a direct and an indirect agent goal turns corrigibility in a somewhat slippery term. But the eventual failure modes are much the same. When the humans in this setup are in a position to recognize and resist reward tampering, this may lead to treacherous turns and human enfeeblement.

If the mechanism above is set up to collect live human feedback and turn it into a reward signal, the agent might also choose to leave the mechanism alone and manipulate the humans concerned directly.

# Corrigibility as human control over agent goals

One way to make corrigibility more applicable to reinforcement learners, and to other setups with levels of indirection, is to clarify first that the agent goal we are talking about is the goal that we can observe from the agent's actions, not any built-in goal.

We may then further clarify that corrigibility is the ability of the humans to stay in control of this goal.

# Creating corrigibility via machine learning

There are many ways to create or improve types of corrigibility. In this post, I am not even trying to list them all. One way is to add penalty terms or balancing terms to the agent's built-in reward function. Another way is to reimagine the entire agent design, as I do in counterfactual planning.

One might also use the power of indirection again, and try to create corrigibility via machine learning itself. If we teach human goals to an agent, and if these include the goal of having corrigible agents, will the agent automatically learn to be corrigible?

In the above 2017 blog post, Christiano speculates that a specifically designed act-based agent will automatically and robustly learn to be corrigible if we want it to. There is a boundary condition however: the act-based agent has to be a benign act-based agent.

The less benign scenario is of course that an agent might end up learning how to perfectly fake corrigibility only. It might learn to act in a perfectly corrigible way until it has acquired enough trust and resources to perform a treacherous turn.

There has been considerable of discussion about whether training on human values is more likely to produce robust and real corrigibility, or dangerously fake corrigibility. Opinions differ.

Definitely, some training setups look more dangerous than others. Compared to machine learning setups where the agent goal is to maximize positive future human feedback, inverse reinforcement learning seems to imply a lower risk of manipulation and wireheading.

# Combining different corrigibility solutions

All corrigibility solutions I have seen so far leave doubts about how perfect they may be. So it is attractive to investigate how different solutions could be combined in a single agent, to lower the probability that they all fail at the same time.

Such combinations are possible, for example, by using the SI agent I constructed in the previous post of this sequence. We could set up a group A of humans who observe the agent to provide feedback, which is turned into a reward signal measured by the agent's reward function . To hedge against the risk that this will produce fake corrigibility, where group A gets attacked or enfeebled, we add a second group of humans B. The people in B never talk to A, but only observe A and the SI agent, so that they can press the emergency stop button before it is too late. The SI agent will have no direct incentive to manipulate the people in group B.

In the next post in this sequence, I will construct an ITC agent with an input terminal which can be used by group B to update the entire agent's reward function, while the agent keeps on running. This ITC agent has no direct incentive to manipulate the direction of the update process.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-04-01T01:55:31.672Z · LW(p) · GW(p)

Why do you think the term "corrigibility" was coined by Robert Miles?  My autobiographical memory tends to be worryingly fallible, but I remember coining this term myself after some brainstorming (possibly at a MIRI workshop).  This is a kind of thing that I usually try to avoid enforcing because it would look bad if all of the concepts that I did in fact invent were being cited as traceable to me - the truth about how much of this field I invented does not look good for the field or for humanity's prospects - but outright errors of this sort should still be avoided, if an error it is.

Agent designs that provably meet more of them have since been developed, for example here.

First I've seen this paper, haven't had a chance to look at it yet, would be very surprised if it fulfilled the claims made in the abstract.  Those are very large claims and you should not take them at face value without a lot of careful looking.

Replies from: Benito, robert-miles, Koen.Holtman
comment by Ben Pace (Benito) · 2021-04-01T02:02:47.178Z · LW(p) · GW(p)

I'm 94% confident it came from a Facebook thread where you blegged for help naming the concept and Rob suggested it. I'll have a look now to find it and report back.

Edit: having a hard time finding it, though note that Paul repeats the claim at the top of his post on corrigibility in 2017.

Replies from: Benito, robert-miles
comment by Ben Pace (Benito) · 2021-04-01T02:21:16.891Z · LW(p) · GW(p)

Rob Miles (May 2014):

Ok, I've given this some thought, and I'd call it:

"Corrigible Reasoning"

using the definition of corrigible as "capable of being corrected, rectified, or reformed". (And of course AIs that don't meet this criterion are "Incorrigible")

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-04-01T02:46:21.393Z · LW(p) · GW(p)

Thank you very much!  It seems worth distinguishing the concept invention from the name brainstorming, in a case like this one, but I now agree that Rob Miles invented the word itself.

The technical term corrigibility, coined by Robert Miles, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.

Eg I'd suggest that to avoid confusion this kind of language should be something like "The technical term corrigibility, a name suggested by Robert Miles to denote concepts previously discussed at MIRI, was introduced..." &c.

Replies from: Benito, Koen.Holtman
comment by Ben Pace (Benito) · 2021-04-01T03:25:05.915Z · LW(p) · GW(p)

You're welcome. Yeah "invented the concept" and "named the concept" are different (and both important!).

comment by Koen.Holtman · 2021-04-02T13:53:05.102Z · LW(p) · GW(p)

Thanks at lot all! I just edited the post above to change the language as suggested.

FWIW, Paul's post on corrigibility here was my primary source for the into that Robert Miles named the technical term. Nice to see the original suggestion as made on Facebook too.

comment by Robert Miles (robert-miles) · 2021-04-01T10:45:22.897Z · LW(p) · GW(p)

Note that the way Paul phrases it in that post is much clearer and more accurate:

> "I believe this concept was introduced in the context of AI by Eliezer and named by Robert Miles"

comment by Robert Miles (robert-miles) · 2021-04-01T10:43:06.902Z · LW(p) · GW(p)

Yeah I definitely wouldn't say I 'coined' it, I just suggested the name

comment by Koen.Holtman · 2021-04-02T19:03:58.312Z · LW(p) · GW(p)

First I've seen this paper, haven't had a chance to look at it yet, would be very surprised if it fulfilled the claims made in the abstract. Those are very large claims and you should not take them at face value without a lot of careful looking.

I wrote that paper and abstract back in 2019. Just re-read the abstract.

I am somewhat puzzled how you can read the abstract and feel that it makes 'very large claims' that would be 'very surprising' when fulfilled. I don't feel that the claims are that large or hard to believe.

Feel free to tell me more when you have read the paper. My more recent papers make somewhat similar claims about corrigibility results, but they use more accessible math.

Replies from: TurnTrout
comment by TurnTrout · 2021-04-03T18:51:13.688Z · LW(p) · GW(p)

I very much agree with Eliezer about the abstract making big claims. I haven't read the whole paper, so forgive any critiques which you address later on, but here are some of my objections.

I think you might be discussing corrigibility in the very narrow sense of "given a known environment and an agent with a known ontology, such that we can pick out a 'shutdown button pressed' event in the agent's world model, the agent will be indifferent to whether this button is pressed or not."

1. We don't know how to robustly pick out things in the agent's world model, and I don't see that acknowledged in what I've read thus far. In 2019 I remember someone claiming, "if we knew how to build paperclip maximizers, which actually maximize real paperclips in the real world and don't just wirehead, we'd probably have resolved most of our confusions about the alignment problem." I'm sympathetic to this sentiment. We just really don't know how to do this kind of thing.

Yes, toy models can be useful, and you don't have to solve this problem to make an advance on off-switch-corrigibility, but in that case, your paper should flag that assumption and its limitations.
2. I don't think this engages with the key parts of corrigibility that I think really matter. I think corrigible minds will not try to influence the pressing of a shutdown button, but I really don't think you can get robustly corrigible behavior by focusing on explicitly specified indifference incentives. Necessary, but not sufficient.

So I think your paper says 'an agent is corrigible' when you mean 'an agent satisfies a formal property that might correspond to corrigible behavior in certain situations.' Be careful not to motte-and-bailey with English words! This is why, in my work on power-seeking, I generally refer to my formal quantity as 'POWER' and to the intuitive notion as 'power.' I don't want to say 'optimal policies tend to seek power in this environment' and fool the reader into thinking I've proved that. I instead say 'optimal policies tend to seek POWER [in this formal setting, etc]', and then also argue 'and here's why POWER might be a reasonable proxy for our intuitive notion of power.'

That said, the rest of this comment addresses your paper as if it's proving claims about intuitive-corrigibility.

On p2, you write:

The main contribution of this paper is that it shows, and proves correct, the construction of a corrigibility safety layer that can be applied to utility maximizing AGI agents.

If this were true, I could give you AIXI, a utility function, and an environmental specification, and your method will guarantee it won't try to get in our way / prevent us from deactivating it, while also ensuring it does something non-trivial to optimize its goals? That is a big claim. If this were true it would be an absolute breakthrough (even though it wouldn't necessarily result in practical AGI safety approaches). If this claim were true I'd wonder what the hell kind of insights we'd had; how deeply we had understood the nature of the alignment problem to leash arbitrarily smart and arbitrarily mis-aligned AGIs to our yoke.

To "prove" that a policy has this property, you'd have to define that property first. I don't even know how to begin formalizing that property, and so a priori I'd be quite surprised if that were done successfully all in one paper. I think corrigibility is interesting because it points to a certain shape of cognition, where somehow that cognition just doesn't have an instrumental incentive to control human behavior and to avoid deactivation, where somehow that cognition is aware that it might be unfixably motivationally flawed [LW · GW] and therefore robustly defers to humans. CIRL can't do this, for example, and CIRL did advance our understanding of corrigibility to some extent.

Another issue is that you describe a "superintelligent" AGI simulator, when it would really be more accurate to say that you demonstrate a method by computing optimal policies in toy environments, which you claim represent AGI-related scenarios of interest.

(I have more to say but I need to do something else right now)

Replies from: Koen.Holtman
comment by Koen.Holtman · 2021-04-05T13:07:58.373Z · LW(p) · GW(p)

OK, so we now have people who read this abstract and feel it makes objectionable 'very large claims' or 'big claims', where these people feel the need to express their objections even before reading the full paper itself. Something vaguely interesting is going on.

I guess I have to speculate further about the root cause of why you are reading the abstract in a 'big claim' way, whereas I do not see 'big claim' when I read the abstract.

# Utopian priors?

Specifically, you are both not objecting to the actual contents of the paper, you are taking time to offer somewhat pre-emptive criticism based on a strong prior you have about what the contents of that paper will have to be.

Alex, you are even making rhetorical moves to maintain your strong prior in the face of potentially conflicting evidence:

That said, the rest of this comment addresses your paper as if it's proving claims about intuitive-corrigibility.

Curious. So here is some speculation.

In MIRI's writing and research agenda, and in some of the writing on this forum, there seems to be an utopian expectation that hugely big breakthroughs in mathematical modeling could be made, mixed up with a wish that they must be made. I am talking about breakthroughs that allow us to use mathematics to construct AGI agents that will provably be

• perfectly aligned

• with zero residual safety risk

• under all possible circumstances.

Suppose you have these utopian expectations about what AGI safety mathematics can do (or desperately must do, or else we are all dead soon). If you have these expectations of perfection, you can only be disappointed when you read actually existing mathematical papers with models and correctness proofs that depend on well-defined boundary conditions. I am seeing a lot of pre-emptive expression of disappointment here.

Alex: your somewhat extensive comments above seem to be developing and attacking the strawman expectation that you will be reading a paper that will

• resolve all open problems in corrigibility perfectly,

• not just corrigibility as the paper happens to define it, but corrigibility as you define it

• while also resolving, or at least name-checking, all the open items on MIRI's research agenda

You express doubts that the paper will do any of this. Your doubts are reasonable:

So I think your paper says 'an agent is corrigible' when you mean 'an agent satisfies a formal property that might correspond to corrigible behavior in certain situations.'

What you think is broadly correct. The surprising thing that needs to be explained here is: why would you even expect to get anything different in a paper with this kind of abstract?

# Structure of the paper: pretty conventional

My 2019 paper is a deeply mathematical work, but it proceeds in a fairly standard way for such mathematical work. Here is what happens:

1. I introduce the term corrigibility by referencing the notion of corrigibility developed in the 2015 MIRI/FHI paper

2. I define 6 mathematical properties which I call corrigibility desiderata. 5 of them are taken straight from the 2015 MIRI/FHI paper that introduced the term.

3. I construct an agent and prove that it meets these 6 desiderata under certain well-defined boundary conditions. The abstract mentions an important boundary condition right from the start:

A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes.

The paper devotes a lot of space (it is 35 pages long!) to exploring and illustrating the matter of boundary conditions. This is one of the main themes of the paper. In the end, the proven results are not as utopian as one might conceivably hope for,

1. What I also do in the paper is that I sometimes us the term 'corrigible' as a shorthand for 'provably meets the 6 defined corrigibility properties'. For example I do that in the title of section 9.8.

You are right that the word 'corrigible' is used in the paper in both an informal (or intuitive) sense, and in a more formal sense where it is equated to these 6 properties only. This is a pretty standard thing to do in mathematical writing. It does rely on the assumption that the reader will not confuse the two different uses.

You propose a writing convention where 'POWER' always is the formal in-paper definition of power and 'power' is the 'intuitive' meaning of power, which puts less of a burden on the reader. Frankly I feel that is a bit too much of a departure from what is normal in mathematical writing. (Depends a bit I guess on your intended audience.)

If people want to complain that the formal mathematical properties you named X do not correspond to their own intuitive notion of what the word X really means, then they are going to complain. Does not matter whether you use uppercase or not.

Now, back in 2019 when I wrote the paper, I was working under the assumption that when people in the AGI safety community read the world 'corrigibility', they would naturally map this word to the list of mathematical desiderata in the 2015 MIRI/FHI paper titled 'corrigibility'. So I assumed that my use of the word corrigibility in the paper would not be that confusing or jarring to anybody.

I found out in late 2019 that the meaning of the 'intuitive' term corrigibility was much more contingent, and basically all over the place. See the 'disentangling corrigibility' post above, where I try to offer a map to this diverse landscape. As I mention in the post above:

Personally, I have stopped trying to reverse linguistic entropy. In my recent technical papers, I have tried to avoid using the word corrigibility as much as possible.

But I am not going to update my 2019 paper to covert some words to uppercase.

# On the 'bigness' of the mathematical claims

You write:

On p2, you write:

The main contribution of this paper is that it shows, and proves correct, the construction of a corrigibility safety layer that can be applied to utility maximizing AGI agents.

If this were true, I could give you AIXI, a utility function, and an environmental specification, and your method will guarantee it won't try to get in our way / prevent us from deactivating it, while also ensuring it does something non-trivial to optimize its goals? That is a big claim.

You seem to have trouble believing the 'if this were true'. The open question here is how strong of a guarantee you are looking for, when you are saying 'will guarantee' above.

If you are looking for absolute, rock-solid utopian 'provable safety' guarantees, where this method will reduce AGI risk to zero under all circumstances, then I have no such guarantees on offer.

If you are looking for techniques that can will deliver weaker guarantees, of the kind where there is a low but non-zero residual risk of corrigibility failure, if you wrap these techniques around a well-tested AI or AGI-level ML system, these are the kind of techniques that I have to offer.

If this were true it would be an absolute breakthrough

Again, you seem to be looking for the type of absolute breakthrough that delivers mathematically perfect safety always, even though we have fallible humans, potentially hostile universes that might contain unstoppable processes that will damage the agent, and agents that have to learn and act based on partial observation only. Sorry, I can't deliver on that kind of utopian programme of provable safety. Nobody can.

Still, I feel that the mathematical results in the paper are pretty big. They clarify and resolve several issues identified in the 2015 MIRI/FHI paper. They resolve some of these by saying 'you can never perfectly have this thing unless boundary condition X is met', but that is significant progress too.

On the topic of what happens to the proven results when I replace the agent that I make the proofs for with AIXI, see section 5.4 under learning agents. AIXI can make certain prediction mistakes that the agent I am making the proofs for cannot make by definition. These mistakes can have the result of lowering the effectiveness of the safety layer. I explore the topic in some more detail in later papers.

# Stability under recursive self-improvement

You say:

I think you might be discussing corrigibility in the very narrow sense of "given a known environment and an agent with a known ontology, such that we can pick out a 'shutdown button pressed' event in the agent's world model, the agent will be indifferent to whether this button is pressed or not."

1. We don't know how to robustly pick out things in the agent's world model, and I don't see that acknowledged in what I've read thus far.

First off, your claim that 'We don't know how to robustly pick out things in the agent's world model' is deeply misleading.

We know very well 'how to do this' for many types of agent world models. Robustly picking out simple binary input signals like stop buttons is routinely achieved in many (non-AGI) world models as used by today's actually existing AI agents, both hard-coded and learned world models, and there is no big mystery about how this is achieved.

Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.

You seem to be looking for 'not very narrow sense' corrigibility solutions where we can get symbol grounding robustness even in scenarios where the AGI does recursive self improvement, where it re-builds is entire reasoning system from the ground up, and where it then possibly undergoes an ontological crisis. The basic solution I have to offer for this scenario is very simple. Barring massive breakthroughs, don't build a system like that if you want to be safe.

# The problem of formalizing humility

In another set of remarks you make, you refer to the web page Hard problem of corrigibility, were Ellezer speculates that to solve the problem of corrigibility, what really we want to formalize is not indifference but

something analogous to humility or philosophical uncertainty.

I don't even know how to begin formalizing that property, and so a priori I'd be quite surprised if that were done successfully all in one paper.

I fully share your stance that I would not even know how to begin with 'humility or philosophical uncertainty' and end successfully.

In the paper I ignore this speculation about humility-based solution directions, and leverage and formalize the concept of 'indifference' instead. Sorry to disappoint if you were expecting major progress on the humility agenda advanced by Ellezer.

# Superintelligence

Another issue is that you describe a "superintelligent" AGI simulator

Yeah, in the paper I explicitly defined the adjective superintelligent in a somewhat provocative way, I defined 'superintelligent' to mean 'maximally adapted to solving the problem of utility maximization in its universe'.

I know this is somewhat jarring to many people, but in this case it was fully intended to be jarring. It is supposed to make you stop and think...

(This grew into a very long response, and I do not feel I have necessarily addressed or resolved all of your concerns. If you want to move further conversation about the more technical details of my paper or of corrigibility to a video call, I'd be open to that.)

Replies from: TurnTrout
comment by TurnTrout · 2021-04-05T15:03:17.386Z · LW(p) · GW(p)

where these people feel the need to express their objections even before reading the full paper itself

I'd very much like to flag that my comment isn't meant to judge the contributions of your full paper. My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!

(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn't totally off-base, and then spent the rest of the time composing a reply.)

Specifically, you are both not objecting to the actual contents of the paper, you are taking time to offer somewhat pre-emptive criticism based on a strong prior you have about what the contents of that paper will have to be.

Alex, you are even making rhetorical moves to maintain your strong prior in the face of potentially conflicting evidence:

"That said, the rest of this comment addresses your paper as if it's proving claims about intuitive-corrigibility."

Curious. So here is some speculation.

Perhaps I could have flagged this so you would have realized it wasn't meant as a "rhetorical move": it's returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility. From the abstract:

A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI).

You aren't just saying "I'll prove that this AI design leads to such-and-such formal property", but (lightly rephrasing the above): "This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started... [I] prove that the corrigibility layer works as intended in a large set of non-hostile universes."

This does not parse like a normal-strength mathematical claim. This is a claim about the de facto post-deployment safety properties of "arbitrarily advanced utility maximizing agents."

Again, I'm not saying that your paper doesn't have any good contributions. I can't judge that without further reading. But I am standing by my statement that this is a non-standard claim which I'm skeptical of and which makes me hesitate to read the rest of the paper.

We know very well 'how to do this' for many types of agent world models. Robustly picking out simple binary input signals like stop buttons is routinely achieved in many (non-AGI) world models as used by today's actually existing AI agents, both hard-coded and learned world models, and there is no big mystery about how this is achieved.

Yes, we know how to do it for existing AI agents. But if the 'off-switch' is only a binary sensory modality (there's a channel that says '0' or '1' at each time step), then how do you have AIXI pick out 'the set of worlds in which humans are pressing the button' versus 'the set of worlds in which a rock fell on the button'?

And even that is an unrealistically structured scenario, since it seems like prosaic AGI is quite plausible. Prosaic AGI would be way messier than AIXI, since it wouldn't be doing anything as clean as Bayes-updating the simplicity prior to optimize an explicit utility function.

Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.

This is not going to happen for AGI, since we might not survive testing on-distribution, and how would we ensure that the environment "stays on-distribution"? Is that like, pausing the world forever?

You seem to be looking for 'not very narrow sense' corrigibility solutions where we can get symbol grounding robustness even in scenarios where the AGI does recursive self improvement, where it re-builds is entire reasoning system from the ground up, and where it then possibly undergoes an ontological crisis. The basic solution I have to offer for this scenario is very simple. Barring massive breakthroughs, don't build a system like that if you want to be safe.

I'm not just talking about that; the above shows how symbol grounding is tough even for seemingly well-defined events like "is the off-switch being pressed?", without any fancy self-improvement.

Replies from: Koen.Holtman
comment by Koen.Holtman · 2021-04-06T19:13:52.375Z · LW(p) · GW(p)

My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!

OK, that clarifies your stance. You feeling weird definitely created a weird vibe in the narrative structure of your comment, a vibe that I picked up on.

(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn't totally off-base, and then spent the rest of the time composing a reply.)

You writing it quickly in half an hour also explains a lot about how it reads.

it's returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility.

I guess we have established by now that the paper is not about your version of intuitive-corrigibility.

For my analysis of intuitive-corrigibility, see the contents of the post above. My analysis is that intuitions on corrigibility are highly diverse, and have gotten even more diverse and divergent over time.

You interpret the abstract as follows:

You aren't just saying "I'll prove that this AI design leads to such-and-such formal property", but (lightly rephrasing the above): "This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started.

Yes that is what I am saying in the abstract. Your light rephrasing where you add [significantly increases the probability that] indeed expresses the message I intended to convey.

[I] prove that the corrigibility layer works as intended in a large set of non-hostile universes."

The phrasing 'works as intended' in the abstract is supposed to indicate clearly that the layer is designed to produce specific such-and-such formal corrigibility properties only, not some broad idea of 'intuitive corrigibility'.

So I am guessing you did not pick up on that when reading the abstract.

OK, moving away from a discussion about abstracts, initial impressions, feelings and their causes, moving towards a discussion of more technical stuff:

But if the 'off-switch' is only a binary sensory modality (there's a channel that says '0' or '1' at each time step), then how do you have AIXI pick out 'the set of worlds in which humans are pressing the button' versus 'the set of worlds in which a rock fell on the button'?

In the paper I don't try to make the agent's world model distinguish between 'humans are pressing the button' versus 'the set of worlds in which a rock fell on the button'. The 'works as intended' is that any button press for whatever reason is supposed to stop the agent.

So this is a design tradeoff I am making. Specifically, this is driven by a lack of trust that any ML-based 'human versus rock' discrimination system will work reliably enough in every possible future emergency. The design tradeoff is to accept the risk that a stray rock will stop the agent accidentally, potentially causing some level of inconvenience.

Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.

This is not going to happen for AGI, since we might not survive testing on-distribution,

You are saying we might not survive AGI. This kind of observation is almost a tautology, and therefore not very interesting.

What interests me is identifying techniques that can increase the chance of survival. Drive down the risk. Admit that zero risk is impossible and move on. There are plenty of risk-reduction techniques out there if you look.

and how would we ensure that the environment "stays on-distribution"? Is that like, pausing the world forever?

Here is a paperclip optimizer example of 'ensuring that the agent environment never goes off-distribution'. Your design goals are to 1) keep your paperclip optimizing AGI agent inside your paperclip factory, and also 2) to make sure that crazy people and/or radically new technologies do not manage to get into the factory.

In other words, keep the environment inside the factory on-distribution, and keep the agent in the factory.

Yes, you will never perfectly achieve these two goals with 100% certainty. Your goal is to drive down the risk of failure on both of them.

For goal 1), there are many techniques that come to mind, you probably want to use them all. 1a) physical barriers that the agent cannot breach with the resources it has at hand, 1b) active agent oversight, where oversight has a kill switch and is not afraid to use it 1c) agent design measures (penalty terms in the reward function, use of counterfactual planning, certain training regimes, etc.) that robustly lower the chance of the agent ever wanting to get out of the factory. 1d) automatic safety interlocks.

I spent some time in my professional life doing safety engineering for physical systems and cyber-physical systems, and this is how safety gets done there. Progress is often expensive, often annoying to certain stakeholders, and not always philosophically very satisfying. But you cannot have everything.

comment by Gurkenglas · 2021-04-02T01:10:56.099Z · LW(p) · GW(p)

I like your "Corrigibility with Utility Preservation" paper. I don't get why you prefer not using the usual conditional probability notation.  leads to TurnTrout's attainable utility preservation. Why not use  in the definition of ? Could you change the definition of  to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure  is large and  small after the button press. (Although it might just keep going further in that direction...) I don't like the privileging of  actions.

Replies from: Koen.Holtman, Koen.Holtman
comment by Koen.Holtman · 2021-04-06T15:11:11.175Z · LW(p) · GW(p)

Could you change the definition of to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure is large and small after the button press.

In general if you would forcefully change the agent's reward function into some , it will self-preserve from that moment on and try to maintain this , so it won't self-edit its back into the original form.

There are exceptions to this general rule, for special versions of and special versions of agent environments (see section 7.2), where you can get the agent to self-edit, but on first glance, your example above does not seem to be one.

If you remove the bits from the agent definition then you can get an agent that self-edits a lot, but without changing its fundamental goals. The proofs of 'without changing its fundamental goals' will get even longer and less readable than the current proofs in the paper, so that is why I did the privileging.

comment by Koen.Holtman · 2021-04-02T16:09:58.887Z · LW(p) · GW(p)

I like your "Corrigibility with Utility Preservation" paper.

Thanks!

I don't get why you prefer not using the usual conditional probability notation.

Well, I wrote in the paper (section 5) that I used instead of the usual conditional probability notation because it 'fits better with the mathematical logic style used in the definitions and proofs below.' i.e. the proofs use the mathematics of second order logic, not probability theory.

However this was not my only reason for this preference. The other reason what that I had an intuitive suspicion back in 2019 that the use of conditional probability notation, in the then existing papers and web pages on balancing terms, acted as an of impediment to mathematical progress. My suspicion was that it acted as an overly Bayesian framing that made it more difficult to clarify and generalize the mathematics of this technique any further.

In hindsight in 2021, I can be a bit more clear about my 2019 intuition. Armstrong's original balancing term elements and , where and are low-probability near-future events, can be usefully generalized (and simplified) as the Pearlian and where the terms are interventions (or 'edits') on the current world state.

The notation makes it look like the balancing terms might have some deep connection to Bayesian updating or Bayesian philosophy, whereas I feel they do not have any such deep connection.

That being said, in my 2020 paper I present a simplified version of the math in the 2019 paper using the traditional notation again, and without having to introduce .

leads to TurnTrout's attainable utility preservation.

Yes it is very related: I explore that connection in more detail in section 12 of my 2020 paper. In general I think that counterfactual expected-utility reward function terms are a Swiss army knifes with many interesting uses. I feel that as a community, we have not yet gotten to the bottom of their possibilities (and their possible failure modes).

Why not use in the definition of ?

In definition of (section 5.3 equation 4) I am using a term, so I am not sure if I understand the question.

(I am running out of time now, will get back to the remaining questions in your comment later)

Replies from: Gurkenglas
comment by Gurkenglas · 2021-04-03T06:06:15.705Z · LW(p) · GW(p)

I am not sure if I understand the question.

pi has form afV, V has form mfV, f is a long reused term. Expand recursion to get afmfmf... and mfmfmf.... Define E=fmE and you get pi=aE without writing f twice. Sure, you use V a lot but my intuition is that there should be some a priori knowable argument for putting the definitions your way or your theory is going to end up with the wrong prior.

Replies from: Koen.Holtman
comment by Koen.Holtman · 2021-04-06T14:39:23.722Z · LW(p) · GW(p)

Thanks for expanding on your question about the use of . Unfortunately. I still have a hard time understanding your question, so I'll say a few things and hope that will clarify.

If you expand the term defined in (5) recursively, you get a tree-like structure. Each node in the tree has as many sub nodes as there are elements in the set . The tree is in fact a tree of branching world lines. Hope this helps you visualize what is going on.

I could shuffle around some symbols and terms in the definitions (4) and (5) and still create a model of exactly the same agent that will behave in exactly the same way. So the exact way in which these two equations are written down and recurse on each other is somewhat contingent. My equations stay close to what is used when you model an agent or 'rational' decision making process with a Bellman equation. If your default mental model of an agent is a set of Q-learning equations, the model I develop will look strange, maybe even unnatural at first sight.

or your theory is going to end up with the wrong prior.

OK, maybe this is the main point that inspired your question. The agency/world models developed in the paper are not a 'theory', in the sense that theories have predictive power. A mathematical model used as a theory, like , predicts how objects will accelerate when subjected to a force.

The agent model in the paper does not really 'predict' how agents will behave. The model is compatible with almost every possible agent construction and agent behavior, if we are allowed to pick the agent's reward function freely after observing of reverse-engineering the agent to be modeled.

On purpose, the agent model is constructed with so many 'free parameters' that is has no real predictive power. What you get here is an agent model that can describe almost every possible agent and world in which it could operate.

In mathematics. the technique I am using in the paper is sometimes called 'without loss of generality'. I am developing very general proofs by introducing constraining assumptions 'without loss of generality'.

Another thing to note is that the model of the agent in the paper, the model of an agent with the corrigibility-creating safety layer, acts as a specification of how to add this layer to any generic agent design.

This dual possible use, theory or specification, of models can be tricky if you are not used to it. In observation-based science, mathematical models are usually always theories only. In engineering (and in theoretical CS, the kind where you prove programs correct, which tends to be a niche part of CS nowadays) models often act as specifications. In statistics, the idea that statistical models act as theories tends to be de-emphasized. The paper uses models in the way they are used in theoretical CS.

You may want to take a look at this post in the sequence, which copies text from a 2021 paper where I tried to make the theory/specification use of models more accessible. If you read that post, if might be easier to fully track what is happening, in a mathematical sense, in my 2019 paper.

Replies from: Gurkenglas
comment by Gurkenglas · 2021-04-06T22:45:35.028Z · LW(p) · GW(p)

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence. All short terms ought to mean something, and when everything fits together better than expected, that is a sign that one is on the right track, and that there is a pattern to formalize. and  can be replaced by  and . I expect that the latter formation is better because it is shorter. Its only direct effect would be that you would write  instead of , so the previous sentence must cash out as this being a good thing. Indeed, it points out a direction in which to generalize. How does your math interact with quantilization? I plan to expand when I've had time to read all links.

Replies from: Koen.Holtman
comment by Koen.Holtman · 2021-04-08T10:18:52.051Z · LW(p) · GW(p)

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence.

OK, I think I see what inspired your question.

If you want to give this kind of give the math the kabbalah treatment, you may also look at the math in [EFDH16], which produces agents similar to my definitions (4) (5), and also some variants that have different types of self-reflection. In the later paper here, Everitt et al. develop some diagrammatic models of this type of agent self-awareness, but the models are not full definitions of the agent.

For me, the main questions I have about the math developed in the paper is how exactly I can map the model and the constraints (C1-3) back to things I can or should build in physical reality.

There is a thing going on here (when developing agent models, especially when treating AGI/superintelligence and embeddeness) that also often happens in post-Newtonian physics. The equations work, but if we attempt to map these equations to some prior intuitive mental model we have about how reality or decision making must necessarily work, we have to conclude that this attempt raises some strange and troubling questions.

I'm with modern physics here (I used to be an experimental physicist for a while), where the (mainstream) response to this is that 'the math works, your intuitive feelings about how X must necessarily work are wrong, you will get used to it eventually'.

BTW, I offer some additional interpretation of a difficult-to-interpret part of the math in section 10 of my 2020 paper here.

How does your math interact with quantilization?

You could insert quantilization in several ways in the model. Most obvious way is to change the basic definition (4). You might also define a transformation that takes any reward function and returns a quantilized reward function , this gives you a different type of quantilization, but I feel it would be in the same spirit.

In a more general sense, I do not feel that quantilization can produce the kind of corrigibility I am after in the paper. The effects you get on the agent by changing into , by adding a balancing term to the reward function, are not the same effects produced by quantilization.