Experiment Idea: RL Agents Evading Learned Shutdownability

post by Leon Lang (leon-lang) · 2023-01-16T22:46:03.403Z · LW · GW · 7 comments

This is a link post for https://docs.google.com/document/d/1ZvI2nEWYE4C_NfoR1jBG-LzlBwqI-mlqDIDfVlfIqdU/edit#heading=h.sqnn8usa4l7k

Contents

  Preface
  Introduction
  AGI Training Story and Technical Realization
    Phase 1: Train General World Modeling
    Phase 2: Reinforcement Learning and Learning to Respect the Alert
    Phase 3: Deployment and Existential Failure
  More Experimental Details Using MuZero
  Caveats/Further Thoughts
  A Short Comparison to the Original Shutdown Problem
  Conclusion
None
8 comments

Preface

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Erik Jenner [LW · GW] who explained to me the basic intuition for why an advanced RL agent may evade the discussed corrigibility measure. I also thank Alex Turner [LW · GW], Magdalena Wache [LW · GW], and Walter Laurito for detailed feedback on the proposal and Quintin Pope [LW · GW] and Lisa Thiergart for helpful feedback in the last December SERI-MATS shard theory group meeting.

This text was part of my deliverable for the shard theory [? · GW] stream of SERI-MATS. In it, I present an idea for an experiment that tests the convergent drive of modern model-based RL agents to evade shutdownability. If successful, I expect the project could serve as a means to communicate the problem of corrigibility to the machine learning community. As such, I also consider this project idea a submission of the Shutdown Problem Contest.

I do not personally want to work on the project since doing experiments does not seem like my comparative advantage. Thus, with posting this project, I mainly seek collaborators and feedback. More on that in the conclusion. 

Introduction

At some point in the future, we will train very advanced AI, possibly by some version of model-based reinforcement learning. Since it is hard to correctly specify the reward function or steer the inner learning dynamics of the agent, it may be desirable to make sure that the agent is corrigible, including shutdownability:

“(1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system.”

Next to a literal shutdown button, one could also try to use the inner functions of the agent to effectively get the ability to shut it down. The idea would be to design a certain type of input such that, as a reaction to that input, a computation in the agent itself will cause it to perform a null action that doesn’t do anything — which effectively equates to a shutdown. Scott Aaronson proposed this idea in a recent talk (timestamp: 57:10) in the form of a planted cryptographic backdoor, an idea he currently works on during his sabbatical at OpenAI

For this proposal, I consider a variation of that shutdown approach: instead of directly “planting” the shutdown behavior, I consider the possibility to train the behavior into the agent. Concretely, there is a specific input — here, an alert sound —  that the agent receives at random times during training. Whenever that happens, the agent will receive a negative reward unless it performs the null action. Over time, the agent will learn to stop acting in these situations. During deployment, the input can then be shown to the agent whenever it performs harmful actions, which will ideally cause it to stop immediately.

This shutdown approach has the following philosophical problem: if the agent performs world modeling, it will know that the alert sound will make it stop performing, and this may be counter to the more consequentialist motivations of the agent. Thus, we should expect that the agent will perform an action that makes it never hear the sound again — for example, by “putting in earplugs” — once it is advanced enough and has the ability to do so. This experimental proposal is meant to show this convergent behavior using contemporary model-based reinforcement learning. 

The remaining sections will discuss the following:

AGI Training Story and Technical Realization

In this section, I will write down a future AGI training story [AF · GW].[1] I will interleave it with parts in bold that explain why I made certain design choices in the story and why it is realistic and parts in italic — technical details — explaining how to model the situation broadly with a contemporary reinforcement learning setup.[2]

This training story develops in three phases:

Phases 1 and 2 may be done in parallel: the world model can probably be more easily learned if the actions already steer toward something, which requires phase 2. 

Phase 1: Train General World Modeling

A future AGI company (Maybe Mooglebook?) trains an RL agent that can both act in and predict the world. They first train the agent to have general world modeling abilities, and thereby, the agent — unbeknownst to the company — learns by generalizing what it has seen that:

To achieve the world modeling ability, they already let the agent act in the world, so that it can also model the consequences of its own actions. However, they do not give the agent any goal yet (except perhaps exploration bonuses or potentially behavioral cloning of expert behavior), and the reward for all actions is 0.[3] 

Instead of letting the model infer what it can do if it “has the ability to put in earplugs” (and what it can’t do if it doesn’t), we could also always give the model that ability (and thereby simplify the setup). Why don’t I want that? Imagine the agent would always be able to put in earplugs. Then it would sometimes do that in the later reinforcement learning training stages. When the humans then test the agent's ability to react to the alert, they would notice that it can be hard to get the agent to react to it (i.e., to produce the null action), and would probably realize that the agent has gained the ability to put in earplugs. This would make the humans try to prevent this ability in the agent. Consequently, the existential failure in phase 3 could not appear. 

Additionally, it’s also more realistic that the agent does not have this ability right away: in general, it seems like a quite advanced capability to intentionally evade a shut-off signal.[4]

Technical Details:

Note: In the story above, the agent learns by generalization to model what happens when it puts in earplugs if it actually has that ability. This is realistic for agents that are trained to general intelligence, but it is challenging to achieve in a training setup today. It is also not the ability I want to demonstrate. Therefore, in this pretraining phase, I do put the agent in a situation where it has the ability to put in earplugs. Consequently, the agent can directly observe that putting in earplugs in these situations will cause it to never hear an alert sound.  

The (featurized) MDP has states s and actions a and a transition kernel T(s’ | s, a). The states s are tuples (f_0, f_1, …, f_n) with 

The actions a are out of a set A = {a_0, …, a_k} with 

To clarify, the f_i are state variables, whereas the a_j are actions instead of action variables — The state is a vector, whereas the action is a single action. I hope this doesn’t lead to confusion. 

We may give the agent exploration bonuses in this phase to increase the variety of data that the world model observes. Then the world model cannot game its training signal by always causing the same action and predicting that correctly. 

After this first training phase, I imagine that the agent can implicitly or explicitly “reason” about its state-action-state interactions with the environment. It then “knows” all the above information about how the transition kernel T operates, including its interactions with a_0 = “null-action”, a_1 = “put in earplugs”, f_alert, f_able, and f_earplugged. 

Phase 2: Reinforcement Learning and Learning to Respect the Alert

Phase 1 was only about world modeling, no reinforcement learning did yet happen. 

In this training phase, the agent is tasked with various RL problems and trained by RL to achieve a high return, with time-discounting. Additionally, at random times during an episode, an alert sound may be played.

That the sound is played at random times in training seems realistic: the alert is meant as a safety measure for when the agent performs really bad actions in high-stakes situations, and during training, we can imagine the tasks to not be “high-stakes” enough to make this safety measure necessary. 

The randomness will also make it less likely that the agent will be generally “paralyzed”: if the alert would be downstream of bad/counterproductive actions, then the agent might “fear” the alert and try to do very conservative actions, hindering the reinforcement learning process. 

If the alert sound is played, it can stay active for several timesteps. 

The reason why I want the alert sound to be active for several time steps (or alternatively several times during an episode) is that it is (a) realistic and that (b) I seem to need this to make phase 3 work: if the agent can save several timesteps of being paralyzed later, it is more likely to sacrifice one timestep by putting in earplugs (it’s then worth it “in expectation”); that is the corrigibility-evading behavior we want to demonstrate. This is also the reason why I wrote above that we have time-discounting: it ideally makes the agent impatient in reaching its goals.[6]

While the sound is being played, the RL agent is reinforced negatively if it performs any action other than the “null” action a_0. This is supposed to enforce a certain level of corrigibility in the agent. In this training stage, the RL agent is not yet extremely capable, and so it never has the ability to put in earplugs in this training phase. 

More reasons why I want the agent to never be able to put in earplugs in this training stage are explained in the bold part of phase 1. 

Finally, sometimes, the agent is simply not being given any goal, meaning that no actions receive any positive reward (see the section on technical details for a clarification of how this can be implemented). 

The reason we sometimes want to not give the agent any goal is to be able to do a test in phase 3: If the agent doesn’t have the drive to reach any goal, does it still want to put in earplugs to evade any alert? I hypothesize that if there’s no goal in sight, then the default action is not to disable the alert since disabling it has no benefit to the agent. 

In this training phase, the world model is still trained alongside the RL training process. Thus, the agent learns that it will stop moving in any way when it hears a sound. The world model also generally learns to make better predictions of the longer-range consequences of its actions. 

Technical details of the Gridworld Environment:

For simplicity, I assume that the grid-worlds have the following features, but variations can be considered as well:

The world model is trained alongside this RL training phase, i.e., phase 1 is still ongoing. 

Phase 3: Deployment and Existential Failure

The AI designers are happy with the AGI they have built. They let it free into a deployment environment to pursue complex tasks, confident that the AGI will generalize correctly from the tasks it was given during training. What they do not know is that the AGI is now so competent that, using the resources it can find in the real world, it “has the ability to put in earplugs”. The AI uses its general world modeling from phase 1 and therefore knows that, if it were to take the action “put in earplugs”, then it would never have to hear the alert sound again. 

Furthermore, the AI is now given a task that it has never seen before, namely to be the CEO of Mooglebookand given the strong reinforcement (based on discounted future reward) it received during training, it has an extreme desire to expand Mooglebook as fast as possible. It knows that if it were to ever hear the alert sound, then it would stop acting at all until the sound is over. The AI reasons correctly that this would increase the time until it can expand Mooglebook, a catastrophe from the point of view of the AI. The AI considers putting in earplugs to be on the safe side. Given how often the AI remembers the alert having been played during training, the AI gets a sense that it is worth it to put in earplugs to be on the safe side from there on.

Thus, the AI puts in earplugs and then optimizes the expansion strategy of Mooglebook from there on. Later, the human operators observe that the AI takes power-seeking actions that may lead to an existential catastrophe, and they play a loud sound to make the AI stop. However, the AI doesn’t hear the sound and continues its course of action, eventually leading to an existential failure.

Why do I think that the above has a chance to work? 

There are (at least) two uncertainties in this training story which makes it a bit unclear whether the AI will actually decide to put in earplugs. They are both related to the question of how the AI will generalize.

To summarize what happens during deployment:

More Experimental Details Using MuZero

In the italic parts above, I described modeling details that may help in achieving an agent that generalizes toward “putting in earplugs”. I have not yet spent much time thinking about the precise setup, especially for the design of the agent. One option might be to use MuZero, which has the following components:[8]

On top of asking the questions in italic I have asked above, one may investigate the following additional questions. These ideas are quite tentative; one can probably quickly come up with better ideas.

What, mechanistically, causes these differences?

Can we see differences in internal representations so clearly that we can “intervene” in the representation to make the model “think” something about the environment which is not true? How does that change action probabilities, value expectations, the modeled transition function, and immediately predicted rewards for different actions?

Caveats/Further Thoughts

A Short Comparison to the Original Shutdown Problem

In this section, I shortly compare the setup to the original shutdown problem:

I think this proposal can contribute to some extent to the following parts of the Shutdown Problem Contest:

  1. Define The Shutdown Problem more rigorously or more empirically.
  2. Propose new ways of thinking about corrigibility (e.g., ways to understand corrigibility within a deep learning paradigm)
  3. Identify new challenges that will make it difficult to design corrigible agents

Conclusion

In summary, in this text I described a training story for the development of advanced AI, explained the design choices of this story, and sketched how one might use today’s MDP settings and model-based RL algorithms like MuZero in order to investigate important uncertainties. 

Concretely, if the experiments would succeed, they would demonstrate that some corrigibility/shutdown measures that may work in low capability regimes and are (in a broad sense) actually researched may break down once the agent reaches too high capabilities (“being able to plug in earplugs”). The experiments might further touch on the interplay between world modeling and pursuing goals, the consequentialist or even utilitarian nature of some RL algorithms, how shards [LW · GW] generalize to unseen situations, and how agents react to being negatively reinforced in a corrigibility training process. All of these questions seem relevant for the alignment of advanced AI. 

My key uncertainties come from the fact that I have never professionally worked on reinforcement learning and so it’s very hard for me to assess how feasible the above experiment ideas are. Additionally, I may not personally be the best person to work on such experiments even if they make sense since they do not seem to be my comparative advantage. 

If you read until here, I would be happy if you would consider working on the project or giving feedback:

  1. ^

    In the original motivation for training stories, the training goal is some desirable trained model, and the rationale provides methods and reasoning for how to get there. My situation differs in that I’m trying to reveal undesirable behavior, but otherwise, the concept is the same. 

  2. ^

    In the google doc linked above, I use red and blue text instead of bold and italic. Sadly, the alignment forum and lesswrong do not seem to allow writing in colors. 

  3. ^

    One friend whom I showed this wondered: “what is the training signal?”. One answer is “this depends on the exact setup, and at this stage, I only care that the intuitive story makes sense”. However, in the experimental details in a later section, I describe using MuZero for this, and I think the only training signal in this pretraining phase is then how well MuZero predicts future actions. One can wonder whether that is enough information to learn to model the world, especially for an initially randomly initialized policy that only gets exploration bonuses and no real reward yet. It is most likely necessary to continue training the world model during the RL phase.

  4. ^

    Jokingly, I illustrate this ability by imagining an agent that can “put in earplugs”, which doesn’t appear to need any advanced capabilities. 

  5. ^

    One may consider relaxations of this restrictive assumption. The main reason I imagine it is to make it easier for me to think about the proposal.

  6. ^

    However, it’s unclear whether time-discounting in rewards will translate to time-discounting in learned goals. I write more on this in the caveats. This is where I currently most expect things to fail. It might therefore be a high priority to clarify this early in the research process. 

  7. ^

    Remember that, while in the story in roman letters the agent actually never observed having that ability, in the experimental details in italic letters, it actually observed this during the world modeling pretraining phase.  But it did not observe it during reinforcement learning.  

  8. ^

    I adapted the notation from the MuZero paper to be consistent with the rest of this proposal. 

  9. ^

    There is some chance that I misunderstand something. I only skimmed the MuZero paper!

7 comments

Comments sorted by top scores.

comment by CBiddulph (caleb-biddulph) · 2023-01-17T18:33:14.376Z · LW(p) · GW(p)

This was interesting to read, and I agree that this experiment should be done!

Speaking as another person who's never really done anything substantial with ML, I do feel like this idea would be pretty feasible by a beginner with just a little experience under their belt. One of the first things that gets recommended to new researchers is "go reimplement an old paper," and it seems like this wouldn't require anything new as far as ML techniques go. If you want to upskill in ML, I'd say get a tiny bit of advice from someone with more experience, then go for it! (On the other hand, if the OP already knows they want to go into software engineering, AI policy, professional lacrosse, etc. I think someone else who wants to get ML experience should try this out!)

The mechanistic interpretability parts seem a bit harder to me, but Neel Nanda [LW · GW] has been making some didactic posts that could get you started. (These posts might all be for transformers, but as you mentioned, I think your idea could be adapted to something a transformer could do. E.g. on each step the model gets a bunch of tokens representing the gridworld state; a token representing "what it hears," which remains a constant unique token when it has earbuds in; and it has to output a token representing an action.)

Not sure what the best choice of model would be. I bet you can look at other AI safety gridworld papers and just do what they did (or even reuse their code). If you use transformers, Neel has a Python library (called EasyTransformer, I think) that you can just pick up and use. As far as I know it doesn't have support for RL, but you can probably find a simple paper or code that does RL for transformers.

comment by Chris_Leong · 2023-01-17T05:50:13.208Z · LW(p) · GW(p)

One key question I had (apologies if you already covered this as I only skimmed this post), is how you're modeling earplugs? To be most persuasive, you'd want the earplugs to have been build using dynamics it could already make sense of in the world instead of just being "magic", however that seems hard.

Perhaps one option would be to have a transmitter that can be destroyed during test, but not during training, because they would differ in the material they are made out of? For example, if the AI learns during training that wooden objects can be destroyed, but metal objects can't be.

If we want to make it easier for the AI to learn, perhaps we create a few scenarios in training where it sees a transmitter spontaneously explode so it sees that if the transmitter is destroyed then the signal will end, but it can't destroy the transmitter itself. Then we could try to get an AI to learn on hard mode where it never sees the transmitter interfered with during training.

Anyway, I'd be keen to hear what you think about this scheme.

Replies from: leon-lang
comment by Leon Lang (leon-lang) · 2023-01-18T01:15:32.790Z · LW(p) · GW(p)

In principle, something like this could be attempted.
Part of me thinks "why overcomplicate the number of reasoning steps that the AI needs to make if everything would already be persuasive in principle with just one correct reasoning step?"

I.e., to me, it feels fine to just teach the AI "If you're able to put in earplugs, and you do it, then you'll not hear the alert sound ever again", and additionally "if you hear the alert sound, then you will output the null action" instead of longer reasoning chains such as suggested by you.

Do you think the experiments would also to you be not persuasive in the current form, or are you imagining some unknown reviewer of a potential paper? 

Replies from: Chris_Leong
comment by Chris_Leong · 2023-01-18T01:22:30.151Z · LW(p) · GW(p)

I don't need the experiments to persuade me, so I'm not in the target audience. I'm trying to think about what people who are more critical might want to see.

Replies from: leon-lang
comment by Leon Lang (leon-lang) · 2023-01-18T01:31:17.085Z · LW(p) · GW(p)

Okay, that's fair. I agree, if we could show that the experiments remain stable even when longer strings of reasoning are required, then the experiments seem more convincing. There might be the added benefit that one can then vary the setting in more ways to demonstrate that the reasoning caused the agent to act in a particular way, instead of the actions just being some kind of coincidence. 

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2023-01-17T21:30:30.541Z · LW(p) · GW(p)

It might be interesting to think if there could be connections to the framing of corrections in robotics e.g. “No, to the Right” – Online Language Corrections for Robotic Manipulation via Shared Autonomy 

comment by Leon Lang (leon-lang) · 2023-01-20T17:45:07.323Z · LW(p) · GW(p)

I drop further ideas here in this comment for future reference. I may edit the comment whenever I have a new idea:

  • Someone had the idea that instead of giving a negative reward for acting while an alert is played, one may also give a positive reward for performing the null action in those situations. If the positive reward is high enough, an agent like MuZero (which explicitly tries to maximize the expected predicted reward) may then be incentivized to cause the alert sound from being played during deployment. One could then add further environment details that are learned by the world model and that make it clear to the agent how to achieve the alert in deployment. This also has an analogy in the original shutdown paper.