# Risks from Learned Optimization: Introduction

post by evhub, chrisvm, vlad_m, Joar Skalse (Logical_Lunatic), Scott Garrabrant · 2019-05-31T23:44:53.703Z · score: 126 (36 votes) · LW · GW · 32 comments

## Contents

  Motivation
Two questions
1.1. Base optimizers and mesa-optimizers
1.2. The inner and outer alignment problems
1.3. Robust alignment vs. pseudo-alignment
1.4. Mesa-optimization as a safety problem
None


This is the first of five posts in the Risks from Learned Optimization Sequence [AF · GW] based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, and everyone else who provided feedback on earlier versions of this sequence.

## Motivation

The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned?

We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we plan to present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning.

## Two questions

In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking those that do well according to some objective.

Whether a system is an optimizer is a property of its internal structure—what algorithm it is physically implementing—and not a property of its input-output behavior. Importantly, the fact that a system’s behavior results in some objective being maximized does not make the system an optimizer. For example, a bottle cap causes water to be held inside the bottle, but it is not optimizing for that outcome since it is not running any sort of optimization algorithm.(1) Rather, bottle caps have been optimized to keep water in place. The optimizer in this situation is the human that designed the bottle cap by searching through the space of possible tools for one to successfully hold water in a bottle. Similarly, image-classifying neural networks are optimized to achieve low error in their classifications, but are not, in general, themselves performing optimization.

However, it is also possible for a neural network to itself run an optimization algorithm. For example, a neural network could run a planning algorithm that predicts the outcomes of potential plans and searches for those it predicts will result in some desired outcome.[1] Such a neural network would itself be an optimizer because it would be searching through the space of possible plans according to some objective function. If such a neural network were produced in training, there would be two optimizers: the learning algorithm that produced the neural network—which we will call the base optimizer—and the neural network itself—which we will call the mesa-optimizer.[2]

The possibility of mesa-optimizers has important implications for the safety of advanced machine learning systems. When a base optimizer generates a mesa-optimizer, safety properties of the base optimizer may not transfer to the mesa-optimizer. Thus, we explore two primary questions related to the safety of mesa-optimizers:

1. Mesa-optimization: Under what circumstances will learned algorithms be optimizers?
2. Inner alignment: When a learned algorithm is an optimizer, what will its objective be, and how can it be aligned?

Once we have introduced our framework in this post, we will address the first question in the second, begin addressing the second question in the third post, and finally delve deeper into a specific aspect of the second question in the fourth post.

## 1.1. Base optimizers and mesa-optimizers

Conventionally, the base optimizer in a machine learning setup is some sort of gradient descent process with the goal of creating a model designed to accomplish some specific task.

Sometimes, this process will also involve some degree of meta-optimization wherein a meta-optimizer is tasked with producing a base optimizer that is itself good at optimizing systems to achieve particular goals. Specifically, we will think of a meta-optimizer as any system whose task is optimization. For example, we might design a meta-learning system to help tune our gradient descent process.(4) Though the model found by meta-optimization can be thought of as a kind of learned optimizer, it is not the form of learned optimization that we are interested in for this sequence. Rather, we are concerned with a different form of learned optimization which we call mesa-optimization.

Mesa-optimization is a conceptual dual of meta-optimization—whereas meta is Greek for above, mesa is Greek for below.[3] Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer. Unlike meta-optimization, in which the task itself is optimization, mesa-optimization is task-independent, and simply refers to any situation where the internal structure of the model ends up performing optimization because it is instrumentally useful for solving the given task.

In such a case, we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. In reinforcement learning (RL), for example, the base objective is generally the expected return. Unlike the base objective, the mesa-objective is not specified directly by the programmers. Rather, the mesa-objective is simply whatever objective was found by the base optimizer that produced good performance on the training environment. Because the mesa-objective is not specified by the programmers, mesa-optimization opens up the possibility of a mismatch between the base and mesa- objectives, wherein the mesa-objective might seem to perform well on the training environment but lead to bad performance off the training environment. We will refer to this case as pseudo-alignment below.

There need not always be a mesa-objective since the algorithm found by the base optimizer will not always be performing optimization. Thus, in the general case, we will refer to the model generated by the base optimizer as a learned algorithm, which may or may not be a mesa-optimizer.

Figure 1.1. The relationship between the base and mesa- optimizers. The base optimizer optimizes the learned algorithm based on its performance on the base objective. In order to do so, the base optimizer may have turned this learned algorithm into a mesa-optimizer, in which case the mesa-optimizer itself runs an optimization algorithm based on its own mesa-objective. Regardless, it is the learned algorithm that directly takes actions based on its input.

Possible misunderstanding: “mesa-optimizer” does not mean “subsystem” or “subagent.” In the context of deep learning, a mesa-optimizer is simply a neural network that is implementing some optimization process and not some emergent subagent inside that neural network. Mesa-optimizers are simply a particular type of algorithm that the base optimizer might find to solve its task. Furthermore, we will generally be thinking of the base optimizer as a straightforward optimization algorithm, and not as an intelligent agent choosing to create a subagent.[4]

We distinguish the mesa-objective from a related notion that we term the behavioral objective. Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. We can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).[5] This is in contrast to the mesa-objective, which is the objective actively being used by the mesa-optimizer in its optimization algorithm.

Arguably, any possible system has a behavioral objective—including bricks and bottle caps. However, for non-optimizers, the appropriate behavioral objective might just be “1 if the actions taken are those that are in fact taken by the system and 0 otherwise,”[6] and it is thus neither interesting nor useful to know that the system is acting to optimize this objective. For example, the behavioral objective “optimized” by a bottle cap is the objective of behaving like a bottle cap.[7] However, if the system is an optimizer, then it is more likely that it will have a meaningful behavioral objective. That is, to the degree that a mesa-optimizer’s output is systematically selected to optimize its mesa-objective, its behavior may look more like coherent attempts to move the world in a particular direction.[8]

A given mesa-optimizer’s mesa-objective is determined entirely by its internal workings. Once training is finished and a learned algorithm is selected, its direct output—e.g. the actions taken by an RL agent—no longer depends on the base objective. Thus, it is the mesa-objective, not the base objective, that determines a mesa-optimizer’s behavioral objective. Of course, to the degree that the learned algorithm was selected on the basis of the base objective, its output will score well on the base objective. However, in the case of a distributional shift, we should expect a mesa-optimizer’s behavior to more robustly optimize for the mesa-objective since its behavior is directly computed according to it.

As an example to illustrate the base/mesa distinction in a different domain, and the possibility of misalignment between the base and mesa- objectives, consider biological evolution. To a first approximation, evolution selects organisms according to the objective function of their inclusive genetic fitness in some environment.[9] Most of these biological organisms—plants, for example—are not “trying” to achieve anything, but instead merely implement heuristics that have been pre-selected by evolution. However, some organisms, such as humans, have behavior that does not merely consist of such heuristics but is instead also the result of goal-directed optimization algorithms implemented in the brains of these organisms. Because of this, these organisms can perform behavior that is completely novel from the perspective of the evolutionary process, such as humans building computers.

However, humans tend not to place explicit value on evolution’s objective, at least in terms of caring about their alleles' frequency in the population. The objective function stored in the human brain is not the same as the objective function of evolution. Thus, when humans display novel behavior optimized for their own objectives, they can perform very poorly according to evolution’s objective. Making a decision not to have children is a possible example of this. Therefore, we can think of evolution as a base optimizer that produced brains—mesa-optimizers—which then actually produce organisms’ behavior—behavior that is not necessarily aligned with evolution.

## 1.2. The inner and outer alignment problems

In “Scalable agent alignment via reward modeling,” Leike et al. describe the concept of the “reward-result gap” as the difference between the (in their case learned) “reward model” (what we call the base objective) and the “reward function that is recovered with perfect inverse reinforcement learning” (what we call the behavioral objective).(8) That is, the reward-result gap is the fact that there can be a difference between what a learned algorithm is observed to be doing and what the programmers want it to be doing.

The problem posed by misaligned mesa-optimizers is a kind of reward-result gap. Specifically, it is the gap between the base objective and the mesa-objective (which then causes a gap between the base objective and the behavioral objective). We will call the problem of eliminating the base-mesa objective gap the inner alignment problem, which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers. This terminology is motivated by the fact that the inner alignment problem is an alignment problem entirely internal to the machine learning system, whereas the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.

It might not be necessary to solve the inner alignment problem in order to produce safe, highly capable AI systems, as it might be possible to prevent mesa-optimizers from occurring in the first place. If mesa-optimizers cannot be reliably prevented, however, then some solution to both the outer and inner alignment problems will be necessary to ensure that mesa-optimizers are aligned with the intended goal of the programmers.

## 1.3. Robust alignment vs. pseudo-alignment

Given enough training, a mesa-optimizer should eventually be able to produce outputs that score highly on the base objective on the training distribution. Off the training distribution, however—and even on the training distribution while it is still early in the training process—the difference could be arbitrarily large. We will use the term robustly aligned to refer to mesa-optimizers with mesa-objectives that robustly agree with the base objective across distributions and the term pseudo-aligned to refer to mesa-optimizers with mesa-objectives that agree with the base objective on past training data, but not robustly across possible future data (either in testing, deployment, or further training). For a pseudo-aligned mesa-optimizer, there will be environments in which the base and mesa- objectives diverge. Pseudo-alignment, therefore, presents a potentially dangerous robustness problem since it opens up the possibility of a machine learning system that competently takes actions to achieve something other than the intended goal when off the training distribution. That is, its capabilities might generalize while its objective does not.

For a toy example of what pseudo-alignment might look like, consider an RL agent trained on a maze navigation task where all the doors during training happen to be red. Let the base objective (reward function) be . On the training distribution, this objective is equivalent to . Consider what would happen if an agent, trained to high performance on on this task, were put in an environment where the doors are instead blue, and with some red objects that are not doors. It might generalize on , reliably navigating to the blue door in each maze (robust alignment). But it might also generalize on instead of , reliably navigating each maze to reach red objects (pseudo-alignment).[10]

## 1.4. Mesa-optimization as a safety problem

If pseudo-aligned mesa-optimizers may arise in advanced ML systems, as we will suggest, they could pose two critical safety problems.

Unintended optimization. First, the possibility of mesa-optimization means that an advanced ML system could end up implementing a powerful optimization procedure even if its programmers never intended it to do so. This could be dangerous if such optimization leads the system to take extremal actions outside the scope of its intended behavior in trying to maximize its mesa-objective. Of particular concern are optimizers with objective functions and optimization procedures that generalize to the real world. The conditions that lead a learning algorithm to find mesa-optimizers, however, are very poorly understood. Knowing them would allow us to predict cases where mesa-optimization is more likely, as well as take measures to discourage mesa-optimization from occurring in the first place. The second post will examine some features of machine learning algorithms that might influence their likelihood of finding mesa-optimizers.

Inner alignment. Second, even in cases where it is acceptable for a base optimizer to find a mesa-optimizer, a mesa-optimizer might optimize for something other than the specified reward function. In such a case, it could produce bad behavior even if optimizing the correct reward function was known to be safe. This could happen either during training—before the mesa-optimizer gets to the point where it is aligned over the training distribution—or during testing or deployment when the system is off the training distribution. The third post will address some of the different ways in which a mesa-optimizer could be selected to optimize for something other than the specified reward function, as well as what attributes of an ML system are likely to encourage this. In the fourth post, we will discuss a possible extreme inner alignment failure—which we believe presents one of the most dangerous risks along these lines—wherein a sufficiently capable misaligned mesa-optimizer could learn to behave as if it were aligned without actually being robustly aligned. We will call this situation deceptive alignment.

It may be that pseudo-aligned mesa-optimizers are easy to address—if there exists a reliable method of aligning them, or of preventing base optimizers from finding them. However, it may also be that addressing misaligned mesa-optimizers is very difficult—the problem is not sufficiently well-understood at this point for us to know. Certainly, current ML systems do not produce dangerous mesa-optimizers, though whether future systems might is unknown. It is indeed because of these unknowns that we believe the problem is important to analyze.

The second post in the Risks from Learned Optimization Sequence [AF · GW], titled “Conditions for Mesa-Optimization,” can be found here [AF · GW].

1. As a concrete example of what a neural network optimizer might look like, consider TreeQN.(2) TreeQN, as described in Farquhar et al., is a Q-learning agent that performs model-based planning (via tree search in a latent representation of the environment states) as part of its computation of the Q-function. Though their agent is an optimizer by design, one could imagine a similar algorithm being learned by a DQN agent with a sufficiently expressive approximator for the Q function. Universal Planning Networks, as described by Srinivas et al.,(3) provide another example of a learned system that performs optimization, though the optimization there is built-in in the form of SGD via automatic differentiation. However, research such as that in Andrychowicz et al.(4) and Duan et al.(5) demonstrate that optimization algorithms can be learned by RNNs, making it possible that a Universal Planning Networks-like agent could be entirely learned—assuming a very expressive model space—including the internal optimization steps. Note that while these examples are taken from reinforcement learning, optimization might in principle take place in any sufficiently expressive learned system. ↩︎

2. Previous work in this space has often centered around the concept of “optimization daemons,”(6) a framework that we believe is potentially misleading and hope to supplant. Notably, the term “optimization daemon” came out of discussions regarding the nature of humans and evolution, and, as a result, carries anthropomorphic connotations. ↩︎

3. The word mesa has been proposed as the opposite of meta.(7) The duality comes from thinking of meta-optimization as one layer above the base optimizer and mesa-optimization as one layer below. ↩︎

4. That being said, some of our considerations do still apply even in that case. ↩︎

5. Leike et al.(8) introduce the concept of an objective recovered from perfect IRL. ↩︎

6. For the formal construction of this objective, see pg. 6 in Leike et al.(8) ↩︎

7. This objective is by definition trivially optimal in any situation that the bottlecap finds itself in. ↩︎

8. Ultimately, our worry is optimization in the direction of some coherent but unsafe objective. In this sequence, we assume that search provides sufficient structure to expect coherent objectives. While we believe this is a reasonable assumption, it is unclear both whether search is necessary and whether it is sufficient. Further work examining this assumption will likely be needed. ↩︎

9. The situation with evolution is more complicated than is presented here and we do not expect our analogy to live up to intense scrutiny. We present it as nothing more than that: an evocative analogy (and, to some extent, an existence proof) that explains the key concepts. More careful arguments are presented later. ↩︎

10. Of course, it might also fail to generalize at all. ↩︎

comment by tom4everitt · 2019-06-08T17:14:04.711Z · score: 28 (9 votes) · LW · GW

Thanks for the interesting post! I find the possibility of a gap between the base optimization objective and the mesa/behavioral objective convincing, and well worth exploring.

However, I'm less convinced that the distinction between the mesa-objective and the behavioral objective is real/important. You write:

Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).[4] [AF · GW] This is in contrast to the mesa-objective, which is the objective actively being used by the mesa-optimizer in its optimization algorithm.

According to Dennett, many systems behave as if they are optimizing some objective. For example, a tree may behave as if optimizes the amount of sun that it can soak up with its leaves. This is a useful description of the tree, offering real predictive power. Whether there is some actual search process going on in the tree is not that important, the intentional stance is useful in either case.

Similarly, a fully trained DQN algorithm will behave as if it optimizes the score of the game, even though there is no active search process going on at a given time step (especially not if the network parameters are frozen). In neither of these example is it necessary to distinguish between mesa and behavior objectives.

At this point, you may object that the mesa objective will be more predictive "off training distribution". Perhaps, but I'm not so sure.

First, the behavioral objective may be predictive "off training distribution": For example, the DQN agent will strive to optimize reward as long as the Q-function generalizes.

Second, the mesa-objective may easily fail to be predictive off distribution. Consider a model-based RL agent with a learned model of the environment, that uses MCTS to predict the return of different policies. The mesa-objective is then the expected return. However, this objective may not be particularly predictive outside the training distribution, because the learned model may only make sense on the distribution.

So the behavioral objective may easily be predictive outside the training distribution, and the mesa-objective easily fail to be predictive.

While I haven't read the follow-up posts yet, I would guess that most of your further analysis would go through without the distinction between mesa and behavior objective. One possible difference is that you may need to be even more paranoid about the emergence of behavior objectives, since they can emerge even in systems that are not mesa-optimizing.

I would also like to emphasize that I really welcome this type of analysis of the emergence of objectives, not the least because it nicely complements my own research on how incentives emerge from a given objective.

comment by vlad_m · 2019-06-08T18:47:18.568Z · score: 15 (6 votes) · LW · GW

Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.

What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is useful from the point of view of making the abstraction more robust. The former doesn’t make clear what makes the abstraction “work”, and so when to expect it to fail. The latter will at least tell you what kind of failures to expect in the abstraction: places where the evaluation of X doesn’t connect to the rest of the system like it’s supposed to. In particular, you’re right that if the learned environment model doesn’t generalise, the mesa-objective won’t be predictive of behaviour. But that’s actually a prediction of taking this view. On the other hand, it is unclear if taking the behavioural view would predict that the system will change its behaviour off-distribution (partially, because it’s unclear what exactly grounds the similarities in behaviour on-distribution).

I think it definitely is useful to also think about the behavioural objective in the way you describe, because the later concerns we raise basically do also translate to coherent behavioural objectives. And I welcome more work trying to untangle these concepts from one another, or trying to dissolve any of them as unnecessary. I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.

Re: DQN

You’re also right to point out DQN as an interesting edge case. But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function. The Q function is regressed to the episode returns. If the learning goes well, the Q function is literally representing the agent’s objective (indeed, it’s not really selected to maximise return; its selected to be accurate at predicting return). Contrast this with e.g. policy optimisation trained agents, which are not supposed to directly represent an objective, but are supposed to score well on it. (Someone good at running RL experiments maybe should look into comparing the coherence of revealed preferences of DQN agents with PPO agents. I’d read that paper.)

comment by tom4everitt · 2019-06-09T09:03:37.693Z · score: 15 (4 votes) · LW · GW
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Indeed, this is a super slippery question. And I think this is a good reason to stand on the shoulders of a giant like Dennett. Some of the questions he has been tackling are actually quite similar to yours, around the emergence of agency and the emergence of consciousness.

For example, does it make sense to say that a tree is *trying to* soak up sun, even though it doesn't have any mental representation itself? Many biologists would hesitate to use such language other than metaphorically.

In contrast, Dennett's answer is yes: Basically, it doesn't matter if the computation is done by the tree, or by the evolution that produced the tree. In either case, it is right to think of the tree as an agent. (Same goes for DQN, I'd say.)

There are other situations where the location of the computation matters, such as for consciousness, and for some "self-reflective" skills that may be hard to pre-compute.

Basically, I would recommend looking closer at Dennett to

• avoid reinventing the wheel (more than necessary), and
• connect to his terminology (since he's so influential).

He's a very lucid writer, so quite a joy to read him really. His most recent book Bacteria to Bach summarizes and references a lot of his earlier work.

I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.

Yes, starting with more assumptions is often a good strategy, because it makes the questions more concrete. As you say, the results may potentially generalize.

But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function.

I see, maybe PPO would have been a better example.

comment by vlad_m · 2019-06-09T18:48:30.712Z · score: 4 (3 votes) · LW · GW

I’ve been meaning for a while to read Dennett with reference to this, and actually have a copy of Bacteria to Bach. Can you recommend some choice passages, or is it significantly better to read the entire book?

P.S. I am quite confused about DQN’s status here and don’t wish to suggest that I’m confident it’s an optimiser. Just to point out that it’s plausible we might want to call it one without calling PPO an optimiser.

P.P.S.: I forgot to mention in my previous comment that I enjoyed the objective graph stuff. I think there might be fruitful overlap between that work and the idea we’ve sketched out in our third post on a general way of understanding pseudo-alignment. Our objective graph framework is less developed than yours, so perhaps your machinery could be applied there to get a more precise analysis?

comment by tom4everitt · 2019-06-14T10:11:41.250Z · score: 6 (4 votes) · LW · GW

Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).

Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there's a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don't have any concrete ideas at the moment -- I can be in touch if I think of something suitable for collaboration!

comment by ofer · 2019-06-09T05:00:56.105Z · score: 5 (4 votes) · LW · GW

The distinction between the mesa- and behavioral objectives might be very useful when reasoning about deceptive alignment [AF · GW] (in which the mesa-optimizer tries to have a behavioral objective that is similar to the base objective, as an instrumental goal for maximizing the mesa-objective).

comment by vlad_m · 2019-06-09T18:53:04.647Z · score: 8 (4 votes) · LW · GW

To some extent, but keep in mind that in another sense, the behavioural objective of maximising paperclips is totally consistent with playing along with the base objective for a while and then defecting. So I’m not sure the behaviour/mesa- distinction alone does the work you want it to do even in that case.

comment by ofer · 2019-06-09T19:44:45.734Z · score: 3 (2 votes) · LW · GW

comment by SoerenMind · 2019-06-24T19:22:51.332Z · score: 25 (7 votes) · LW · GW

This recent Deepmind paper seems to claim that they found a mesa optimizer. E. g. suppose their LSTM observes an initial state. You can let the LSTM 'think' about what to do by feeding it that state multiple times in a row. The more time it had to think, the better it acts. It has more properties like that. It's a pretty standard LSTM so part of their point is that this is common.

https://arxiv.org/abs/1901.03559v1

comment by abramdemski · 2019-06-02T07:54:36.944Z · score: 25 (7 votes) · LW · GW

I wrote something which is sort of a reply to this post [LW · GW] (although I'm not really making a critique or any solid point about this post, just exploring some ideas which I see as related).

comment by ESRogs · 2019-06-01T21:11:23.857Z · score: 20 (7 votes) · LW · GW

Very clear presentation! As someone outside the field who likes to follow along, I very much appreciate these clear conceptual frameworks and explanations.

I did however get slightly lost in section 1.2. At first reading I was expecting this part:

which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers.

to say, "... gap between the behavioral objective and the intended goal of the programmers." (In which case the inner alignment problem would be a subcomponent of the outer alignment problem.)

On second thought, I can see why you'd want to have a term just for the problem of making sure the base objective is aligned. But to help myself (and others who think similarly) keep this all straight, do you have a pithy term for "the intended goal of the programmers" that's analogous to base objective, mesa objective, and behavioral objective?

Would meta objective be appropriate?

(Apologies if my question rests on a misunderstanding or if you've defined the term I'm looking for somewhere and I've missed it.)

comment by evhub · 2019-06-01T22:46:12.685Z · score: 24 (5 votes) · LW · GW

I don't have a good term for that, unfortunately—if you're trying to build an aligned AI, "human values" could be the right term, though in most cases you really just want "move one strawberry onto a plate without killing everyone," which is quite a lot less than "optimize for all human values." I could see how meta-objective might make sense if you're thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.

Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the "classical" alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you're good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.

comment by ESRogs · 2019-06-01T23:33:37.620Z · score: 6 (3 votes) · LW · GW

Got it, that's helpful. Thank you!

comment by rohinmshah · 2019-06-02T22:27:35.884Z · score: 12 (4 votes) · LW · GW

Phrases I've used: [intended/desired/designer's] [objective/goal]

I think "designer's objective" would fit in best with the rest of the terminology in this post, though "desired objective" is also good.

comment by DanielFilan · 2019-05-31T02:46:15.118Z · score: 17 (9 votes) · LW · GW

Another example of trained optimisers that is imo worth checking out is Value Iteration Networks.

comment by Raemon · 2019-06-01T22:10:38.492Z · score: 13 (6 votes) · LW · GW

Pedagogical-comment – I find it much easier to fit a new term into my vocabulary and models when I have an explanation of why that term was chosen (even if it was sort of idiosyncratic or arbitrary). Why "mesa-optimization"?

comment by evhub · 2019-06-01T22:25:44.172Z · score: 19 (8 votes) · LW · GW

The word mesa is Greek meaning into/inside/within, and has been proposed as a good opposite word to meta, which is Greek meaning about/above/beyond. Thus, we chose mesa based on thinking about mesa-optimization as conceptually dual to meta-optimization—whereas meta is one level above, mesa is one level below.

comment by Raemon · 2019-06-01T22:29:06.048Z · score: 5 (2 votes) · LW · GW

Thanks!

comment by ricraz · 2019-06-07T14:49:02.481Z · score: 12 (4 votes) · LW · GW
We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.

I appreciate the difficulty of actually defining optimizers, and so don't want to quibble with this definition, but am interested in whether you think humans are a central example of optimizers under this definition, and if so whether you think that most mesa-optimizers will "explicitly represent" their objective functions to a similar degree that humans do.

comment by vlad_m · 2019-06-08T18:15:14.877Z · score: 10 (3 votes) · LW · GW

I think humans are fairly weird because we were selected for an objective that is unlikely to be what we select for in our AIs.

That said, if we model AI success as driven by model size and compute (with maybe innovations in low-level architecture), then I think that the way humans represent objectives is probably fairly close to what we ought to expect.

If we model AI success as mainly innovative high-level architecture, then I think we will see more explicitly represented objectives.

My tentative sense is that for AI to be interpretable (and safer) we want it to be the latter kind, but given enough compute the former kind of AI will give better results, other things being equal.

Here, what I mean by low-level architecture is something like “we’ll use lots of LSTMs instead of lots of plain RNNs, but keep the model structure simple: plug in the inputs, pass it through some layers, and read out the action probabilities”, and high-level is something like “let’s organise the model using this enormous flowchart with all of these various pieces that each are designed to take a particular role; here’s the observation embedding, here’s the search in latent model space, here’s the ...”

comment by steve2152 · 2019-06-01T13:01:11.739Z · score: 12 (12 votes) · LW · GW

This paper replaces a normal feedforward image classifier with a mesa-optimizing one (build generative models of different possibilities and pick the one that best matches the data). The result was better and far more human-like than a traditional image classifier, e.g. the same examples are ambiguous to the model that are ambiguous to humans and vice-versa. I also understand that the human brain is very big into generative modeling of everything. So I expect that ML systems of the future will approach 100% mesa-optimizers, while non-optimizing feedforward NN's will become rare. This post is a good framework and I'm looking forward to follow-ups!

comment by rohinmshah · 2019-06-02T22:24:10.348Z · score: 18 (7 votes) · LW · GW

I would not call that mesa-optimization and would not take it as evidence that mesa-optimization is the "default" for powerful ML systems. That paper has a model with subagents where each subagent does optimization. Ways in which this is a different thing:

• Given an input, a mesa-optimizer would only run on that input once; in the case of this model there are 10 different optimizations happening in order to classify each digit.
• The base objective is "correctly map an image of a digit to its label"; the objective of the dth optimizer in the model is "Evidence Lower Bound (ELBO) on the log likelihood of the image as evaluated by a generative model for the digit d". The model optimizers' objectives are not of the right type signature and don't agree with the base objective on the training distribution, as would be the case with a mesa optimizer.

Note that I do think mesa-optimization will be common; I just don't think that that paper is evidence for the claim.

comment by Vika · 2019-07-03T13:55:16.054Z · score: 10 (6 votes) · LW · GW

I'm confused about the difference between a mesa-optimizer and an emergent subagent. A "particular type of algorithm that the base optimizer might find to solve its task" or a "neural network that is implementing some optimization process" inside the base optimizer seem like emergent subagents to me. What is your definition of an emergent subagent?

comment by evhub · 2019-07-03T18:28:45.770Z · score: 9 (5 votes) · LW · GW

I think my concern with describing mesa-optimizers as emergent subagents is that they're not really "sub" in a very meaningful sense, since we're thinking of the mesa-optimizer as the entire trained model, not some portion of it. One could describe a mesa-optimizer as a subagent in the sense that it is "sub" to gradient descent, but I don't think that's the right relationship—it's not like the mesa-optimizer is some subcomponent of gradient descent; it's just the trained model produced by it.

The reason we opted for "mesa" is that I think it reflects more of the right relationship between the base optimizer and the mesa-optimizer, wherein the base optimizer is "meta" to the mesa-optimizer rather than the mesa-optimizer being "sub" to the base optimizer.

Furthermore, in my experience, when many people encounter "emergent subagents" they think of some portion of the model turning into an agent and (correctly) infer that something like that seems very unlikely, as it's unclear why such a thing would actually be advantageous for getting a model selected by something like gradient descent (unlike mesa-optimization, which I think has a very clear story for why it would be selected for). Thus, we want to be very clear that something like that is not the concern being presented in the paper.

comment by Jan Kulveit (jan-kulveit) · 2019-07-03T20:44:43.126Z · score: 5 (4 votes) · LW · GW

I don't see why portion of a system turning into an agent would be "very unlikely". In a different perspective, if the system lives in something like an evolutionary landscape, there can be various basins of attraction which lead to sub-agent emergence, not just mesa-optimisation.

comment by rohinmshah · 2019-06-23T20:51:19.116Z · score: 10 (6 votes) · LW · GW
More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).

Just want to note that I think this is extremely far from a formal definition. I don't know what perfect IRL would be. Does perfect IRL assume that the agent is perfectly optimal, or can it have biases? How do you determine what the action space is? How do you break ties between reward functions that are equally good on the training data?

I get that definitions are hard -- the main thing bothering me here is the "more formally" phrase, not the definition itself. This gives it a veneer of precision that it really doesn't have.

comment by vlad_m · 2019-06-24T17:58:15.167Z · score: 10 (7 votes) · LW · GW

You’re completely right; I don’t think we meant to have ‘more formally’ there.

comment by Mark Xu (mark-xu) · 2020-01-16T19:08:42.540Z · score: 5 (3 votes) · LW · GW

I'm confused why the inner alignment problem is conceptually different from the outer alignment problem. From a general perspective, we can think of the task of building any AI system as humans trying to optimizer their values by searching over some solution space. In this scenario, the programmer becomes the base optimizer and the AI system becomes the mesa optimizer. The outer alignment problem thus seems like a particular manifestation of the inner alignment problem where the base optimizer is a human.

In particular, if there exists a robust solution to the outer alignment problem, then presumably there's some property that we want the AI system to have and that we have some process that convinces us that the AI system has property . I don't see why we can't just give the AI system the ability to enact to ensure that any optimizer's that it creates have property (modulo the problem of ensuring that the system has with , ensuring that with , etc.). I guess you can have a solution to the outer alignment problem by having and not have the recursive tower needed to solve the inner alignment problem, but that seems like not the issues that were being brought up. (something something [LW · GW]Lobian Obstacle [LW · GW])

In particular,

We will call the problem of eliminating the base-mesa objective gap the inner alignment problem, which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers. This terminology is motivated by the fact that the inner alignment problem is an alignment problem entirely internal to the machine learning system, whereas the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.

My view says that if is the machine learning system and are the programmers, we can view as the "machine learning system" and as a mesa-optimizer. The task of aligning the the mesa-objective with the specific loss seems the same type of problem as aligning the loss function of with the programmers values.

Maybe the important thing is that loss functions are functions and values are not, so the point is that even if we have a function that represents our values, things can still go wrong. That is, before people thought that the problem was that finding a function that does what we want when it gets optimized was the main problem, but mesa-optimizer pseudo-alignment shows that even if we have that then we can't just optimize the function.

An implication is that all the reasons why mesa-optimizers can cause problems are reasons why strategies for trying to turn human values into a function can go wrong too. For example, value learning strategies seem vulnerable to the same pseudo-alignment problems. Admittedly, I do not have a good understanding of current approaches to value learning, so I am not sure if this is a real concern. (Assuming that the authors of this post are adequate, if such a similar concern existed in value learning, I think they would have mentioned it. This suggests that either I am wrong about this being a problem or that no one has given it serious thought. My priors are on the former, but I want to know why I'm wrong.)

I suspect that I've failed to understand something fundamental because it seems like a lot of people that know a lot of stuff think this is really important. In general, I think this paper has been well written and extremely accessible to someone like me who has only recently started reading about AI safety.

comment by astrobiscuit · 2019-07-23T12:21:56.002Z · score: 3 (3 votes) · LW · GW

To me that the mesaoptimizer in the toy example is:

• aligned with its goal - it reaches the door (which it incorrectly identifies)
• dysfunctional - it incorrectly identifies doors.

From a consequentialist perspective this may be irrelevant, but from safety point of view this distinction is important and big.

In the context of this article I believe that misalignment (pseudo alignment) would occur when the goal of the mesa optimizer would diverge from its original goal (change completely, extend, etc.)

(As a secondary point that I haven't thought a lot about, it seems problematic to discuss alignment unless the mesa optimizer's goal liberally contains the base goal: Find doors in order to achieve Obase.)

comment by SoerenMind · 2019-06-20T19:31:34.017Z · score: 3 (1 votes) · LW · GW

Terminology: the phrase 'inner alignment' is loaded with connotations to spiritual thought (https://www.amazon.com/Inner-Alignment-Dinesh-Senan-ebook/dp/B01CRI5UIY)

comment by Pattern · 2019-06-02T22:03:16.078Z · score: 1 (1 votes) · LW · GW
In the fourth post, we will discuss a possible extreme inner alignment failure—which we believe presents one of the most dangerous risks along these lines—wherein a sufficiently capable misaligned mesa-optimizer could learn to behave as if it were aligned without actually being robustly aligned. We will call this situation deceptive alignment.

How does this relate to Stories of Continuous Deception [LW · GW]?

comment by Abhinav Sharma (abhinav-sharma) · 2019-06-02T12:31:11.744Z · score: 1 (1 votes) · LW · GW

Because we don't know when a neural network runs into the Mesa optimization problem we are prone to adversarial attacks ? Like the example with red doors is a neat one. There as a human programmer we thought that the algorithm is learning to read red doors but maybe all it was doing was to learn to distinguish red from everything else. Also isn't every neural network we train today performing some sort of Mesa optimization ?