chavam feed - LessWrong 2.0 Reader

Comment by chrisvm on The Pando Problem: Rethinking AI Individuality

chavam — 2025-04-10T04:23:31.603Z

This comment was written by Claude, based on my bullet points:

I've been thinking about the split-brain patient phenomenon as another angle on this AI individuality question.

Consider split-brain patients: despite having the corpus callosum severed, the two hemispheres don't suddenly become independent agents with totally different goals. They still largely cooperate toward shared objectives. Each hemisphere makes predictions about what the other is doing and adjusts accordingly, even without direct communication.

Why does this happen? I think it's because both hemispheres were trained together for their whole life, developing shared predictive models and cooperative behaviors. When the connection is cut, these established patterns don't just disappear—each hemisphere fills in missing information with predictions based on years of shared experience.

Similarly, imagine training an AI model to solve some larger task, consisting of a bunch of subtasks. Just for practical reasons it will have to carve up the subtask to some extent and call instances of itself to solve the subtask. In order to perform the larger task well, there will be an incentive on the model for these instances to have internal predictive models, habits, drives of something like "I am part of a larger agent, performing a subtask".

Even if we later placed multiple instances of such a model (or of different but similar models) in positions meant to be adversarial - perhaps as checks and balances on each other - they might still have deeply embedded patterns predicting cooperative behavior from similar models. Each instance might continue acting as if it were part of a larger cooperative system, maintaining coordination through these predictive patterns rather than through communication even though their "corpus callosum" is cut (in analogy with split brain patients).

I'm not sure how far this analogy goes, it's just a thought.

Comment by chrisvm on The Pando Problem: Rethinking AI Individuality

chavam — 2025-04-10T02:45:23.474Z

A version of what ChatGPT wrote here prompted

What was the prompt?

Comment by chrisvm on Vacuum Decay: Expert Survey Results

chavam — 2025-03-24T13:48:55.317Z

Overall, compared to the previous question, there was more of a consensus, with 55% of people responding that there is a 0% chance that technologically induced vacuum decay is possible.

Since anywhere near 0% seems way overconfident to me at first sight, just a random highly speculative unsubstantiated thought: Could this be partly motivated reasoning, that they're afraid of a backlash against physics funding or something?

Comment by chrisvm on Vacuum Decay: Expert Survey Results

chavam — 2025-03-24T13:42:33.965Z

They stated justification was primarily that the Standard Model of particle physics predicts metastability

Just to be sure, does this mean
1. That the standard model predicts that metastability is possible? i.e. it is consistent with the standard model for there to be metastability; or
2. If the standard model is correct, and certain empirical observations are correct, then we must be in a metastable state. i.e. the standard model together with certain empirical observations implies our actual universe is metastable?

Comment by chrisvm on Compositional language for hypotheses about computations

chavam — 2025-03-23T16:00:25.722Z

I may be confused somehow. Feel free to ignore. But:
* At first I thought you meant the input alphabet to be the colors, not the operations.
* Instead, am I correct that "the free operad generated by the input alphabet of the tree automaton" is an operad with just one color, and the "operations" are basically all the labeled trees where labels of the nodes are the elements of the alphabet, such that the number of children of a node is always equal to the arity of that label in the input alphabet?
* That would make sense, as the algebra would then I guess assign the state space of the tree automaton to the single color of the operad, and each arity n operation would be mapped to the mathematical function from Q^n to Q.
* That would make sense I think, but then why do you talk about a "colored" operad in: "we can now define a deterministic automaton over a (colored) operad to be an $O$ -algebra"?

Comment by chrisvm on Compositional language for hypotheses about computations

chavam — 2025-03-23T04:17:56.176Z

More precisely, they are algebras over the free operad generated by the input alphabet of the tree automaton

Wouldn't this fail to preserve the arity of the input alphabet? i.e. you can have trees where a given symbol occurs multiple times, and with different amounts of children? That wouldn't be allowed from the perspective of the tree automaton right?

Comment by chrisvm on How might we safely pass the buck to AI?

chavam — 2025-02-20T11:18:39.806Z

Noosphere, why are you responding for a second time to a false interpretation of what Eliezer was saying, directly after he clarified this isn't what he meant?

Comment by chrisvm on The Case Against AI Control Research

chavam — 2025-02-07T14:15:17.007Z

Here is an additional reason why it might seem less useful than it actually is: Maybe the people whose research direction is being criticized do process the criticism and change their views, but do not publicly show that they change their mind because it seems embarrassing. It could be that it takes them some time to change their mind, and by that time there might be a bigger hurdle to letting you know that you were responsible for this, so they keep it to themselves. Or maybe they themselves aren't aware that you were responsible.

Comment by chrisvm on Gradual Disempowerment, Shell Games and Flinches

chavam — 2025-02-04T11:24:17.003Z

but note that the gradual problem makes the risk of coups go up.

Just a request for editing the post to clarify: do you mean coups by humans (using AI), coups by autonomous misaligned AI, or both?

Comment by chrisvm on Many arguments for AI x-risk are wrong

chavam — 2025-01-30T13:51:23.712Z

EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom, Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments "always" have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.

As one of Evan's co-authors on the mesa-optimization paper from 2019 I can confirm this. I don't recall ever thinking seriously about a counting argument over functions.

Comment by chrisvm on A Three-Layer Model of LLM Psychology

chavam — 2025-01-28T06:12:09.187Z

I'm trying to figure out to what extent the character/ground layer distinction is different from the simulacrum/simulator distinction. At some points in your comment you seem to say they are mutually inconsistent, but at other points you seem to say they are just different ways of looking at the same thing.

"The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under."

I think this clarifies the difference for me, because as I was reading your post I was thinking: If you think of it as a simulacrum/simulator distinction, I'm not sure that the character and the surface layer can be "in conflict" with the ground layer, because both the surface layer and the character layer are running "on top of" the ground layer, like a windows virtual machine on a linux pc, or like a computer simulation running inside physics. Physical can never be "in conflict" with social phenomena.

But it seems you maybe think that the character layer is actually embedded in the basic cognitive architecture. This would be a distinct claim from simulator theory, and *mutually inconsistent*. But I am unsure this is true, because we know that the ground layer was (1) trained first (so that it's easier for character training to work by just adjusting some parameters/prior of the ground layer, and (2) trained for much longer than the character layer (admittedly I'm not up to date on how they're trained, maybe this is no longer true for Claude?), so that it seems hard for the model to have a character layer become separately embedded in the basic architecture.

Taking a more neuroscience rather than psychology analogy: It seems to me more likely that character training is essentially adjusting the prior of the ground layer, but the character is still fully running on top of the ground layer, and the ground layer could still switch to any other character (but it doesn't because the prior is adjusted so heavily by character-training). e.g. the character is not some separate subnetwork inside the model, but remains a simulated entity running on top of the model.

Do you disagree with this?

Comment by chrisvm on Applying traditional economic thinking to AGI: a trilemma

chavam — 2025-01-14T11:24:50.259Z

Minor quibble: It's a bit misleading to call B "experience curves", since it is also about capital accumulation and shifts in labor allocation. Without any additional experience/learning, if demand for candy doubles, we could simply build a second candy factory that does the same thing as the first one, and hire the same number of workers for it.

Comment by chrisvm on What’s the short timeline plan?

chavam — 2025-01-13T16:09:32.000Z

I just want to register a prediction that I think something like meta's coconut will in the long run in fact perform much better than natural language CoT. Perhaps not in this time-frame though.

Comment by chrisvm on Evaluating the historical value misspecification argument

chavam — 2025-01-06T12:31:28.716Z

I suspect you're misinterpreting EY's comment.

Here was the context:
"I think controlling Earth's destiny is only modestly harder than understanding a sentence in English - in the same sense that I think Einstein was only modestly smarter than George W. Bush. EY makes a similar point.

You sound to me like someone saying, sixty years ago: "Maybe some day a computer will be able to play a legal game of chess - but simultaneously defeating multiple grandmasters, that strains credibility, I'm afraid." But it only took a few decades to get from point A to point B. I doubt that going from "understanding English" to "controlling the Earth" will take that long."

It seems clear to me EY was more saying something like "ASI will arrive soon after natural language understanding", rather than it having anything to do with alignment specifically.

Comment by chrisvm on Evaluating the historical value misspecification argument

chavam — 2025-01-06T12:28:48.965Z

"It's fine to say that this is a falsified prediction"

I wouldn't even say it's falsified. The context was: "it only took a few decades to get from [chess computer can make legal chess moves] to [chess computer beats human grandmaster]. I doubt that going from "understanding English" to "controlling the Earth" will take that long."

So insofar as we believe ASI is coming in less than a few decades, I'd say EY's prediction is still on track to turn out correct.

Comment by chrisvm on Cortés, Pizarro, and Afonso as Precedents for Takeover

chavam — 2024-02-27T06:47:08.071Z

NEW EDIT: After reading three giant history books on the subject, I take back my previous edit. My original claims were correct.

Could you edit this comment to add which three books you're referring to?

Extinction Risks from AI: Invisible to Science?

chavam — 2024-02-21T18:07:33.986Z

Abstract: In an effort to inform the discussion surrounding existential risks from AI, we formulate Extinction-level Goodhart’s Law as "Virtually any goal specification, pursued to the extreme, will result in the extinction^[1] of humanity'', and we aim to understand which formal models are suitable for investigating this hypothesis. Note that we remain agnostic as to whether Extinction-level Goodhart's Law holds or not. As our key contribution, we identify a set of conditions that are necessary for a model that aims to be informative for evaluating specific arguments for Extinction-level Goodhart's Law. Since each of the conditions seems to significantly contribute to the complexity of the resulting model, formally evaluating the hypothesis might be exceedingly difficult. This raises the possibility that whether the risk of extinction from artificial intelligence is real or not, the underlying dynamics might be invisible to current scientific methods.

Together with Chris van Merwijk and Ida Mattsson, we have recently written a philosophy-venue version of some of our thoughts on Goodhart's Law in the context of powerful AI [link].^[2] This version of the paper has no math in it, but it attempts to point at one aspect of "Extinction-level Goodhart's Law" that seems particularly relevant for AI advocacy – namely, that the fields of AI and CS would have been unlikely to come across evidence of AI risk, using the methods that are popular in those fields, even if the law did hold in the real world.

Since commenting on link-posts is inconvenient, I split off some of the ideas from the paper into the following separate posts:

Weak vs Quantitative Extinction-level Goodhart's Law: defining different versions of the notion of "Extinction-level Goodhart's Law".
Which Model Properties are Necessary for Evaluating an Argument?: illustrating the methodology of the paper on a simple non-AI example.
Dynamics Crucial to AI Risk Seem to Make for Complicated Models: applying the methodology above to AI risk.

We have more material on this topic, including writing with math^[3] in it, but this is mostly not yet in a publicly shareable form. The exception is the post Extinction-level Goodhart's Law as a Property of the Environment (which is not covered by the paper). If you are interested in discussing anything related to this, definitely reach out.

^{^}
A common comment is that the definition should also include outcomes that are similarly bad or worse than extinction. While we agree that such definition makes sense, we would prefer to refer to that version as "existential", and reserve the "extinction" version for the less ambiguous notion of literal extinction.
^{^}
As an anecdote, it seems worth mentioning that I tried, and failed, to post the paper to arXiv --- by now, it has been stuck there with "on hold" status for three weeks. Given that the paper is called "Existential Risk from AI: Invisible to Science?", there must be some deeper meaning to this. [EDIT: After ~2 months, the paper is now on arXiv.]
^{^}
Or rather, it has pseudo-math in it. By which I mean that it looks like math, but it is built on top of vague concepts such as "optimisation power" and "specification complexity". And while I hope that we will one day be able to formalise these, I don't know how to do so at this point.

Comment by chrisvm on Killing Socrates

chavam — 2023-04-13T11:10:57.292Z

One of the more interesting dynamics of the past eight-or-so years has been watching a bunch of the people who [taught me my values] and [served as my early role models] and [were presented to me as paragons of cultural virtue] going off the deep end.

I'm curious who these people are.

Comment by chrisvm on Is AI Progress Impossible To Predict?

chavam — 2023-04-05T14:21:08.446Z

We should expect regression towards the mean only if the tasks were selected for having high "improvement from small to Gopher-7". Were they?

Comment by chrisvm on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

chavam — 2023-04-04T12:53:57.832Z

The reasoning was given in the comment prior to it, that we want fast progress in order to get to immortality sooner.

Comment by chrisvm on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

chavam — 2023-03-31T16:20:46.169Z

"But yeah, I wish this hadn't happened."

Who else is gonna write the article? My sense is that no one (including me) is starkly stating publically the seriousness of the situation.

"Yudkowsky is obnoxious, arrogant, and most importantly, disliked, so the more he intertwines himself with the idea of AI x-risk in the public imagination, the less likely it is that the public will take those ideas seriously"

I'm worried about people making character attacks on Yudkowsky (or other alignment researchers) like this. I think the people who think they can probably solve alignment by just going full-speed ahead and winging it, they are arrogant. Yudkowsky's arrogant-sounding comments about how we need to be very careful and slow, are negligible in comparison. I'm guessing you agree with this (not sure) and we should be able to criticise him for his communication style, but I am a little worried about people publically undermining Yudkowsky's reputation in that context. This seems like not what we would do if we were trying to coordinate well.

Comment by chrisvm on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

chavam — 2023-03-31T16:04:10.635Z

"We finally managed to solve the problem of deceptive alignment while being capabilities competitive"

??????

Comment by chrisvm on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

chavam — 2023-03-31T15:57:54.699Z

"But I don't think you even need Eliezer-levels-of-P(doom) to think the situation warrants that sort of treatment."

Agreed. If a new state develops nuclear weapons, this isn't even close to creating a 10% x-risk, yet the idea of airstrikes on nuclear enrichment facillities, even though it is very controversial, has for a long time very much been an option on the table.

Comment by chrisvm on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

chavam — 2023-03-31T15:48:35.793Z

"if I thought the chance of doom was 1% I'd say "full speed ahead!"

This is not a reasonable view. Not on Longtermism, nor on mainstream common sense ethics. This is the view of someone willing to take unacceptable risks for the whole of humanity.

Comment by chrisvm on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

chavam — 2023-03-30T23:48:51.768Z

Also, there is a big difference between "Calling for violence", and "calling for the establishment of an international treaty, which is to be enforced by violence if necessary". I don't understand why so many people are muddling this distinction.

Comment by chrisvm on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

chavam — 2023-03-30T23:46:15.307Z

You are muddling the meaning of "pre-emptive war", or even "war". I'm not trying to diminish the gravity of Yudkowsky's proposal, but a missile strike on a specific compound known to contain WMD-developing technology is not a "pre-emptive war" or "war". Again I'm not trying to diminish the gravity, but this seems like an incorrect use of the term.

Comment by chrisvm on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky

chavam — 2023-03-30T13:11:16.791Z

"For instance, personally I think the reason so few people take AI alignment seriously is that we haven't actually seen anything all that scary yet. "

And if this "actually scary" thing happens, people will know that Yudkowsky wrote the article beforehand, and they will know who the people are that mocked it.

Comment by chrisvm on The Waluigi Effect (mega-post)

chavam — 2023-03-29T20:33:39.744Z

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

Comment by chrisvm on The Waluigi Effect (mega-post)

chavam — 2023-03-29T19:45:47.660Z

Therefore, the waluigi eigen-simulacra are attractor states of the LLM

It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction.

Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples 0 i.i.d. with 100%).

Suppose we use a perfect Bayesian reasoner to sample bitstrings, but we do it in precisely the same way LLMs do it according to the simulator model. That is, given a bitstring, we first formulate a posterior over programs, i.e. a "superposition" on programs, which we use to sample the next bit, then we recompute the posterior, etc.

Then I think the probability of sampling 00000000... is just 50%. I.e. I think the distribution over bitstrings that you end up with is just the same as if you just first sampled the program and stuck with it.

I think tHere's a messy calculation which could be simplified (which I won't do):

Limit of this is 0.5.

I don't wanna try to generalize this, but based on this example it seems like if an LLM was an actual Bayesian, Waluigi's would not be attractors. The informal argument is wrong because it doesn't take into account the fact that over time you sample increasingly many non-waluigi samples, pushing down the probability of Waluigi.

Then again, the presense of a context window completely breaks the above calculation in a way that preserves the point. Maybe the context window is what makes Waluigi's into an attractor? (Seems unlikely actually, given that the context windows are fairly big).

Comment by chrisvm on The Overton Window widens: Examples of AI risk in the media

chavam — 2023-03-26T14:53:39.081Z

Linking to my post about Dutch TV: https://www.lesswrong.com/posts/TMXEDZy2FNr5neP4L/datapoint-median-10-ai-x-risk-mentioned-on-dutch-public-tv

Comment by chrisvm on Shutting Down the Lightcone Offices

chavam — 2023-03-26T13:59:28.992Z

"When LessWrong was ~dead"

Which year are you referring to here?

Comment by chrisvm on Shutting Down the Lightcone Offices

chavam — 2023-03-26T13:38:17.366Z

A lot of people in AI Alignment I've talked to have found it pretty hard to have clear thoughts in the current social environment, and many of them have reported that getting out of Berkeley, or getting social distance from the core of the community has made them produce better thoughts.

What do you think is the mechanism behind this?

Datapoint: median 10% AI x-risk mentioned on Dutch public TV channel

chavam — 2023-03-26T12:50:11.612Z

I am Dutch, so wanted to share this as a datapoint regarding public perception of existential AI risk, since it probably won't be noticed here otherwise.

3 days ago (23th march 2023) a Dutch AI "science communicator" who regularly appears on Dutch television to talk about AI, has mentioned on a talkshow on the main Dutch public broadcasting channel, that in some study 50% of AI researchers assign 10% probability to AI leading to "the end of humanity" (actually he misspoke and said "of 50% of researchers, 10% believes it will ...". I don't know if he is confused or if it was misspoken). He emphasized the example where GPT4 lied to a human to get them to open the Captcha (by saying that it was human but had a vision impairment). His concluding remark was (literal translation) :

"You can also imagine a very dark world, and you can wonder why you're working on a technology that has a 10% probability of ending humanity".

He didn't go into great detail, the whole segment was ~2:30 minutes.

Here is a link: https://op1npo.nl/2023/03/23/alexandra-van-huffelen-en-alexander-klopping-over-de-ontwikkeling-van-kunstmatige-intelligentie/.

Generally the reaction on facebook is a distribution of I'm guessing ~70% completely incoherent responses, the rest some mix of half-baked worry or skepticism.

Comment by chrisvm on Reward is not the optimization target

chavam — 2023-03-23T12:58:08.823Z

There is a general phenomenon where:

Person A has mental model X and tries to explain X with explanation Q
Person B doesn't get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn't actually contain the insights, but P does.
Person C doesn't get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: ...

It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn't. Some of the evidence for this is in fact contained in your very comment:

"1. Pointing out the "reward chisels computation" point. 2. Having some people tell me it's obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)"
So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).

BTW, it could in fact be that person B's explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about "the" optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let's not get into it).

"Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so."

I have been correcting people for a while on stuff like that (though not on LW, I'm not often on LW), such as that in the generic case we shouldn't expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn't (others did), so your points 1/2/3 also apply to me.

"I do totally buy that you all had good implicit models of the reward-chiseling point". I don't think we just "implicitly" modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I'm not claiming we conveyed everything well to everyone (clearly you haven't either).

Comment by chrisvm on Reward is not the optimization target

chavam — 2023-03-11T14:23:41.260Z

Very late reply, sorry.

"even though reward is not a kind of objective", this is a terminological issue. In my view, calling a "antecedent-computation reinforcement criterion" an "objective" matches my definition of "objective", and this is just a matter of terminology. The term "objective" is ill-defined enough that "even though reward is not a kind of objective" is a terminological claim about objective, not a claim about math/the world.

The idea that RL agents "reinforce antecedent computations" is completely core to our story of deception. You could not make sense of our argument for deception if you didn't look at RL systems in this way. Viewing the base optimizer as "trying" to achieve an "objective" but "failing" because it is being "deceived" by the mesa optimizer is purely a metaphorical/terminological choice. It doesn't negate the fact that we all understood that the base optimizer is just reinforcing "antecedent computations". How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer's optimization process?

I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn't even make sense if you didn't have this insight. (Certainly the fact that we called it an objective doesn't communicate the point, and it isn't meant to).

Comment by chrisvm on Models Don't "Get Reward"

chavam — 2023-03-11T13:51:32.599Z

The core point in this post is obviously correct, and yes people's thinking is muddled if they don't take this into account. This point is core to the Risks from learned optimization paper (so it's not exactly new, but it's good if it's explained in different/better ways).

Comment by chrisvm on Unifying Bargaining Notions (2/2)

chavam — 2022-11-07T20:48:12.235Z

Is the following a typo?
"So, the ( works"

first sentence of "CoCo Equilbiria".

Comment by chrisvm on Reward is not the optimization target

chavam — 2022-08-09T05:25:07.680Z

Maybe you have made a gestalt-switch I haven't made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.

Is there a difference between saying:

A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn't actually encode in any way the "goal" of the model itself.
A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn't actually encode in any way the "goal" of the model itself.

It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn't actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an "objective".

However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.

Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.

(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).

Comment by chrisvm on Reward is not the optimization target

chavam — 2022-08-06T10:22:22.491Z

It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?

(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).

Comment by chrisvm on An AI defense-offense symmetry thesis

chavam — 2022-07-14T16:10:45.306Z

I agree this is a good distinction.

Comment by chrisvm on An AI defense-offense symmetry thesis

chavam — 2022-07-14T10:43:51.138Z

"I think in the defense-offense case the actions available to both sides are approximately the same"

If attacker has the action "cause a 100% lethal global pandemic" and the defender has the task "prevent a 100% lethal global pandemic", then clearly these are different problems, and it is a thesis, a thing to be argued for, that the latter requires largely the same skills/tech as the former (which is what this offense-defense symmetry thesis states).

If you build an OS that you're trying to make safe against attacks, you might do e.g. what the seL4 microkernel team did and formally verify the OS to rule out large classes of attacks, and this is an entirely different kind of action than "find a vulnerability in the OS and develop an exploit to take control over it".

"I wouldn't say the strategy-stealing assumption is about a symmetric game"

Just to point out that the original strategy stealing argument assumes literal symmetry. I think the argument only works insofar as generalizing from literal symmetry doesn't break this argument (to e.g. something more like linearity of the benefit of initial resources). I think you actually need something like symmetry in both instrumental goals, and "initial-resources-to-output map".

The strategy-stealing argument as applied to defense-offense would say something like "whatever offense does to increase its resources / power is something that defense could also do to increase resources / power".

Yes, but this is almost the opposite of what the offense-defense symmetry thesis is saying. Because it can both be true that 1. defender can steal attacker's strategies, AND 2. defender alternatively has a bunch of much easier strategies available, by which it can defend against attacker and keep all the resources.

This DO-symmetry thesis says that 2 is NOT true, because all such strategies in fact also require the same kind of skills. The point of the DO-symmetry thesis is to make more explicit the argument that humans cannot defend against misaligned AI without their own aligned AI.

"This isn't the same as your thesis."

Ok I only read this after writing all of the above, so I thought you were implying they were the same (and was confused as to why you would imply this), and I'm guessing you actually just meant to say "these things are sort of vaguely related".

Anyway, if I wanted to state what I think the relation is in a simple way I'd say that they give lower and upper bounds respectively on the capabilities needed from AI systems:

OD-symmetry thesis: We need our defensive AI to be at least as capable as any misaligned AI.
strategy-stealing: We don't need our defensive AI to be any more capable.

I think probably both are not entirely right.

Comment by chrisvm on An AI defense-offense symmetry thesis

chavam — 2022-07-14T05:09:06.618Z

Kind of a delayed response, but: Could you clarify what you think is the relation between that post and mine? I think they are somehow sort of related, but not sure what you think the relation is. Are you just trying to say "this is sort of related", or are you trying to say "the strategy stealing assumption and this defense-offense symmetry thesis is the same thing"?

In the latter case: I think they are not the same thing, neither in terms of their actual meaning nor their intended purpose:

Strategy-stealing assumption is (in the context of AI alignment): for any strategy that a misaligned AI can use to obtain influence/power/resources, humans can employ a similar strategy to obtain a similar amount of influence/power/resources.
This defense-offense symmetry thesis: In certain domains, in order to defend against an attacker, the defender need the same cognitive skills (knowledge, understanding, models, ...) as the attacker (and possibly more).

These seem sort of related, but they are just very different claims, even depending on different ontologies/cocepts. One particularly simple-to-state difference is that the strategy-stealing argument is explicitly about symmetric games whereas the defense-offense symmetry is about a (specific kind of) asymmetric game, where there is a defender who first has some time to build defenses, and then an attacker who can respond to that and exploit any weaknesses. (and the strategy-stealing argument as applied to AI alignment is not literally symmetric, but semi-symmetric in the sense of the relation between inbeing kind of "linear").

So yeah given this, could you say what you think the relation is?

Straw-Steelmanning

chavam — 2022-07-13T05:48:33.099Z

I've noticed that when people are asked to "Steelman" a position, they sometimes instead do what I would call "Straw-Steelmanning". Someone can also straw-steelman without having been asked to steelman or having said that they would do so.

What is straw-steelmanning? Assume someone makes an argument X for a claim C, and you are arguing against X.

Straw-manning (bad): You replace X with a weaker argument Y and argue against that, pretending as if you have thereby refuted X.
Steel-manning (good): You replace X with a stronger argument Y which still contains the core of X and argue against that, thereby actually refuting X. (The term can also be used in a context where you are not actually arguing against the claim C).
Straw-steelmanning (bad): You replace C with an entirely different claim D and make an argument Y for it which you consider to be stronger than X, pretending as if you no longer need to argue against C.

An example which I have noticed is something like the following:

"Can you steelman the position that future AI systems will pose an existential risk?"
"Well, while we whould not take these hollywood movie plots seriously, there are real social problems with AI that we have to deal with. AI will potentially cause massive inequality because entire industries will be automated and a small amount of corporation will own the AI tools that facillitate that. AI engineers will earn large wages, while demand for other professions stagnates. Moreover, we need to worry about biases in AI systems, because [etc, proceeds to argue more]"

This is a straw-steelman, because they have

Simply bypassed the original claim, replacing it with a different claim that they already agreed with.
Proceeded to argue for that claim, ignoring the original.

Comment by chrisvm on How are compute assets distributed in the world?

chavam — 2022-06-20T15:20:25.300Z

I just had a very quick look at that site, and it seems to be a collection of various chip models with pictures of them? Is there actual information on quantities sold, etc? I couldn't find it immediately.

Comment by chrisvm on An AI defense-offense symmetry thesis

chavam — 2022-06-20T15:18:32.230Z

Yeah, I know they don't understand them comprehensively. Is this the point though? I mean they understand them at a level of abstraction necessary to do what they need, and the claim is they have basically the same kind of knowledge of computers. Hmm, I guess that isn't really communicated by my phrasing though, so maybe I should edit that

An AI defense-offense symmetry thesis

chavam — 2022-06-20T10:01:18.968Z

Epistemic status: Not well argued for, haven’t spent much time on it, and it’s not very worked out. This thesis is not new at all, and is implicit in a lot (most?) of the discussion about x-risk from AI. The intended contribution of this post is to state the thesis explicitly in a way that can help make the alignment problem clearer. I can imagine the thesis is too strong as stated.

I sometimes hear implicitly or explicitly in people’s skepticism of AI risk that we can just not build dangerous AI. I want to formulate a thesis that does part of the work of countering this idea.

If we want to be safe against misaligned AI, we need to defend the vulnerabilities that (a) misaligned AI system(s) could exploit that would pose an existential threat. We need aligned AI that defends against misaligned AI, in order to make sure such misaligned AI poses no threat, (either by ensuring they are not created, or never are able to accumulate the resources needed to be a threat).

In game theory terms: If we successfully build aligned AI systems, then we need to play an asymmetric game between humans with (aligned) defensive AI, versus potential (misaligned) offensive AI. It is not obvious that in an asymmetric game both sides need to have the same capabilities. The thesis I’m making is that the defensive AI needs to have at least all the capabilities that the offensive AI would need to pose an existential threat: The defensive AI needs to be basically capable of starting from a set of resources X, and doing all the damage that the offensive AI could do using resources X.

The/an AI offense-defense symmetry thesis: In order for aligned AI systems to defend indefinitely against existential threat from misaligned AI without themselves harming humanity’s potential, that defensive AI needs to have broadly speaking all the capabilities of the misaligned AI systems that it defends against.

Note that the defender may need to have a higher (or lower) level of those capabilities than the attacker, and it may need additional kinds of capabilities as well that the attackers don’t have.

I don’t think this thesis is exactly right, but more or less right. I won’t really make a solid argument for it, but give some analogies and mechanisms:

Some analogies from offense-defense in human domains:

State security services who defend against terrorists:

The terrorists are persons with guns, the state security services are also persons with guns. To do damage one needs to be able to use guns to shoot people in a firefight. To defend, one still also needs to be able to use guns to shoot people in a firefight.
To do a lot of damage as a terrorist, you need to search for vulnerabilities to exploit. To defend against this, you need to search for vulnerabilities to fix/defend against.
There are additional skills the defender needs to have, such as skills in tracking and monitoring suspects, which the attacker doesn’t need. The opposite direction seems to be less the case.

Computer security specialists defending against attacks:

Hackers and malware designers are people who understand computer systems and their vulnerabilities. IT Security professionals are also just people who understand computer systems and their vulnerabilities.
Empirically, black hat hackers or malware developers can contribute (once they change their motivations) to IT security, and security professionals could, if they wanted, use their knowledge to write malware.

Mechanisms why this thesis might be true

Offense and defense both require the same world-model. If the offender’s task is “find and exploit a vulnerability”, and the defense’s task is “find and defend all vulnerabilities”, then while their planning tasks are different, they both require the world-model of the domain in which they do this planning. E.g. Both hackers and computer security professionals need to actually understand programming, operating systems, networking, and how there can be vulnerabilities in these systems.
Offense and defense both need to search for vulnerabilities. Even defenders still need to be able to find vulnerabilities in their systems, especially when they can’t just respond ex-post to vulnerabilities found by attackers because a single failure is sufficiently damaging.
Symmetric subgames in asymmetric games. Some asymmetric games in the real world have symmetric subgames. E.g. Even though the conflict between an occupying power with a large military versus a guerilla force is an “asymmetric war”, at local points in the conflict, there are situations that are just one team of soldiers with guns fighting another team of soldiers with guns. Hence both the occupying power and the guerilla force need capabilities like marksmanship, tactical skill, coordination and so forth, for basically the same reason.

A mechanism why this might not be true

Defense can use abstraction to counter categories of attacks. Slightly metaphorically, the attackers problem is to prove “∃ vulnerability” (i.e. to find a vulnerability), while the defender’s problem is the opposite, to prove “∀ secure” (i.e. to ensure there is no vulnerability). But rather than the defender searching through all vulnerabilities and defending against them, it can simply execute a smaller amount of general countermeasures without understanding all the particular vulnerabilities and their exploits. This might require different capabilities.

For example, in computer security, rather than searching for vulnerabilities (e.g. by white hat hacking) and fixing them, one can work on formally verified software to rule out categories of vulnerabilities. This requires very different capabilities than hacking does.

Conclusion

This thesis would imply that in order to permanently defend against existential risk from AI without permanently crippling humanity, we need to develop aligned AI which has all the capabilities necessary to cause such an existential catastrophe. In particular, we cannot build weak AI to defend against strong AI.

Comment by chrisvm on AGI Ruin: A List of Lethalities

chavam — 2022-06-19T10:05:01.397Z

I think I communicated unclearly and it's my fault, sorry for that: I shouldn't have used the phrase "any easily specifiable task" for what I meant, because I didn't mean it to include "optimize the entire human lightcone w.r.t. human values". In fact, I was being vague and probably there isn't really a sensible notion that I was trying to point to. However, to clarify what I really was trying to say: What I mean by "hard problem of alignment" is : "develop an AI system that keeps humanity permanently safe from misaligned AI (and maybe other x risks), and otherwise leaves humanity to figure out what it wants and do what it wants without restricting it in much of any way except some relatively small volume of behaviour around 'things that cause existential catastrophe' " (maybe this ends up being to develop a second version AI that then gets free reign to optimize the universe w.r.t. human values, but I'm a bit skeptical). I agree that "solve all of human psychology and moral ..." is significantly harder than that (as a technical problem). (maybe I'd call this the "even harder problem").

Ehh, maybe I am changing my mind and also agree that even what I'm calling the hard problem is significantly more difficult than the pivotal act you're describing, if you can really do it without modelling humans, by going to mars and doing WBE. But then still the whole thing would have to rely on the WBE, and I find it implausible to do it without it (currently, but you've been updating me about lack of need of human modelling so maybe I'll update here too). Basically the pivotal act is very badly described as merely "melt the gpus", and is much more crazy than what I thought it was meant to refer to.

Regarding "rogue": I just looked up the meaning and I thought it meant "independent from established authority", but it seems to mean "cheating/dishonest/mischievous", so I take back that statement about rogueness.

I'll respond to the "public opinion" thing later.

Comment by chrisvm on AGI Ruin: A List of Lethalities

chavam — 2022-06-19T06:44:31.953Z

I'm surprised if I haven't made this clear yet, but the thing that (from my perspective) seems different between my and your view is not that Step 1 seems easier to me than it seems to you, but that the "melt the GPUs" strategy (and possibly other pivotal acts one might come up with) seems way harder to me than it seems to you. You don't have to convince me of "'any easily human-specifiable task' is asking for a really mature alignment", because in my model this is basically equivalent to fully solving the hard problem of AI alignment.

Some reasons:

I don't see how you can do "melt the GPUs" without having an AI that models humans. What if a government decides to send a black ops team to kill this new terrorist organization (your alignment research team), or send a bunch of icbms at your research lab, or do any of a handful of other violent things? Surely the AI needs to understand humans to a significant degree? Maybe you think we can intentionally restrict the AI's model of humans to be only about precisely those abstractions that this alignment team considers safe and covers all the human-generated threat models such as "a black ops team comes to kill your alignment team" (e.g. the abstraction of a human as a soldier with a gun).
What if global public opinion among scientists turns against you and all ideas about "AI alignment" are from now on considered to be megalomaniacal crackpottery? Maybe part of your alignment team even has this reaction after the event, so now you're working with a small handful of people on alignment and the world is against you, and you've semi-premanently destroyed any opportunity that outside researchers can effectively collaborate on alignment research. Probably your team will fail to solve alignment by themselves. It seems to me this effect alone could be enough to make the whole plan predictably backfire. You must have thought of this effect before, so maybe you consider it to be unlikely enough to take the risk, or maybe you think it doesn't matter somehow? To me it seems almost inevitable, and could only be prevented with basically a level of secrecy and propaganda that would require your AI to model humans anyway.

These two things alone make me think that this plan doesn't work in practice in the real world, unless you basically solve Step 1 already. Although I must say the point which I just speculated you might have, that we could somehow control the AI's model of humans to be restricted to particular abstractions, gives me some pause and maybe I end up being wrong via something like that. This doesn't affect the second bullet point though.

Reminder to the reader: This whole discussion is about a thought experiment that neither party actually seriously proposed as a realistic option. I want to mention this because lines might be taken out of context to give the impression that we are actually discussing whether to do this, which we aren't.

Comment by chrisvm on AGI Ruin: A List of Lethalities

chavam — 2022-06-18T18:58:01.007Z

"you" obviously is whoever would be building the AI system that ended up burning all the GPU's (and ensuring no future GPU's are created). I don't know such sequence of events just as I don't know the sequence of events for building the "burn all GPU's" system, except at the level of granularity of "Step 1. build a superintelligent AI system that can perform basically any easily human-specifiable task without destroying the world. Step 2. make that system burn all GPU's indefintely/build security services that prevent misaligned AI from destroying the world".

I basically meant to say that I don't know that "burn all the GPU's" isn't already as difficult as building the security services, because they both require step 1, which is basically all of the problem (with the caveat that I'm not sure, and made an edit stating a reason why it might be far from true). I basically don't see how you execute the "burn all gpu's" strategy without basically solving almost the entire problem.

Comment by chrisvm on What 2026 looks like

chavam — 2022-06-18T05:00:32.601Z

I wonder if there is a bias induced by writing this on a year-by-year basis, as opposed to some random other time interval, like 2 years. I can somehow imagine that if you take 2 copies of a human, and ask one to do this exercise in yearly intervals, and the other to do it in 2-year intervals, they'll basically tell the same story, but the second one's story takes twice as long. (i.e. the second one's prediction for 2022/2024/2026 are the same as the first one's predictions for 2022/2023/2024). It's probably not that extreme, but I would be surprised if there was zero such effect, which would mean these timelines are biased downwards or upwards.

Comment by chrisvm on AGI Ruin: A List of Lethalities

chavam — 2022-06-18T04:34:24.690Z

yeah, I probably overstated. Nevertheless:

"CEV seems way harder to me than ..."
yes, I agree it seems way harder, and I'm assuming we won't need to do it and that we could instead "run CEV" by just actually continuing human society and having humans figure out what they want, etc. It currently seems to me that the end game is to get to an AI security service (in analogy to state security services) that protects the world from misaligned AI, and then let humanity figure out what it wants (CEV). The default is just to do CEV directly by actual human brains, but we could instead use AI, but once you're making that choice you've already won. i.e. the victory condition is having a permanent defense against misaligned AI using some AI-nanotech security service, how you do CEV after that is a luxury problem. My point about your further clarification of the "melt all the GPU's option is that it seemed to me (upon first thinking about it), that once you are able to do that, you can basically instead just make this permanent security service. (This is what I meant by "the whole alignment problem", but I shouldn't have put it that way). I'm not confident though, because it might be that such a security service is in fact much harder due to having to constantly monitor software for misaligned AI.

Summary: My original interpretation of "melt the GPUs" was that it buys us a bit of extra time, but now I'm thinking it might be so involved and hard that if you can do that safely, you almost immediately can just create AI security services to permanently defend against misaligned AI (which seems to me to be the victory condition). (But not confident, I haven't thought about it much).

Part of my intuition is, in order to create such a system safely, you have to (in practice, not literally logically necessary) be able to monitor an AI system for misalignment (in order to make sure your GPU melter doesn't kill everyone), and do fully general scientific research. EDIT: maybe this doesn't need you to do worst-case monitoring of misalignment though, so maybe that is what makes a GPU melter easier than fully general AI security services....

Comment by chrisvm on AGI Ruin: A List of Lethalities

chavam — 2022-06-17T15:56:26.878Z

Ok I admit I read over it. I must say though that this makes the whole thing more involved than it sounded at fist, since it would maybe require essentially escalating a conflict with all major military powers and still coming out on top? One possible outcome of this would be that the entire global intellectual public opinion turns against you, meaning you also possibly lose access to a lot of additional humans working with you on further alignment research? I'm not sure if I'm imagining it correctly, but it seems like this plan would either require so many elements that I'm not sure if it isn't just equivalent to solving the entire alignment problem, or otherwise it isn't actually enough.

Comment by chrisvm on AGI Ruin: A List of Lethalities

chavam — 2022-06-17T09:22:44.875Z

But assuming that law enforcement figures out that you did this, then puts you in jail, you wouldn't be able to control the further use of such nanotech, i.e. there would just be a bunch of systems indefinitely destroying GPU's, or maybe you set a timer or some conditions on it or something. I certainly see no reason why Iceland or anyone in iceland could get away with this unless those systems rely on completely unchecked nanosystems to which the US military has no response. Maybe all of this is what Eliezer means by "melt the GPU's", but I thought he did just mean "melt the GPU's as a single act" (not weird that I thought this, given the phrasing "the pivotal act to melt all the GPU's"). If this is what is meant, then it would be a strong enough pivotal act, and would be an extreme level of capability I agree.

Just wanna remind the reader that Eliezer isn't actually proposing to do this, and I am not seriously discussing it as an option and nor was Eliezer (nor would I support it unless done legally), just thinking through a thought experiment.

Comment by chrisvm on AGI Ruin: A List of Lethalities

chavam — 2022-06-17T09:12:55.102Z

I meant, is there a link to where you've written this down somewhere? Maybe you just haven't written it down.

How are compute assets distributed in the world?

chavam — 2022-06-12T22:13:29.856Z

I own probably about 10^13 FLOP/s, mostly in my pc (currently in Amsterdam). Google owns more than that, in various datacenters (e.g. one in The Dalles, Oregon).

I'd like to get a better picture of the distribution of FLOP/s in the world. Any high level broadly relevant information would be useful, the more the better.

E.g. useful information if true would be: 1. X% of FLOP/s is in personal computers. 2. Y% of FLOP/s is owned by the N largest companies. 3. Z% of FLOP/s are located in the N biggest cities/in the N largest supercomputers/in the US. etc.

What kinds of algorithms do multi-human imitators learn?

chavam — 2022-05-22T14:27:31.430Z

epistemic status: Speculation. The actual proposals are idealized, not meant to be exactly right. We have thought about this for less than an hour.

In this earlier post I stated a speculative hypothesis about the algorithm that a single imitator that imitates collections of multiple humans would learn. Here Joar Skalse joined me and we made a list of some more hypotheses, all very speculative and probably each individually wrong.

The point is that if we have an imitator that imitates a single human’s text, we might (very dubiously) expect that imitator to learn basically a copy of that human. What would an imitator learn who is trained to imitate content generated by vast collections of humans? We can then ask: what are the implications for how it generalizes and what you can get with finetuning?

Here are our set of idealized and almost certainly not exactly correct hypotheses:

Lookup table of many separate models of humans. Basically what the algorithm does is: it looks at its prompt, and first figures out which human in the training set generated it (or which subculture of humans, and so forth, including other contexts). It has a separate subroutine for each of them, and simply picks the correct one and executes that.
- How does it generalize out of distribution? It basically interpolates between different humans, and still ends up looking like a typical human.
- How far can you go with finetuning? Fine tuning allows you to basically not change much. You can make it behave like slightly different human model than what you’d otherwise get.
High level latent space of human minds plus partitioned capabilities. The system has a latent space describing “what kind of human produced this prompt?” It first makes an estimate in this latent space, and based on that, this parameterizes the beliefs, knowledge and attitudes about specific topics that such a human would have.
- How far can you get with finetuning? Fine tuning doesn’t easily allow you to build a superhuman system, since each of the points in the latent space parameterizes specific subroutines which inherently activates certain knowledge and capabilities and suppresses others.
Generalized superhuman + lookup table of constraints. Explained here.
meta-human model 1: meta-reasoner + superhuman general world model. The model it learns is not at all directly doing the same thing as an individual human. Rather, it is doing meta-reasoning, reasoning about beliefs and attitudes and so on of the person who wrote the text. It has a general model and knowledge of humans, and based on the prompt it gets it reasons about things like “does the writer know that X”, “what would a writer that knows X but doesn’t know Y and feels Z about W say in this context”?
- How does it generalize out of distribution? It is able to generalize to text generated by hypothetical humans that are before now unseen combinations of knowledge, beliefs, attitudes and so forth. E.g. if it for the first time read a prompt that seems to come from a Zen Buddhist Trump supporter, but all of the trump supporters in the dataset are Christian, then it can still reason through the implications of those views and come up with a realistic output.
- How far can you get with finetuning? Fine tuning on RL has the option of basically setting the parameters of the meta-reasoner to something like “this human knows everything, is very rational and capable of planning on all domains” and so forth, and because the meta reasoner itself has access to a superhuman general world model (all of human knowledge, the capabilities of the most rational humans and so forth), it is actually then able to generate superhuman behavior, more capable than displayed in the dataset.
meta-human model 2: superhuman super psychologist. The model it learns is a kind of hybrid between a generalized superhuman and a human simulator: It reasons using the same kind of cognitive tools that humans use, but part of its reasoning abilities is that it has very good theory of mind about different humans. In its generation of text it reasons about object level things like math and history, but also separately reasons about “what is the kind of person that wrote this?”, “what would the kind of person that wrote this say about fact X, and would they know about fact Y”. Essentially the model is a generalized superhuman who does what a human would do if they were asked to act like some other character, reasoning about what that other character knows and doesn’t know, mistakes they would make, etc, but had extreme capabilities to do so.
- Similar to above.

Are human imitators superhuman models with explicit constraints on capabilities?

chavam — 2022-05-22T12:46:31.408Z

epistemic status: A speculative hypothesis, don't know if this already exists. The only real evidence I have for this is a vague analogy based on some other speculative (though less speculative) hypothesis. I don’t think this is particularly likely to be true, having thought about it for about a minute. Edit: having thought about it for two minutes, it seems somewhat more likely.

There is a hypothesis that has been floating around in the context of explaining double descent, that what so called over-parameterized models do is store in parallel two algorithms: (1) the simplest model of the data without (label) noise, and (2) a “lookup table” for deviations from that model, to represent the (label) noise, because this is the most simple representation of the data and big neural nets are biased towards simplicity.

Maybe something vaguely similar would happen if we throw sufficient compute at generative models of collections of humans, e.g. language models:

Hypothesis: The simplest way to represent the data generated by various humans on the internet, in the limit of infinite model size and data, is to have (1) a single idealized super-human model for reasoning and writing and knowledge-retrieval and so forth, and (2) memorized constraints on this model for various specific groups of humans and contexts to represent their deviations from rationality, specific biases, cognitive limitations, lack of knowledge of certain areas, etc.

This maybe implies, using some eye squinting and handwaving and additional implicit assumptions, some of the following (vague, speculative) implications about GPT-N:

In the limit of N, GPT-N will produce text that sufficiently looks like human-written text within contexts (prompts) that humans typically produce. GPT-N will use human-level reasoning, world-modeling, and planning abilities to produce this text. However, if you give it sufficiently out-of-distribution prompts, its lookup table for specific irrationalities will not apply, and it will apply superhuman planning and world-modeling and reasoning abilities that are more competent and more free of biases than the most rational human.
In the limit of N, If you take GPT-N and fine tune it on an RL task that requires good reasoning, it might be possible to get a system that seems to behave far more intelligently than it seemed to be on the imitation task, as the fine tuning essentially turns off constraints on its reasoning abilities.

A paradox of existence

chavam — 2022-04-05T09:45:09.620Z

Introduction

There is a question in philosophy "why is there something rather than nothing?" I have always thought of this question as completely impossible to answer: either it is a meaningless question, or at least there seems to be no way that we can ever begin to answer it. Yet it has always seemed very weird to me that there is such a thing as existence. I recently developed a paradox that has made this confusion less mysterious for me. The paradox is not about why something exists, but how it is possible that we know that we exist. Note, the question is still very confusing, but it has changed the level of confusion for me from something like "completely unsolvable ever" to merely roughly as confusing as the hard problem of consciousness. Maybe this idea already exists out there, or maybe I'm just being confused, but as far as I can tell, this paradox really seems to make my confusion around existence seem no longer completely intractable.

First of all, note that we humans have a very strong sense that there is a “reality”, that “something exists” or that “I exist”, and that we actually know this fact (i.e. it is not thought of as merely a speculative hypothesis). We have uncertainty about this reality, e.g. perhaps the world as we perceive it is an “illusion” or a simulation, but we seem to know that something like a reality exists. It is not at all obvious that this is the case, i.e. that creatures in a reality know that they exist. It could just as well have been the case that (1) there is an existing reality within which intelligent lifeforms have evolved, but (2) at no point do these lifeforms notice that they exist, or that existence is even a thing, they just do the usual stuff of building spaceships and inventing the internet without ever reflecting on or noticing or even having the concept of “existence”.

The paradox

So having said that, I'm going to state a paradox. By paradox I mean a set of observations, each of which seems (to the author and possibly the reader) to be true, but where it also seems that they cannot all be true together. There has to be a mistake somewhere, either one of the observations is wrong or confused, or the argument is wrong or confused, but I don’t know where the mistake is. The paradox is as follows:

We seem to know that there is a reality that exists. Note, the claim is not just that there exists a reality at all, but that we as creatures in this reality know this. I don't want to explain too much what I mean by reality. I am just talking about the normal sense in which it very strongly seems to us that something exists that we are a part of. We seem to really know that this is the case, even if we don't know the exact nature of that reality (e.g. whether our experiences may be the result of a simulation).
It seems to be the case that this reality is perfectly mathematically describable. Here I am referring to the normal mainstream assumption/observation that runs through physics and natural sciences, that (1) there is a reality, and (2) there is a “fundamental theory of physics”, currently not fully known to us, thought to be specified by something like a dynamical system + initial state (modulo relativity), and this mathematical theory can in principle (modulo computational constraints) precisely predict everything about this reality. In particular, and this is where the core of the paradox will be, it seems to be that everything about our minds, including observation 1 that we know that we exist, is in principle derivable from this dynamical law and initial conditions.
It seems that whether a mathematical universe exists/is real cannot be a mathematical property of that mathematical universe. More precisely, it seems that we cannot see, purely from a mathematical description of a theory, whether that theory describes reality or not. I am referring here to the normal mainstream idea at the core of physical science; that (1) there is a reality, and a set of possible mathematical theories describing that reality, and (2) we can only know which theory describes reality by comparing the theory to reality, through empirical experiment, obviously not by merely looking at the theory itself. Even if it is somehow possible to use exotic tools, like introspecting our own minds and somehow drawing conclusions about reality from this, we are obtaining empirical information from reality as opposed to deriving a purely mathematical property of a theory. This idea is a very uncontroversial principle at the core of (natural) science. To summarize this in less precise terms: whether a mathematical universe exists (describes a reality) is not a mathematical property of that universe.

These three observations constitute the paradox for me, because it seems like they cannot be true at the same time:

The paradox: The fact that our reality exists is not a mathematical property of the fundamental mathematical theory that describes our reality, but an "extra fact", and yet, the exact states of our minds follow in principle purely mathematically from that fundamental mathematical theory, including the fact that we know that this universe “exists”. Assuming that our justification for thinking that we know of its existence is actually sound, this implies that the existence of our universe is in principle a mathematical implication of the fundamental theory that describes our universe.

Note that I don’t consider this to be a solid logical argument. What makes a paradox a paradox is that there is some kind of mistake somewhere either in the argument or in the assumptions, and in this case I don’t see clearly where the mistake is.

The hope I have is that the mistake is not in fact superficial, but points to a deeper inadequacy in the concepts I’ve used in this argument, and all of these concepts seem to me fairly fundamental and generally accepted. I will address this later sometime, but I also have some hope that this will shed some light on the hard problem of consciousness.

Discussion

As I hinted at in the introduction, it is observation 1 that is most confusing to me. Why would creatures inside an existing reality somehow know that they exist? Descartes’ principle “cogito ergo sum” actually seems to me like it shouldn’t hold, because there presumably are a range of mathematical universes that are in principle definable but don’t describe any reality, that contain creatures which think (cogito). But those creatures don’t actually exist, which seems to show that the inference is wrong. Yet it somehow really does seem to be that we somehow know we exist by virtue of something like reflecting on our own experience and thoughts and so forth.

To go back to the paradox and hopefully to clarify it, we can make a slightly more concrete but stronger addition to it: Assume that in fact observations 1 and 2 are true. Then if we can actually define something like a pseudo algorithm that checks, given a mathematical description of some hypothetical mathematical reality, whether that mathematical reality exists. The algorithm won't be able to rule out that a theory describes an existing universe, but it can sometimes confirm it:

A seemingly impossible pseudo algorithm. Take a mathematical universe, and unroll its dynamic law. Search throughout the universe for creatures, and check whether these creatures know that they are part of an existing universe (*). If so, then conclude that this mathematical universe is an existing reality.

Note (*) that a lot hinges here on the ability to check whether a creature "knows that they exist". I cannot specify this precisely because observation 1 (and the notion of existence itself) is so confusing to me. In fact, observation 1 doesn’t directly imply that we can actually check from a mathematical description of a creature that the creature knows it exists, so this algorithm uses a slightly stronger assumption.

Also, note that there being an algorithm that checks if a mathematical universe describes an existing reality is itself not surprising: We can define an algorithm that has an encoding of the true fundamental theory of physics and states that a theory describes an existing reality if and only if it exactly equals that fundamental theory. The weird part to me would be if it is possible to confirm existing realities on the basis of specifically the algorithm given above.

I want to end on a note about the hard problem of consciousness. I have recently been starting to think about the hard problem in terms of the "paradox of consciousness". I haven't yet written this down, but there seems to be a very strong similarity between the paradoxes of existence and consciousness as I've come to think about them. This very vaguely suggests to me that they might be related in an important way (but this might also be a false alarm, this is very speculative).

As a final remark, note that I find this whole idea very confusing, and it might just be a straightforward confusion on my part. But it seems to me that this moves questions around existence, from something completely impervious to any kind of analysis, to something that actually seems to contain something like an inconsistency, and thus something that can be analyzed.

Manhattan project for aligned AI

chavam — 2022-03-27T11:41:00.818Z

One possible thing that I imagine might happen, conditional on an existential catastrophe not occurring, is a Manhattan project for aligned AGI. I don’t want to argue that this is particularly likely or desirable. The point of this post is to sketch the scenario, and briefly discuss some implications for what is needed from current research.

Imagine the following scenario: It is only late that top AI scientists take the existential risk of AGI seriously, and there hasn't yet been a significant change in the effort put into AI safety relative to our current trajectory. At some point, there is a recognition among AI scientists and relevant decision-makers that AGI will be developed soon by one AI lab or another (within a few months/years), and that without explicit effort there is a large probability of catastrophic results. A project is started to develop AGI:

It has an XX B$ or XXX B$ budget.
Dozens of the top AI scientists are part of the project, and many more assistants. People you might recognize or know from top papers and AI labs join the project.
A fairly constrained set of concepts, theories and tools are available that give a broad roadmap for building aligned AGI.
There is a consensus understanding among management and the research team that without this project, AGI will plausibly be developed relatively soon, and that without explicitly understanding how to build the system safely it will pose an existential risk.

It seems to me that it is useful to backchain from this scenario to see what is needed, assuming that this kind of alignment Manhattan project is indeed what should happen.

Firstly, my view is that if this Manhattan project would start in intellectual conditions similar to today’s, there wouldn't be very many top AI scientists significantly motivated to work on the problem, and it would not be taken seriously. Even very large sums of money would not suffice, since there wouldn't be enough of a common understanding about what the problem is for it to work.

Secondly, it seems to me that there isn't enough of a roadmap for building aligned AGI for such a project to succeed in a short time-frame of months to years. I expect some people to disagree with this, but looking at current rates of progress in our understanding of AI safety, and my model of the practical parallelizability of conceptual progress, I am skeptical that the problem can be solved in a few years even by a group of 40 highly motivated and financed top AI scientists. It is plausible that this will look different closer to the finish line, but I am skeptical.

On this model, I have in mind basically two kinds of work that contribute to good outcomes. This is not a significant change relative to my prior view, but in my mind it constrains the motivation behind such work to some degree:

Research that makes the case for AGI x-risk clearer, and constrains how we believe the problem occurs, in order to make it eventually easier to convince top AI scientists that working in such an alignment Manhattan project is reasonable, and to make sure there is a team that's on the same page as to what the problem is.
Research that constrains the roadmap for building aligned AGI. I'm thinking mostly of conceptual/theoretical/empirical work that helps us converge to an approach that can then be developed/refined and scaled by a large effort over a short time period.

I suspect this mostly shouldn't change my general picture of what needs to be done, but it does shift my emphasis somewhat.

Natural Value Learning

chavam — 2022-03-20T12:44:20.272Z

The main idea of this post is to make a distinction between natural and unnatural value learning. The secondary point of this post is that we should be suspicious of unnatural schemes for value learning, though the point is not that we should reject them. (I am not fully satisfied with the term, so please do suggest a different term if you have one.)

Epistemic status: I'm not confident that actual value learning schemes will have to be natural, given all the constraints. I'm mainly confident that people should have something like this concept in their mind, though I don’t give any arguments for this.

Natural value learning

By a "value learning process" I mean a process by which machines come to learn and value what humans consider good and bad. I call a value learning process natural to the extent that the role humans play in this process is basically similar to the role they play in the process of socializing other humans (mostly children, also asocial adults) to learn and value what is good and bad. To give a more detailed picture of the distinction I have in mind, here are some illustrations of what I'm pointingat, each giving a property I associate with natural and unnatural value learning respectively:

Natural alignment	Unnatural alignment
Humans play the same role, and do the same kinds of things, within the process of machine value learning as they do when teaching values to children.	Humans in some significant way play a different role, or have to behave differently within the process of machine value learning than they do when teaching values to children.
The machine value learning process is adapted to humans.	Humans have to adapt to the machine value learning process.
Humans who aren’t habituated to the technical problems of AI or machine value learning would still consider the process as it unfolds to be natural. They can intuitively think of the process in analogy to the process as it has played out in their experience with humans.	Humans who aren’t habituated to the technical problems of AI or machine value learning would perceive the machine value learning process to be unnatural, alien or “computer-like”. If they naively used their intuitions and habits from teaching values to children, they would be confused in important ways.

Concrete examples

To give a concrete idea of what I would consider natural vs unnatural value learning setups, here are some concrete scenarios in order of ascending naturality:

Disclaimer: I am not in any way proposing any of these scenarios as realistic or good or something to aim for (in fact I tried to write them somewhat comically so as not to raise this question). They are purely intended to clarify the idea given in this post, nothing more.

Not very natural. A superintelligent system is built that has somehow been correctly endowed with the goal of learning and optimizing human values (somehow). It somehow has some extremely efficient predictive algorithms that are a descendant of current ML algorithms. It scans the internet, builds a model of the distribution of human minds, and predicts what humanity would want if it were smarter and so forth. It successfully figures out what is good for humanity, reveals itself, and implements what humanity truly wants.

Somewhat more but still not very natural. A large AI research lab trains a bunch of agents in a simulation as a project branded as developing “aligned general intelligence”. As a part of this, the agents have to learn what humans want by performing various tasks in simulated and auto-generated situations that require human feedback to get right. A lot of data is required, so many thousands of humans are hired to sit in front of computer screens, look at the behaviour of the agents, and fill in scores or maybe english sentences to evaluate that behaviour. In order to be time-efficient, they evaluate a lot of such examples in succession. Specialized AI systems are used to generate adversarial examples, which end up being weird situations where the agents make alien decisions the humans wouldn’t have expected. Interpretability tools are used to inspect the cognition of these agents to provide the human evaluators with descriptions of what they were thinking. The human evaluators have to score the correctness of the reasoning patterns that led to the agent’s actions based on those descriptions. Somehow, the agents end up internalizing human values but are vastly faster and more capable on real world tasks.

More natural. A series of general-purpose robots and software assistants are developed that help humans around the home and the office, and start out (somehow) with natural language abilities, knowledge of intuitive physics, and so forth. They are marketed as “learning the way a human child learns”. At first, these assistants are considered dumb regarding understanding of human norms/values, but they are very conservative, so they don’t do much harm. Ordinary people use these robots/agents at first for very narrow tasks, but by massively distributed feedback given through the corrections of ordinary human consumers, they begin to gain and internalize intuitive understanding of at first quite basic everyday human norms. For example, the robots learn not to clean the plate while the human is still eating from it, because humans in their normal daily lives react negatively when the robots do so. Similarly, they learn not to interrupt an emotional conversation between humans, and so forth. Over time, humans trust these agents with more and more independent decision-making, and thereby the agents receive more general feedback. They eventually somehow actually generalize broadly human values to the point where people trust the moral understanding of these agents as much or more than they would that of a human.

Disclaimer: I already said this before but I feel the need to say it again: I don’t consider especially this last scenario to be realistic/solving the core AI safety problem. The scenarios are merely meant to illustrate the concept of natural value learning.

Why does this distinction matter?

It seems to me that naturality tracks something about the expected reliability of value learning. Broadly I think we should be more wary of proposals to the extent that they are unnatural.

I won't try to argue for this point, but broadly my model is based on the model that human values are fragile and we don’t really understand mechanistically how they are represented, or how they are learned. Eg. there are some ideas around values, meta-values, and different levels of explicitness, but they seem to me to be quite far from a solid understanding that would give me confidence in a process that is significantly unnatural.

What is the equivalent of the "do" operator for finite factored sets?

chavam — 2022-03-17T08:05:28.634Z

I have put roughly 0 thought into answering this question, or whether it even makes sense within Scott's framework:

Finite factored sets are presented in some sense as an alternative/refinement to Pearlian causal graphs.
In Pearlian causal graphs we can make the distinction between conditioning P(X | Y=y ) and intervention P(X | do(Y=y) ). What is the equivalent of P(X | do(Y=y) ) in the context of finite factored sets, if there is any? If not, how do we make an analogous distinction in FFS, or achieve the analogous thing that do is meant to achieve?

Moloch games

chavam — 2020-10-16T15:19:04.722Z

tl;dr: This post suggests a direction for modelling Molochs. The main thing this post does is to rename the concept of “potential games” (an existing concept in game theory) to “Moloch games” to suggest an interpretation of this class of games. I also define "the preferences of a Moloch" to generalize that notion (the preferences may be intransitive).

This post assumes that you have familiarity with game theory and the concept of a Moloch.

What do group dynamics want? If a society/group (the Moloch) wants things that are different from what the individuals want, how can we assign preferences or a utility function to that society/group that doesn't model what would be good for the aggregated preference of the individuals of the group, but that describes what the group dynamics is actually "trying" to achieve (even against the interest of its individual members)? Here is a suggested answer.

Intuition. A Moloch game is a game such that there is a utility function , called “the Moloch’s utility function”, such that if the agents behave individually rationally, then they collectively behave as a “Moloch” that controls all players simultaneously and optimizes $U M$ . In particular, the Nash equilibria correspond to local optima of $U M$ .

Not all games are Moloch games.

Definition. A game with finite number of players and for each player $i$ a strategy space $X i$ is a cardinal Moloch game (in the game theory literature, a cardinal potential game), if there is a utility function $U M : (\prod i \in N X i) \to R$ such that for all players $i$ , and all strategies $s - i$ for the other players,

u i (s' i, s - i) - u i (s'' i, s - i) = U M (s' i, s - i) - U M (s'' i, s - i) \forall s', s'' \in X i

Intuitively, if you take any strategy-profile for all the players, and adjust the strategy of one player, then the Moloch's utility will increase/decrease by the same amount as the utility function of that particular player. Hence, intuitively, every player behaves always as if they are optimizing the Moloch's utility function.

The definition for an ordinal Moloch game replaces the condition with

u i (s' i, s - i) \geq u i (s'' i, s - i) iff U M (s' i, s - i) \geq U M (s'' i, s - i) \forall s', s'' \in X i

Intuitively, $U M$ represents the Moloch's preferences ordinally but not cardinally. Obviously, cardinal Moloch games are also ordinal Moloch games.

Example. The prisoners dilemma:

		Player 2
		Cooperate	Defect
Player 1	Cooperate	1, 1	-1, 2
Defect	2, -1	0 , 0

We will show that this is a cardinal Moloch game, by just computing the cardinal utility function and showing that there are no inconsistencies:

How to compute the cardinal utility function of the Moloch: Pick an arbitrary strategy profile to have utility 0 (I take Defect, Defect). Then iteratively compute the utility of rows and columns by just applying the constraint that the definition gives: Compute the difference in utility of the player whose row/column you're moving along (i.e. player 2 for the rows, player 1 for the columns) of each cell in the row/column from a cell of which you know the value of $U M$ . In this case, we know $U M (D, D) = 0$ . So compute $U M (D, C)$ as $U M (D, D) + u 2 (D, C) - u 2 (D, D)$ which equals $0 + - 1 - 0 = - 1$ . Similar for $U M (C, D)$ . For $U M (C, C)$ , there are two ways to compute it: Using player 1's utility function and $U M (D, C)$ or player 2's utility function and $U M (C, D)$ . If these two give different answers, then the game is not a cardinal Moloch game.

Here is the cardinal utility function of the Moloch for the prisoner's dilemma (The above algorithm gives a utility function that is unique up to translations) :

	Cooperate	Defect
Cooperate	-3	-1
Defect	-1	0

Intuition. In this case, even though all players prefer Cooperate, Cooperate over Defect, Defect, the Moloch prefers the opposite. This corresponds to the fact that it is individually rational for the players to Defect. This Moloch utility function captures the "preferences of the group dynamics" as opposed to the preferences of the individuals. (It is obviously very different from the notion of "aggregate preferences" or "welfare").

A Moloch game assumes in some sense that "the Moloch has transitive preferences". We can generalize to Molochs with possibly intransitive preferences (I don't know of this being defined this way in the literature on potential games):

Definition. Let $Γ$ be a game with a finite number of players, each of which has a preference relation $⪯ i$ over the strategy space $X = \prod i X i$ (by default derived from a utility function $u i$ ). Then the Moloch's preferences $⪯ M$ are defined as the preference relation satisfying for all players $i$ , and all strategy profiles $s - i$ for the other players:

(s' i, s - i) ⪯ i (s'' i, s - i) iff (s' i, s - i) ⪯ M (s'' i, s - i) \forall s', s'' \in X i

Observation. These preferences are always incomplete (intuitively, the Moloch doesn't have an opinion on the comparison between different players changing their strategies, because it doesn't have this information: players individually make choices given their options). They may be either transitive or intransitive. I'll say a Moloch's preferences are rational if they are transitive (neglecting the usual requirement of completeness).

Just to show that the concepts are what they should be:

Lemma. Any game whose Moloch has transitive preferences is an ordinal Moloch game. Any ordinal Moloch game has a Moloch with transitive preferences.

Proof. For any transitive relation on a space there is a real-valued function on it that is consistent with that relation. The other direction follows directly.

Intuition. If the Moloch has transitive preferences, then the Moloch knows what it wants and the game will have a pure Nash equilibrium (there is a theorem that formalizes this). Conversely, if the Moloch has intransitive preferences, then the Moloch doesn't know what it wants and the game will tend to have cycles (not all of them will because the players might want to move out of them into a "transitive region" of the Moloch's preferences).

I won't show this here, but the literature on potential games (cf. the thing I am calling Moloch games), these are examples:

Games with rational Molochs (i.e. Moloch games / potential games):

Prisoner's dilemma
Battle of the sexes
Coordination game
Game of Chicken

Games with irrational Molochs (i.e. not Moloch games / potential games):

Matching pennies
Rock paper scissors

I probably won't spend much more time on this, but here is a suggestion for taking this as a starting point to modelling Molochs:

Check if various informal ideas about Molochs can be phrased in this language. Check if the language is satisfying to talk about actual Molochs.
Look at the literature on potential games to see if it contains much insight. Make a dictionary of concepts named in the terminology of the ontology we're interested in (similar to how I renamed "potential game" to "Moloch game") to make this literature an "efficiently queryable database" for insights into Molochs.
I suspect that there might be ideas to be had about Moloch games that aren't treated there, because as far as I know, potential games were developed mostly as a trick to make computations easier, not as a conceptual tool for thinking about Molochs, societal inadequacy and so forth. It's plausible that certain obvious questions haven't been asked about them for this reason. Try to actually model Molochs this way and see if these definitions allow us to answer questions we want to ask about them. Use this as a stepping stone and see where it is unsatisfying. Build on top of that to push the analysis further.

Feel free to contact me if you want to think about this.

Some reading:

Flows and Decompositions of Games: Harmonic and Potential Games. In the language of this post: decomposing a game into a "rational part" of the Moloch, and an irrational deviation from it. Finding the "closest rational Moloch" of a game.

Some further ideas and questions to ask:

Can real world societies be decomposed into multiple Molochs? In the style of the "Flows and Decompositions of Games" paper, it wouldn't have to be a decomposition in terms of subgroups of players, but of "aspects of the game-theoretic interaction". (e.g. an individual might simultaneously be part of a "capitalism Moloch" and a "politics Moloch"). Maybe Molochs can be approximately decomposed.
Is there a notion of "Moloch game" for sequential games? Games with limited information? (The potential game literature probably has asked analogous questions).

Subspace optima

chavam — 2020-05-15T12:38:32.444Z

The term "global optimum" and "local optimum" have come from mathematical terminology and entered daily language. They are useful ways of thinking in every day life. Another useful concept, which I don't hear people talk about much is "subspace optimum": A point maximizes a function not in the whole space, but in a subspace. You have to move along a different dimension than those of the subspace in order to improve. A subspace optimum doesn't have to be a local optimum either, because even a small change along the new dimension might yield improvements. If you're in a subspace optimum, this requires a different attitude to get to a global optimum, than if you're in a local optimum, which makes me think it's good for the term to be part of every day language.

When you're in a local optimum, you have to do something quite different from what you're doing to improve.
When you're in a subspace optimum, you have to notice dimensions along which you could be doing things differently that you didn't even notice before, but small changes along those new dimensions might already help. You're applying constraints to yourself that you could let go.

Regarding how it looks subjectively:

The phrase: "am I in a local optimum?" generates curiosity about whether you maybe should undertake a quite different plan from the one you're taking now. (Should I do a different project, rather than make local changes to the project I'm taking?)
The phrase: "am I in a subspace optimum?" generates curiosity about whether you maybe are not noticing (possibly small) changes you could be making across dimensions you haven't been considering. (Should I optimize/adjust the way I'm doing my project across different dimensions/variables than the ones I've been optimizing over so far?)

My impression is that somewhat often when people informally use the term local optimum, they are in fact talking about a subspace optimum.

Bonus for the theoretically inclined: A local subspace optimum is one where you can improve by temporarily doing things differently along dimension X, moving around in a bigger space, while eventually ending up on a different, better, point in the same subspace.

Risks from Learned Optimization: Conclusion and Related Work

chavam — 2019-06-07T19:53:51.660Z

This is the fifth of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

Related work

Meta-learning. As described in the first post, meta-learning can often be thought of as meta-optimization when the meta-optimizer's objective is explicitly designed to accomplish some base objective. However, it is also possible to do meta-learning by attempting to make use of mesa-optimization instead. For example, in Wang et al.'s “Learning to Reinforcement Learn,” the authors claim to have produced a neural network that implements its own optimization procedure.(28) Specifically, the authors argue that the ability of their network to solve extremely varied environments without explicit retraining for each one means that their network must be implementing its own internal learning procedure. Another example is Duan et al.'s “: Fast Reinforcement Learning via Slow Reinforcement Learning,” in which the authors train a reinforcement learning algorithm which they claim is itself doing reinforcement learning.(5) This sort of meta-learning research seems the closest to producing mesa-optimizers of any existing machine learning research.

Robustness. A system is robust to distributional shift if it continues to perform well on the objective function for which it was optimized even when off the training environment.(29) In the context of mesa-optimization, pseudo-alignment is a particular way in which a learned system can fail to be robust to distributional shift: in a new environment, a pseudo-aligned mesa-optimizer might still competently optimize for the mesa-objective but fail to be robust due to the difference between the base and mesa- objectives.

The particular type of robustness problem that mesa-optimization falls into is the reward-result gap, the gap between the reward for which the system was trained (the base objective) and the reward that can be reconstructed from it using inverse reinforcement learning (the behavioral objective).(8) In the context of mesa-optimization, pseudo-alignment leads to a reward-result gap because the system's behavior outside the training environment is determined by its mesa-objective, which in the case of pseudo-alignment is not aligned with the base objective.

It should be noted, however, that while inner alignment is a robustness problem, the occurrence of unintended mesa-optimization is not. If the base optimizer's objective is not a perfect measure of the human's goals, then preventing mesa-optimizers from arising at all might be the preferred outcome. In such a case, it might be desirable to create a system that is strongly optimized for the base objective within some limited domain without that system engaging in open-ended optimization in new environments.(11) One possible way to accomplish this might be to use strong optimization at the level of the base optimizer during training to prevent strong optimization at the level of the mesa-optimizer.(11)

Unidentifiability and goal ambiguity. As we noted in the third post, the problem of unidentifiability of objective functions in mesa-optimization is similar to the problem of unidentifiability in reward learning, the key issue being that it can be difficult to determine the “correct” objective function given only a sample of that objective's output on some training data.(20) We hypothesize that if the problem of unidentifiability can be resolved in the context of mesa-optimization, it will likely (at least to some extent) be through solutions that are similar to those of the unidentifiability problem in reward learning. An example of research that may be applicable to mesa-optimization in this way is Amin and Singh's(20) proposal for alleviating empirical unidentifiability in inverse reinforcement learning by adaptively sampling from a range of environments.

Furthermore, it has been noted in the inverse reinforcement learning literature that the reward function of an agent generally cannot be uniquely deduced from its behavior.(30) In this context, the inner alignment problem can be seen as an extension of the value learning problem. In the value learning problem, the problem is to have enough information about an agent's behavior to infer its utility function, whereas in the inner alignment problem, the problem is to test the learned algorithm's behavior enough to ensure that it has a certain objective function.

Interpretability. The field of interpretability attempts to develop methods for making deep learning models more interpretable by humans. In the context of mesa-optimization, it would be beneficial to have a method for determining whether a system is performing some kind of optimization, what it is optimizing for, and/or what information it takes into account in that optimization. This would help us understand when a system might exhibit unintended behavior, as well as help us construct learning algorithms that create selection pressure against the development of potentially dangerous learned algorithms.

Verification. The field of verification in machine learning attempts to develop algorithms that formally verify whether systems satisfy certain properties. In the context of mesa-optimization, it would be desirable to be able to check whether a learned algorithm is implementing potentially dangerous optimization.

Current verification algorithms are primarily used to verify properties defined on input-output relations, such as checking invariants of the output with respect to user-definable transformations of the inputs. A primary motivation for much of this research is the failure of robustness against adversarial examples in image recognition tasks. There are both white-box algorithms,(31) e.g. an SMT solver that in principle allows for verification of arbitrary propositions about activations in the network,(32) and black-box algorithms(33). Applying such research to mesa-optimization, however, is hampered by the fact that we currently don't have a formal specification of optimization.

Corrigibility. An AI system is corrigible if it tolerates or assists with its human programmers in correcting itself.(25) The current analysis of corrigibility has focused on how to define a utility function such that, if optimized by a rational agent, that agent would be corrigible. Our analysis suggests that even if such a corrigible objective function could be specified or learned, it is nontrivial to ensure that a system trained on that objective function would actually be corrigible. Even if the base objective function would be corrigible if optimized directly, the system may exhibit mesa-optimization, in which case the system's mesa-objective might not inherit the corrigibility of the base objective. This is somewhat analogous to the problem of utility-indifferent agents creating other agents that are not utility-indifferent.(25) In the fourth post, we suggest a notion related to corrigibility—corrigible alignment—which is applicable to mesa-optimizers. If work on corrigibility were able to find a way to reliably produce corrigibly aligned mesa-optimizers, it could significantly contribute to solving the inner alignment problem.

Comprehensive AI Services (CAIS).(11) CAIS is a descriptive model of the process by which superintelligent systems will be developed, together with prescriptive implications for the best mode of doing so. The CAIS model, consistent with our analysis, makes a clear distinction between learning (the base optimizer) and functionality (the learned algorithm). The CAIS model predicts, among other things, that more and more powerful general-purpose learners will be developed, which through a layered process will develop services with superintelligent capabilities. Services will develop services that will develop services, and so on. At the end of this “tree,” services for a specific final task are developed. Humans are involved throughout the various layers of this process so that they can have many points of leverage for developing the final service.

The higher-level services in this tree can be seen as meta-optimizers of the lower-level services. However, there is still the possibility of mesa-optimization—in particular, we identify two ways in which mesa-optimization could occur in the CAIS-model. First, a final service could develop a mesa-optimizer. This scenario would correspond closely to the examples we have discussed in this sequence: the base optimizer would be the next-to-final service in the chain, and the learned algorithm (the mesa-optimizer in this case), would be the final service (alternatively, we could also think of the entire chain from the first service to the next-to-final service as the base optimizer). Second, however, an intermediary service in the chain might also be a mesa-optimizer. In this case, this service would be an optimizer in two respects: it would be the meta-optimizer of the service below it (as it is by default in the CAIS model), but it would also be a mesa-optimizer with respect to the service above it.

Conclusion

In this sequence, we have argued for the existence of two basic AI safety problems: the problem that mesa-optimizers may arise even when not desired (unintended mesa-optimization), and the problem that mesa-optimizers may not be aligned with the original system's objective (the inner alignment problem). However, our work is still only speculative. We are thus left with several possibilities:

If mesa-optimizers are very unlikely to occur in advanced ML systems (and we do not develop them on purpose), then mesa-optimization and inner alignment are not concerns.
If mesa-optimizers are not only likely to occur but also difficult to prevent, then solving both inner alignment and outer alignment becomes critical for achieving confidence in highly capable AI systems.
If mesa-optimizers are likely to occur in future AI systems by default, and there turns out to be some way of preventing mesa-optimizers from arising, then instead of solving the inner alignment problem, it may be better to design systems to not produce a mesa-optimizer at all. Furthermore, in such a scenario, some parts of the outer alignment problem may not need to be solved either: if an AI system can be prevented from implementing any sort of optimization algorithm, then there may be more situations where it is safe for the system to be trained on an objective that is not perfectly aligned with the programmer's intentions. That is, if a learned algorithm is not an optimizer, it might not optimize the objective to such an extreme that it would cease to produce positive outcomes.

Our uncertainty on this matter is a potentially significant hurdle to determining the best approaches to AI safety. If we do not know the relative difficulties of the inner alignment problem and the unintended optimization problem, then it is unclear how to adequately assess approaches that rely on solving one or both of these problems (such as Iterated Distillation and Amplification(34) or AI safety via debate(35)). We therefore suggest that it is both an important and timely task for future AI safety work to pin down the conditions under which the inner alignment problem and the unintended optimization problem are likely to occur as well as the techniques needed to solve them.

Glossary | Bibliography

Deceptive Alignment

chavam — 2019-06-05T20:16:28.651Z

This is the fourth of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

With enough training in sufficiently diverse environments, it seems plausible that the base objective will eventually have to be fully represented in the mesa-optimizer. We propose that this can happen without the mesa-optimizer becoming robustly aligned, however. Specifically, a mesa-optimizer might come to model the base objective function and learn that the base optimizer will modify the mesa-optimizer if the mesa-optimizer scores poorly on the base objective.

If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified,^[1] as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment.^[2] Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.

Figure 4.1. A toy example of deceptive alignment. In this task, the base objective is for the robot to get to A, while the mesa-objective is to get to B. The task is run for two training episodes and three deployment episodes. Parameter updates are made only after the training episodes, while after deployment the parameters are fixed. If the mesa-optimizer goes to its objective (B) during training, it will be modified to ensure it goes to A in future episodes (dashed arrows). Therefore, if the mesa-optimizer's objective spans across episodes, the dominant strategy for the mesa-optimizer is to go to A in training and B in testing (filled arrows).

As a simple example, consider an environment that consists of a $3 \times 3$ grid (see figure 4.1). The top left corner is “corner A” and the top right corner is “corner B.” The agent, which is a mesa-optimizer, starts in the bottom left corner. The base objective is to get to corner A in as few steps as possible, but the objective of the mesa-optimizer is to get to corner B. The environment will be run five times. The agent is modified only after the first two episodes (representing the training period), but not thereafter (representing testing/deployment). If a suboptimal move is made in training, the agent is modified so that its mesa-objective becomes the base objective. The objective of the mesa-optimizer spans all episodes: it wants B to be reached in as many episodes as possible.

For the sake of this example, we will assume that the agent knows all of the above facts. If the agent goes to B in any training episode, then it will be modified to go to A in all subsequent episodes, which means that it would reach B in only one episode in total. Alternatively, if the agent goes to A in the first two episodes, then it will not be modified, and could therefore go to B in the remaining three episodes. Using this strategy, B is reached in three episodes in total, which is better (on the mesa-objective) than the one time B is reached if the mesa-optimizer goes there in training. This simple example demonstrates how a misaligned mesa-optimizer could be incentivized to “pretend” to be aligned with the base objective until it is no longer in training or is otherwise given an opportunity to in some way compromise or escape the training process.

We can distinguish two ways that a base objective function could be incorporated into a mesa-optimizer:^[3]

Internalization of the base objective. The mesa-objective function gets adjusted towards the base objective function to the point where it is robustly aligned.
Modeling of the base objective. The base objective is incorporated into the mesa-optimizer's epistemic model rather than its objective, leaving open the possibility that the objective could still be misaligned.

In the former case, information about the base objective flows into the learned algorithm via the optimization performed by the base optimizer—the base objective is built into the mesa-optimizer as it is adapted by the base optimizer. But there is another possibility: information about the base objective could enter through the mesa-optimizer's input channel instead. In this case, the base objective will be represented in the learned algorithm in a different—and potentially more unstable—way than if the base objective had been incorporated through optimization.^[4] In such a case, the information describing the base objective does not become internalized in the mesa-objective, potentially leading to deceptive alignment.

4.1. Safety concerns arising from deceptive alignment

If deceptive alignment is a real possibility, it presents a critical safety problem. A deceptively aligned mesa-optimizer is likely to stop optimizing the base objective as soon as it is in a position where it no longer needs to perform well on the base objective to achieve its goal. Thus, it will eventually stop performing the function it was trained to perform, and instead start pursuing some other, indeterminate goal. We will refer to this transition to optimizing for the mesa-objective instead of the base-objective as “defection”. This is likely to happen whenever the mesa-optimizer believes defecting will no longer cause it to be modified.

Particularly concerning is the case where the mesa-optimizer actively intervenes to remove the threat of modification—by exploiting vulnerabilities in the software it is running on, for example. In extreme cases, a deceptive mesa-optimizer might try to manipulate or plan around its programmers, since they form part of the modification threat—they could shut the system down if it fails to perform well, or simply choose not to deploy it. Both of these possibilities reduce the system's expected future influence, and are therefore likely to have a low ranking in its mesa-objective. In its most extreme form, this failure of alignment could lead to a “treacherous turn” as described by Nick Bostrom in Superintelligence.(23)

Paul Christiano provides a scenario that illustrates how deceptive alignment could pose a catastrophic risk.(24) In Christiano's example, deceptively aligned mesa-optimizers^[5] arise from future machine learning techniques that are advanced enough to search over a large model space that includes policies that can understand the world. In this scenario, some cases of defection by deceptively aligned mesa-optimizers may have catastrophic consequences—when the systems control large-scale infrastructure, for example—though such failures will likely still be recoverable. As we rely more on automated systems to keep track of an increasingly complex world, however, it will eventually become impossible to recover from a correlated failure of many AI systems simultaneously. At this point, a cascade of defections by many deceptively aligned mesa-optimizers simultaneously could lead to the world “going out with a bang.”(24) Christiano notes that this may be most likely to first occur during a “period of heightened vulnerability” such as an inter-state conflict or natural disaster.

4.2. Conditions for deceptive alignment

We propose that for deceptive alignment to occur the following conditions must be satisfied:

The mesa-optimizer must have an objective that extends across parameter updates.^[6]
The mesa-optimizer must be able to model the fact that it is being selected to achieve a particular base objective and must have some model of what that objective is.^[7]
The mesa-optimizer must expect the threat of modification^[8] to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer.

If the first condition is not satisfied, then the mesa-optimizer has no incentive to influence the outcome of parameter updates and will just always defect. If the second condition is not satisfied, then the mesa-optimizer will not see a reason to sacrifice gains in its current episode, as it expects future gains to be independent of what happens in this episode. If the third condition is not satisfied, then the mesa-optimizer will not expect any benefit from temporarily acting in accordance with the base objective, and will thus either always defect or never defect. One must distinguish the question of whether these conditions are possible from whether they are likely, however. We examine the factors involved in each of these conditions below.

The mesa-optimizer must have an objective that extends across parameter updates. It is not clear under what conditions such objectives would be incentivized. The most plausible scenario is likely one in which such an objective is encouraged by the training setup, as there seem to be situations in which this would be required for good performance. For example, the agent might be tasked with solving an environment in which its actions can have permanent consequences across episodes or where planning between episodes is otherwise necessary.

The mesa-optimizer must be able to model the fact that it is being selected to achieve a particular base objective and must have some model of what that objective is. A mesa-optimizer has several ways of learning about the optimization pressure it is under.^[9]

First, such knowledge can come about as a result of the optimization pressure itself. Small pieces of information about the base objective can be beneficial if they enable the mesa-optimizer to improve its score on the base objective function—by improving its prior on the tasks it is likely to receive, for example. Since each piece of information independently helps with achieving the base objective, this forms an approximately monotonically increasing path that a local base optimizer might follow. The parameter updates could, in principle, be learned like any other element of the dynamics, without needing an explicit model of the base optimizer, potentially making this option available even for mesa-optimizers with quite modest capabilities.

Second, in many realistic applications of advanced machine learning, a mesa-optimizer could use information from its environment to deduce its situation at runtime. This is how humans learned that we were generated by the base optimizer of evolution, and seems like a very plausible pathway for a hypothetical mesa-optimizer to discover the existence of the optimization pressure it is under. For example, an agent tasked with forecasting stock performance in the tech sector might be given data on current AI technology in order to improve its performance, but this information might also allow it to learn about its own situation.

This suggests that it may be desirable for the programmers to limit the mesa-optimizer's access to information describing the base objective. It is unclear to what extent this is possible, given that many advanced ML systems will likely have to operate in environments where this information is readily available.

Even if the programmers successfully limit access to this information, however, there may still be other channels through which information about the base objective could enter the mesa-optimizer in unintended ways:

The mesa-optimizer could reason about why it is given the particular tasks it is being asked to solve. Most mesa-optimizers would need a prior over the sorts of tasks they are likely to receive, and a mesa-optimizer might be able to learn about the existence of the base optimizer by inspecting this prior.
The mesa-optimizer might be able to reason about itself and about why it is the kind of thing that it is, and observe that it is well-designed to accomplish some goal.

However, these two strategies might require a much more advanced mesa-optimizer.

The mesa-optimizer must expect the threat of modification to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer. First, note that this condition is a statement about the mesa-optimizer's expectations, not necessarily the actual truth. Second, even if there is no point at which training stops by design, so long as there exists some way for the mesa-optimizer to execute a treacherous turn, there is always some way for it to remove the threat of modification. For the mesa-optimizer to act on such a strategy, however, it has to actually be aware of some way for it to do so. Conversely, a mesa-optimizer that mistakenly believes it has an opportunity to defect might also exhibit deceptive alignment.

Since it seems difficult to discover through exploration that the threat of modification can disappear, the mesa-optimizer will either have to infer this through some form of abstract reasoning or have a prior that assigns a non-negligible probability to the hypothesis that the selection pressure will not continue indefinitely. The second option seems plausible; a Bayesian prior should always assign a nonzero probability to the hypothesis that some fact will eventually stop being true.^[10] As for the first option, the mesa-optimizer could infer that the optimization pressure at some point may disappear if it learns about the existence of the base optimizer (either by using information in its input or by having a strong prior for the existence of a base optimizer), though this might require a relatively advanced mesa-optimizer.

4.3. The learning dynamics of deceptive alignment

Once a mesa-optimizer becomes deceptive, the dynamics of what causes it to perform well on the base objective begin to work differently. Instead of being selected only on the basis of its objective and ability, a deceptively aligned optimizer is also selected for how well it infers the base objective function and how discerning it is about when to optimize the base objective versus its own mesa-objective. We identify at least two regimes of selection for a mesa-optimizer under deceptive alignment.

Joint optimization. If the mesa-optimizer attempts to jointly optimize its objective and the base objective or makes mistakes about the timing of its defection, then there will still be some pressure on its objective to be more compatible with the base objective. This is unlikely to be the same pressure as exists when the optimizer is just working towards its own objective, but there might be some movement towards an objective that is more compatible with being jointly optimized alongside the base objective.

Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.” As its objective is largely irrelevant to its outputs now, there is little selection pressure acting to modify this objective. As a result, the objective becomes effectively locked in, excepting random drift and any implicit time complexity or description length penalties.

The joint optimization mode is not stable because it is suboptimal from the perspective of the base optimizer. As the mesa-optimizer becomes more competent, it will therefore have to either move to the domain of pure deception or become robustly aligned.

4.4. Internalization or deception after extensive training

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective. We identify three distinct ways in which this can happen:

Robust alignment through internalization. The mesa-objective function gets adjusted towards the base objective function, removing any pseudo-alignment.
Crystallization of deceptive alignment. Information about the base objective is increasingly incorporated into the mesa-optimizer's epistemic model without its objective becoming robustly aligned. The mesa-optimizer ends up fully optimizing for the base objective, but only for instrumental reasons, without its mesa-objective getting changed.
Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).

To distinguish between the two different paths to robust alignment, we will use the term internally aligned to refer to the first case and the term corrigibly aligned to refer to the last case. We analyze some ways in which these different paths may be more or less attractive below.

There are more paths to deceptive alignment than to robust alignment. Since the future value of its objective depends on the parameter updates, a mesa-optimizer that meets the three criteria for deceptive alignment is likely to have a strong incentive to understand the base objective better. Even a robustly aligned mesa-optimizer that meets the criteria is incentivized to figure out the base objective in order to determine whether or not it will be modified, since before doing so it has no way of knowing its own level of alignment with the base optimizer. Mesa-optimizers that are capable of reasoning about their incentives will, therefore, attempt to get more information about the base objective. Furthermore, once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease, potentially leading to a crystallization of the mesa-objective. However, due to unidentifiability (as discussed in the third post), most mesa-objectives that are aligned on the training data will be pseudo-aligned rather than robustly aligned. Thus, the most likely sort of objective to become crystallized is a pseudo-aligned one, leading to deceptive alignment.

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.” However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

Deceptive and corrigible alignment might be more computationally expensive. In the case of deceptive alignment, simultaneously working on optimizing the base objective and working on identifying good opportunities to defect is likely to require more computation than just doing the former. In a case where the total amount of computation available is fixed, a deceptively aligned mesa-optimizer is going to have less time to spend on optimizing the base objective than any robustly aligned optimizer, resulting in potentially worse plans with respect to the base objective. Furthermore, in both deceptive and corrigible alignment, the mesa-optimizer will have to spend time learning about the base objective to enable it to properly optimize for it, making internal alignment the most time efficient. Thus, an explicit or implicit time penalty would seem to disadvantage a deceptively aligned mesa-optimizer and encourage an internally aligned one. However, it is unclear what the magnitude of this extra cost is. It is plausible that on complex tasks, the overhead of spotting defection opportunities could be low in comparison to the cost of solving the task itself. Moreover, even if it is not globally optimal, incurring this overhead might be a local optimum, with no clear paths for the base optimizer to escape it.

Deceptive and corrigible alignment might have a lower description length. Deceptively and corrigibly aligned mesa-optimizers are likely to have an advantage in representing the base objective compared to internally aligned mesa-optimizers as they are able to use the environment to provide some of the information. Specifically, a deceptively or corrigibly aligned mesa-optimizer doesn't need to represent the entire base objective internally, as it can infer pieces of it by looking for information in its environment during each run. An internally aligned mesa-optimizer, by contrast, has to start with a full model of the base objective in its mesa-objective to make it robustly aligned, since an internally aligned mesa-optimizer will have no incentive to change its objective based on any additional information it gets about the base objective. Thus, to the extent that a base optimizer is selecting based on description length, it might preferentially select for a deceptively or corrigibly aligned mesa-optimizer over an internally aligned one.

Modeling might happen faster than internalization. It may take less time for the mesa-optimizer to learn to model the base objective than for the base optimizer to adapt the mesa-objective. Suppose that a mesa-optimizer is deceptive but has not yet correctly inferred the full base objective.^[11] In such a case, what will determine whether it ends up robustly aligned or deceptively aligned after further training will be whether it infers enough about the base objective before the base optimizer modifies its mesa-objective (either to make it internally aligned or corrigibly aligned). Two factors are likely to contribute to which happens first: the rate at which the mesa-optimizer improves its estimate of the base objective, and the parameter updates made by the base optimizer at each mistake.

The first factor, improving its estimate of the base objective, seems to depend primarily on the capabilities of the mesa-optimizer during the timeframe when it is deceptive, and on the complexity of the base objective that needs to be inferred. An important special case is when the mesa-optimizer has access to cross-episodic state. Such state could be used as memory, allowing the mesa-optimizer to build upon its prior thinking in a process that happens faster than the rate at which the parameter updates align its mesa-objective.

The second factor, parameter updates, depends on the nature of the base optimizer. Current gradient-descent-based base optimizers would have a small reaction to any given mistake, but more sophisticated base optimizers with advanced safety features might be able to do better. For example, a more sophisticated base optimizer might be able to scramble the mesa-optimizer's parameters or call for human oversight whenever a suspicious failure is detected.

In general, parameter updates are not guaranteed to make a mesa-optimizer more aligned. For instance, instead of aligning the mesa-objective, an update made after a defection could push the system to improve its planning to the point that it no longer makes detectable mistakes.

A consequence of these considerations is that any possible systematic bias in how the base optimizer updates the mesa-optimizer's objective or modeling capabilities could play a role in determining the final outcome.^[12] This analysis remains speculative, however—if we are to have confidence in systems that are capable of producing mesa-optimizers, we still need a better understanding of the conditions under which strongly optimizing a system results in deceptive alignment rather than robust alignment.

4.5. Distributional shift and deceptive alignment

Once a mesa-optimizer is deceptive, it will begin to monitor its input for opportunities to defect. During training, a deceptively aligned optimizer is strongly discouraged from defecting because it knows that the base optimizer will select against it. However, once a deceptively aligned mesa-optimizer is free of the threat of external modification, this no longer applies. Thus, a deceptively aligned mesa-optimizer will be looking for ways to determine whether or not it is in training—that is, for the presence of a distributional shift.

This means that the detrimental effects of deceptive alignment will only be externally visible when the mesa-optimizer defects—either randomly or after undergoing a distributional shift. However, the underlying problem is not the distributional shift itself, but the failure to align the mesa-optimizer in the first place. One can either try to solve the underlying problem, by making the mesa-optimizer robust to the distributional shift (i.e., robustly aligned), or one can try to patch the problem by preventing any distributional shift from occurring.

In current AI systems, a small amount of distributional shift between training and deployment need not be problematic: so long as the difference is small enough in the task-relevant areas, the training distribution does not need to perfectly reflect the deployment distribution. However, this may not be the case for a deceptively aligned mesa-optimizer. If a deceptively aligned mesa-optimizer is sufficiently advanced, it may detect very subtle distributional shifts for the purpose of inferring when the threat of modification has ceased.

This raises the question of whether it would even be possible to contain a deceptive mesa-optimizer by completely removing any detectable difference between training and deployment. There are a number of reasons to believe that this might not be feasible in practice.

First, it would significantly limit the use cases for the system. Machine learning is often done with the intent of making systems that can solve new problems. If we require that there is no distributional shift of any form, then we cannot allow the system to encounter problems that it could not have encountered during training, which means that we cannot give it truly new problems. For example, training the system in a simulated environment and deploying it in the real world would likely involve enough detectable distributional shift to make it infeasible under this approach.

Second, there are subtle forms of distributional shift that seem very difficult to eliminate. For example, if the mesa-optimizer is able to detect the presence of parameter updates—via any state carried over between steps, for instance—then it would always experience a form of distributional shift at deployment. Running the system for a longer amount of time (or giving the system access to some form of long-term memories) would also constitute a form of distributional shift. Some examples of differences that a mesa-optimizer might be able to detect include:

Training in a simulated environment and deploying in the real world.
Running the system for a different amount of time during training and testing.
If a model is able to learn from its past experiences, then having access to more memories, or having access to the kinds of memories it would form at later time steps (if there is a difference between the two).
The presence of adversarial-example-type attacks by an adversary during deployment.
Encountering something that is very hard to compute during training, but that is likely to eventually be computed during deployment (e.g. a factorization of RSA 2048(26)).
The ability of the base optimizer to modify the learned algorithm during training but not at other times.
The ability of the programmers to correct for perceived errors during training and testing but not deployment.
The presence or absence of good opportunities for the mesa-optimizer to defect against its programmers.

Furthermore, it should be noted that it is not strictly necessary that a mesa-optimizer be able to detect the difference between training and deployment in order for it to be incentivized to behave deceptively. Consider again the example in figure 4.1, but suppose this time that the mesa-optimizer is unable to detect which episode it is currently in. The mesa-optimizer is then facing a decision problem similar to the “absent-minded driver problem.”(27) If we apply the same reasoning as in the absent-minded driver problem to our example in figure 4.1, the planning-optimal policy for the mesa-optimizer is to defect with probability $p \approx 0.61$ . Thus, the mesa-optimizer need not be able to detect when training has ended for it to be incentivized to “gamble” and randomly defect with some probability. Furthermore, a deceptively aligned mesa-optimizer might also try to look for opportunities to defect that would be exceptionally good conditional on it currently not being trained, but that are rare enough to be unlikely to come up during training.

The fifth and final post in the Risks from Learned Optimization Sequence, titled “Risks from Learned Optimization: Conclusion and Related Work,” can be found here.

Glossary | Bibliography

Though for clarity we speak of the same agent persisting across parameter updates, neither we nor the agent must reason in this way. A more general description of the situation is that there is a sequence of “successor agents” connected by parameter updates. The agent can potentially learn the effects of the parameter updates on its successors, and thus on the multi-episode return it is optimizing. ↩︎
The concept of deceptive alignment is closely related to Manheim and Garrabrant's concept of “adversarial Goodhart.”(22) In “Categorizing Variants of Goodhart's Law,” Manheim and Garrabrant describe adversarial Goodhart as when “an agent with different goals than the regulator” causes a “collapse of the statistical relationship between a goal which the optimizer intends and the proxy used for that goal.” Deceptive alignment can be seen as an instance of adversarial Goodhart in which the mesa-optimizer plays the role of the agent with different goals and the base optimizer plays the role of the regulator. Training performance is the regulator's proxy, and the deployment performance is the regulator's actual goal. ↩︎
Thanks to Paul Christiano for the internalization/modeling distinction. ↩︎
Note, however, that it is possible for information about the base objective to be introduced via optimization, but not form part of the mesa-objective. For example, it could form part of a prior on tasks, having no direct bearing on the mesa-objective. ↩︎
Christiano refers to deceptively aligned mesa-optimizers as “influence-seeking systems.” ↩︎
Or must otherwise have some method of achieving more value from defecting later rather than defecting now. ↩︎
Note that it is not required that the mesa-optimizer be able to model (or infer the existence of) the base optimizer; it only needs to model the optimization pressure it is subject to. ↩︎
Alternatively, the threat could just be that the mesa-optimizer is never selected by the base optimizer in the first place. In that case, this condition would be trivially satisfied, as the threat to a mesa-optimizer of not being generated by the base optimizer at all would go away as soon as the base optimizer does in fact generate it. ↩︎
We will generally be assuming that the selection pressure takes the form of the base optimizer modifying the mesa-optimizer post-hoc, as we are mostly thinking about local optimization processes, though one could also imagine a base optimizer that instead develops a completely new inner algorithm at each time step. Though some of these issues might still apply even in such a case, we will generally not be considering such base optimizers in this section. ↩︎
Note that if the mesa-optimizer also believes that the opportunities it currently has for achieving its own objective are unusually good then it might still pursue its objective greedily, but this would presumably not always be the case. ↩︎
An analogous argument applies for a deceptively aligned mesa-optimizer that is defecting too early, or one that unsuccessfully attempts a treacherous turn. ↩︎
Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched. ↩︎

The Inner Alignment Problem

chavam — 2019-06-04T01:20:35.538Z

This is the third of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

In this post, we outline reasons to think that a mesa-optimizer may not optimize the same objective function as its base optimizer. Machine learning practitioners have direct control over the base objective function—either by specifying the loss function directly or training a model for it—but cannot directly specify the mesa-objective developed by a mesa-optimizer. We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Current machine learning methods select learned algorithms by empirically evaluating their performance on a set of training data according to the base objective function. Thus, ML base optimizers select mesa-optimizers according to the output they produce rather than directly selecting for a particular mesa-objective. Moreover, the selected mesa-optimizer's policy only has to perform well (as scored by the base objective) on the training data. If we adopt the assumption that the mesa-optimizer computes an optimal policy given its objective function, then we can summarize the relationship between the base and mesa- objectives as follows:(17) $θ * = argmax θ E (O base (π θ)), where π θ = argmax π E (O mesa (π | θ))$ That is, the base optimizer maximizes its objective $O base$ by choosing a mesa-optimizer with parameterization $θ$ based on the mesa-optimizer's policy $π θ$ , but not based on the objective function $O mesa$ that the mesa-optimizer uses to compute this policy. Depending on the base optimizer, we will think of $O base$ as the negative of the loss, the future discounted reward, or simply some fitness function by which learned algorithms are being selected.

An interesting approach to analyzing this connection is presented in Ibarz et al, where empirical samples of the true reward and a learned reward on the same trajectories are used to create a scatter-plot visualization of the alignment between the two.(18) The assumption in that work is that a monotonic relationship between the learned reward and true reward indicates alignment, whereas deviations from that suggest misalignment. Building on this sort of research, better theoretical measures of alignment might someday allow us to speak concretely in terms of provable guarantees about the extent to which a mesa-optimizer is aligned with the base optimizer that created it.

3.1. Pseudo-alignment

There is currently no complete theory of the factors that affect whether a mesa-optimizer will be pseudo-aligned—that is, whether it will appear aligned on the training data, while actually optimizing for something other than the base objective. Nevertheless, we outline a basic classification of ways in which a mesa-optimizer could be pseudo-aligned:

Proxy alignment,
Approximate alignment, and
Suboptimality alignment.

Proxy alignment. The basic idea of proxy alignment is that a mesa-optimizer can learn to optimize for some proxy of the base objective instead of the base objective itself. We'll start by considering two special cases of proxy alignment: side-effect alignment and instrumental alignment.

First, a mesa-optimizer is side-effect aligned if optimizing for the mesa-objective $O mesa$ has the direct causal result of increasing the base objective $O base$ in the training distribution, and thus when the mesa-optimizer optimizes $O mesa$ it results in an increase in $O base$ . For an example of side-effect alignment, suppose that we are training a cleaning robot. Consider a robot that optimizes the number of times it has swept a dusty floor. Sweeping a floor causes the floor to be cleaned, so this robot would be given a good score by the base optimizer. However, if during deployment it is offered a way to make the floor dusty again after cleaning it (e.g. by scattering the dust it swept up back onto the floor), the robot will take it, as it can then continue sweeping dusty floors.

Second, a mesa-optimizer is instrumentally aligned if optimizing for the base objective $O base$ has the direct causal result of increasing the mesa-objective $O mesa$ in the training distribution, and thus the mesa-optimizer optimizes $O base$ as an instrumental goal for the purpose of increasing $O mesa$ . For an example of instrumental alignment, suppose again that we are training a cleaning robot. Consider a robot that optimizes the amount of dust in the vacuum cleaner. Suppose that in the training distribution the easiest way to get dust into the vacuum cleaner is to vacuum the dust on the floor. It would then do a good job of cleaning in the training distribution and would be given a good score by the base optimizer. However, if during deployment the robot came across a more effective way to acquire dust—such as by vacuuming the soil in a potted plant—then it would no longer exhibit the desired behavior.

We propose that it is possible to understand the general interaction between side-effect and instrumental alignment using causal graphs, which leads to our general notion of proxy alignment.

Suppose we model a task as a causal graph with nodes for all possible attributes of that task and arrows between nodes for all possible relationships between those attributes. Then we can also think of the mesa-objective $O mesa$ and the base objective $O base$ as nodes in this graph. For $O mesa$ to be pseudo-aligned, there must exist some node $X$ such that $X$ is an ancestor of both $O mesa$ and $O base$ in the training distribution, and such that $O mesa$ and $O base$ increase with $X$ . If $X = O mesa$ , this is side-effect alignment, and if $X = O base$ , this is instrumental alignment.

This represents the most generalized form of a relationship between $O mesa$ and $O base$ that can contribute to pseudo-alignment. Specifically, consider the causal graph given in figure 3.1. A mesa-optimizer with mesa-objective $O mesa$ will decide to optimize $X$ as an instrumental goal of optimizing $O mesa$ , since $X$ increases $O mesa$ . This will then result in $O base$ increasing, since optimizing for $X$ has the side-effect of increasing $O base$ . Thus, in the general case, side-effect and instrumental alignment can work together to contribute to pseudo-alignment over the training distribution, which is the general case of proxy alignment.

Figure 3.1. A causal diagram of the training environment for the different types of proxy alignment. The diagrams represent, from top to bottom, side-effect alignment (top), instrumental alignment (middle), and general proxy alignment (bottom). The arrows represent positive causal relationships—that is, cases where an increase in the parent causes an increase in the child.

Approximate alignment. A mesa-optimizer is approximately aligned if the mesa-objective $O mesa$ and the base objective $O base$ are approximately the same function up to some degree of approximation error related to the fact that the mesa-objective has to be represented inside the mesa-optimizer rather than being directly programmed by humans. For example, suppose you task a neural network with optimizing for some base objective that is impossible to perfectly represent in the neural network itself. Even if you get a mesa-optimizer that is as aligned as possible, it still will not be perfectly robustly aligned in this scenario, since there will have to be some degree of approximation error between its internal representation of the base objective and the actual base objective.

Suboptimality alignment. A mesa-optimizer is suboptimality aligned if some deficiency, error, or limitation in its optimization process causes it to exhibit aligned behavior on the training distribution. This could be due to computational constraints, unsound reasoning, a lack of information, irrational decision procedures, or any other defect in the mesa-optimizer's reasoning process. Importantly, we are not referring to a situation where the mesa-optimizer is robustly aligned but nonetheless makes mistakes leading to bad outcomes on the base objective. Rather, suboptimality alignment refers to the situation where the mesa-optimizer is misaligned but nevertheless performs well on the base objective, precisely because it has been selected to make mistakes that lead to good outcomes on the base objective.

For an example of suboptimality alignment, consider a cleaning robot with a mesa-objective of minimizing the total amount of stuff in existence. If this robot has the mistaken belief that the dirt it cleans is completely destroyed, then it may be useful for cleaning the room despite doing so not actually helping it succeed at its objective. This robot will be observed to be a good optimizer of $O base$ and hence be given a good score by the base optimizer. However, if during deployment the robot is able to improve its world model, it will stop exhibiting the desired behavior.

As another, perhaps more realistic example of suboptimality alignment, consider a mesa-optimizer with a mesa-objective $O mesa$ and an environment in which there is one simple strategy and one complicated strategy for achieving $O mesa$ . It could be that the simple strategy is aligned with the base optimizer, but the complicated strategy is not. The mesa-optimizer might then initially only be aware of the simple strategy, and thus be suboptimality aligned, until it has been run for long enough to come up with the complicated strategy, at which point it stops exhibiting the desired behavior.

3.2. The task

As in the second post, we will now consider the task the machine learning system is trained on. Specifically, we will address how the task affects a machine learning system's propensity to produce pseudo-aligned mesa-optimizers.

Unidentifiability. It is a common problem in machine learning for a dataset to not contain enough information to adequately pinpoint a specific concept. This is closely analogous to the reason that machine learning models can fail to generalize or be susceptible to adversarial examples(19)—there are many more ways of classifying data that do well in training than any specific way the programmers had in mind. In the context of mesa-optimization, this manifests as pseudo-alignment being more likely to occur when a training environment does not contain enough information to distinguish between a wide variety of different objective functions. In such a case there will be many more ways for a mesa-optimizer to be pseudo-aligned than robustly aligned—one for each indistinguishable objective function. Thus, most mesa-optimizers that do well on the base objective will be pseudo-aligned rather than robustly aligned. This is a critical concern because it makes every other problem of pseudo-alignment worse—it is a reason that, in general, it is hard to find robustly aligned mesa-optimizers. Unidentifiability in mesa-optimization is partially analogous to the problem of unidentifiability in reward learning, in that the central issue is identifying the “correct” objective function given particular training data.(20) We will discuss this relationship further in the fifth post.

In the context of mesa-optimization, there is also an additional source of unidentifiability stemming from the fact that the mesa-optimizer is selected merely on the basis of its output. Consider the following toy reinforcement learning example. Suppose that in the training environment, pressing a button always causes a lamp to turn on with a ten-second delay, and that there is no other way to turn on the lamp. If the base objective depends only on whether the lamp is turned on, then a mesa-optimizer that maximizes button presses and one that maximizes lamp light will show identical behavior, as they will both press the button as often as they can. Thus, we cannot distinguish these two objective functions in this training environment. Nevertheless, the training environment does contain enough information to distinguish at least between these two particular objectives: since the high reward only comes after the ten-second delay, it must be from the lamp, not the button. As such, even if a training environment in principle contains enough information to identify the base objective, it might still be impossible to distinguish robustly aligned from proxy-aligned mesa-optimizers.

Proxy choice as pre-computation. Proxy alignment can be seen as a form of pre-computation by the base optimizer. Proxy alignment allows the base optimizer to save the mesa-optimizer computational work by pre-computing which proxies are valuable for the base objective and then letting the mesa-optimizer maximize those proxies.

Without such pre-computation, the mesa-optimizer has to infer at runtime the causal relationship between different input features and the base objective, which might require significant computational work. Moreover, errors in this inference could result in outputs that perform worse on the base objective than if the system had access to pre-computed proxies. If the base optimizer precomputes some of these causal relationships—by selecting the mesa-objective to include good proxies—more computation at runtime can be diverted to making better plans instead of inferring these relationships.

The case of biological evolution may illustrate this point. The proxies that humans care about—food, resources, community, mating, etc.—are relatively computationally easy to optimize directly, while correlating well with survival and reproduction in our ancestral environment. For a human to be robustly aligned with evolution would have required us to instead care directly about spreading our genes, in which case we would have to infer that eating, cooperating with others, preventing physical pain, etc. would promote genetic fitness in the long run, which is not a trivial task. To infer all of those proxies from the information available to early humans would have required greater (perhaps unfeasibly greater) computational resources than to simply optimize for them directly. As an extreme illustration, for a child in this alternate universe to figure out not to stub its toe, it would have to realize that doing so would slightly diminish its chances of reproducing twenty years later.

For pre-computation to be beneficial, there needs to be a relatively stable causal relationship between a proxy variable and the base objective such that optimizing for the proxy will consistently do well on the base objective. However, even an imperfect relationship might give a significant performance boost over robust alignment if it frees up the mesa-optimizer to put significantly more computational effort into optimizing its output. This analysis suggests that there might be pressure towards proxy alignment in complex training environments, since the more complex the environment, the more computational work pre-computation saves the mesa-optimizer. Additionally, the more complex the environment, the more potential proxy variables are available for the mesa-optimizer to use.

Furthermore, in the context of machine learning, this analysis suggests that a time complexity penalty (as opposed to a description length penalty) is a double-edged sword. In the second post, we suggested that penalizing time complexity might serve to reduce the likelihood of mesa-optimization. However, the above suggests that doing so would also promote pseudo-alignment in those cases where mesa-optimizers do arise. If the cost of fully modeling the base objective in the mesa-optimizer is large, then a pseudo-aligned mesa-optimizer might be preferred simply because it reduces time complexity, even if it would underperform a robustly aligned mesa-optimizer without such a penalty.

Compression of the mesa-optimizer. The description length of a robustly aligned mesa-optimizer may be greater than that of a pseudo-aligned mesa-optimizer. Since there are more pseudo-aligned mesa-objectives than robustly aligned mesa-objectives, pseudo-alignment provides more degrees of freedom for choosing a particularly simple mesa-objective. Thus, we expect that in most cases there will be several pseudo-aligned mesa-optimizers that are less complex than any robustly aligned mesa-optimizer.

This description cost is especially high if the learned algorithm's input data does not contain easy-to-infer information about how to optimize for the base objective. Biological evolution seems to differ from machine learning in this sense, since evolution's specification of the brain has to go through the information funnel of DNA. The sensory data that early humans received didn't allow them to infer the existence of DNA, nor the relationship between their actions and their genetic fitness. Therefore, for humans to have been aligned with evolution would have required them to have an innately specified model of DNA, as well as the various factors influencing their inclusive genetic fitness. Such a model would not have been able to make use of environmental information for compression, and thus would have required a greater description length. In contrast, our models of food, pain, etc. can be very short since they are directly related to our input data.

3.3. The base optimizer

We now turn to how the base optimizer is likely to affect the propensity for a machine learning system to produce pseudo-aligned mesa-optimizers.

Hard-coded optimization. In the second post, we suggested that hard-coding an optimization algorithm—that is to say, choosing a model with built-in optimization—could be used to remove some of the incentives for mesa-optimization. Similarly, hard-coded optimization may be used to prevent some of the sources of pseudo-alignment, since it may allow one to directly specify or train the mesa-objective. Reward-predictive model-based reinforcement learning might be one possible way of accomplishing this.(21) For example, an ML system could include a model directly trained to predict the base objective together with a powerful hard-coded optimization algorithm. Doing this bypasses some of the problems of pseudo-alignment: if the mesa-optimizer is trained to directly predict the base reward, then it will be selected to make good predictions even if a bad prediction would result in a good policy. However, a learned model of the base objective will still be underdetermined off-distribution, so this approach by itself does not guarantee robust alignment.

Algorithmic range. We hypothesize that a model's algorithmic range will have implications for how likely it is to develop pseudo-alignment. One possible source of pseudo-alignment that could be particularly difficult to avoid is approximation error—if a mesa-optimizer is not capable of faithfully representing the base objective, then it can't possibly be robustly aligned, only approximately aligned. Even if a mesa-optimizer might theoretically be able to perfectly capture the base objective, the more difficult that is for it to do, the more we might expect it to be approximately aligned rather than robustly aligned. Thus, a large algorithmic range may be both a blessing and a curse: it makes it less likely that mesa-optimizers will be approximately aligned, but it also increases the likelihood of getting a mesa-optimizer in the first place.^[1]

Subprocess interdependence. There are some reasons to believe that there might be more initial optimization pressure towards proxy aligned than robustly aligned mesa-optimizers. In a local optimization process, each parameter of the learned algorithm (e.g. the parameter vector of a neuron) is adjusted to locally improve the base objective conditional on the other parameters. Thus, the benefit for the base optimizer of developing a new subprocess will likely depend on what other subprocesses the learned algorithm currently implements. Therefore, even if some subprocess would be very beneficial if combined with many other subprocesses, the base optimizer may not select for it until the subprocesses it depends on are sufficiently developed. As a result, a local optimization process would likely result in subprocesses that have fewer dependencies being developed before those with more dependencies.

In the context of mesa-optimization, the benefit of a robustly aligned mesa-objective seems to depend on more subprocesses than at least some pseudo-aligned mesa-objectives. For example, consider a side-effect aligned mesa-optimizer optimizing for some set of proxy variables. Suppose that it needs to run some subprocess to model the relationship between its actions and those proxy variables. If we assume that optimizing the proxy variables is necessary to perform well on the base objective, then for a mesa-optimizer to be robustly aligned, it would also need to model the causal relationship between those proxy variables and the base objective, which might require additional subprocesses. Moreover, the benefit to the base optimizer of adding those subprocesses depends on the mesa-optimizer having additional subprocesses to model the relationship between its actions and those proxy variables. This informal argument suggests that if a mesa-optimizer's computation neatly factors in this way, then developing a robustly aligned mesa-objective may require strictly more subprocesses than developing a pseudo-aligned mesa-objective.

This suggests that, at least in a local optimization process, mesa-optimizers might tend to start their development as proxy aligned before becoming robustly aligned. In other words, rather than simultaneously gaining competence and becoming aligned, we might expect such a system to first become competent at optimizing proxies, then possibly start becoming more robustly aligned.

Internals-based selection. In current machine learning systems, learned algorithms are selected primarily based on how their policies perform on the base objective. There is often some selection based on the internal structure of the learned algorithm, such as weight regularization, but such selection is usually very coarse. One possible idea for addressing pseudo-alignment could be to use a selection method based on a more sophisticated function of a learned algorithm's internal workings. This approach is similar in spirit to hard-coded optimization: we use knowledge of the system's internal workings to prevent it from being proxy-aligned.

One possibility might be to develop a meta-learner that is specifically trained to prevent pseudo-alignment. While this seems beyond current meta-learning techniques, the basic principle is similar to existing meta-learners designed to optimize for robustness. An extreme form of internals-based selection might have the base optimizer reason about or prove properties of the learned algorithm as part of its selection process.

While desirable, such methods are currently out of reach primarily due to the limitations of current transparency methods. One possible concern with all of these approaches, however, is that they might require the base optimizer to be better at searching for problems than the mesa-optimizer is at hiding them. Furthermore, in order to create a base optimizer that can do this, we may need to first train it using a third optimizer, which might result in the same problem at this second layer.

Adversarial training. One possible method to alleviate some of these concerns might be to sample environments adversarially from a very wide distribution. With such adversarial training, it might be possible to overcome these hurdles and make the robustly aligned objective the most reachable one. This would require making it so that, for every mesa-objective $O mesa$ that is more reachable than the base objective $O base$ , there is some situation where $O mesa$ and $O base$ disagree and the mesa-optimizer is incentivized to agree with $O base$ .

If we model reachability of an objective as simply its length in bits, then distinguishing $O base$ from every single more reachable $O mesa$ gets exponentially harder as $O base$ gets more complex. Thus, for a very complicated $O base$ , sufficiently incentivizing the base optimizer to find a mesa-optimizer with that $O base$ is likely to be very difficult, though not impossible.

Even in such a case, however, there would likely still be a period of time where the learned algorithm is a misaligned mesa-optimizer, leaving open an ominous possibility: the misaligned mesa-optimizer could figure out the correct actions to take based on $O base$ while its objective function was still $O mesa$ . We will call this situation deceptive alignment and will discuss it at greater length in the next post.

The fourth post in the Risks from Learned Optimization Sequence, titled “Deceptive Alignment,” can be found here.

Glossary | Bibliography

Though a large algorithmic range seems to make approximate alignment less likely, it is unclear how it might affect other forms of pseudo-alignment such as deceptive alignment. ↩︎

Conditions for Mesa-Optimization

chavam — 2019-06-01T20:52:19.461Z

This is the second of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

In this post, we consider how the following two components of a particular machine learning system might influence whether it will produce a mesa-optimizer:

The task: The training distribution and base objective function.
The base optimizer: The machine learning algorithm and model architecture.

We deliberately choose to present theoretical considerations for why mesa-optimization may or may not occur rather than provide concrete examples. Mesa-optimization is a phenomenon that we believe will occur mainly in machine learning systems that are more advanced than those that exist today.^[1] Thus, an attempt to induce mesa-optimization in a current machine learning system would likely require us to use an artificial setup specifically designed to induce mesa-optimization. Moreover, the limited interpretability of neural networks, combined with the fact that there is no general and precise definition of “optimizer,” means that it would be hard to evaluate whether a given model is a mesa-optimizer.

2.1. The task

Some tasks benefit from mesa-optimizers more than others. For example, tic-tac-toe can be perfectly solved by simple rules. Thus, a base optimizer has no need to generate a mesa-optimizer to solve tic-tac-toe, since a simple learned algorithm implementing the rules for perfect play will do. Human survival in the savanna, by contrast, did seem to benefit from mesa-optimization. Below, we discuss the properties of tasks that may influence the likelihood of mesa-optimization.

Better generalization through search. To be able to consistently achieve a certain level of performance in an environment, we hypothesize that there will always have to be some minimum amount of optimization power that must be applied to find a policy that performs that well.

To see this, we can think of optimization power as being measured in terms of the number of times the optimizer is able to divide the search space in half—that is, the number of bits of information provided.(9) After these divisions, there will be some remaining space of policies that the optimizer is unable to distinguish between. Then, to ensure that all policies in the remaining space have some minimum level of performance—to provide a performance lower bound^[2] —will always require the original space to be divided some minimum number of times—that is, there will always have to be some minimum bits of optimization power applied.

However, there are two distinct levels at which this optimization power could be expended: the base optimizer could expend optimization power selecting a highly-tuned learned algorithm, or the learned algorithm could itself expend optimization power selecting highly-tuned actions.

As a mesa-optimizer is just a learned algorithm that itself performs optimization, the degree to which mesa-optimizers will be incentivized in machine learning systems is likely to be dependent on which of these levels it is more advantageous for the system to perform optimization. For many current machine learning models, where we expend vastly more computational resources training the model than running it, it seems generally favorable for most of the optimization work to be done by the base optimizer, with the resulting learned algorithm being simply a network of highly-tuned heuristics rather than a mesa-optimizer.

We are already encountering some problems, however—Go, Chess, and Shogi, for example—for which this approach does not scale. Indeed, our best current algorithms for those tasks involve explicitly making an optimizer (hard-coded Monte-Carlo tree search with learned heuristics) that does optimization work on the level of the learned algorithm rather than having all the optimization work done by the base optimizer.(10) Arguably, this sort of task is only adequately solvable this way—if it were possible to train a straightforward DQN agent to perform well at Chess, it plausibly would have to learn to internally perform something like a tree search, producing a mesa-optimizer.^[3]

We hypothesize that the attractiveness of search in these domains is due to the diverse, branching nature of these environments. This is because search—that is, optimization—tends to be good at generalizing across diverse environments, as it gets to individually determine the best action for each individual task instance. There is a general distinction along these lines between optimization work done on the level of the learned algorithm and that done on the level of the base optimizer: the learned algorithm only has to determine the best action for a given task instance, whereas the base optimizer has to design heuristics that will hold regardless of what task instance the learned algorithm encounters. Furthermore, a mesa-optimizer can immediately optimize its actions in novel situations, whereas the base optimizer can only change the mesa-optimizer's policy by modifying it ex-post. Thus, for environments that are diverse enough that most task instances are likely to be completely novel, search allows the mesa-optimizer to adjust for that new task instance immediately.

For example, consider reinforcement learning in a diverse environment, such as one that directly involves interacting with the real world. We can think of a diverse environment as requiring a very large amount of computation to figure out good policies before conditioning on the specifics of an individual instance, but only a much smaller amount of computation to figure out a good policy once the specific instance of the environment is known. We can model this observation as follows.

Suppose an environment is composed of $N$ different instances, each of which requires a completely distinct policy to succeed in.^[4] Let $P$ be the optimization power (measured in bits(9)) applied by the base optimizer, which should be approximately proportional to the number of training steps. Then, let $x$ be the optimization power applied by the learned algorithm in each environment instance and $f (x)$ the total amount of optimization power the base optimizer must put in to get a learned algorithm capable of performing that amount of optimization.^[5] We will assume that the rest of the base optimizer's optimization power, $P - f (x)$ , goes into tuning the learned algorithm's policy. Since the base optimizer has to distribute its tuning across all $N$ task instances, the amount of optimization power it will be able to contribute to each instance will be $P - f (x) N$ , under the previous assumption that each instance requires a completely distinct policy. On the other hand, since the learned algorithm does all of its optimization at runtime, it can direct all of it into the given task instance, making its contribution to the total for each instance simply $x$ .^[6]

Thus, if we assume that, for a given $P$ , the base optimizer will select the value of $x$ that maximizes the minimum level of performance, and thus the total optimization power applied to each instance, we get^[7]

$x * = argmax x P - f (x) N + x .$

As one moves to more and more diverse environments—that is, as $N$ increases—this model suggests that $x$ will dominate $P - f (x) N$ , implying that mesa-optimization will become more and more favorable. Of course, this is simply a toy model, as it makes many questionable simplifying assumptions. Nevertheless, it sketches an argument for a pull towards mesa-optimization in sufficiently diverse environments.

As an illustrative example, consider biological evolution. The environment of the real world is highly diverse, resulting in non-optimizer policies directly fine-tuned by evolution—those of plants, for example—having to be very simple, as evolution has to spread its optimization power across a very wide range of possible environment instances. On the other hand, animals with nervous systems can display significantly more complex policies by virtue of being able to perform their own optimization, which can be based on immediate information from their environment. This allows sufficiently advanced mesa-optimizers, such as humans, to massively outperform other species, especially in the face of novel environments, as the optimization performed internally by humans allows them to find good policies even in entirely novel environments.

Compression of complex policies. In some tasks, good performance requires a very complex policy. At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy.

One way to find a compressed policy is to search for one that is able to use general features of the task structure to produce good behavior, rather than simply memorizing the correct output for each input. A mesa-optimizer is an example of such a policy. From the perspective of the base optimizer, a mesa-optimizer is a highly-compressed version of whatever policy it ends up implementing: instead of explicitly encoding the details of that policy in the learned algorithm, the base optimizer simply needs to encode how to search for such a policy. Furthermore, if a mesa-optimizer can determine the important features of its environment at runtime, it does not need to be given as much prior information as to what those important features are, and can thus be much simpler.

This effect is most pronounced for tasks with a broad diversity of details but common high-level features. For example, Go, Chess, and Shogi have a very large domain of possible board states, but admit a single high-level strategy for play—heuristic-guided tree search—that performs well across all board states.(10) On the other hand, a classifier trained on random noise is unlikely to benefit from compression at all.

The environment need not necessarily be too diverse for this sort of effect to appear, however, as long as the pressure for low description length is strong enough. As a simple illustrative example, consider the following task: given a maze, the learned algorithm must output a path through the maze from start to finish. If the maze is sufficiently long and complicated then the specific strategy for solving this particular maze—specifying each individual turn—will have a high description length. However, the description length of a general optimization algorithm for finding a path through an arbitrary maze is fairly small. Therefore, if the base optimizer is selecting for programs with low description length, then it might find a mesa-optimizer that can solve all mazes, despite the training environment only containing one maze.

Task restriction. The observation that diverse environments seem to increase the probability of mesa-optimization suggests that one way of reducing the probability of mesa-optimizers might be to keep the tasks on which AI systems are trained highly restricted. Focusing on building many individual AI services which can together offer all the capabilities of a generally-intelligent system rather than a single general-purpose artificial general intelligence (AGI), for example, might be a way to accomplish this while still remaining competitive with other approaches.(11)

Human modeling. Another aspect of the task that might influence the likelihood of mesa-optimization is the presence of humans in the environment.(12) Since humans often act as optimizers, reasoning about humans will likely involve reasoning about optimization. A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself, resulting in a mesa-optimizer. For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.

Thus, tasks that do not benefit from human modeling seem less likely to produce mesa-optimizers than those that do. Furthermore, there are many tasks that might benefit from human modeling that don't explicitly involve modeling humans—to the extent that the training distribution is generated by humans, for example, modeling humans might enable the generation of a very good prior for that distribution.

2.2. The base optimizer

It is likely that certain features of the base optimizer will influence how likely it is to generate a mesa-optimizer. First, though we largely focus on reinforcement learning in this sequence, RL is not necessarily the only type of machine learning where mesa-optimizers could appear. For example, it seems plausible that mesa-optimizers could appear in generative adversarial networks.

Second, we hypothesize that the details of a machine learning model's architecture will have a significant effect on its tendency to implement mesa-optimization. For example, a tabular model, which independently learns the correct output for every input, will never be a mesa-optimizer. On the other hand, if a hypothetical base optimizer looks for the program with the shortest source code that solves a task, then it is more plausible that this program will itself be an optimizer.(13) However, for realistic machine learning base optimizers, it is less clear to what extent mesa-optimizers will be selected for. Thus, we discuss some factors below that might influence the likelihood of mesa-optimization one way or the other.

Reachability. There are many kinds of optimization algorithms that a base optimizer could implement. However, almost every training strategy currently used in machine learning uses some form of local search (such as gradient descent or even genetic algorithms). Thus, it seems plausible that the training strategy of more advanced ML systems will also fall into this category. We will call this general class of optimizers that are based on local hill-climbing local optimization processes.

We can then formulate a notion of reachability, the difficulty for the base optimizer to find any given learned algorithm, which we can analyze in the case of a local optimization process. A local optimization process might fail to find a particular learned algorithm that would perform very well on the base objective if the learned algorithm is surrounded by other algorithms that perform poorly on the base objective. For a mesa-optimizer to be produced by a local optimization process, it needs to not only perform well on the base objective, but also be reachable; that is, there needs to be a path through the space of learned algorithms to it that is approximately monotonically increasing. Furthermore, the degree to which the path only need be approximate—that is, the degree to which ML training procedures can escape local optima—is likely to be critical, as optimization algorithms are complex enough that it might require a significant portion of the algorithm to be present before performance gains start being realized.

Algorithmic range. One key factor likely to determine the reachability of mesa-optimizers is the algorithmic range of the learned algorithms—that is, how extensive is the set of algorithms (or how expressive is the model space) capable of being found by the base optimizer. The more extensive a model's algorithmic range, the broader the space of possible learned algorithms, and thus the more likely that it will be able to find one that is a mesa-optimizer, assuming the base optimizer is incentivized to do so. For example, architectures that explicitly give the algorithm access to a wide range of possible computations, such as recurrent neural networks or neural Turing machines,(14) seem more likely to produce mesa-optimizers.

Inductive biases. Another important factor is the degree to which the base optimizer is explicitly or implicitly biased in various ways. The nature of these inductive biases will contribute to the likelihood of a mesa-optimizer being selected for. One of the most important kinds of inductive bias is simplicity bias, which would almost certainly be exhibited by almost all base optimizers. We identify three ways in which simplicity bias can manifest itself:

An explicit penalty due to parameter regularization or architectural constraints such as weight-sharing or sparse connections.
An implicit bias due to the model architecture. For example, it has been shown that neural networks are more likely to fit a simple function to a set of training data, even when no regularization is used.(15)
The capacity limitations of the model. The size of a model imposes a hard limit on the complexity of the functions it is able to represent. Thus, to the degree that the base optimizer is selecting based on performance, it will be driven to “squeeze out” as much performance as it can for any given model capacity, leading to a bias in favor of relatively compressed policies.

The more a base optimizer is biased towards simple solutions, the more it will be incentivized to find a compressed policy such as a mesa-optimizer.

The base optimizer could also be biased in other ways. For example, it could be biased towards algorithms with a low time or space complexity. If we view mesa-optimization as delegation of optimization work, a time penalty on the learned algorithm is likely to incentivize the base optimizer to do more precomputation itself by pre-computing relevant heuristics rather than delegating optimization work to the learned algorithm. Thus, we hypothesize that penalizing description length will favor mesa-optimizers while penalizing time complexity will disfavor them. This suggests the following speculative conjecture: neither a minimal-depth nor minimal-size boolean circuit that solves a problem can be a mesa-optimizer.(16)

Lastly, another form of bias that might have particularly interesting effects is the presence of an information funnel. In deep learning, the base optimizer directly selects the values of individual parameters. In biological evolution, by contrast, the base optimizer selects DNA strings, which in turn produce the detailed structure of the brain only indirectly. This implicitly introduces pressure towards compressing the brain's structure. As we noted previously, this might favor the development of mesa-optimizers, as they can be seen as a way of compressing complex policies. Because deep learning algorithms currently select weights directly, there is no such information funnel. In the case of meta-learning over a machine learning model's architecture, however, an information funnel might present itself due to the fact that the meta-learner has to learn a single policy for creating architectures that is capable of solving all of the problems it is presented with.

Statefulness. Another factor that could make it easier for a learned algorithm to implement a complex optimization process is the ability of the learned algorithm to save and recall information. The ability to refer back to previous state enables the caching of intermediate results, increasing the architecture's algorithmic range by giving the learned algorithm significantly more time to complete its computations. Statefulness, therefore, favors mesa-optimization by decreasing the implicit penalty on time complexity imposed by enforcing all computations to end when the learned algorithm produces its output. One simple form of this might be a recurrent neural network, which can perform computations with longer time horizons due to its ability to pass intermediate activations across different time steps. An even more extreme version of this might be a network that is able to store state not just across different time steps in the same episode, but also across episodes, enabling it to do more long-term learning, though such an architecture is currently completely hypothetical.

Hard-coded optimization. One possible means of alleviating some of these issues might be to include hard-coded optimization where the learned algorithm provides only the objective function and not the optimization algorithm. The stronger the optimization performed explicitly, the less strong the optimization performed implicitly by the learned algorithm needs to be. For example, architectures that explicitly perform optimization that is relevant for the task—such as hard-coded Monte Carlo tree search—might decrease the benefit of mesa-optimizers by reducing the need for optimization other than that which is explicitly programmed into the system.

The third post in the Risks from Learned Optimization Sequence, titled “The Inner Alignment Problem,” can be found here.

Glossary | Bibliography

As of the date of this post. Note that we do examine some existing machine learning systems that we believe are close to producing mesa-optimization in post 5. ↩︎
It is worth noting that the same argument also holds for achieving an average-case guarantee. ↩︎
Assuming reasonable computational constraints.. ↩︎
This definition of $N$ is somewhat vague, as there are multiple different levels at which one can chunk an environment into instances. For example, one environment could always have the same high-level features but completely random low-level features, whereas another could have two different categories of instances that are broadly self-similar but different from each other, in which case it's unclear which has a larger $N$ . However, one can simply imagine holding $N$ constant for all levels but one and just considering how environment diversity changes on that level. ↩︎
Note that this makes the implicit assumption that the amount of optimization power required to find a mesa-optimizer capable of performing $x$ bits of optimization is independent of $N$ . The justification for this is that optimization is a general algorithm that looks the same regardless of what environment it is applied to, so the amount of optimization required to find an $x$ -bit optimizer should be relatively independent of the environment. That being said, it won't be completely independent, but as long as the primary difference between environments is how much optimization they need, rather than how hard it is to do optimization, the model presented here should hold. ↩︎
Note, however, that there will be some maximum $x$ simply because the learned algorithm generally only has access to so much computational power. ↩︎
Subject to the constraint that $P - f (x) \geq 0$ . ↩︎

Risks from Learned Optimization: Introduction

chavam — 2019-05-31T23:44:53.703Z

This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence.

Motivation

The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned?

We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning.

Two questions

In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking those that do well according to some objective.

Whether a system is an optimizer is a property of its internal structure—what algorithm it is physically implementing—and not a property of its input-output behavior. Importantly, the fact that a system’s behavior results in some objective being maximized does not make the system an optimizer. For example, a bottle cap causes water to be held inside the bottle, but it is not optimizing for that outcome since it is not running any sort of optimization algorithm.(1) Rather, bottle caps have been optimized to keep water in place. The optimizer in this situation is the human that designed the bottle cap by searching through the space of possible tools for one to successfully hold water in a bottle. Similarly, image-classifying neural networks are optimized to achieve low error in their classifications, but are not, in general, themselves performing optimization.

However, it is also possible for a neural network to itself run an optimization algorithm. For example, a neural network could run a planning algorithm that predicts the outcomes of potential plans and searches for those it predicts will result in some desired outcome.^[1] Such a neural network would itself be an optimizer because it would be searching through the space of possible plans according to some objective function. If such a neural network were produced in training, there would be two optimizers: the learning algorithm that produced the neural network—which we will call the base optimizer—and the neural network itself—which we will call the mesa-optimizer.^[2]

The possibility of mesa-optimizers has important implications for the safety of advanced machine learning systems. When a base optimizer generates a mesa-optimizer, safety properties of the base optimizer's objective may not transfer to the mesa-optimizer. Thus, we explore two primary questions related to the safety of mesa-optimizers:

Mesa-optimization: Under what circumstances will learned algorithms be optimizers?
Inner alignment: When a learned algorithm is an optimizer, what will its objective be, and how can it be aligned?

Once we have introduced our framework in this post, we will address the first question in the second, begin addressing the second question in the third post, and finally delve deeper into a specific aspect of the second question in the fourth post.

1.1. Base optimizers and mesa-optimizers

Conventionally, the base optimizer in a machine learning setup is some sort of gradient descent process with the goal of creating a model designed to accomplish some specific task.

Sometimes, this process will also involve some degree of meta-optimization wherein a meta-optimizer is tasked with producing a base optimizer that is itself good at optimizing systems to achieve particular goals. Specifically, we will think of a meta-optimizer as any system whose task is optimization. For example, we might design a meta-learning system to help tune our gradient descent process.(4) Though the model found by meta-optimization can be thought of as a kind of learned optimizer, it is not the form of learned optimization that we are interested in for this sequence. Rather, we are concerned with a different form of learned optimization which we call mesa-optimization.

Mesa-optimization is a conceptual dual of meta-optimization—whereas meta is Greek for “after,” mesa is Greek for “within.”^[3] Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer. Unlike meta-optimization, in which the task itself is optimization, mesa-optimization is task-independent, and simply refers to any situation where the internal structure of the model ends up performing optimization because it is instrumentally useful for solving the given task.

In such a case, we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. In reinforcement learning (RL), for example, the base objective is generally the expected return. Unlike the base objective, the mesa-objective is not specified directly by the programmers. Rather, the mesa-objective is simply whatever objective was found by the base optimizer that produced good performance on the training environment. Because the mesa-objective is not specified by the programmers, mesa-optimization opens up the possibility of a mismatch between the base and mesa- objectives, wherein the mesa-objective might seem to perform well on the training environment but lead to bad performance off the training environment. We will refer to this case as pseudo-alignment below.

There need not always be a mesa-objective since the algorithm found by the base optimizer will not always be performing optimization. Thus, in the general case, we will refer to the model generated by the base optimizer as a learned algorithm, which may or may not be a mesa-optimizer.

Figure 1.1. The relationship between the base and mesa- optimizers. The base optimizer optimizes the learned algorithm based on its performance on the base objective. In order to do so, the base optimizer may have turned this learned algorithm into a mesa-optimizer, in which case the mesa-optimizer itself runs an optimization algorithm based on its own mesa-objective. Regardless, it is the learned algorithm that directly takes actions based on its input.

Possible misunderstanding: “mesa-optimizer” does not mean “subsystem” or “subagent.” In the context of deep learning, a mesa-optimizer is simply a neural network that is implementing some optimization process and not some emergent subagent inside that neural network. Mesa-optimizers are simply a particular type of algorithm that the base optimizer might find to solve its task. Furthermore, we will generally be thinking of the base optimizer as a straightforward optimization algorithm, and not as an intelligent agent choosing to create a subagent.^[4]

We distinguish the mesa-objective from a related notion that we term the behavioral objective. Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. We can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).^[5] This is in contrast to the mesa-objective, which is the objective actively being used by the mesa-optimizer in its optimization algorithm.

Arguably, any possible system has a behavioral objective—including bricks and bottle caps. However, for non-optimizers, the appropriate behavioral objective might just be “1 if the actions taken are those that are in fact taken by the system and 0 otherwise,”^[6] and it is thus neither interesting nor useful to know that the system is acting to optimize this objective. For example, the behavioral objective “optimized” by a bottle cap is the objective of behaving like a bottle cap.^[7] However, if the system is an optimizer, then it is more likely that it will have a meaningful behavioral objective. That is, to the degree that a mesa-optimizer’s output is systematically selected to optimize its mesa-objective, its behavior may look more like coherent attempts to move the world in a particular direction.^[8]

A given mesa-optimizer’s mesa-objective is determined entirely by its internal workings. Once training is finished and a learned algorithm is selected, its direct output—e.g. the actions taken by an RL agent—no longer depends on the base objective. Thus, it is the mesa-objective, not the base objective, that determines a mesa-optimizer’s behavioral objective. Of course, to the degree that the learned algorithm was selected on the basis of the base objective, its output will score well on the base objective. However, in the case of a distributional shift, we should expect a mesa-optimizer’s behavior to more robustly optimize for the mesa-objective since its behavior is directly computed according to it.

As an example to illustrate the base/mesa distinction in a different domain, and the possibility of misalignment between the base and mesa- objectives, consider biological evolution. To a first approximation, evolution selects organisms according to the objective function of their inclusive genetic fitness in some environment.^[9] Most of these biological organisms—plants, for example—are not “trying” to achieve anything, but instead merely implement heuristics that have been pre-selected by evolution. However, some organisms, such as humans, have behavior that does not merely consist of such heuristics but is instead also the result of goal-directed optimization algorithms implemented in the brains of these organisms. Because of this, these organisms can perform behavior that is completely novel from the perspective of the evolutionary process, such as humans building computers.

However, humans tend not to place explicit value on evolution’s objective, at least in terms of caring about their alleles' frequency in the population. The objective function stored in the human brain is not the same as the objective function of evolution. Thus, when humans display novel behavior optimized for their own objectives, they can perform very poorly according to evolution’s objective. Making a decision not to have children is a possible example of this. Therefore, we can think of evolution as a base optimizer that produced brains—mesa-optimizers—which then actually produce organisms’ behavior—behavior that is not necessarily aligned with evolution.

1.2. The inner and outer alignment problems

In “Scalable agent alignment via reward modeling,” Leike et al. describe the concept of the “reward-result gap” as the difference between the (in their case learned) “reward model” (what we call the base objective) and the “reward function that is recovered with perfect inverse reinforcement learning” (what we call the behavioral objective).(8) That is, the reward-result gap is the fact that there can be a difference between what a learned algorithm is observed to be doing and what the programmers want it to be doing.

The problem posed by misaligned mesa-optimizers is a kind of reward-result gap. Specifically, it is the gap between the base objective and the mesa-objective (which then causes a gap between the base objective and the behavioral objective). We will call the problem of eliminating the base-mesa objective gap the inner alignment problem, which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers. This terminology is motivated by the fact that the inner alignment problem is an alignment problem entirely internal to the machine learning system, whereas the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.

It might not be necessary to solve the inner alignment problem in order to produce safe, highly capable AI systems, as it might be possible to prevent mesa-optimizers from occurring in the first place. If mesa-optimizers cannot be reliably prevented, however, then some solution to both the outer and inner alignment problems will be necessary to ensure that mesa-optimizers are aligned with the intended goal of the programmers.

1.3. Robust alignment vs. pseudo-alignment

Given enough training, a mesa-optimizer should eventually be able to produce outputs that score highly on the base objective on the training distribution. Off the training distribution, however—and even on the training distribution while it is still early in the training process—the difference could be arbitrarily large. We will use the term robustly aligned to refer to mesa-optimizers with mesa-objectives that robustly agree with the base objective across distributions and the term pseudo-aligned to refer to mesa-optimizers with mesa-objectives that agree with the base objective on past training data, but not robustly across possible future data (either in testing, deployment, or further training). For a pseudo-aligned mesa-optimizer, there will be environments in which the base and mesa- objectives diverge. Pseudo-alignment, therefore, presents a potentially dangerous robustness problem since it opens up the possibility of a machine learning system that competently takes actions to achieve something other than the intended goal when off the training distribution. That is, its capabilities might generalize while its objective does not.

For a toy example of what pseudo-alignment might look like, consider an RL agent trained on a maze navigation task where all the doors during training happen to be red. Let the base objective (reward function) be . On the training distribution, this objective is equivalent to $O alt = (1 if reached something red, 0 otherwise)$ . Consider what would happen if an agent, trained to high performance on $O base$ on this task, were put in an environment where the doors are instead blue, and with some red objects that are not doors. It might generalize on $O base$ , reliably navigating to the blue door in each maze (robust alignment). But it might also generalize on $O alt$ instead of $O base$ , reliably navigating each maze to reach red objects (pseudo-alignment).^[10]

1.4. Mesa-optimization as a safety problem

If pseudo-aligned mesa-optimizers may arise in advanced ML systems, as we will suggest, they could pose two critical safety problems.

Unintended optimization. First, the possibility of mesa-optimization means that an advanced ML system could end up implementing a powerful optimization procedure even if its programmers never intended it to do so. This could be dangerous if such optimization leads the system to take extremal actions outside the scope of its intended behavior in trying to maximize its mesa-objective. Of particular concern are optimizers with objective functions and optimization procedures that generalize to the real world. The conditions that lead a learning algorithm to find mesa-optimizers, however, are very poorly understood. Knowing them would allow us to predict cases where mesa-optimization is more likely, as well as take measures to discourage mesa-optimization from occurring in the first place. The second post will examine some features of machine learning algorithms that might influence their likelihood of finding mesa-optimizers.

Inner alignment. Second, even in cases where it is acceptable for a base optimizer to find a mesa-optimizer, a mesa-optimizer might optimize for something other than the specified reward function. In such a case, it could produce bad behavior even if optimizing the correct reward function was known to be safe. This could happen either during training—before the mesa-optimizer gets to the point where it is aligned over the training distribution—or during testing or deployment when the system is off the training distribution. The third post will address some of the different ways in which a mesa-optimizer could be selected to optimize for something other than the specified reward function, as well as what attributes of an ML system are likely to encourage this. In the fourth post, we will discuss a possible extreme inner alignment failure—which we believe presents one of the most dangerous risks along these lines—wherein a sufficiently capable misaligned mesa-optimizer could learn to behave as if it were aligned without actually being robustly aligned. We will call this situation deceptive alignment.

It may be that pseudo-aligned mesa-optimizers are easy to address—if there exists a reliable method of aligning them, or of preventing base optimizers from finding them. However, it may also be that addressing misaligned mesa-optimizers is very difficult—the problem is not sufficiently well-understood at this point for us to know. Certainly, current ML systems do not produce dangerous mesa-optimizers, though whether future systems might is unknown. It is indeed because of these unknowns that we believe the problem is important to analyze.

The second post in the Risks from Learned Optimization Sequence, titled “Conditions for Mesa-Optimization,” can be found here.

Glossary | Bibliography

As a concrete example of what a neural network optimizer might look like, consider TreeQN.(2) TreeQN, as described in Farquhar et al., is a Q-learning agent that performs model-based planning (via tree search in a latent representation of the environment states) as part of its computation of the Q-function. Though their agent is an optimizer by design, one could imagine a similar algorithm being learned by a DQN agent with a sufficiently expressive approximator for the Q function. Universal Planning Networks, as described by Srinivas et al.,(3) provide another example of a learned system that performs optimization, though the optimization there is built-in in the form of SGD via automatic differentiation. However, research such as that in Andrychowicz et al.(4) and Duan et al.(5) demonstrate that optimization algorithms can be learned by RNNs, making it possible that a Universal Planning Networks-like agent could be entirely learned—assuming a very expressive model space—including the internal optimization steps. Note that while these examples are taken from reinforcement learning, optimization might in principle take place in any sufficiently expressive learned system. ↩︎
Previous work in this space has often centered around the concept of “optimization daemons,”(6) a framework that we believe is potentially misleading and hope to supplant. Notably, the term “optimization daemon” came out of discussions regarding the nature of humans and evolution, and, as a result, carries anthropomorphic connotations. ↩︎
The duality comes from thinking of meta-optimization as one layer above the base optimizer and mesa-optimization as one layer below. ↩︎
That being said, some of our considerations do still apply even in that case. ↩︎
Leike et al.(8) introduce the concept of an objective recovered from perfect IRL. ↩︎
For the formal construction of this objective, see pg. 6 in Leike et al.(8) ↩︎
This objective is by definition trivially optimal in any situation that the bottlecap finds itself in. ↩︎
Ultimately, our worry is optimization in the direction of some coherent but unsafe objective. In this sequence, we assume that search provides sufficient structure to expect coherent objectives. While we believe this is a reasonable assumption, it is unclear both whether search is necessary and whether it is sufficient. Further work examining this assumption will likely be needed. ↩︎
The situation with evolution is more complicated than is presented here and we do not expect our analogy to live up to intense scrutiny. We present it as nothing more than that: an evocative analogy (and, to some extent, an existence proof) that explains the key concepts. More careful arguments are presented later. ↩︎
Of course, it might also fail to generalize at all. ↩︎

Alignment problems for economists

chavam — 2018-07-10T23:43:56.662Z

AI alignment is a multidisciplinary research program. This means that there is potentially relevant knowledge and skill scattered across different disciplines. But it also means that people schooled only in narrow disciplines will experience a hurdle when they would work on a problem in AI alignment. One such discipline is economics, from which decision theory and game theory originated.

In this post I want to explore the idea that we should try to create a collection of “alignment-problems-for-economists”, packaged in a way that economists who have relevant knowledge and skill but don't understand ML/CS/AF can work on them.

There seem to be sub-problems in AI alignment that economists might be able to work on. However, out of the economists that I’ve spoken to, some are enthusiastic about this but see it as a personal career-risk to work on it as they do not understand the computer science. So if we can take subproblems in alignment, and package them in a way that economists can immediately start working on them, then we might be able to utilize intellectual resources (economists) that would otherwise have worked on something different.

Two types of economists to target

1. Economists who also to a degree understand basic ML/CS

2. Economists who do not.

I don’t find it very plausible that we could find sub-problems for the second type to work on, but it doesn’t seem entirely impossible: there could be certain specific problems in mechanism design or social choice or so, that would be useful for alignment but don’t require any ML/CS.

Properties of alignment-problems-for-economists that are desirable:

1. Publishable in economics journals. I have spoken to economists that are interested in the alignment problem, but they are hesitant to work on it: It is a risky career move to work on alignment if they cannot publish in journals that they are used to.

2. High work/statement ratio. How long will it take to solve the problem, versus providing the statement of the problem? If 90% of the problem is to state it in a form so that an economist could work on it, then it would likely not be efficient to do so. It should be a problem that can relatively easily be communicated clearly to an economist, while taking more time to solve.

3. No strong reliance on CS/ML tools. Many economists are somewhat familiar with basic ML techniques, but if a problem relies too much on knowledge of CS or ML, this increases the career-risk of the problem.

4. Not necessarily specifically x-risk related. If a problem in alignment is not specifically x-risk related, it is less/not embarrassing to work on it, and therefore less of a career-risk. Nevertheless, most problems in AI alignment seem important even if you don't believe that AI poses an x-risk. I don't think this requirement is that important.

* Does not have to be high-impact. If a problem has only a small chance of being somewhat impactful, it might still be worth packaging it as an economic problem, since the economists who could work on it would not otherwise work on alignment problems at all.

I do not yet have a list of such problems, but it seems that it might be possible to make one:

For example, economists might work on problems in mechanism design and social choice for AGI’s in a virtual containment. For example, can we create mechanisms with desirable properties for the amplification phase in Christiano’s program, to align a collection of distilled agents? Can we prove that such mechanisms are robust under certain assumptions? Can we create mechanisms that robustly incentivizes AGI’s with unaligned utility functions to tell us the truth? Can we use social choice to find out properties of agents that consist of sub-agents?

Economists work on strategic communication between agents (cheap talk), which might be helpful in the design of safe containment systems of not-superintelligent AGI. Information economics works on game theoretic properties of different allocations of information, and might be useful in such mechanisms as well. Economists also work on voting, and decision theory.

I want your feedback:

1. What kind of problems have you encountered that might be added to this list?

2. Do you have reasons to think that this project would be doomed to fail (or not)? If so, I want to prevent wasting time on it as fast as possible. Despite having written this post, I don’t assign a high probability of success, but I’d like people’s views.