Parsing Abram on Gradations of Inner Alignment Obstacles

post by alexflint · 2021-05-04T17:44:16.858Z · LW · GW · 4 comments

Contents

  Outline
  Mesa-optimizers
  Mesa-controllers and mesa-searchers
  Compression and explicit goals
  Deception and GPT-3
  The lottery ticket hypothesis
  Conclusion
None
4 comments

This is independent research. To make further posts like this possible, please consider supporting me.

Epistemic status: This is my highly imperfect understanding of a lot of detailed thinking by Abram, Evan, John, and others.


Outline

Mesa-optimizers

First, a quick bit of background on the inner optimizer paper.

When we search over a large space of algorithms for one that performs well on a task, we may find one that, if deployed, would not just perform well on the task but would also exert unexpected influence over the world we live in. For example, while designing a robot vacuum we might search over policies that process observations from the robot’s sensors and control the motor outputs, evaluating each according to how effectively it vacuums the floor. But a sufficiently exhaustive search over a sufficient expressive hypothesis space could turn up a policy that internally builds up an understanding of the environment being vacuumed and its human inhabitants, and chooses the actions that it expects to accomplish an objective. The question is then: is this internal goal actually the goal of cleaning the floor? And more broadly: what kind of capacity does this robot vacuum have to influence the future? Will it, for example, build a sufficiently sophisticated model of houses and humans that it discovers that it can achieve its objective by subtly dissuading humans from bringing messy food items into carpeted areas at all? If so, then by searching over policies to control the motor outputs of our robot vacuum we have accidentally turned up a policy that exerts a much more general level of influence over the future than we anticipated, and we might be concerned about the direction that this method of designing robot vacuums is heading in.

So we would like to avoid unwittingly deploying mesa-optimizers into the world. We could do that either by developing a detector for mesa-optimizers so that we can check the policies we find during search, or by developing a training process that does not produce mesa-optimizers in the first place.

Mesa-controllers and mesa-searchers

Next, a bit of background from a previous post by Abram [LW · GW]:

Abram is concerned not just about policies that influence the future via explicit internal search, such as a policy that evaluates possible plans, evaluate the likely consequences of each one, and picks the one whose consequences match an internal goal, but also policies that exert goal-directed influence over the future in any other way. For example, a thermostat consistently brings a room to a particular temperature. It does not accomplish that by evaluating the possible actions it might take and picking the one that seems most likely to bring the room to a desired temperature. But it does in fact bring the room to a consistent temperature. In the example above, when we search over policies for our robot vacuum, we might turn up one that does not perform an explicit internal search over possible plans, but does push the state of the world towards some target state in the same way that a thermostat pushes the temperature of a room towards a target temperature. Such a system might exert a level of influence over the future that was not expected by the engineers that designed the system and thus might pose dangers.

So we might want to broaden our goal from avoiding mesa-searchers to avoiding mesa-controllers, of which mesa-searchers are a special case.

Compression and explicit goals

Now we come to the content of Gradations of Inner Alignment Obstacles [LW · GW].

The problem with avoiding all mesa-controllers, as Richard Ngo points out [LW(p) · GW(p)], is that almost any competent policy could be seen as a mesa-controller since most tasks involve pushing the world towards some state or other, so avoiding all mesa-controllers would mean avoiding all competent policies. For example, when we search over policies for our robot vacuum, we hope to find one that is efficient in its floor-vacuuming pattern but that does not build sophisticated models of houses or humans, nor exert influence in surprising ways over the world. But even a benign robot vacuum policy should still push the state of the world towards one in which the floor is clean, so could still be considered a mesa-controller. So we need a more precise definition of what kind of policies we are trying to avoid.

Abram suggests that what we really want to avoid is policies that explicitly represent their objectives. This is because policies that steer the world in different directions under different conditions are not particularly worrisome, whereas policies that consistently push the world towards the same objective from anywhere within a wide region of configuration space are powerful and therefore dangerous. And in order to push the world towards a consistent objective, Abram hypothesizes that such policies will need explicit internal representations of their objective.

So the picture that emerges seems to be:

Deception and GPT-3

Abram then continues to apply this framework to the question of whether GPT-3 is currently deceiving us. My understanding of his reasoning is as follows.

The lottery ticket hypothesis

The lottery ticket hypothesis, as originally articulated by Frankle and Carbin, is as follows:

Dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that -- when trained in isolation -- reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective

That is, we can think of a dense neural network as a weighted combination of a huge number of sparse subnetworks. When we randomly intiailize our dense neural network at the beginning of training, we are also randomly initializing each of these huge number of sparse subnetworks. The lottery ticket hypothesis says that there is some as-yet-unknown property of the initialization of sparse subnetworks that makes them trainable, and that by initializing this huge number of sparse subnetworks each with independent random initialization we are buying a huge number of lottery tickets, hoping to get at least one that has this trainability property. If this hypothesis is true then we might view the entire training process is essentially being about identifying which subnetworks have this traininability property and training them, while down-weighting all the other subnetworks.

Abram points out that this is concerning from the perspective of deception and mesa-optimizers since there may be deceptive subnetworks within any randomly initialized dense network, as suggested by the feasibility of data poisoning backdoor attacks [LW(p) · GW(p)]. This is bad news for approaches to avoiding deception that work by not introducing deception into networks during training, since not introducing deception during training won’t be enough if there are already deceptive subnetworks present at the beginning of training.

Conclusion

I hope that this attempt at summarization sheds more light than confusion on the topics of inner optimization, agency, and deception. The basic situation here, as I see it, is that we are looking at machine learning systems that are conducting ever more powerful searches over ever more expressive hypothesis spaces and asking:

4 comments

Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-05-04T19:57:17.138Z · LW(p) · GW(p)

I think Abram's concern about the lottery ticket hypothesis wasn't about the "vanilla" LTH that you discuss, but rather the scarier "tangent space hypothesis." See this comment thread. [LW · GW]

Replies from: alexflint
comment by alexflint · 2021-05-04T23:20:56.567Z · LW(p) · GW(p)

Thank you for the pointer. Why is the tangent space hypothesis version of the LTH scarier?

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-05-05T04:27:01.324Z · LW(p) · GW(p)

Well, it seems to be saying that the training process basically just throws away all the tickets that score less than perfectly, and randomly selects one of the rest. This means that tickets which are deceptive agents and whatnot are in there from the beginning, and if they score well, then they have as much chance of being selected at the end as anything else that scores well. And since we should expect deceptive agents that score well to outnumber aligned agents that score well... we should expect deception.

I'm working on a much more fleshed out and expanded version of this argument right now.

Replies from: alexflint
comment by alexflint · 2021-05-05T21:00:29.089Z · LW(p) · GW(p)

Yeah right, that is scarier. Looking forward to reading your argument, esp re why we would expect deceptive agents that score well to outnumber aligned agents that score well.

Although in the same sense we could say that a rock “contains” many deceptive agents, since if we viewed the rock as a giant mixture of computations then we would surely find some that implement deceptive agents.