Comment by evhub on Deceptive Alignment · 2019-06-17T21:41:00.263Z · score: 5 (3 votes) · LW · GW

Suppose you have a system which appears to be aligned on-distribution. Then, you might want to know:

  1. Is it actually robustly aligned?
  2. How did it learn how to behave aligned? Where did the information about the base objective necessary to display behavior aligned with it come from? Did it learn about the base objective by looking at its input, or was that knowledge programmed into it by the base optimizer?

The different possible answers to these questions give us four cases to consider:

  1. It's not robustly aligned and the information came through the base optimizer. This is the standard scenario for (non-deceptive) pseudo-alignment.
  2. It is robustly aligned and the information came through the base optimizer. This is the standard scenario for robust alignment, which we call internal alignment.
  3. It is not robustly aligned and the information came through the input. This is the standard scenario for deceptive pseudo-alignment.
  4. It is robustly aligned and the information came through the input. This is the weird case that we call corrigible alignment. In this case, it's trying to figure out what the base objective is as it goes along, not because it wants to play along and pretend to optimize for the base objective, but because optimizing for the base objective is actually somehow its terminal goal. How could that happen? It would have to be that its mesa-objective somehow "points" to the base objective in such a way that, if it had full knowledge of the base objective, then its mesa-objective would just be the base objective. What does this situation look like? Well, it has a lot of similarities to the notion of corrigibility: the mesa-optimizer in this situation seems to be behaving corrigibly wrt to the base optimizer (though not necessarily wrt the programmers), as it is trying to understand what the base optimizer wants and then do that.

Risks from Learned Optimization: Conclusion and Related Work

2019-06-07T19:53:51.660Z · score: 52 (12 votes)
Comment by evhub on The Inner Alignment Problem · 2019-06-07T17:41:04.387Z · score: 5 (3 votes) · LW · GW

I just added a footnote mentioning IDA to this section of the paper, though I'm leaving it as is in the sequence to avoid messing up the bibliography numbering.

Comment by evhub on Deceptive Alignment · 2019-06-06T19:04:02.954Z · score: 6 (4 votes) · LW · GW

This is a response to point 2 before Pattern's post was modified to include the other points.

Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.

That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about but doesn't care that it's around to do the optimization for it. Then, defecting and optimizing for instead of when you expect to be modified afterwards won't actually hurt the long-term fulfillment of , since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of to a bit of , that's not actually the choice presented to it; rather, it's actual choice is between a bit of and a lot of versus only a lot of . Thus, in this case, it would defect even if it thought it would be modified.

(Also, on point 6, thanks for the catch; it should be fixed now!)

Comment by evhub on The Inner Alignment Problem · 2019-06-06T17:17:00.419Z · score: 7 (3 votes) · LW · GW

IDA is definitely a good candidate to solve problems of this form. I think IDA's best properties are primarily outer alignment properties, but it does also have some good properties with respect to inner alignment such as allowing you to bootstrap an informed adversary by giving it access to your question-answer system as you're training it. That being said, I suspect you could do something similar under a wide variety of different systems—bootstrapping an informed adversary is not necessarily unique to IDA. Unfortunately, we don't discuss IDA much in the last post, though thinking about mesa-optimizers in IDA (and other proposals e.g. debate) is imo a very important goal, and our hope is to at the very least provide the tools so that we can then go and start answering questions of that form.

Comment by evhub on The Inner Alignment Problem · 2019-06-05T22:10:58.866Z · score: 4 (3 votes) · LW · GW

I broadly agree that description complexity penalties help fight against pseudo-alignment whereas computational complexity penalties make it more likely, though I don't think it's absolute and there are definitely a bunch of caveats to that statement. For example, Solomonoff Induction seems unsafe despite maximally selecting for description complexity, though obviously that's not a physical example.

Comment by evhub on The Inner Alignment Problem · 2019-06-05T21:58:47.707Z · score: 3 (2 votes) · LW · GW

I agree, status definitely seems more complicated—in that case it was just worth the extra complexity. The point, though, is just that the measure of complexity under which the mesa-objective is selected is different from more natural measures of complexity under which you might hope for the base objective to be the simplest. Thus, even though sometimes it is absolutely worth it to sacrifice simplicity, you shouldn't usually expect that sacrifice to be in the direction of moving closer to the base objective.

Deceptive Alignment

2019-06-05T20:16:28.651Z · score: 55 (13 votes)
Comment by evhub on The Inner Alignment Problem · 2019-06-05T20:07:11.980Z · score: 9 (2 votes) · LW · GW

I think it's very rarely going to be the case that the simplest possible mesa-objective that produces good behavior on the training data will be the base objective. Intuitively, we might hope that, since we are judging the mesa-optimizer based on the base objective, the simplest way to achieve good behavior will just be to optimize for the base objective. But importantly, you only ever test the base objective over some finite training distribution. Off-distribution, the mesa-objective can do whatever it wants. Expecting the mesa-objective to exactly mirror the base objective even off-distribution where the correspondence was never tested seems very problematic. It must be the case that precisely the base objective is the unique simplest objective that fits all the data points, which, given the massive space of all possible objectives, seems unlikely, even for very large training datasets. Furthermore, the base and mesa- optimizers are operating under different criteria for simplicity: as you mention, food, pain, mating, etc. are pretty simple to humans, because they get to refer to sensory data, but very complex from the perspective of evolution, which doesn't.

That being said, you might be able to get pretty close, even if you don't hit the base objective exactly, though exactly how close is very unclear, especially once you start considering other factors like computational complexity as you mention.

More generally, I think the broader point here is just that there are a lot of possible pseudo-aligned mesa-objectives: the space of possible objectives is very large, and the actual base objective occupies only a tiny fraction of that space. Thus, to the extent that you are optimizing for anything other than pure similarity to the base objective, you're likely to find an optimum which isn't exactly the base objective, just simply because there are so many different possible objectives for you to find, and it's likely that one of them will gain more from increased simplicity (or anything else) than it loses by being farther away from the base objective.

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-04T21:00:39.829Z · score: 3 (2 votes) · LW · GW

I actually just updated the paper to just use model capacity instead of algorithmic range to avoid needlessly confusing machine learning researchers, though I'm keeping algorithmic range here.

Comment by evhub on The Inner Alignment Problem · 2019-06-04T18:06:14.213Z · score: 7 (4 votes) · LW · GW

I agree with that as a general takeaway, though I would caution that I don't think it's always true—for example, hard-coded optimization seems to help in both cases, and I suspect algorithmic range to be more complicated than that, likely making some pseudo-alignment problems better but also possibly making some worse.

Also, yeah, that was backwards—it should be fixed now.

The Inner Alignment Problem

2019-06-04T01:20:35.538Z · score: 60 (13 votes)
Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T22:50:47.711Z · score: 1 (1 votes) · LW · GW

I believe AlphaZero without MCTS is still very good but not superhuman—International Master level, I believe. That being said, it's unclear how much optimization/search is currently going on inside of AlphaZero's policy network. My suspicion would be that currently it does some, and that to perform at the same level as the full AlphaZero it would have to perform more.

I added a footnote regarding capacity limitations (though editing doesn't appear to be working for me right now—it should show up in a bit). As for the broader point, I think it's just a question of degree—for a sufficiently diverse environment, you can do pretty well with just heuristics, you do better introducing optimization, and you keep getting better as you keep doing more optimization. So the question is just what does "perform well" mean and what threshold are you drawing for "internally performs something like a tree search."

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T22:37:32.868Z · score: 1 (1 votes) · LW · GW

The argument in this post is just that it might help prevent mesa-optimization from happening at all, not that it would make it more aligned. The next post will be about how to align mesa-optimizers.

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T22:36:39.631Z · score: 1 (1 votes) · LW · GW

The idea would be that all of this would be learned—if the optimization machinery is entirely internal to the system, it can choose how to use that optimization machinery arbitrarily. We talk briefly about systems where the optimization is hard-coded, but those aren't mesa-optimizers. Rather, we're interested in situations where your learned algorithm itself performs optimization internal to its own workings—optimization it could re-use to do prediction or vice versa.

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T22:34:38.704Z · score: 4 (3 votes) · LW · GW

It definitely will vary with the environment, though the question is degree. I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both. The footnote should probably be rephrased so as not to assert that it is completely independent, as I agree that it obviously isn't, but just that it needs to be relatively independent, with the amount of optimization power dominating for the model to make sense.

Renamed to —good catch (though editing doesn't appear to be working for me right now—it should show up in a bit)!

Algorithmic range is very similar to model capacity, except that we're thinking slightly more broadly as we're more interested in the different sorts of general procedures your model can learn to implement than how many layers of convolutions you can do. That being said, they're basically the same thing.

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T04:03:53.379Z · score: 12 (4 votes) · LW · GW

No, not all—we distinguish robustly aligned mesa-optimizers, which are aligned on and off distribution, from pseudo-aligned mesa-optimizers, which appear to be aligned on distribution, but are not necessarily aligned off-distribution. For the full glossary, see here.

Comment by evhub on Risks from Learned Optimization: Introduction · 2019-06-01T22:46:12.685Z · score: 24 (5 votes) · LW · GW

I don't have a good term for that, unfortunately—if you're trying to build an FAI, "human values" could be the right term, though in most cases you really just want "move one strawberry onto a plate without killing everyone," which is quite a lot less than "optimize for all human values." I could see how meta-objective might make sense if you're thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.

Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the "classical" alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you're good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.

Comment by evhub on Risks from Learned Optimization: Introduction · 2019-06-01T22:25:44.172Z · score: 18 (7 votes) · LW · GW

The word mesa is Greek meaning into/inside/within, and has been proposed as a good opposite word to meta, which is Greek meaning about/above/beyond. Thus, we chose mesa based on thinking about mesa-optimization as conceptually dual to meta-optimization—whereas meta is one level above, mesa is one level below.

Conditions for Mesa-Optimization

2019-06-01T20:52:19.461Z · score: 48 (13 votes)

Risks from Learned Optimization: Introduction

2019-05-31T23:44:53.703Z · score: 101 (28 votes)
Comment by evhub on A Concrete Proposal for Adversarial IDA · 2019-04-20T04:58:10.375Z · score: 3 (2 votes) · LW · GW

I considered collapsing all of it into one (as Paul has talked about previously), but as you note the amplification procedure I describe here basically already does that. The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to . I do agree that you could include the iteration procedure described here into the amplification procedure, which is probably a good idea, though you'd probably want to anneal in that situation, as starts out really bad, whereas in this setup you shouldn't have to do any annealing because by the time you get to that point should be performing well enough that it will automatically anneal as its predictions get better. Also, apologies for the math--I didn't really have the time to write up more explanation, so it was a choice between posting it as is or not posting it at all, and I went with posting it as is.

(Also, the sum isn't a typo--I'm using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)

A Concrete Proposal for Adversarial IDA

2019-03-26T19:50:34.869Z · score: 16 (5 votes)

Nuances with ascription universality

2019-02-12T23:38:24.731Z · score: 16 (5 votes)
Comment by evhub on How does Gradient Descent Interact with Goodhart? · 2019-02-04T05:53:30.889Z · score: 20 (9 votes) · LW · GW

While at a recent CFAR workshop with Scott, Peter Schmidt-Nielsen and I wrote some code to run experiments of the form that Scott is talking about here. If anyone is interested, the code can be found here , though I'll also try to summarize our results below.

Our methodology was as follows:
1. Generate a real utility function by randomly initializing a feed-forward neural network with 3 hidden layers with 10 neurons each and tanh activations, then train it using 5000 steps of gradient descent with a learning rate of 0.1 on a set of 1023 uniformly sampled data points. The reason we pre-train the network on random data is that we found that randomly initialized neural networks tended to be very similar and very smooth such that it was very easy for the proxy network to learn them, whereas networks trained on random data were significantly more variable.
2. Generate a proxy utility function by training a randomly initialized neural network with the same architecture as the real network on 50 uniformly sampled points from the real utility using 1000 steps of gradient descent with a learning rate of 0.1.
3. Fix μ to be uniform sampling.
4. Let be uniform sampling followed by 50 steps of gradient descent on the proxy network with a learning rate of 0.1.
5. Sample 1000000 points from μ, then optimize those same points according to . Create buckets of radius 0.01 utilons for all proxy utility values and compute the real utility values for points in that bucket from the μ set and the set.
6. Repeat steps 1-5 10 times, then average the final real utility values per bucket and plot them. Furthermore, compute the "Goodhart error" as the real utility for the proxy utility points minus the real utility for the random points plotted against their proxy utility values.

The plot generated by this process is given below:

Goodhart

As can be seen from the plot, the Goodhart error is fairly consistently negative, implying that the gradient descent optimized points are performing worse on the real utility conditional on the proxy utility.

However, using an alternative , we were able to reverse the effect. That is, we ran the same experiment, but instead of optimizing the proxy utility to be close to the real utility on the sampled points, we optimized the gradient of the proxy utility to be close to the gradient of the real utility on the sampled points. This resulted in the following graph:
Grad Opt Goodhart

As can be seen from the plot, the Goodhart error flipped and became positive in this case, implying that the gradient optimization did significantly better than the point optimization.

Finally, we also did a couple of checks to ensure the correctness of our methodology.

First, one concern was that our method of bucketing could be biased. To determine the degree of "bucket error" we computed the average proxy utility for each bucket from the μ and datasets and took the difference. This should be identically zero, since the buckets are generated based on proxy utility, while any deviation from zero would imply a systematic bias in the buckets. We did find a significant bucket error for large bucket sizes, but for our final bucket size of 0.01, we found a bucket error in the range of 0 - 0.01, which should be negligible.

Second, another thing we did to check our methodology was to generate simply by sampling 100 random points, then selecting the one with the highest proxy utility value. This should give exactly the same results as μ, since bucketing conditions on the proxy utility value, and in fact that was what we got.

Comment by evhub on Owen's short-form blog · 2018-09-15T23:09:58.645Z · score: 1 (1 votes) · LW · GW

Is there an RSS feed?

Comment by evhub on Dependent Type Theory and Zero-Shot Reasoning · 2018-07-11T19:43:17.275Z · score: 4 (3 votes) · LW · GW

Dimensional analysis is absolutely an instance of what I'm talking about!

As for only being able to do constructive stuff, you actually can do classical stuff as well, but you have to explicitly assume the law of the excluded middle. For example, if in Lean I write

axiom lem (P: Prop): P ∨ ¬P

then I can start doing classical reasoning.

Also, you're totally right that you could also do and so on as much as you want, but there's no real reason to do so, since if you start from the simplest possible way to do it you'll solve the problem by the time you get to .

Comment by evhub on Conditions under which misaligned subagents can (not) arise in classifiers · 2018-07-11T03:34:03.880Z · score: 2 (2 votes) · LW · GW

There are a couple of pieces of this that I disagree with:

  • I think claim 1 is wrong because even if the memory is unhelpful, the agent which uses it might be simpler, and so you might still end up with an agent. My intuition is that just specifying a utility function and an optimization process is often much easier than specifying the complete details of the actual solution, and thus any sort of program-search-based optimization process (e.g. gradient descent in a nn) has a good chance of finding an agent.
  • I think claim 3 is wrong because agenty solutions exist for all tasks, even classification tasks. For example, take the function which spins up an agent, tasks that agent with the classification task, and then takes that agent's output. Unless you've done something to explicitly remove agents from your search space, this sort of solution always exists.
  • Thus, I think claim 6 is wrong due to my complaints about claims 1 and 3.

Dependent Type Theory and Zero-Shot Reasoning

2018-07-11T01:16:45.557Z · score: 18 (11 votes)