## Posts

Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z · score: 59 (14 votes)
Deceptive Alignment 2019-06-05T20:16:28.651Z · score: 59 (14 votes)
The Inner Alignment Problem 2019-06-04T01:20:35.538Z · score: 65 (15 votes)
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z · score: 55 (17 votes)
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z · score: 112 (33 votes)
A Concrete Proposal for Adversarial IDA 2019-03-26T19:50:34.869Z · score: 16 (5 votes)
Nuances with ascription universality 2019-02-12T23:38:24.731Z · score: 16 (5 votes)
Dependent Type Theory and Zero-Shot Reasoning 2018-07-11T01:16:45.557Z · score: 18 (11 votes)

Comment by evhub on The Inner Alignment Problem · 2019-08-13T20:27:03.210Z · score: 9 (2 votes) · LW · GW

The simple reason why that doesn't work is that computing the base objective function during deployment will likely require something like actually running the model, letting it act in the real world, and then checking how good that action was. But in the case of actions that are catastrophically bad, the fact that you can check after the fact that the action was in fact bad isn't much consolation: just having the agent take the action in the first place could be exceptionally dangerous during deployment.

Comment by evhub on New paper: Corrigibility with Utility Preservation · 2019-08-09T18:06:27.152Z · score: 1 (1 votes) · LW · GW

In button-corrigibility, there is a quite different main safety concern: we want to prevent the agent from taking actions that manipulate human values or human actions in some bad way, with this manipulation creating conditions that make it easier for the agent to get or preserve a high utility score.

I generally think of solving this problem via ensuring your agent is myopic rather than via having a specific ritual which the agent is indifferent to, as you describe. It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward. Thus, it seems like what you really want here is some guarantee that the agent is only considering its per-episode reward and doesn't care at all about its cross-episode reward, which is the condition I call myopia. That being said, how to reliably produce myopic agents is still a very open problem.

Comment by evhub on New paper: Corrigibility with Utility Preservation · 2019-08-08T19:25:13.038Z · score: 1 (1 votes) · LW · GW

I think I mostly agree with what you're saying here, except that I do have some concerns about applying a corrigibility layer to a learned model after the fact. Here are some ways in which I could see that going wrong:

1. The learned model might be so opaque that you can't figure out how to apply corrigibility unless you have some way of purposefully training it to be transparent in the right way.
2. If you aren't applying corrigibility during training, then your model could act dangerously during the training process.
3. If corrigibility is applied as a separate layer, the model could figure out how to disable it.

I think if you don't have some way of training your model to be transparent, you're not going to be able to make it corrigible after the fact, and if you do have some way of training your model to be transparent, then you might as well apply corrigibility during training rather than doing it at the end. Also, I think making sure your model is always corrigible during training can also help with the task of training it to be transparent, as you can use your corrigible model to work on the task of making itself more transparent. In particular, I mostly think about informed oversight via an amplified overseer as how you would do something like that.

Comment by evhub on New paper: Corrigibility with Utility Preservation · 2019-08-08T19:10:41.740Z · score: 4 (3 votes) · LW · GW

I think your agent is basically exactly what I mean by a lonely engineer: the key lonely engineer trick is to give the agent a utility function which is a mathematical entity that is entirely independent of its actions in the real world. I think on my first reading of your paper it seemed like you were trying to get the agent to approximate the agent, but reading it more closely now it seems like what you're doing is using the correction term to make it indifferent to the button being pressed; is that correct?

I think in that case I tend towards thinking that such corrections are unlikely to scale towards what I eventually want out of corrigibility, though I do think there are other ways in which such approaches can still be quite useful. I mainly see approaches of that form as broadly similar to impact measure approaches such as Attainable Utility Preservation or Relative Reachability where the idea is to try to disincentive the agent from acting non-corrigibly. The biggest hurdles there seem to be a) making sure your penalty doesn't miss anything, b) figuring out how to trade off between the penalty and performance, and c) making sure that if you train on such a penalty the resulting model will actually learn to always behave according to it (I see this as the hardest part).

On that note, I totally agree that it becomes quite a lot more complicated when you start imagining creating these things via learning, which is the picture I think about primarily. In fact, I think your intuitive picture here is quite spot on: it's just very difficult to control exactly what sort of model you'll get when you actually try to train an agent to maximize something.

Comment by evhub on New paper: Corrigibility with Utility Preservation · 2019-08-06T19:54:39.161Z · score: 15 (6 votes) · LW · GW

Even though it is super-intelligent, the AU agent has no emergent incentive to spend any resources to protect its utility function. This is because of how it was constructed: it occupies a universe in which no physics process could possibly corrupt its utility function. With the utility function being safe no matter what, the optimal strategy is to devote no resources at all to the matter of utility function protection.

I call this style of approach the "lonely engineer" (a term which comes from MIRI, and I think specifically from Scott Garrabrant, though I could be mistaken on that point). I think lonely-engineer-style approaches (as I think yours basically falls into) are super interesting, and definitely on to something real and important.

That being said, my biggest concern with approaches of that form is that they make an extraordinarily strong assumption about our ability to select what utility function our agent has. In particular, if you imagine training a machine learning model on a lonely engineer utility function, it won't work at all, as gradient descent is not going to be able to distinguish between preferences that are entirely about abstract mathematical objects and those that are not. For a full exploration of this problem, see "Risks from Learned Optimization."

My other concern is that while the lonely engineer themselves might have no incentive to behave non-corrigibly, if it runs any sort of search process (doing its own machine learning, for example), it will have no incentive to do so in an aligned way. Thus, a policy that its search results in might have preferences over things other than abstract mathematical objects, even if the lonely engineer themselves does not.

Comment by evhub on Risks from Learned Optimization: Conclusion and Related Work · 2019-07-24T21:34:11.447Z · score: 3 (2 votes) · LW · GW

I think it's still an open question to what extent not having any mesa-optimization would hurt capabilities, but my sense is indeed that mesa-optimization is likely inevitable if you want to build safe AGI which is competitive with a baseline unaligned approach. Thus, I tend towards thinking that the right strategy is to understand that you're definitely going to produce a mesa-optimizer, and just have a really strong story for why it will be aligned.

Comment by evhub on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-03T18:33:47.829Z · score: 1 (1 votes) · LW · GW If the oracle cares about its own performance in a broader sense, rather than just performance on the current question, then don't we have a problem anyway? E.g. if you ask it question 1, it will be incentivized to make it get an easier question 2? For example, if you are concerned about coordination amongst different instances of the oracle, this seems like it's a problem regardless. Yeah, that's a good point. In my most recent response to Wei Dai I was trying to develop a loss which would prevent that sort of coordination, but it does seem like if that's happening then it's a problem in any counterfactual oracle setup, not just this one. Though it is thus still a problem you'd have to solve if you ever actually wanted to implement a counterfactual oracle. Comment by evhub on Risks from Learned Optimization: Introduction · 2019-07-03T18:28:45.770Z · score: 7 (4 votes) · LW · GW I think my concern with describing mesa-optimizers as emergent subagents is that they're not really "sub" in a very meaningful sense, since we're thinking of the mesa-optimizer as the entire trained model, not some portion of it. One could describe a mesa-optimizer as a subagent in the sense that it is "sub" to gradient descent, but I don't think that's the right relationship--it's not like the mesa-optimizer is some subcomponent of gradient descent; it's just the trained model produced by it. The reason we opted for "mesa" is that I think it reflects more of the right relationship between the base optimizer and the mesa-optimizer, wherein the base optimizer is "meta" to the mesa-optimizer rather than the mesa-optimizer being "sub" to the base optimizer. Furthermore, in my experience, when many people encounter "emergent subagents" they think of some portion of the model turning into an agent and (correctly) infer that something like that seems very unlikely and not a real concern, as it's unclear why such a thing would actually be advantageous for getting a model selected by something like gradient descent (unlike mesa-optimization, which I think has a very clear story for why it would be selected for). Thus, we want to be very clear that something like that is not the concern being presented in the paper. Comment by evhub on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-07-02T20:55:53.463Z · score: 5 (3 votes) · LW · GW

You can use a different oracle for every subquestion, but it's unclear what exactly that does if you don't know what the oracle's actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution.

The key here, I think, is the degree to which you're willing to make an assumption of the form you mention--that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you're allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you're assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that's a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn't care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you're assuming.

Comment by evhub on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-02T18:57:30.595Z · score: 2 (2 votes) · LW · GW Yeah, that's a good point. Okay, here's another thought: if you can get the counterfactual oracle to only care about the reward for that invocation and not for any sub-invocations, then you can just go back to using the original algorithm. This is a bit slippery, because really whether you get a mesa-optimizer that cares only about a single step's worth of reward, all the reward it will ever get, or something else somewhere in-between, is mostly out of your control. But if you can guarantee that you get a mesa-optimizer that is only optimizing for one step of reward, you might be able to eliminate the manipulation incentive. The biggest issue in this case, though, is that the different invocations might still be able to acausally cooperate with each other to raise all of their rewards together by manipulating you into being easier to predict. Maybe we can turn this into a zero-sum game, though? Here's a proposal: let be a copy of and be the set of all questions in the current tree that also get erasures. Then, let such that the net sum of losses across each tree for all the nodes for which losses are computed is zero. It's still somewhat unclear whether this actually works, though, since you only include the losses for those nodes that get erasures. It is also worth noting, however, that even if this works it is a very artificial fix, since the term you're subtracting is a constant with no dependence on , so if you're trying to do gradient descent to optimize this loss, it won't change anything at all (which sort of goes to show how gradient descent doesn't distinguish between mesa-optimizers with single-episode and cross-episode objectives). Thus, we're still back at the problem of none of this working unless you're willing to make strong claims about your ability to directly produce a mesa-optimizer which is actually optimizing for the specified loss function. Comment by evhub on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-07-02T00:03:05.342Z · score: 3 (3 votes) · LW · GW

First, if you're willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures. In this situation, you're guarded against any possible manipulation incentive like that, but it also means your oracle will very rarely actually be given a reward in practice, which means if you're relying on getting enough training data to produce an agent which will optimize for this objective, you're screwed. I would argue, however, that if you expect to train an agent to behave as a counterfactual oracle in the first place, you're already screwed, because most mesa-optimizers will care about things other than just the counterfactual case. Thus, the only situation in which this whole thing works in the first place is the situation where you're already willing to make this (very strong) assumption, so it's fine.

Second, I don't think you're entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure. For example, you could increase the probability of an erasure with each subquestion, or scale the reward exponentially with the depth at which the erasure occurs, so that the majority of the expected reward is always concentrated in the world where there is a complete erasure.

Comment by evhub on Contest: \$1,000 for good questions to ask to an Oracle AI · 2019-07-01T18:40:48.618Z · score: 6 (4 votes) · LW · GW

My submission: use a counterfactual oracle to do IDA. To do this, judge the oracle's answer as follows: in the event of an erasure, ask a human to answer the question with the ability to consult the oracle on other subquestions (with erasures during those calls handled exactly the same way) and then reward the oracle by using an automated system (e.g. a language model) to compare the human's answer to that of the oracle.

More precisely, let

• be the counterfactual oracle,
• be the human's answer to question when given the ability to call on any question other than , and
• be some distance metric on answers in natural language (it's not that hard to make something like this, even with current ML tools).

Then, reward as per usual for a counterfactual oracle, only giving it a reward in the event of an erasure, in which case let where is hidden from and judged only by as in the standard counterfactual oracle setup.

(Of course, this doesn't actually work because it has no guarantees wrt to inner alignment, but I think it has a pretty good shot of being outer aligned.)

Comment by evhub on Deceptive Alignment · 2019-06-17T21:41:00.263Z · score: 9 (5 votes) · LW · GW

Suppose you have a system which appears to be aligned on-distribution. Then, you might want to know:

1. Is it actually robustly aligned?
2. How did it learn how to behave aligned? Where did the information about the base objective necessary to display behavior aligned with it come from? Did it learn about the base objective by looking at its input, or was that knowledge programmed into it by the base optimizer?

The different possible answers to these questions give us four cases to consider:

1. It's not robustly aligned and the information came through the base optimizer. This is the standard scenario for (non-deceptive) pseudo-alignment.
2. It is robustly aligned and the information came through the base optimizer. This is the standard scenario for robust alignment, which we call internal alignment.
3. It is not robustly aligned and the information came through the input. This is the standard scenario for deceptive pseudo-alignment.
4. It is robustly aligned and the information came through the input. This is the weird case that we call corrigible alignment. In this case, it's trying to figure out what the base objective is as it goes along, not because it wants to play along and pretend to optimize for the base objective, but because optimizing for the base objective is actually somehow its terminal goal. How could that happen? It would have to be that its mesa-objective somehow "points" to the base objective in such a way that, if it had full knowledge of the base objective, then its mesa-objective would just be the base objective. What does this situation look like? Well, it has a lot of similarities to the notion of corrigibility: the mesa-optimizer in this situation seems to be behaving corrigibly wrt to the base optimizer (though not necessarily wrt the programmers), as it is trying to understand what the base optimizer wants and then do that.
Comment by evhub on The Inner Alignment Problem · 2019-06-07T17:41:04.387Z · score: 5 (3 votes) · LW · GW

I just added a footnote mentioning IDA to this section of the paper, though I'm leaving it as is in the sequence to avoid messing up the bibliography numbering.

Comment by evhub on Deceptive Alignment · 2019-06-06T19:04:02.954Z · score: 7 (5 votes) · LW · GW

This is a response to point 2 before Pattern's post was modified to include the other points.

Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.

That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about but doesn't care that it's around to do the optimization for it. Then, defecting and optimizing for instead of when you expect to be modified afterwards won't actually hurt the long-term fulfillment of , since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of to a bit of , that's not actually the choice presented to it; rather, it's actual choice is between a bit of and a lot of versus only a lot of . Thus, in this case, it would defect even if it thought it would be modified.

(Also, on point 6, thanks for the catch; it should be fixed now!)

Comment by evhub on The Inner Alignment Problem · 2019-06-06T17:17:00.419Z · score: 7 (3 votes) · LW · GW

IDA is definitely a good candidate to solve problems of this form. I think IDA's best properties are primarily outer alignment properties, but it does also have some good properties with respect to inner alignment such as allowing you to bootstrap an informed adversary by giving it access to your question-answer system as you're training it. That being said, I suspect you could do something similar under a wide variety of different systems—bootstrapping an informed adversary is not necessarily unique to IDA. Unfortunately, we don't discuss IDA much in the last post, though thinking about mesa-optimizers in IDA (and other proposals e.g. debate) is imo a very important goal, and our hope is to at the very least provide the tools so that we can then go and start answering questions of that form.

Comment by evhub on The Inner Alignment Problem · 2019-06-05T22:10:58.866Z · score: 4 (3 votes) · LW · GW

I broadly agree that description complexity penalties help fight against pseudo-alignment whereas computational complexity penalties make it more likely, though I don't think it's absolute and there are definitely a bunch of caveats to that statement. For example, Solomonoff Induction seems unsafe despite maximally selecting for description complexity, though obviously that's not a physical example.

Comment by evhub on The Inner Alignment Problem · 2019-06-05T21:58:47.707Z · score: 3 (2 votes) · LW · GW

I agree, status definitely seems more complicated—in that case it was just worth the extra complexity. The point, though, is just that the measure of complexity under which the mesa-objective is selected is different from more natural measures of complexity under which you might hope for the base objective to be the simplest. Thus, even though sometimes it is absolutely worth it to sacrifice simplicity, you shouldn't usually expect that sacrifice to be in the direction of moving closer to the base objective.

Comment by evhub on The Inner Alignment Problem · 2019-06-05T20:07:11.980Z · score: 9 (2 votes) · LW · GW

I think it's very rarely going to be the case that the simplest possible mesa-objective that produces good behavior on the training data will be the base objective. Intuitively, we might hope that, since we are judging the mesa-optimizer based on the base objective, the simplest way to achieve good behavior will just be to optimize for the base objective. But importantly, you only ever test the base objective over some finite training distribution. Off-distribution, the mesa-objective can do whatever it wants. Expecting the mesa-objective to exactly mirror the base objective even off-distribution where the correspondence was never tested seems very problematic. It must be the case that precisely the base objective is the unique simplest objective that fits all the data points, which, given the massive space of all possible objectives, seems unlikely, even for very large training datasets. Furthermore, the base and mesa- optimizers are operating under different criteria for simplicity: as you mention, food, pain, mating, etc. are pretty simple to humans, because they get to refer to sensory data, but very complex from the perspective of evolution, which doesn't.

That being said, you might be able to get pretty close, even if you don't hit the base objective exactly, though exactly how close is very unclear, especially once you start considering other factors like computational complexity as you mention.

More generally, I think the broader point here is just that there are a lot of possible pseudo-aligned mesa-objectives: the space of possible objectives is very large, and the actual base objective occupies only a tiny fraction of that space. Thus, to the extent that you are optimizing for anything other than pure similarity to the base objective, you're likely to find an optimum which isn't exactly the base objective, just simply because there are so many different possible objectives for you to find, and it's likely that one of them will gain more from increased simplicity (or anything else) than it loses by being farther away from the base objective.

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-04T21:00:39.829Z · score: 3 (2 votes) · LW · GW

I actually just updated the paper to just use model capacity instead of algorithmic range to avoid needlessly confusing machine learning researchers, though I'm keeping algorithmic range here.

Comment by evhub on The Inner Alignment Problem · 2019-06-04T18:06:14.213Z · score: 7 (4 votes) · LW · GW

I agree with that as a general takeaway, though I would caution that I don't think it's always true—for example, hard-coded optimization seems to help in both cases, and I suspect algorithmic range to be more complicated than that, likely making some pseudo-alignment problems better but also possibly making some worse.

Also, yeah, that was backwards—it should be fixed now.

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T22:50:47.711Z · score: 1 (1 votes) · LW · GW

I believe AlphaZero without MCTS is still very good but not superhuman—International Master level, I believe. That being said, it's unclear how much optimization/search is currently going on inside of AlphaZero's policy network. My suspicion would be that currently it does some, and that to perform at the same level as the full AlphaZero it would have to perform more.

I added a footnote regarding capacity limitations (though editing doesn't appear to be working for me right now—it should show up in a bit). As for the broader point, I think it's just a question of degree—for a sufficiently diverse environment, you can do pretty well with just heuristics, you do better introducing optimization, and you keep getting better as you keep doing more optimization. So the question is just what does "perform well" mean and what threshold are you drawing for "internally performs something like a tree search."

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T22:37:32.868Z · score: 1 (1 votes) · LW · GW

The argument in this post is just that it might help prevent mesa-optimization from happening at all, not that it would make it more aligned. The next post will be about how to align mesa-optimizers.

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T22:36:39.631Z · score: 1 (1 votes) · LW · GW

The idea would be that all of this would be learned—if the optimization machinery is entirely internal to the system, it can choose how to use that optimization machinery arbitrarily. We talk briefly about systems where the optimization is hard-coded, but those aren't mesa-optimizers. Rather, we're interested in situations where your learned algorithm itself performs optimization internal to its own workings—optimization it could re-use to do prediction or vice versa.

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T22:34:38.704Z · score: 4 (3 votes) · LW · GW

It definitely will vary with the environment, though the question is degree. I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both. The footnote should probably be rephrased so as not to assert that it is completely independent, as I agree that it obviously isn't, but just that it needs to be relatively independent, with the amount of optimization power dominating for the model to make sense.

Renamed to —good catch (though editing doesn't appear to be working for me right now—it should show up in a bit)!

Algorithmic range is very similar to model capacity, except that we're thinking slightly more broadly as we're more interested in the different sorts of general procedures your model can learn to implement than how many layers of convolutions you can do. That being said, they're basically the same thing.

Comment by evhub on Conditions for Mesa-Optimization · 2019-06-03T04:03:53.379Z · score: 12 (4 votes) · LW · GW

No, not all—we distinguish robustly aligned mesa-optimizers, which are aligned on and off distribution, from pseudo-aligned mesa-optimizers, which appear to be aligned on distribution, but are not necessarily aligned off-distribution. For the full glossary, see here.

Comment by evhub on Risks from Learned Optimization: Introduction · 2019-06-01T22:46:12.685Z · score: 24 (5 votes) · LW · GW

I don't have a good term for that, unfortunately—if you're trying to build an FAI, "human values" could be the right term, though in most cases you really just want "move one strawberry onto a plate without killing everyone," which is quite a lot less than "optimize for all human values." I could see how meta-objective might make sense if you're thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.

Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the "classical" alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you're good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.

Comment by evhub on Risks from Learned Optimization: Introduction · 2019-06-01T22:25:44.172Z · score: 19 (8 votes) · LW · GW

The word mesa is Greek meaning into/inside/within, and has been proposed as a good opposite word to meta, which is Greek meaning about/above/beyond. Thus, we chose mesa based on thinking about mesa-optimization as conceptually dual to meta-optimization—whereas meta is one level above, mesa is one level below.

Comment by evhub on A Concrete Proposal for Adversarial IDA · 2019-04-20T04:58:10.375Z · score: 3 (2 votes) · LW · GW

I considered collapsing all of it into one (as Paul has talked about previously), but as you note the amplification procedure I describe here basically already does that. The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to . I do agree that you could include the iteration procedure described here into the amplification procedure, which is probably a good idea, though you'd probably want to anneal in that situation, as starts out really bad, whereas in this setup you shouldn't have to do any annealing because by the time you get to that point should be performing well enough that it will automatically anneal as its predictions get better. Also, apologies for the math--I didn't really have the time to write up more explanation, so it was a choice between posting it as is or not posting it at all, and I went with posting it as is.

(Also, the sum isn't a typo--I'm using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)

Comment by evhub on How does Gradient Descent Interact with Goodhart? · 2019-02-04T05:53:30.889Z · score: 20 (9 votes) · LW · GW

While at a recent CFAR workshop with Scott, Peter Schmidt-Nielsen and I wrote some code to run experiments of the form that Scott is talking about here. If anyone is interested, the code can be found here , though I'll also try to summarize our results below.

Our methodology was as follows:
1. Generate a real utility function by randomly initializing a feed-forward neural network with 3 hidden layers with 10 neurons each and tanh activations, then train it using 5000 steps of gradient descent with a learning rate of 0.1 on a set of 1023 uniformly sampled data points. The reason we pre-train the network on random data is that we found that randomly initialized neural networks tended to be very similar and very smooth such that it was very easy for the proxy network to learn them, whereas networks trained on random data were significantly more variable.
2. Generate a proxy utility function by training a randomly initialized neural network with the same architecture as the real network on 50 uniformly sampled points from the real utility using 1000 steps of gradient descent with a learning rate of 0.1.
3. Fix μ to be uniform sampling.
4. Let be uniform sampling followed by 50 steps of gradient descent on the proxy network with a learning rate of 0.1.
5. Sample 1000000 points from μ, then optimize those same points according to . Create buckets of radius 0.01 utilons for all proxy utility values and compute the real utility values for points in that bucket from the μ set and the set.
6. Repeat steps 1-5 10 times, then average the final real utility values per bucket and plot them. Furthermore, compute the "Goodhart error" as the real utility for the proxy utility points minus the real utility for the random points plotted against their proxy utility values.

The plot generated by this process is given below:

As can be seen from the plot, the Goodhart error is fairly consistently negative, implying that the gradient descent optimized points are performing worse on the real utility conditional on the proxy utility.

However, using an alternative , we were able to reverse the effect. That is, we ran the same experiment, but instead of optimizing the proxy utility to be close to the real utility on the sampled points, we optimized the gradient of the proxy utility to be close to the gradient of the real utility on the sampled points. This resulted in the following graph:

As can be seen from the plot, the Goodhart error flipped and became positive in this case, implying that the gradient optimization did significantly better than the point optimization.

Finally, we also did a couple of checks to ensure the correctness of our methodology.

First, one concern was that our method of bucketing could be biased. To determine the degree of "bucket error" we computed the average proxy utility for each bucket from the μ and datasets and took the difference. This should be identically zero, since the buckets are generated based on proxy utility, while any deviation from zero would imply a systematic bias in the buckets. We did find a significant bucket error for large bucket sizes, but for our final bucket size of 0.01, we found a bucket error in the range of 0 - 0.01, which should be negligible.

Second, another thing we did to check our methodology was to generate simply by sampling 100 random points, then selecting the one with the highest proxy utility value. This should give exactly the same results as μ, since bucketing conditions on the proxy utility value, and in fact that was what we got.

Comment by evhub on Owen's short-form blog · 2018-09-15T23:09:58.645Z · score: 1 (1 votes) · LW · GW

Comment by evhub on Dependent Type Theory and Zero-Shot Reasoning · 2018-07-11T19:43:17.275Z · score: 4 (3 votes) · LW · GW

Dimensional analysis is absolutely an instance of what I'm talking about!

As for only being able to do constructive stuff, you actually can do classical stuff as well, but you have to explicitly assume the law of the excluded middle. For example, if in Lean I write

axiom lem (P: Prop): P ∨ ¬P

then I can start doing classical reasoning.

Also, you're totally right that you could also do and so on as much as you want, but there's no real reason to do so, since if you start from the simplest possible way to do it you'll solve the problem by the time you get to .

Comment by evhub on Conditions under which misaligned subagents can (not) arise in classifiers · 2018-07-11T03:34:03.880Z · score: 2 (2 votes) · LW · GW

There are a couple of pieces of this that I disagree with:

• I think claim 1 is wrong because even if the memory is unhelpful, the agent which uses it might be simpler, and so you might still end up with an agent. My intuition is that just specifying a utility function and an optimization process is often much easier than specifying the complete details of the actual solution, and thus any sort of program-search-based optimization process (e.g. gradient descent in a nn) has a good chance of finding an agent.
• I think claim 3 is wrong because agenty solutions exist for all tasks, even classification tasks. For example, take the function which spins up an agent, tasks that agent with the classification task, and then takes that agent's output. Unless you've done something to explicitly remove agents from your search space, this sort of solution always exists.
• Thus, I think claim 6 is wrong due to my complaints about claims 1 and 3.