Utility ≠ Reward 2019-09-05T17:28:13.222Z · score: 71 (24 votes)
2-D Robustness 2019-08-30T20:27:34.432Z · score: 36 (20 votes)
Risks from Learned Optimization: Conclusion and Related Work 2019-06-07T19:53:51.660Z · score: 59 (14 votes)
Deceptive Alignment 2019-06-05T20:16:28.651Z · score: 59 (14 votes)
The Inner Alignment Problem 2019-06-04T01:20:35.538Z · score: 65 (15 votes)
Conditions for Mesa-Optimization 2019-06-01T20:52:19.461Z · score: 55 (17 votes)
Risks from Learned Optimization: Introduction 2019-05-31T23:44:53.703Z · score: 112 (33 votes)
Clarifying Consequentialists in the Solomonoff Prior 2018-07-11T02:35:56.533Z · score: 18 (12 votes)


Comment by vlad_m on Utility ≠ Reward · 2019-09-13T09:55:02.451Z · score: 3 (2 votes) · LW · GW

Ah; this does seem to be an unfortunate confusion.

I didn’t intend to make ‘utility’ and ‘reward’ terminology – that’s what ‘mesa-‘ and ‘base’ objectives are for. I wasn’t aware of the terms being used in the technical sense as in your comment, so I wanted to use utility and reward as friendlier and familiar words for this intuition-building post. I am not currently inclined to rewrite the whole thing using different words because of this clash, but could add a footnote to clear this up. If the utility/reward distinction in your sense becomes accepted terminology, I’ll think about rewriting this.

That said, the distinctions we’re drawing appear to be similar. In your terminology, a utility-maximising agent is an agent which has an internal representation of a goal which it pursues. Whereas a reward-maximising agent does not have a rich internal goal representation but instead a kind of pointer to the external reward signal. To me this suggests your utility/reward tracks a very similar, if not the same, distinction between internal/external that I want to track, but with a difference in emphasis. When either of us says ‘utility ≠ reward’, I think we mean the same distinction, but what we want to draw from that distinction is different. Would you disagree?

Comment by vlad_m on Utility ≠ Reward · 2019-09-05T23:14:58.263Z · score: 6 (4 votes) · LW · GW

Thanks for raising this. While I basically agree with evhub on this, I think it is unfortunate that the linguistic justification is messed up as it is. I'll try to amend the post to show a bit more sensitivity to the Greek not really working like intended.

Though I also think that "the opposite of meta"-optimiser is basically the right concept, I feel quite dissatisfied with the current terminology, with respect to both the "mesa" and the "optimiser" parts. This is despite us having spent a substantial amount of time and effort on trying to get the terminology right! My takeaway is that it's just hard to pick terms that are both non-confusing and evocative, especially when naming abstract concepts. (And I don't think we did that badly, all things considered.)

If you have ideas on how to improve the terms, I would like to hear them!

Comment by vlad_m on Risks from Learned Optimization: Introduction · 2019-06-24T17:58:15.167Z · score: 8 (6 votes) · LW · GW

You’re completely right; I don’t think we meant to have ‘more formally’ there.

Comment by vlad_m on Risks from Learned Optimization: Introduction · 2019-06-09T18:53:04.647Z · score: 6 (3 votes) · LW · GW

To some extent, but keep in mind that in another sense, the behavioural objective of maximising paperclips is totally consistent with playing along with the base objective for a while and then defecting. So I’m not sure the behaviour/mesa- distinction alone does the work you want it to do even in that case.

Comment by vlad_m on Risks from Learned Optimization: Introduction · 2019-06-09T18:48:30.712Z · score: 2 (2 votes) · LW · GW

I’ve been meaning for a while to read Dennett with reference to this, and actually have a copy of Bacteria to Bach. Can you recommend some choice passages, or is it significantly better to read the entire book?

P.S. I am quite confused about DQN’s status here and don’t wish to suggest that I’m confident it’s an optimiser. Just to point out that it’s plausible we might want to call it one without calling PPO an optimiser.

P.P.S.: I forgot to mention in my previous comment that I enjoyed the objective graph stuff. I think there might be fruitful overlap between that work and the idea we’ve sketched out in our third post on a general way of understanding pseudo-alignment. Our objective graph framework is less developed than yours, so perhaps your machinery could be applied there to get a more precise analysis?

Comment by vlad_m on Risks from Learned Optimization: Introduction · 2019-06-08T18:47:18.568Z · score: 15 (6 votes) · LW · GW

Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.

What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is useful from the point of view of making the abstraction more robust. The former doesn’t make clear what makes the abstraction “work”, and so when to expect it to fail. The latter will at least tell you what kind of failures to expect in the abstraction: places where the evaluation of X doesn’t connect to the rest of the system like it’s supposed to. In particular, you’re right that if the learned environment model doesn’t generalise, the mesa-objective won’t be predictive of behaviour. But that’s actually a prediction of taking this view. On the other hand, it is unclear if taking the behavioural view would predict that the system will change its behaviour off-distribution (partially, because it’s unclear what exactly grounds the similarities in behaviour on-distribution).

I think it definitely is useful to also think about the behavioural objective in the way you describe, because the later concerns we raise basically do also translate to coherent behavioural objectives. And I welcome more work trying to untangle these concepts from one another, or trying to dissolve any of them as unnecessary. I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.


You’re also right to point out DQN as an interesting edge case. But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function. The Q function is regressed to the episode returns. If the learning goes well, the Q function is literally representing the agent’s objective (indeed, it’s not really selected to maximise return; its selected to be accurate at predicting return). Contrast this with e.g. policy optimisation trained agents, which are not supposed to directly represent an objective, but are supposed to score well on it. (Someone good at running RL experiments maybe should look into comparing the coherence of revealed preferences of DQN agents with PPO agents. I’d read that paper.)

Comment by vlad_m on Risks from Learned Optimization: Introduction · 2019-06-08T18:15:14.877Z · score: 10 (3 votes) · LW · GW

I think humans are fairly weird because we were selected for an objective that is unlikely to be what we select for in our AIs.

That said, if we model AI success as driven by model size and compute (with maybe innovations in low-level architecture), then I think that the way humans represent objectives is probably fairly close to what we ought to expect.

If we model AI success as mainly innovative high-level architecture, then I think we will see more explicitly represented objectives.

My tentative sense is that for AI to be interpretable (and safer) we want it to be the latter kind, but given enough compute the former kind of AI will give better results, other things being equal.

Here, what I mean by low-level architecture is something like “we’ll use lots of LSTMs instead of lots of plain RNNs, but keep the model structure simple: plug in the inputs, pass it through some layers, and read out the action probabilities”, and high-level is something like “let’s organise the model using this enormous flowchart with all of these various pieces that each are designed to take a particular role; here’s the observation embedding, here’s the search in latent model space, here’s the ...”

Comment by vlad_m on Conditions for Mesa-Optimization · 2019-06-06T07:10:19.252Z · score: 4 (3 votes) · LW · GW

Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.

I do think this is a pretty speculative argument, even for this sequence.

Comment by vlad_m on Conditions for Mesa-Optimization · 2019-06-04T10:51:29.869Z · score: 4 (3 votes) · LW · GW

The main benefit I see of hardcoding optimisation is that, assuming the system's pieces learn as intended (without any mesa-optimisation happening in addition to the hardcoded optimisation) you get more access and control as a programmer over what the learned objective actually is. You could attempt to regress the learned objective directly to a goal you want, or attempt to enforce a certain form on it, etc. When the optimisation itself is learned*, the optimiser is more opaque, and you have fewer ways to affect what goal is learned: which weights of your enormous LSTM-based mesa-optimiser represent the objective?

This doesn't solve the problem completely (you might still learn an objective that is very incorrect off-distribution, etc.), but could offer more control and insight into the system to the programmer.

*Of course, you can have learned optimisation where you keep track of the objective which is being optimised (like in Learning to Learn by Gradient Descent), but I'd class that more under hard-coded optimisation for the purposes of this discussion. Here I mean the kind of learned optimisation that happens where you're not building the architecture explicitly around optimising or learning to optimise.

Comment by vlad_m on Conditions for Mesa-Optimization · 2019-06-04T10:41:23.373Z · score: 4 (3 votes) · LW · GW

The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:

For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.

The idea there isn't that the algorithm simulates human judgement as an external source of information for itself, but that the actual algorithm learns to be a human-like reasoner, with human-like goals (because that's a good way of approximating the output of human-like reasoning). In that case, the agent really is a mesa-optimiser, to the degree that a goal-directed human-like reasoner is an optimiser.

(I'm not sure to what degree it's actually likely that a good way to approximate the behaviour of human-like reasoning is to instantiate human-like reasoning)

Comment by vlad_m on Selection vs Control · 2019-06-02T20:03:50.586Z · score: 11 (7 votes) · LW · GW

to what extent are mesa-controllers with simple behavioural objectives going to be simple?

I’m not sure what “simple behavioural objective” really means. But I’d expect that for tasks requiring very simple policies, controllers would do, whereas the more complicated the policy required to solve a task, the more one would need to do some kind of search. Is this what we observe? I’m not sure. AlphaStar and OpenAI Five seem to do well enough in relatively complex domains without any explicit search built into the architecture. Are they using their recurrence to search internally? Who knows. I doubt it, but it’s not implausible.

certain kinds of mesa-controllers can be simple: the mesa-controllers which are more like my rocket example (explicit world-model; explicit representation of objective within that world model; but, optimal policy does not use any search).

The rocket example is interesting. I guess the question for me there is, what sorts of tasks admit an optimal policy that can be represented in this way? Here it also seems to me like the more complex an environment, the more implausible it seems that a powerful policy can be successfully represented with straightforward functions. E.g., let’s say we want a rocket not just to get to the target, but to self-identify a good target in an area and pick a trajectory that evades countermeasures. I would be somewhat surprised if we can still represent the best policy as a set of non-searchy functions. So I have this intuition that for complex state spaces, it’s hard to find pure controllers that do the job well.

Comment by vlad_m on Selection vs Control · 2019-06-02T15:38:24.555Z · score: 18 (11 votes) · LW · GW

(I am unfortunately currently bogged down with external academic pressures, and so cannot engage with this at the depth I’d like to, but here’s some initial thoughts.)

I endorse this post. The distinction explained here seems interesting and fruitful.

I agree with the idea to treat selection and control as two kinds of analysis, rather than as two kinds of object – I think this loosely maps onto the distinction we make between the mesa-objective and the behavioural objective. The former takes the selection view of the learned algorithm; the latter takes the control view.

At least speaking for myself (the other authors might have different thoughts on this), the decision to talk explicitly in terms of the selection view in the mesa-optimiser post is based on an intuition that selectors, in general, have more coherently defined counterfactual behaviour. That is, given a very different input, a selector will still select an output that scores well on its mesa-objective, because that’s how selectors work. Whereas a controller, to the degree it optimises for an objective, seems more likely to just completely stop working on a different input. I have fairly low confidence in this argument, however: it seems to me that one can plausibly have pretty coherent counterfactual behaviour in a very broad distribution even without doing selection. And since it is ultimately the behaviour that does the damage, it would be good to have a working distinction that is based purely on that. We (the mesa-optimisation authors) haven’t been able to come up with one.

Another reason to be interested in selectors is that in RL, the learned algorithm is supposed to fill a controller role. So, restricting attention to selectors allows to talk at least somewhat meaningfully about non-optimiser agents, which is otherwise difficult, as any learned agent is in a controller-shaped context.

In any case, I hope that more work happens on this problem, either dissolving the need to talk about optimisation, or at least making all these distinctions more precise. The vagueness of everything is currently my biggest worry about the legitimacy of mesa-optimiser concerns.

Comment by vlad_m on What failure looks like · 2019-03-27T00:22:49.513Z · score: 3 (2 votes) · LW · GW

The goal that the agent is selected to score well on is not necessarily the goal that the agent is itself pursuing. So, unless the agent’s internal goal matches the goal for which it’s selected, the agent might still seek influence because its internal goal permits that. I think this is in part what Paul means by “Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges)”

Comment by vlad_m on Probabilistic decision-making as an anxiety-reduction technique · 2018-07-17T05:27:57.651Z · score: 2 (2 votes) · LW · GW

I also use a simple version of this, with a key extra step at the end:

1) have a decision you are unsure about. 2) perform randomisation (I usually just use a coin). 3) notice how the outcome makes you feel. If you find that you wish the coin landed the other way, override the decision and do what you secretly wanted to do all along.

You might think the third step defeats the purpose of the exercise, but so long as you actually commit to following the randomisation most of the time, it gives you direct access to very useful information. It also sets up the right incentive, wherein you never really need to work your willpower against your desires (except, I guess, the desire to deliberate more).

I mostly use this for a slightly different use case – inconsequential decisions like where to eat or small purchases, where taking a lot of time to optimise isn’t worth it. Your mileage may vary with more important decisions, but I see no reason in principle this couldn’t work.

Comment by vlad_m on Clarifying Consequentialists in the Solomonoff Prior · 2018-07-11T22:16:27.264Z · score: 1 (1 votes) · LW · GW

I agree. That’s what I meant when I wrote there will be TMs that artificially promote S itself. However, this would still mean that most of S’s mass in the prior would be due to these TMs, and not due to the natural generator of the string.

Furthermore, it’s unclear how many TMs would promote S vs S’ or other alternatives. Because of this, I don’t now whether the prior would be higher for S or S’ from this reasoning alone. Whichever is the case, the prior no longer reflects meaningful information about the universe that generates S and whose inhabitants are using the prefix to choose what to do; it’s dominated by these TMs that search for prefixes they can attempt to influence.

Comment by vlad_m on Clarifying Consequentialists in the Solomonoff Prior · 2018-07-11T19:55:10.443Z · score: 1 (1 votes) · LW · GW

I agree that this probably happens when you set out to mess with an arbitrary particular S, I.e. try to make some S’ that shares a prefix with S as likely as S.

However, some S are special, in the sense that their prefixes are being used to make very important decisions. If you, as a malicious TM in the prior, perform an exhaustive search of universes, you can narrow down your options to only a few prefixes used to make pivotal decisions, selecting one of those to mess with is then very cheap to specify. I use S to refer to those strings that are the ‘natural’ continuation of those cheap-to-specify prefixes.

There are, it seems to me, a bunch of other equally-complex TMs that want to make other strings that share that prefix more likely, including some that promote S itself. What the resulting balance looks like is unclear to me, but what’s clear is that the prior is malign with respect to that prefix - conditioning on that prefix gives you a distribution almost entirely controlled by these malign TMs. The ‘natural’ complexity of S, or of other strings that share the prefix, play almost no role in their priors.

The above is of course conditional on this exhaustive search being possible, which also relies on there being anyone in any universe that actually uses the prior to make decisions. Otherwise, we can’t select the prefixes that can be messed with.

Comment by vlad_m on Clarifying Consequentialists in the Solomonoff Prior · 2018-07-11T04:47:38.194Z · score: 2 (2 votes) · LW · GW

The trigger sequence is a cool idea.

I want to add that the intended generator TM also needs to specify a start-to-read time, so there is symmetry there. Whatever method a TM needs to use to select the camera start time in the intended generator for the real world samples, it can also use in the simulated world with alien life, since for the scheme to work only the difference in complexity between the two matters.

There is additional flex in that unlike the intended generator, the reasoner TM can sample its universe simulation at any cheaply computable interval, giving the civilisation the option of choosing any amount of thinking they can perform between outputs, if they so choose.