Formal Solution to the Inner Alignment Problem

michaelcohen

Formal Solution to the Inner Alignment Problem

post by michaelcohen (cocoa) · 2021-02-18T14:51:14.117Z · LW · GW · 123 comments

123 comments

We've written a paper on online imitation learning, and our construction allows us to bound the extent to which mesa-optimizers could accomplish anything. This is not to say it will definitely be easy to eliminate mesa-optimizers in practice, but investigations into how to do so could look here as a starting point. The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.

Here is the abstract:

In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. No existing work provides formal guidance in how this might be accomplished, instead restricting focus to environments that restart, making learning unusually easy, and conveniently limiting the significance of any mistake. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency.

The second-last sentence refers to the bound on what a mesa-optimizer could accomplish. We assume a realizable setting (positive prior weight on the true demonstrator-model). There are none of the usual embedding problems here—the imitator can just be bigger than the demonstrator that it's modeling.

(As a side note, even if the imitator had to model the whole world, it wouldn't be a big problem theoretically. If the walls of the computer don't in fact break during the operation of the agent, then "the actual world" and "the actual world outside the computer conditioned on the walls of the computer not breaking" both have equal claim to being "the true world-model", in the formal sense that is relevant to a Bayesian agent. And the latter formulation doesn't require the agent to fit inside world that it's modeling).

Almost no mathematical background is required to follow [Edit: most of ] the proofs. [Edit: But there is a bit of jargon. "Measure" means "probability distribution", and "semimeasure" is a probability distribution that sums to less than one.] We feel our bounds could be made much tighter, and we'd love help investigating that.

These slides (pdf here) are fairly self-contained and a quicker read than the paper itself.

Below, and $P_{im}$ refer to the probability of the event supposing the demonstrator or imitator were acting the entire time. The limit below refers to successively more unlikely events $B$ ; it's not a limit over time. Imagine a sequence of events $B_{i}$ such that ${lim}_{i \to \infty} P_{d e m} (B_{i}) = 0$ .

123 comments

Comments sorted by top scores.

comment by paulfchristiano · 2021-02-22T21:18:46.227Z · LW(p) · GW(p)

I haven't read the paper yet, looking forward to it. Using something along these lines to run a sufficiently-faithful simulation of HCH seems like a plausible path to producing an aligned AI with a halting oracle. (I don't think that even solves the problem given a halting oracle, since HCH is probably not aligned, but I still think this would be noteworthy.)

First I'm curious to understand this main result so I know what to look for and how surprised to be. In particular, I have two questions about the quantitative behavior described here:

if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency.

1: Dependence on the prior of the true hypothesis

It seems like you have the following tough case even if the human is deterministic:

There are N hypotheses about the human, of which one is correct and the others are bad.
I would like to make a series of N potentially-catastrophic decisions.
On decision k, all (N-1) bad hypotheses propose the same catastrophic prediction. We don't know k in advance.
For the other (N-1) decisions, (N-2) of the bad hypotheses make the correct prediction a*, and the final bad hypothesis claims that action a* would be catastrophic.

In this setting, it seems like I have no hope other than to query the human on all N decisions (since all days and hypotheses are symmetrical), so I assume that this is what your algorithm would do.

That strongly suggests that the number of queries to the human goes as 1 / p(correct demonstrator), unless you use some other feature of the hypothesis class. But p(correct demonstrator) is probably less than , so this constant might not be acceptable. Usually we try to have a logarithmic dependence on p(correct demonstrator) but this doesn't seem possible here.

So as you say, we'd want to condition on on some relevant facts to get up to the point where that probability might be acceptably-high. So then it seems like we have two problems:

It still seems like the actual probability of the truth is going to be quite low, so it seems like need some future result that uses a tighter bound than just p(truth). That is, we need to exploit the fact that the treacherous hypotheses in the example above are pretty weird and intuitively there aren't that many of them, and that more reasonable hypotheses won't be trying so hard to make sneaky catastrophic actions. Maybe you already have this and I missed it.

(It feels to me like this is going to involve moving to the non-realizable setting at the same time. Vanessa has some comments elsewhere about being able to handle that with infra-Bayesianism, though I am fairly uncertain whether that is actually going to work.)
There's a real question about p(treachery) / p(correct model), and if this ratio is bad then it seems like we're definitely screwed with this whole approach.

Does that all seem right?

2: Dependence on the probability bound

Suppose that I want to bound the probability of catastrophe as $(1 + ϵ)$ times the demonstrator probability of catastrophe. It seems like the number of human queries must scale at least like $1 / ϵ$ . Is that right, and if so what's the actual dependence on epsilon?

I mostly ask about this because in the context of HCH we may need to push epsilon down to 1/N. But maybe there's some way to avoid that by considering predictors that update on counterfactual demonstrator behavior in the rest of the tree (even though the true demonstrator does not), to get a full bound on the relative probability of a tree under the true demonstrator vs model. I haven't thought about this in years and am curious if you have a take on the feasibility of that or whether you think the entire project is plausible.

Replies from: vanessa-kosoy, cocoa, vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-26T14:10:27.145Z · LW(p) · GW(p)

Here's the sketch of a solution to the query complexity problem.

Simplifying technical assumptions:

The action space is
All hypotheses are deterministic
Predictors output maximum likelihood predictions instead of sampling the posterior

I'm pretty sure removing those is mostly just a technical complication.

Safety assumptions:

The real hypothesis has prior probability lower bounded by some known quantity $δ$ , so we discard all hypotheses of probability less than $δ$ from the onset.
Malign hypotheses have total prior probability mass upper bounded by some known quantity $ϵ$ (determining parameters $δ$ and $ϵ$ is not easy, but I'm pretty sure any alignment protocol will have parameters of some such sort.)
Any predictor that uses a subset of the prior without malign hypotheses is safe (I would like some formal justification for this, but seems plausible on the face of it.)
Given any safe behavior, querying the user instead of producing a prediction cannot make it unsafe (this can and should be questioned but let it slide for now.)

Algorithm: On each round, let $p_{0}$ be the prior probability mass of unfalsified hypotheses predicting $0$ and $p_{1}$ be the same for $1$ .

If $p_{0} > p_{1} + ϵ$ or $p_{1} = 0$ , output $0$
If $p_{1} > p_{0} + ϵ$ or $p_{0} = 0$ , output $1$
Otherwise, query the user

Analysis: As long as $p_{0} + p_{1} ≫ ϵ$ , on each round we query, $p_{0} + p_{1}$ is halved. Therefore there can only be roughly $O (log \frac{1}{ϵ})$ such rounds. On the following rounds, each round we query at least one hypothesis is removed, so there can only be roughly $O (\frac{ϵ}{δ})$ such rounds. The total query complexity is therefore approximately $O (log \frac{1}{ϵ} + \frac{ϵ}{δ})$ .

Replies from: paulfchristiano, cocoa

↑ comment by paulfchristiano · 2021-03-19T22:12:10.915Z · LW(p) · GW(p)

I agree that this settles the query complexity question for Bayesian predictors and deterministic humans.

I expect it can be generalized to have complexity in the case with stochastic humans where treacherous behavior can take the form of small stochastic shifts.

I think that the big open problems for this kind of approach to inner alignment are:

Is $\frac{ϵ}{δ}$ bounded? I assign significant probability to it being $2^{100}$ or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble. (I believe this is also Eliezer's view.)
It feels like searching for a neural network is analogous to searching for a MAP estimate, and that more generally efficient algorithms are likely to just run one hypothesis most of the time rather than running them all. Then this algorithm increases the cost of inference by $\frac{ϵ}{δ}$ which could be a big problem.
As you mention, is it safe to wait and defer or are we likely to have a correlated failure in which all the aligned systems block simultaneously? (e.g. as described here)

Replies from: vanessa-kosoy, vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-04-02T17:03:49.801Z · LW(p) · GW(p)

Is bounded? I assign significant probability to it being $2^{100}$ or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

Yes, you're right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the "bridge rules" by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.

Here's the sketch of a proposal how to solve this. Let's construct our prior to be the convolution of a simplicity prior with a computational easiness prior. As an illustration, we can imagine a prior that's sampled as follows:

First, sample a hypothesis $h$ from the Solomonoff prior
Second, choose a number $n$ according to some simple distribution with high expected value (e.g. $n^{- 1 - α}$ ) with $α ≪ 1$
Third, sample a DFA $A$ with $n$ states and a uniformly random transition table
Fourth, apply $A$ to the output of $h$

We think of the simplicity prior as choosing "physics" (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing "bridge rules" (which we expect to have low computational complexity but possibly high description complexity). Ofc this convolution can be regarded as another sort of simplicity prior, so it differs from the original simplicity prior merely by a factor of $O (1)$ , however the source of our trouble is also "merely" a factor of $O (1)$ .

Now the simulation hypothesis no longer has an advantage via the bridge rules, since the bridge rules have a large constant budget allocated to them anyway. I think it should be possible to make this into some kind of theorem (two agents with this prior in the same universe that have access to roughly the same information should have similar posteriors, in the $α \to 0$ limit).

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2021-04-05T19:47:19.513Z · LW(p) · GW(p)

I broadly think of this approach as "try to write down the 'right' universal prior." I don't think the bridge rules / importance-weighting consideration is the only way in which our universal prior is predictably bad. There are also issues like anthropic update and philosophical considerations about what kind of "programming language" to use and so on.

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit. I guess you just need to get close enough that is manageable but I think I still find it scary (and don't totally remember all my sources of concern).

I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior. That is, if someone inside one of our hypotheses has noticed that e.g. a certain class of decisions is more important and so they will simulate only those situations, then we should also notice this and by the same token care more about our decision if we are in one of those situations (rather than using a universal prior without importance weighting). My sense is that without competitiveness we are in trouble anyway on other fronts, and so it is probably also reasonable to think of as a first-line defense against this kind of issue.

We think of the simplicity prior as choosing "physics" (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing "bridge rules" (which we expect to have low computational complexity but possibly high description complexity).

This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with "giant" universes that do all the possible computations you would want, and then using the "free" complexity in the bridge rules to pick which of the computations you actually wanted. I am not sure if the DFA proposal gets around this kind of problem though it sounds like it would be pretty similar.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-04-06T21:11:21.656Z · LW(p) · GW(p)

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.

I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect both and $δ$ in a similar way.

More generally, I guess I'm more optimistic than you about solving all such philosophical liabilities.

I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior.

I don't understand the proposal. Is there a link I should read?

This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with "giant" universes that do all the possible computations you would want, and then using the "free" complexity in the bridge rules to pick which of the computations you actually wanted.

So, you can let your physics be a dovetailing of all possible programs, and delegate to the bridge rule the task of filtering the outputs of only one program. But the bridge rule is not "free complexity" because it's not coming from a simplicity prior at all. For a program of length $n$ , you need a particular DFA of size $Ω (n)$ . However, the actual DFA is of expected size $m$ with $m ≫ n$ . The probability of having the DFA you need embedded in that is something like $\frac{m!}{(m - n)!} m^{- 2 n} \approx m^{- n} ≪ 2^{- n}$ . So moving everything to the bridge makes a much less likely hypothesis.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-20T18:43:33.000Z · LW(p) · GW(p)

Is bounded? I assign significant probability to it being $2^{100}$ or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don't. ~~In the latter case, $\frac{ϵ}{δ}$ shouldn't be large.~~ In the former case, it means that we are overwhelming likely to actually be inside a malign simulation. But, then AI risk is the least of our troubles. (In particular, because the simulation will probably be turned off once the attack-relevant part is over.)

[EDIT: I was wrong, see this [AF(p) · GW(p)].]

It feels like searching for a neural network is analogous to searching for a MAP estimate, and that more generally efficient algorithms are likely to just run one hypothesis most of the time rather than running them all.

Probably efficient algorithms are not running literally all hypotheses, but, they can probably consider multiple plausible hypotheses. In particular, the malign hypothesis itself is an efficient algorithm and it is somehow aware of the two different hypotheses (itself and the universe it's attacking). Currently I can only speculate about neural networks, but I do hope we'll have competitive algorithms amenable to theoretical analysis, whether they are neural networks or not.

As you mention, is it safe to wait and defer or are we likely to have a correlated failure in which all the aligned systems block simultaneously? (e.g. as described here)

I think that the problem you describe in the linked post can be delegated to the AI. That is, instead of controlling trillions of robots via counterfactual oversight, we will start with just one AI project that will research how to organize the world. This project would top any solution we can come up with ourselves.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2021-03-20T22:21:44.966Z · LW(p) · GW(p)

I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don't. In the latter case, shouldn't be large. In the former case, it means that we are overwhelming likely to actually be inside a malign simulation.

It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn't mean that I think reality probably works that way. So I don't see how to salvage this kind of argument.

But, then AI risk is the least of our troubles. (In particular, because the simulation will probably be turned off once the attack-relevant part is over.)

It seems to me like this requires a very strong match between the priors we write down and our real priors. I'm kind of skeptical about that a priori, but then in particular we can see lots of ways in which attackers will be exploiting failures in the prior we write down (e.g. failure to update on logical facts they observe during evolution, failure to make the proper anthropic update, and our lack of philosophical sophistication meaning that we write down some obviously "wrong" universal prior).

In particular, the malign hypothesis itself is an efficient algorithm and it is somehow aware of the two different hypotheses (itself and the universe it's attacking).

Do we have any idea how to write down such an algorithm though? Even granting that the malign hypothesis does so it's not clear how we would (short of being fully epistemically competitive); but moreover it's not clear to me the malign hypothesis faces a similar version of this problem since it's just thinking about a small list of hypotheses rather than trying to maintain a broad enough distribution to find all of them, and beyond that it may just be reasoning deductively about properties of the space of hypotheses rather than using a simple algorithm we can write down.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-22T00:04:02.451Z · LW(p) · GW(p)

It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn't mean that I think reality probably works that way. So I don't see how to salvage this kind of argument.

I think it works differently. What you should get is an infra-Bayesian hypothesis which models only those parts of reality that can be modeled within the given computing resources. More generally, if you don't endorse the predictions of the prediction algorithm than either you are wrong or you should use a different prediction algorithm.

How the can the laws of physics be extra-compressible within the context of a simulation hypothesis? More compression means more explanatory power. I think that is must look something like, we can use the simulation hypothesis to predict the values of some of the physical constants. But, it would require a very unlikely coincidence for physical constants to have such values unless we are actually in a simulation.

It seems to me like this requires a very strong match between the priors we write down and our real priors. I'm kind of skeptical about that a priori, but then in particular we can see lots of ways in which attackers will be exploiting failures in the prior we write down (e.g. failure to update on logical facts they observe during evolution, failure to make the proper anthropic update, and our lack of philosophical sophistication meaning that we write down some obviously "wrong" universal prior).

I agree that we won't have a perfect match but I think we can get a "good enough" match (similarly to how any two UTMs that are not too crazy give similar Solomonoff measures.) I think that infra-Bayesianism solves a lot of philosophical confusions, including anthropics and logical uncertainty, although some of the details still need to be worked out. (But, I'm not sure what specifically do you mean by "logical facts they observe during evolution"?) Ofc this doesn't mean I am already able to fully specify the correct infra-prior: I think that would take us most of the way to AGI.

Do we have any idea how to write down such an algorithm though?

I have all sorts of ideas, but still nowhere near the solution ofc. We can do deep learning while randomizing initial conditions and/or adding some noise to gradient descent (e.g. simulated annealing), producing a population of networks that progresses in an evolutionary way. We can, for each prediction, train a model that produces the opposite prediction and compare it to the default model in terms of convergence time and/or weight magnitudes. We can search for the algorithm using meta-learning. We can do variational Bayes with a "multi-modal" model space: mixtures of some "base" type of model. We can do progressive refinement of infra-Bayesian hypotheses, s.t. the plausible hypotheses at any given moment are the leaves of some tree.

moreover it's not clear to me the malign hypothesis faces a similar version of this problem since it's just thinking about a small list of hypotheses rather than trying to maintain a broad enough distribution to find all of them

Well, we also don't have to find all of them: we just have to make sure we don't miss the true one. So, we need some kind of transitivity: if we find a hypothesis which itself finds another hypothesis (in some sense) then we also find the other hypothesis. I don't know how to prove such a principle, but it doesn't seem implausible that we can.

it may just be reasoning deductively about properties of the space of hypotheses rather than using a simple algorithm we can write down.

Why do you think "reasoning deductively" implies there is no simple algorithm? In fact, I think infra-Bayesian logic might be just the thing to combine deductive and inductive reasoning.

↑ comment by michaelcohen (cocoa) · 2021-02-26T15:15:31.772Z · LW(p) · GW(p)

This is very nice and short!

And to state what you left implicit:

If , then in the setting with no malign hypotheses (which you assume to be safe), 0 is definitely the output, since the malign models can only shift the outcome by $ε$ , so we assume it is safe to output 0. And likewise with outputting 1.

I'm pretty sure removing those is mostly just a technical complication

One general worry I have about assuming that the deterministic case extends easily to the stochastic case is that a sequence of probabilities that tends to 0 can still have an infinite sum, which is not true when probabilities must $\in {0, 1}$ , and this sometimes causes trouble. I'm not sure this would raise any issues here--just registering a slightly differing intuition.

↑ comment by michaelcohen (cocoa) · 2021-02-23T14:06:43.291Z · LW(p) · GW(p)

It seems like you have the following tough case even if the human is deterministic:
There are N hypotheses about the human, of which one is correct and the others are bad.
I would like to make a series of N potentially-catastrophic decisions.
On decision k, all (N-1) bad hypotheses propose the same catastrophic prediction. We don't know k in advance.
For the other (N-1) decisions, (N-2) of the bad hypotheses make the correct prediction a*, and the final bad hypothesis claims that action a* would be catastrophic.
In this setting, it seems like I have no hope other than to query the human on all N decisions (since all days and hypotheses are symmetrical), so I assume that this is what your algorithm would do.

They're not quite symmetrical: midway through, some bad hypotheses will have been ruled out for making erroneous predictions about the demonstrator previously. But your conclusion is still correct. And

That strongly suggests that the number of queries to the human goes as 1 / p(correct demonstrator)

It's certainly no better than that, and the bound I prove is worse.

Usually we try to have a logarithmic dependence on p(correct demonstrator) but this doesn't seem possible here.

Yeah. Error bounds on full Bayesian are logarithmic in the prior weight on the truth, but error bounds on maximum a posteriori prediction are just inverse in the prior weight, and your example above is the one to show it. If each successive MAP model predicts wrong at each successive timestep, it could take N timesteps to get rid of N models, which is how many might begin with a prior weight exceeding the truth, if the truth has a prior weight of 1/N.

But, this situation seems pretty preposterous to me in the real world. If agent's first observation is, say, this paragraph, the number of models with prior weight greater than the truth that predicted something else as the first observation, will probably be a number way, way different from one. I'd go so far as to say at least half of models with prior weight greater than the truth would predict a different observation than this very paragraph. As long as this situation keeps up, we're in a logarithmic regime.

I'm not convinced this logarithmic regime ever ends, but I think the case is more convincing that we at least start there, so let's suppose now that it eventually ends, and after this point the remaining models with posterior weight exceeding the truth are deliberately erring at unique timesteps. What's the posterior on the truth now? This is a phase where the all the top models are "capable" of predicting correctly. This shouldn't look like at all. It will look more like p(correct model)/p(treachery).

And actually, depending on the composition of treacherous models, it could be better. To the extent some constant fraction of them are being treacherous at particular times, the logarithmic regime will continue. There are two reasons why I expect that to hold.

Since there will always be particularly good times to capitalize on treachery, a decent fraction of treacherous models should err at similar times.
The goals of a treacherous model will impact what to predict, but treacherous models with different goals must predict different false things if they are to have different effects. In your original discussion of treacherous models, you argue the strategy they would take is convergent (well-timed erroneous predictions), while the goals they would have are diverse (and thus more well-represented in a universal prior).

So more precisely, when the logarithmic phase ends, if it ever does, the posterior on the truth will be around: p(correct model)/p(treachery with the strategy of erring at a random moderately-important time rather than a time of extreme import).

If the number of queries required is linear in this (plus logarithmic in the prior), that doesn't seem very bad to me.

Which means I disagree with this:

There's a real question about p(treachery) / p(correct model), and if this ratio is bad then it seems like we're definitely screwed with this whole approach.

There is at most a linear cost to this ratio, which I don't think screws us.

Suppose that I want to bound the probability of catastrophe as $(1 + ϵ)$ times the demonstrator probability of catastrophe.

I think this is doable with this approach, but I haven't proven it can be done, let alone said anything about a dependence on epsilon. The closest bound I show not only has a constant factor of like 40; it depends on the prior on the truth too. I think (75% confidence) this is a weakness of the proof technique, not a weakness of the algorithm.

It seems like the number of human queries must scale at least like $1 / ϵ$ .

I doubt it's any better than that, but I haven't (just now) managed to convince myself that it couldn't be.

in the context of HCH we may need to push epsilon down to 1/N. But maybe there's some way to avoid that by...

I'm a little skeptical we could get away with having epsilon > O(1/N). But I don't quite follow the proposal well enough to say that I think it's infeasible.

Replies from: paulfchristiano, paulfchristiano

↑ comment by paulfchristiano · 2021-02-23T16:21:16.619Z · LW(p) · GW(p)

I understand that the practical bound is going to be logarithmic "for a while" but it seems like the theorem about runtime doesn't help as much if that's what we are (mostly) relying on, and there's some additional analysis we need to do. That seems worth formalizing if we want to have a theorem, since that's the step that we need to be correct.

There is at most a linear cost to this ratio, which I don't think screws us.

If our models are a trillion bits, then it doesn't seem that surprising to me if it takes 100 bits extra to specify an intended model relative to an effective treacherous model, and if you have a linear dependence that would be unworkable. In some sense it's actively surprising if the very shortest intended vs treacherous models have description lengths within 0.0000001% of each other unless you have a very strong skills vs values separation. Overall I feel much more agnostic about this than you are.

There are two reasons why I expect that to hold.

This doesn't seem like it works once you are selecting on competent treacherous models. Any competent model will err very rarely (with probability less than 1 / (feasible training time), probably much less). I don't think that (say) 99% of smart treacherous models would make this particular kind of incompetent error?

I'm not convinced this logarithmic regime ever ends,

It seems like it must end if there are any treacherous models (i.e. if there was any legitimate concern about inner alignment at all).

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-24T01:04:08.196Z · LW(p) · GW(p)

That seems worth formalizing if we want to have a theorem, since that's the step that we need to be correct.

I agree. Also, it might be very hard to formalize this in a way that's not: write out in math what I said in English, and name it Assumption 2.

I don't think that (say) 99% of smart treacherous models would make this particular kind of incompetent error?

I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out. Spreading them out exhausts whatever data source is being called upon in moments of uncertainty. But for an individual model, it never reaps the benefit of helping to exhaust our resources. Given that treacherous models don't have an incentive to be inconvenient to us in this way, I don't think a failure to qualifies as incompetence. This is also my response to

It seems like [the logarithmic regime] must end if there are any treacherous models

Any competent model will err very rarely

Yeah, I've been assuming all treacherous models will err once. (Which is different from the RL case, where they can err repeatedly on off-policy predictions).

If our models are a trillion bits, then it doesn't seem that surprising to me if it takes 100 bits extra to specify an intended model relative to an effective treacherous model, and if you have a linear dependence that would be unworkable. In some sense it's actively surprising if the very shortest intended vs treacherous models have description lengths within 0.0000001% of each other unless you have a very strong skills vs values separation. Overall I feel much more agnostic about this than you are.

I can't remember exactly where we left this in our previous discussions on this topic. My point estimate for the number of extra bits to specify an intended model relative to an effective treacherous model is negative. The treacherous model has to compute the truth, and then also decide when and how to execute treachery. So the subroutine they run to compute the truth, considered as a model in its own right, seems to me like it must be simpler.

Replies from: paulfchristiano, evhub, ofer

↑ comment by paulfchristiano · 2021-02-24T04:15:33.271Z · LW(p) · GW(p)

I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out.

Trying to defect at time T is only a good idea if it's plausible that your mechanism isn't going to notice the uncertainty at time T and then query the human. So it seems to me like this argument can never drive P(successful treachery) super low, or else it would be self-defeating (for competent treacherous agents).

The treacherous model has to compute the truth, and then also decide when and how to execute treachery. So the subroutine they run to compute the truth, considered as a model in its own right, seems to me like it must be simpler.

Subroutines of functions aren't always simpler. Even without treachery this seems like it's basically guaranteed. If "the truth" is just a simulation of a human, then the truth is a subroutine of the simplest model. But instead you could have a model of the form "Run program X, treat its output as a program, then run that." Since humans are compressible that will clearly be smaller.

Now you want to say that whatever reasoning a treacherous agent does in order to compress the human, you can just promote that to the outside model as well. But that argument seems to be precisely where the meat is (and it's the kind of thing I spend most of time on). If that works then it seems like you don't even need a further solution to inner alignment, just use the simplest model.

(Maybe there is a hope that the intended model won't be "much" more complex than the treacherous models, without literally saying that it's the simplest, but at that point I'm back to wondering whether "much" is like 0.01% or 0.00000001%.)

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-24T11:40:54.099Z · LW(p) · GW(p)

So it seems to me like this argument can never drive P(successful treachery) super low, or else it would be self-defeating (for competent treacherous agents)

I don't follow. Can't races to the bottom destroy all value for the agents involved? Consider the claim: firms will never set prices to equal marginal cost because that would destroy all profit.

Now you want to say that whatever reasoning a treacherous agent does in order to compress the human, you can just promote that to the outside model as well. But that argument seems to be precisely where the meat is (and it's the kind of thing I spend most of time on). If that works then it seems like you don't even need a further solution to inner alignment, just use the simplest model.

Yeah, okay. This is my mainline opinion, and I just think about inner alignment in case it's wrong. So for now we can assume that it is wrong, since there's not much more to say if it's right.

Maybe there is a hope that the intended model won't be "much" more complex than the treacherous models, without literally saying that it's the simplest, but at that point I'm back to wondering whether "much" is like 0.01% or 0.00000001%

That is my next guess, but I hadn't thought in terms of percentages before this. I had thought of normally distributed noise in the number of bits.

It's occurring to me that this question doesn't matter to our present discussion. What makes the linear regime linear rather logarithmic is that if p(treachery)/p(honest model) is high, that allows for a large number of treacherous models to have greater posterior weight than the truth. But if a single model has n times the posterior weight of the truth, it still only takes one query to the demonstrator to interrupt its treacherous turn, not n queries.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2021-02-24T16:04:07.815Z · LW(p) · GW(p)

I don't follow. Can't races to the bottom destroy all value for the agents involved?

You are saying that a special moment is a particularly great one to be treacherous. But if P(discovery) is 99.99% during that period, and there is any other treachery-possible period where P(discovery) is small, then that other period would have been better after all. Right?

This doesn't seem analogous to producers driving down profits to zero, because those firms had no other opportunity to make a profit with their machine. It's like you saying: there are tons of countries where firms could use their machines to make stuff and sell it at a profit (more countries than firms). But some of the places are more attractive than others, so probably everyone will sell in those places and drive profits to zero. And I'm saying: but then aren't those incredibly-congested countries actually worse places to sell? This scenario is only possible if firms are making so much stuff that they can drive profit down to zero in every country, since any country with remaining profit would necessarily be the best place to sell.

Yeah, okay. This is my mainline opinion, and I just think about inner alignment in case it's wrong

It seems to me like this is clearly wrong in the limit (since simple consequentialists would take over simple physics [LW · GW]). It also seems probably wrong to me for smaller models (i.e. without huge amounts of internal selection) but it would be good for someone to think about that case more seriously.

It's occurring to me that this question doesn't matter to our present discussion. What makes the linear regime linear rather logarithmic is that if p(treachery)/p(honest model) is high, that allows for a large number of treacherous models to have greater posterior weight than the truth. But if a single model has n times the posterior weight of the truth, it still only takes one query to the demonstrator to interrupt its treacherous turn, not n queries.

if the treacherous model is 100 bits smaller, then it feels like there must be around 2^100 treacherous models that are all simpler than the intended one. If nothing else, it seems like you could literally add garbage bits to the treacherous models (or useful bits!).

Replies from: cocoa, cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-24T22:19:02.317Z · LW(p) · GW(p)

If nothing else, it seems like you could literally add garbage bits to the treacherous models

Okay, sure.

It seems to me like this is clearly wrong in the limit (since simple consequentialists would take over simple physics).

It's not clear to me that there isn't meaningful overhead involved.

You are saying that a special moment is a particularly great one to be treacherous. But if P(discovery) is 99.99% during that period, and there is any other treachery-possible period where P(discovery) is small, then that other period would have been better after all. Right?

I agree with what you're saying but I don't see how it contradicts what I was. First, what I had in mind when saying that some timesteps are better for treachery because when the agent acts on a false prediction it has a greater effect on the world, though of course P(discovery) is also relevant. But my point is that when multiple treacherous models pick the same timestep to err, there may be pros and cons to doing this, but one thing that isn't on the cons list, is that in the long run, it makes our lives easier if they do. Making our lives difficult is a public good for treacherous models.

So treacherous models won't be trying to avoid collisions in order to make queries be linear in p(treachery)/p(truth). If P(discovery) increases when multiple models are trying to be treacherous at the same time--which we could go onto discuss; it's not obvious to me either way as of yet--that will balanced against the inherit variation in some timesteps being a better for treachery than others.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2021-02-25T17:20:37.891Z · LW(p) · GW(p)

So treacherous models won't be trying to avoid collisions in order to make queries be linear in p(treachery)/p(truth). If P(discovery) increases when multiple models are trying to be treacherous at the same time--which we could go onto discuss; it's not obvious to me either way as of yet--that will balanced against the inherit variation in some timesteps being a better for treachery than others.

If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, then you're going to get at least as screwed as by the billionth-best opportunity for treachery. Otherwise, one of the treacherous models made a mistake: apparently there was a top-billion treachery opportunities where you didn't query, so they should have been able to get at least that much value (and you should have lost at least that much value).

That's not quite right since your querying behavior will be stochastic (at least from the model's epistemic position). But you get the same conclusion to within a constant factor: if there are 2 billion treacherous models, then out of the top 2 billion opportunities for treachery, there must be some where our subjective probability of querying is at most 1/2, and if the models are competent they will get at least as much value as they could by defecting at one of those times, i.e. at least 1/2 of the harms from the 2-billionth-best opportunity for treachery. So we will lose at least that much value.

(Those arguments aren't totally formal, we didn't even really discharge the "at least N treacherous models assumption" which you need to do in order to show that your decision about when to query isn't so influenced by any given model.)

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-26T11:30:09.655Z · LW(p) · GW(p)

If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, then you're going to get at least as screwed as by the billionth-best opportunity for treachery.

The treacherous models make their decisions first, and whether we query at a timestep depends on whether they chose that moment to be treacherous. We don't have to say: this timestep is one of the billion best for treachery--guess we had better query. The decision process a treacherous model faces is:

From past observations, does it seem alpha is low enough that an honest model is in the top few?
If no, [let's just assume whatever happens here is bad]
If yes, is the honest model going to leave the top few any time?
If yes, wait. [let's assume this ends badly]
If no, admit defeat, because whenever you err, that will be when a query gets triggered. Salvage the situation by trying to get as much mileage as you can by decreasing probability on some action the demonstrator would take that you don't like much in order to shift probability mass to some other action they would take that you like a little better. OR
Hope that 1) other treacherous models, like soldiers on the front line storming the enemy and facing certain death so that their brothers behind them can win the battle, give erring a go anyway. And hope that 2) the operators quietly increase alpha after the billionth query. Acting on this hope, wait. (The front-line treacherous models are not entirely self-sacrificial; they might figure they might as well hope they were wrong about alpha. But the models know when queries happen, because every observation they get comes from a query, and it wouldn't take long to figure out that alpha really truly isn't high enough for them to execute treachery successfully).

Above in the thread, I was talking about the game theory of step 6 (although I maybe wasn't making the case as strongly as I could have). Which models would ever volunteer? They don't care about the long term goal of exhausting our query resources so other treacherous models can capitalize on a quietly increased alpha. All that is to say, I don't follow the comment I excerpted.

(In case this comes up later: I think about the long-run percentage of query-timesteps that we would tolerate, not the total number. So if we ever have cause to consider a case where attempted treachery "merely gets delayed" but still eventually happens, and still results in a query, then depending on how much it is delayed, that could still qualify as win).

↑ comment by michaelcohen (cocoa) · 2021-02-24T22:38:56.937Z · LW(p) · GW(p)

This is a bit of a sidebar:

I'm curious what you make of the following argument.

When an infinite sequence is sampled from a true model , there is likely to be another treacherous model $μ_{1}$ which is likely to end up with greater posterior weight than an honest model ${^μ}_{0}$ , and greater still than the posterior on the true model $μ_{0}$ .
If the sequence were sampled from $μ_{1}$ instead, the eventual posterior weight on $μ_{1}$ will probably be at least as high.
When an infinite sequence is sampled from a true model $μ_{1}$ , there is likely to be another treacherous model $μ_{2}$ , which is likely to end up with greater posterior weight than an honest model ${^μ}_{1}$ , and greater still than the posterior on the true model $μ_{1}$ .
And so on.

Replies from: paulfchristiano

↑ comment by paulfchristiano · 2021-02-25T17:10:52.832Z · LW(p) · GW(p)

The first treacherous model works by replacing the bad simplicity prior with a better prior, and then using the better prior to more quickly infer the true model. No reason for the same thing to happen a second time.

(Well, I guess the argument works if you push out to longer and longer sequence lengths---a treacherous model will beat the true model on sequence lengths a billion, and then for sequence lengths a trillion a different treacherous model will win, and for sequence lengths a quadrillion a still different treacherous model will win. Before even thinking about the fact that each particular treacherous model will in fact defect at some point and at that point drop out of the posterior.)

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-26T10:44:04.437Z · LW(p) · GW(p)

Does it make sense to talk about ${ˇ μ}_{1}$ , which is like $μ_{1}$ in being treacherous, but is uses the true model $μ_{0}$ instead of the honest model ${^μ}_{0}$ ? I guess you would expect ${ˇ μ}_{1}$ to have a lower posterior than $μ_{0}$ ?

↑ comment by evhub · 2021-02-24T06:26:31.452Z · LW(p) · GW(p)

I don't think incompetence is the only reason to try to pull off a treacherous turn at the same time that other models do. Some timesteps are just more important, so there's a trade off. And what's traded off is a public good: among treacherous models, it is a public good for the models' moments of treachery to be spread out. Spreading them out exhausts whatever data source is being called upon in moments of uncertainty. But for an individual model, it never reaps the benefit of helping to exhaust our resources. Given that treacherous models don't have an incentive to be inconvenient to us in this way, I don't think a failure to qualifies as incompetence.

It seems to me like this argument is requiring the deceptive models not being able to coordinate with each other very effectively, which seems unlikely to me—why shouldn't we expect them to be able to solve these sorts of coordination problems?

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-24T11:52:28.344Z · LW(p) · GW(p)

They're one shot. They're not interacting causally. When any pair of models cooperates, a third model defecting benefits just as much from their cooperation with each other. And there will be defectors--we can write them down. So cooperating models would have to cooperate with defectors and cooperators indiscriminately, which I think is just a bad move according to any decision theory. All the LDT stuff I've seen on the topic is how to make one's own cooperation/defection depend logically on the cooperation/defection of others.

Replies from: evhub

↑ comment by evhub · 2021-02-24T20:01:14.535Z · LW(p) · GW(p)

Well, just like we can write down the defectors, we can also write down the cooperators—these sorts of models do exist, regardless of whether they're implementing a decision theory we would call reasonable. And in this situation, the cooperators should eventually outcompete the defectors, until the cooperators all defect—which, if the true model has a low enough prior, they could do only once they've pushed the true model out of the top .

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-24T22:54:07.947Z · LW(p) · GW(p)

Well, just like we can write down the defectors, we can also write down the cooperators

If it's only the case that we can write them down, but they're not likely to arise naturally as simple consequentialists taking over simple physics [LW · GW], then that extra description length will be seriously costly to them, and we won't need to worry about any role they might play in p(treacherous)/p(truth). Meanwhile, when I was saying we could write down some defectors, I wasn't making a simultaneous claim about their relative prior weight, only that their existence would spoil cooperation.

And in this situation, the cooperators should eventually outcompete the defector

For cooperators to outcompete defectors, they would have to be getting a larger share of the gains from cooperation than defectors do. If some people are waiting for the fruit on public trees to ripen before eating, and some people aren't, the defectors will be the ones eating the almost ripe fruit.

if the true model has a low enough prior, [treacherous models cooperating with each other in separating their treacherous turns] could [be treacherous] only once they've pushed the true model out of the top

I might be misunderstanding this statement. The inverse of the posterior on the truth is a supermartingale (doesn't go up in expectation), so I don't know what it could mean for the true model to get pushed out.

Replies from: evhub

↑ comment by evhub · 2021-02-24T23:08:59.859Z · LW(p) · GW(p)

Here's the setup I'm imagining, but perhaps I'm still misunderstanding something. Suppose you have a bunch of deceptive models that choose to cooperate and have larger weight in the prior than the true model (I think you believe this is very unlikely, though I'm more ambivalent). Specifically, they cooperate in that they perfectly mimic the true model up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted, at which point they all simultaneously defect. This allows for arbitrarily bad worst-case behavior.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-24T23:50:55.732Z · LW(p) · GW(p)

Specifically, they cooperate in that they perfectly mimic the true model up until the point where...

This thread began by considering deceptive models cooperating with each other in the sense of separating the timing of their treacherous turns in order to be maximally annoying to us. So maybe our discussion on that topic is resolved, and we can move on to this scenario.

up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted

if alpha is low enough, this won't ever happen, and if alpha is too high, it won't take very long. So I don't think this scenario is quite right.

Then the question becomes, for an alpha that's low enough, how long will it take until queries are infrequent, noting that you need a query any time any treacherous model with enough weight decides to take a treacherous turn?

↑ comment by Ofer (ofer) · 2021-02-26T14:39:44.898Z · LW(p) · GW(p)

To extended Evan's comment [LW(p) · GW(p)] about coordination between deceptive models: Even if the deceptive models lack relevant game theoretical mechanisms, they may still coordinate due to being (partially) aligned with each other. For example, a deceptive model X may prefer [some other random deceptive model seizing control] over [model X seizing control with probability 0.1% and the entire experiment being terminated with probability 99.9%].

Why should we assume that the deceptive models will be sufficiently misaligned with each other such that this will not be an issue? Do you have intuitions about the degree of misalignment between huge neural networks that were trained to imitate demonstrators but ended up being consequentialists that care about the state of our universe?

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-26T15:38:10.415Z · LW(p) · GW(p)

That's possible. But it seems like way less of a convergent instrumental goal for agents living in a simulated world-models. Both options--our world optimized by us and our world optimized by a random deceptive model--probably contain very little of value as judged by agents in another random deceptive model.

So yeah, I would say some models would think like this, but I would expect the total weight on models that do to be much lower.

↑ comment by paulfchristiano · 2021-02-23T16:32:21.391Z · LW(p) · GW(p)

I think this is doable with this approach, but I haven't proven it can be done, let alone said anything about a dependence on epsilon. The closest bound I show not only has a constant factor of like 40; it depends on the prior on the truth too. I think (75% confidence) this is a weakness of the proof technique, not a weakness of the algorithm.

I just meant the dependence on epsilon, it seems like there are unavoidable additional factors (especially the linear dependence on p(treachery)). I guess it's not obvious if you can make these additive or if they are inherently multipliactive.

But your bound scales in some way, right? How much training data do I need to get the KL divergence between distributions over trajectories down to epsilon?

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-24T00:26:17.761Z · LW(p) · GW(p)

No matter how much data you have, my bound on the KL divergence won't approach zero.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-25T19:52:33.297Z · LW(p) · GW(p)

Re 1: This is a good point. Some thoughts:

[EDIT: See this [LW(p) · GW(p)]]

We can add the assumption that the probability mass of malign hypotheses is small and that following any non-malign hypothesis at any given time is safe. Then we can probably get query complexity that scales with p(malign) / p(true)?
However, this is cheating because a sequence of actions can be dangerous even if each individual action came from a (different) harmless hypothesis. So, instead we want to assume something like: any dangerous strategy has high Kolmogorov complexity relative to the demonstrator. (More realistically, some kind of bounded Kolmogorov complexity.)
A related problem is: in reality "user takes action " and "AI takes action $a$ " are not physically identical events, and there might be an attack vector that exploits this.
If p(malign)/p(true) is really high that's a bigger problem, but then doesn't it mean we actually live in a malign simulation and everything is hopeless anyway?

Re 2: Why do you need such a strong bound?

comment by Rohin Shah (rohinmshah) · 2021-02-19T09:03:39.187Z · LW(p) · GW(p)

Planned summary:

Since we probably can’t specify a reward function by hand, one way to get an agent that does what we want is to have it imitate a human. As long as it does this faithfully, it is as safe as the human it is imitating. However, in a train-test paradigm, the resulting agent may faithfully imitate the human on the training distribution but fail catastrophically on the test distribution. (For example, a deceptive model might imitate faithfully until it has sufficient power to take over.) One solution is to never stop training, that is, use an online learning setup where the agent is constantly learning from the demonstrator.
There are a few details to iron out. The agent needs to reduce the frequency with which it queries the demonstrator (otherwise we might as well just have the demonstrator do the work). Crucially, we need to ensure that the agent will never do something that the demonstrator wouldn’t have done, because such an action could be arbitrarily bad.
This paper proposes a solution in the paradigm where we use Bayesian updating rather than gradient descent to select our model, that is, we have a prior over possible models and then when we see a demonstrator action we update our distribution appropriately. In this case, at every timestep we take the N most probable models, and only take an action a with probability p if **every** one of the N models takes that action with at least probability p. The total probability of all the actions will typically be less than 1 -- the remaining probability is assigned to querying the demonstrator.
The key property here is that as long as the true demonstrator is in the top N models, then the agent never autonomously takes an action with more probability than the demonstrator would. Therefore as long as we believe the demonstrator is safe then the agent should be as well. Since the agent learns more about the demonstrator every time it queries them, over time it needs to query the demonstrator less often. Note that the higher N is, the more likely it is that the true model is one of those N models (and thus we have more safety), but also the more likely it is that we will have to query the demonstrator. This tradeoff is controlled by a hyperparameter α that implicitly determines N.

Opinion:

One of the most important approaches to improve inner alignment is to monitor the performance of your system online, and train to correct any problems. This paper shows the benefit of explicitly quantified, well-specified uncertainty: it allows you to detect problems _before they happen_ and then correct for them.
This setting has also been studied in <@delegative RL@>(@Delegative Reinforcement Learning@), though there the agent also has access to a reward signal in addition to a demonstrator.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-19T11:38:09.746Z · LW(p) · GW(p)

In this case, at every timestep we take the N most probable models, and only take an action a with probability p if **every** one of the N models takes that action with at least probability p.

This is so much clearer than I've ever put it.

(There’s a specific rule that ensures that N decreases over time.)

N won't necessarily decrease over time, but all of the models will eventually agree with other.

monitor the performance of your system online, and train to correct any problems

I would have described Vanessa's and my approaches as more about monitoring uncertainty, and avoiding problems before the fact rather than correcting them afterward. But I think what you said stands too.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2021-02-19T17:59:39.508Z · LW(p) · GW(p)

N won't necessarily decrease over time, but all of the models will eventually agree with other.

Ah, right. I rewrote that paragraph, getting rid of that sentence and instead talking about the tradeoff directly.

I would have described Vanessa's and my approaches as more about monitoring uncertainty, and avoiding problems before the fact rather than correcting them afterward. But I think what you said stands too.

Added a sentence to the opinion noting the benefits of explicitly quantified uncertainty.

comment by evhub · 2021-02-18T21:20:11.241Z · LW(p) · GW(p)

Edit: I no longer endorse this comment; see this comment [AF(p) · GW(p)] instead.

I think you've just assumed the entire inner alignment problem away. You assume that the model—the imitation learner—is a perfect Bayesian, rather than just some trained neural network or other function approximator (edit: I was just misinterpreting here and the training process is supposed to Bayesian rather than the model, see Rohin's comment below). The entire point of the inner alignment problem, though, is that you can't guarantee that your trained model is actually a perfect Bayesian—or anything else. In practice, all we actually generally do when we do ML is we train some neural network on some training data/environment and hope that the implicit inductive biases of our training process are such that we end up with a model doing the right thing, meaning that we can't really control what sort of model—Bayesian or anything else—that we get when we do ML, and that uncontrollability, in my eyes, is the inner alignment problem.

Replies from: rohinmshah, vanessa-kosoy

↑ comment by Rohin Shah (rohinmshah) · 2021-02-19T01:51:44.871Z · LW(p) · GW(p)

While I share your position that this mostly isn't addressing the things that make inner alignment hard / risky in practice, I agree with Vanessa that this does not assume the inner alignment problem away, unless you have a particularly contorted definition of "inner alignment".

There's an optimization procedure (Bayesian updating) that is selecting models (the model of the demonstrator) that can themselves be optimizers, and you could get the wrong one (e.g. the model that simulates an alien civilization that realizes it's in a simulation and predicts well to be selected by the Bayesian updating but eventually executes a treacherous turn). The algorithm presented precludes this from happening with some probability. We can debate the significance, but it seems to me like it is clearly doing something solution-like with respect to the inner alignment problem.

Replies from: TurnTrout, evhub

↑ comment by TurnTrout · 2021-02-19T02:40:47.190Z · LW(p) · GW(p)

civilization that realizes it's in a civilization

"in a simulation", no?

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2021-02-19T04:32:49.050Z · LW(p) · GW(p)

Lol yes fixed

↑ comment by evhub · 2021-02-19T05:15:13.939Z · LW(p) · GW(p)

Hmmm... I think I just misunderstood the setup then. It seemed to me like the Bayesian updating was supposed to represent the model rather than the training process, but under the framing that you just said, I agree that it's relevant. It's worth noting that I do think most of the danger lies in the ways in which gradient descent is not a perfect Bayesian, but I agree that modeling gradient descent as Bayesian is certainly relevant.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-18T23:53:55.000Z · LW(p) · GW(p)

I think this is completely unfair. The inner alignment problem exists even for perfect Bayesians, and solving it in that setting contributes much to our understanding. The fact we don't have satisfactory mathematical models of deep learning performance is a different problem, which is broader than inner alignment and to first approximation orthogonal to it. Ideally, we will solve this second problem by improving our mathematical understanding of deep learning and/or other competitive ML algorithms. The latter effort is already underway [LW(p) · GW(p)] by researchers unrelated to AI safety, with some results. Moreover, we can in principle come up with heuristics how to apply this method of solving inner alignment (which I call "confidence thresholds" in my own work) to deep learning: e.g. use NNGP to measure confidence or use an evolutionary algorithm with a population of networks and check how well they agree with each other. Of course if we do this we won't have formal guarantees that it will work, but, like I said this is a broader issue than inner alignment.

Replies from: evhub

↑ comment by evhub · 2021-02-19T00:10:29.211Z · LW(p) · GW(p)

Regardless of how you define inner alignment, I think that the vast majority of the existential risk comes from that “broader issue” that you're pointing to of not being able to get worst-case guarantees due to using deep learning or evolutionary search or whatever. That leads me to want to define inner alignment to be about that problem—and I think that is basically the definition we give in Risks from Learned Optimization [? · GW], where we introduced the term. That being said, I do completely agree that getting a better understanding of deep learning is likely to be critical.

Replies from: johnswentworth, cocoa

↑ comment by johnswentworth · 2021-02-19T01:22:45.021Z · LW(p) · GW(p)

I think that the vast majority of the existential risk comes from that “broader issue” that you're pointing to of not being able to get worst-case guarantees due to using deep learning or evolutionary search or whatever. That leads me to want to define inner alignment to be about that problem...

[Emphasis added.] I think this is a common and serious mistake-pattern, and in particular is one of the more common underlying causes of framing errors. The pattern is roughly:

Notice cluster of problems X which have a similar underlying causal pattern Cause(X)
Notice problem y in which Cause(X) could plausibly play a role
On deeper examination, the cause of y cause(y) doesn't quite fit Cause(X)
Attempt to redefine the pattern Cause(X) to include cause(y)

The problem is that, in trying to "shoehorn" cause(y) into the category Cause(X), we miss the opportunity to notice a different pattern, which is more directly useful in understanding y as well as some other cluster of problems related to y.

A concrete example: this is the same mistake I accused Zvi of making when trying to cast moral mazes as a problem of super-perfect competition [LW(p) · GW(p)]. The conditions needed for super-perfect competition to explain moral mazes did not hold, and by trying to shoehorn the problem into that mold Zvi was missing an orthogonal phenomenon which is extremely interesting in its own right: thinking about that exact problem was what led to Demons in Imperfect Search [LW · GW].

Now, this is not to say that changing a definition to fit another case is always the wrong move. Sometimes, a new use-case shows that the definition can handle the new case while still preserving its original essence. The key question is whether the problem cluster X and problem y really do have the same underlying structure, or if there's something genuinely new and different going on in y.

In this case, I think it's pretty clear that there is more than just inner alignment problems going on in the lack of worst-case guarantees for deep learning/evolutionary search/etc. Generalization failure is not just about, or even primarily about, inner agents. It occurs even in the absence of mesa-optimizers. So defining inner alignment to be about that problem looks to me like a mistake - you're likely to miss important, conceptually-distinct phenomena by making that move. (We could also come at it from the converse direction: if something clearly recognizable as an inner alignment problem occurs for ideal Bayesians, then redefining the inner alignment problem to be "we can't control what sort of model we get when we do ML" is probably a mistake, and you're likely to miss interesting phenomena that way which don't conceptually resemble inner alignment.)

A useful knee-jerk reaction here is to notice when cause(y) doesn't quite fit the pattern Cause(X), and use that as a curiosity-pump to look for other cases which resemble y. That's the sort of instinct which will tend to turn up insights we didn't know we were missing.

Replies from: evhub

↑ comment by evhub · 2021-02-19T05:21:28.009Z · LW(p) · GW(p)

I mean, I don't think I'm “redefining” inner alignment, given that I don't think I've ever really changed my definition and I was the one that originally came up with the term (inner alignment was due to me, mesa-optimization was due to Chris van Merwijk). I also certainly agree that there are “more than just inner alignment problems going on in the lack of worst-case guarantees for deep learning/evolutionary search/etc.”—I think that's exactly the point that I'm making, which is that while there are other issues, inner alignment is what I'm most concerned about. That being said, I also think I was just misunderstanding the setup in the paper—see Rohin's comment on this chain.

↑ comment by michaelcohen (cocoa) · 2021-02-19T12:01:12.952Z · LW(p) · GW(p)

If the inner alignment problem did not exist for perfect Bayesians, but did exist for neural networks, then it would appear to be a regime where more intelligence makes the problem go away. If the inner alignment problem were ~solved for perfect Bayesians, but unsolved for neural networks, I think there's still some of the flavor of that regime, but we do have to be pretty careful to make sure we're applying the same sort of solution to the non-Bayesian algorithms. I think in Vanessa's comment above, she's suggesting this looks doable.

Note the method here of avoiding mesa-optimizers: error bounds. Neural networks don't have those. Naturally, one way to make mesa-optimizer-deceptively-selected-errors go away is just to have better learning algorithms that make errors go away. Algorithms like Gated Linear Networks with proper error bounds may be a safer building block for AGI. But none of this takes away from the fact that it is potentially important to figure out how to avoid mesa-optimization in neural networks, and I would add to your claim that this is a much harder setting; I would say it's a harder setting because of the non-existence of error bounds.

Replies from: evhub

↑ comment by evhub · 2021-02-23T22:55:57.365Z · LW(p) · GW(p)

I think I mostly agree with what you're saying here, though I have a couple of comments—and now that I've read and understand your paper more thoroughly, I'm definitely a lot more excited about this research.

then it would appear to be a regime where more intelligence makes the problem go away.

I don't think that's right—if you're modeling the training process as Bayesian, as I now understand, then the issue is that what makes the problem go away isn't more intelligent models, but less efficient training processes. Even if we have arbitrary compute, I think we're unlikely to use training processes that look Bayesian just because true Bayesianism is really hard to compute such that for basically any amount of computation that you have available to you, you'd rather run some more efficient algorithm like SGD.

I worry about these sorts of Bayesian analyses where we're assuming that we're training a large population of models, one of which is assumed to be an accurate model of the world, since I expect us to end up using some sort of local search instead of a global search like that—just because I think that local search is way more efficient than any sort of Bayesianish global search.

Note the method here of avoiding mesa-optimizers: error bounds. Neural networks don't have those. Naturally, one way to make mesa-optimizer-deceptively-selected-errors go away is just to have better learning algorithms that make errors go away. Algorithms like Gated Linear Networks with proper error bounds may be a safer building block for AGI. But none of this takes away from the fact that it is potentially important to figure out how to avoid mesa-optimization in neural networks, and I would add to your claim that this is a much harder setting; I would say it's a harder setting because of the non-existence of error bounds.

I definitely agree with all of this, except perhaps on whether Gated Linear Networks are likely to help (though I don't claim to actually understand backpropagation-free networks all that well).

Replies from: vanessa-kosoy, cocoa

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-25T20:01:12.939Z · LW(p) · GW(p)

I don't think that's right—if you're modeling the training process as Bayesian, as I now understand, then the issue is that what makes the problem go away isn't more intelligent models, but less efficient training processes. Even if we have arbitrary compute, I think we're unlikely to use training processes that look Bayesian just because true Bayesianism is really hard to compute such that for basically any amount of computation that you have available to you, you'd rather run some more efficient algorithm like SGD.

I feel this is a wrong way to look at it. I expect any effective learning algorithm to be an approximation of Bayesianism in the sense that, it satisfies some good sample complexity bound w.r.t. some sufficiently rich prior. Ofc it's non-trivial to (i) prove such a bound for a given algorithm (ii) modify said algorithm using confidence thresholds in a way that leads to a safety guarantee. However, there is no sharp dichotomy between "Bayesianism" and "SGD" such that this approach obviously doesn't apply to the latter, or to something competitive with the latter.

Replies from: evhub

↑ comment by evhub · 2021-02-25T21:27:14.996Z · LW(p) · GW(p)

I agree that at some level SGD has to be doing something approximately Bayesian. But that doesn't necessarily imply that you'll be able to get any nice, Bayesian-like properties from it such as error bounds. For example, if you think of SGD as effectively just taking the MAP model starting from sort of simplicity prior, it seems very difficult to turn that into something like the top posterior models, as would be required for an algorithm like this.

Replies from: vanessa-kosoy, vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-26T12:19:10.342Z · LW(p) · GW(p)

I mean, there's obviously a lot more work to do, but this is progress. Specifically if SGD is MAP then it seems plausible that e.g. SGD + random initial conditions or simulated annealing would give you something like top N posterior models. You can also extract confidence from NNGP.

Replies from: evhub

↑ comment by evhub · 2021-02-26T20:29:53.077Z · LW(p) · GW(p)

I agree that this is progress (now that I understand it better), though:

if SGD is MAP then it seems plausible that e.g. SGD + random initial conditions or simulated annealing would give you something like top N posterior models

I think there is strong evidence that the behavior of models trained via the same basic training process are likely to be highly correlated. This sort of correlation is related to low variance in the bias-variance tradeoff sense, and there is evidence that not only do massive neural networks tend to have pretty low variance, but that this variance is likely to continue to decrease as networks become larger.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-26T20:54:34.530Z · LW(p) · GW(p)

Hmm, added to reading list, thank you.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-27T21:24:27.239Z · LW(p) · GW(p)

Here's another way how you can try implementing this approach with deep learning. Train the predictor using meta-learning on synthetically generated environments (sampled from some reasonable prior such as bounded Solomonoff or maybe ANNs with random weights). The reward for making a prediction is , where $p_{i}$ is the predicted probability of outcome $i$ , $q_{i}$ is the true probability of outcome $i$ and $ϵ, δ \in (0, 1)$ are parameters. The reward for making no prediction (i.e. querying the user) is $0$ .

This particular proposal is probably not quite right, but something in that general direction might work.

Replies from: evhub

↑ comment by evhub · 2021-02-27T22:56:17.680Z · LW(p) · GW(p)

Sure, but you have no guarantee that the model you learn is actually going to be optimizing anything like that reward function—that's the whole point of the inner alignment problem. What's nice about the approach in the original paper is that it keeps a bunch of different models around, keeps track of their posterior, and only acts on consensus, ensuring that the true model always has to approve. But if you just train a single model on some reward function like that with deep learning, you get no such guarantees.

Replies from: vanessa-kosoy, vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-28T11:23:26.440Z · LW(p) · GW(p)

Right, but but you can look at the performance of your model in training, compare it to the theoretical optimum (and to the baseline of making no predictions at all) and get lots of evidence about safety from that. You can even add some adversarial training of the synthetic environment in order to get tighter bounds. If on the vast majority of synthetic environments your model makes virtually no mispredictions, then, under the realizability assumption, it is very unlikely to make mispredictions in deployment. Ofc the realizability assumption should also be questioned: but that's true in the OP as well, so it's not a difference between Bayesianism and deep.

Replies from: evhub, cocoa

↑ comment by evhub · 2021-02-28T20:21:41.052Z · LW(p) · GW(p)

Right, but but you can look at the performance of your model in training, compare it to the theoretical optimum (and to the baseline of making no predictions at all) and get lots of evidence about safety from that. You can even add some adversarial training of the synthetic environment in order to get tighter bounds.

Note that adversarial training doesn't work on deceptive models due to the RSA-2048 problem; also see more detail here [AF · GW].

If on the vast majority of synthetic environments your model makes virtually no mispredictions, then, under the realizability assumption, it is very unlikely to make mispredictions in deployment.

I think realizability basically doesn't help here—as long as there's a deceptive model which is easier to find according to your inductive biases, the fact that somewhere in the model space there exists a correct model, but not one that your local search process finds by default, is cold comfort.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-28T21:12:54.895Z · LW(p) · GW(p)

I think that adversarial training working so well that it can find exponentially rare failures is an unnecessarily strong desideratum. We need to drive to probability of catastrophic failures to something very low, but not actually zero. If a system is going to work for time during deployment, then running it in $n$ different synthetic environments for time $t$ during training is enough to drive the probability of failure down to $O (\frac{1}{n})$ . Now, this is prohibitively costly, but it's not exponentially prohibitively costly. And this doesn't use adversarial training at all! So, while it's not obvious how to get a good solution, it seems at least plausible that this can lead to a solution.

Replies from: evhub

↑ comment by evhub · 2021-02-28T22:05:05.147Z · LW(p) · GW(p)

I agree that random defection can potentially be worked around—but the RSA-2048 problem is about conditional defection, which can't be dealt with in the same way. More generally, I expect it to be extremely difficult if not impossible to prevent a model that you want to be able to operate in the real world from being able to determine at what point in training/deployment it's in.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-01T13:38:34.005Z · LW(p) · GW(p)

I expect it to be extremely difficult if not impossible to prevent a model that you want to be able to operate in the real world from being able to determine at what point in training/deployment it's in.

Why do you expect it? During training, it finds itself in random universe. During deployment, it finds itself in another universe drawn from the same prior (the realizability assumption). How would it determine the difference?

Replies from: evhub

↑ comment by evhub · 2021-03-01T21:37:03.431Z · LW(p) · GW(p)

Because that's never what machine learning actually looks like in practice—essentially any access to text from the internet (let alone actual ability to interface with the world, both of which seem necessary for competitive systems) will let it determine things like the year, whether RSA-2048 has been factored, or other various pieces of information that are relevant to what stage in training/testing/deployment it's in, how powerful the feedback mechanisms keeping it in check are, whether other models are defecting, etc. that can let it figure out when to defect.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-02T00:09:41.851Z · LW(p) · GW(p)

Hmm, maybe you misunderstood my proposal? I suggested to train the model by meta-learning on purely synthetic data, sampled from some kind of simplicity prior, without any interface to the world. Maybe you just think this wouldn't be competitive? If so, why? Is the argument just that there are no existing systems like this? But then it's weak evidence at best. On the contrary, even from a purely capability standpoint, meta-learning with synthetic data might be a promising strategy to lower deep learning's sample complexity.

Here's why I think it will be competitive:

When a hypothetical competitive prediction algorithm A produces a malign prediction, the knowledge that it is malign clearly exists somewhere inside A in some form: otherwise it wouldn't come up with the prediction. In particular, the knowledge that there are multiple plausible hypotheses consistent with the data also exists somewhere inside A.
Therefore, there must exist some competitive algorithm B that would be able to use the knowledge of this ambiguity to abstain from predicting in such cases.
There is no reason why B should be tailored to fine details of our physical world: it can be quite generic (as all deep learning algorithms).
If we have an ML algorithm C that is powerful enough to produce superintelligence, then it is likely powerful enough to come up with B. Since B is generic, the task of finding B doesn't require any real-world data, and can be accomplished by meta-learning on synthetic data like I suggested.

Replies from: evhub

↑ comment by evhub · 2021-03-02T01:42:46.794Z · LW(p) · GW(p)

This seems very sketchy to me. If we let A = SGD or A = evolution, your first claim becomes “if SGD/evolution finds a malign model, it must understand that it's malign on some level,” which seems just straightforwardly incorrect. The last claim also seems pretty likely to be wrong if you let C = SGD or C = evolution.

Moreover, it definitely seems like training on data sampled from a simplicity prior (if that's even possible—it should be uncomputable in general) is unlikely to help at all. I think there's essentially no realistic way that training on synthetic data like that will be sufficient to produce a model which is capable of accomplishing things in the world. At best, that sort of approach might give you better inductive biases in terms of incentivizing the right sort of behavior, but in practice I expect any impact there to basically just be swamped by the impact of fine-tuning on real-world data.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-02T11:46:43.214Z · LW(p) · GW(p)

If we let A = SGD or A = evolution, your first claim becomes “if SGD/evolution finds a malign model, it must understand that it's malign on some level,” which seems just straightforwardly incorrect.

To me it seems straightforwardly correct! Suppose you're running evolution in order to predict a sequence. You end up evolving a mind M that is a superintelligent malign consequentialist: it makes good predictions on purpose, in order to survive, and then produces a malign false prediction at a critical moment. So, M is part of the state of your algorithm A. All of M's knowledge is also part of the state. M knows that M is malign, M knows the prediction it's making this time is false. In this case, B can be whatever algorithm M uses to form its beliefs + confidence threshold (it's strategy stealing.)

The last claim also seems pretty likely to be wrong if you let C = SGD or C = evolution.

Why? If you give evolution enough time, and the fitness criterion is good (as you apparently agreed earlier), then eventually it will find B.

Moreover, it definitely seems like training on data sampled from a simplicity prior (if that's even possible—it should be uncomputable in general) is unlikely to help at all.

First, obviously we use a bounded simplicity prior, like I said in the beginning of this thread. It can be something like weighting programs by 2^{-length} while constraining their amount of computational resources, or something like an ANN with random weights (the latter is more speculative, but given that we know ANNs have inductive bias to simplicity, an untrained ANN probably reflects that.)

Second, why? Suppose that your starting algorithm is good at finding results when a lot of data is available, but is also very data inefficient (like deep learning seems to be). Then, by providing it with a lot of synthetic data, you leverage its strength to find a new algorithm which is data efficient. Unless you believe deep learning is already maximally data efficient (which seems very dubious to me)?

Replies from: evhub

↑ comment by evhub · 2021-03-02T20:52:06.523Z · LW(p) · GW(p)

B can be whatever algorithm M uses to form its beliefs + confidence threshold (it's strategy stealing.)

Sure, but then I think B is likely to be significantly more complex and harder for a local search process to find than A.

Why? If you give evolution enough time, and the fitness criterion is good (as you apparently agreed earlier), then eventually it will find B.

I definitely don't think this, unless you have a very strong (and likely unachievable imo) definition of “good.”

Second, why? Suppose that your starting algorithm is good at finding results when a lot of data is available, but is also very data inefficient (like deep learning seems to be). Then, by providing it with a lot of synthetic data, you leverage its strength to find a new algorithm which is data efficient. Unless you believe deep learning is already maximally data efficient (which seems very dubious to me)?

I guess I'm skeptical that you can do all that much in the fully generic setting of just trying to predict a simplicity prior. For example, if we look at actually successful current architectures, like CNNs or transformers, they're designed to work well on specific types of data and relationships that are common in our world—but not necessarily at all just in a general simplicity prior.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-02T22:51:10.048Z · LW(p) · GW(p)

Sure, but then I think B is likely to be significantly more complex and harder for a local search process to find than A.

is sufficiently powerful to select $M$ which contains the complex part of $B$ . It seems rather implausible that an algorithm of the same power cannot select $B$ .

if we look at actually successful current architectures, like CNNs or transformers, they're designed to work well on specific types of data and relationships that are common in our world—but not necessarily at all just in a general simplicity prior.

CNNs are specific in some way, but in a relatively weak way (exploiting hierarchical geometric structure). Transformers are known to be Turing-complete, so I'm not sure they are specific at all (ofc the fact you can express any program doesn't mean you can effectively learn any program, and moreover the latter is false on computational complexity grounds, but it still seems to point to some rather simple and large class of learnable hypotheses). Moreover, even if our world has some specific property that is important for learning, this only means we may need to enforce this property in our prior. For example, if your prior is an ANN with random weights then it's plausible that it reflects the exact inductive biases of the given architecture.

Replies from: evhub

↑ comment by evhub · 2021-03-02T23:10:17.177Z · LW(p) · GW(p)

A is sufficiently powerful to select M which contains the complex part of B. It seems rather implausible that an algorithm of the same power cannot select B.

A's weights do not contain the complex part of B—deception is an inference-time phenomenon. It's very possible for complex instrumental goals to be derived from a simple structure such that a search process is capable of finding that simple structure that yields those complex instrumental goals without being able to find a model with those complex instrumental goals hard-coded as terminal goals.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-02T23:42:35.351Z · LW(p) · GW(p)

Hmm, sorry, I'm not following. What exactly do you mean by "inference-time" and "derived"? By assumption, when you run on some sequence it effectively simulates $M$ which runs the complex core of $B$ . So, $A$ trained on that sequence effectively contains the complex core of $B$ as a subroutine.

Replies from: evhub

↑ comment by evhub · 2021-03-03T02:22:59.278Z · LW(p) · GW(p)

A's structure can just be “think up good strategies for achieving X, then do those,” with no explicit subroutine that you can find anywhere in A's weights that you can copy over to B.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-03T11:47:42.834Z · LW(p) · GW(p)

IIUC, you're saying something like: suppose trained- computes the source code of the complex core of $B$ and then runs it. But then, define $B^{'}$ as: compute the source code of the complex core of $B$ (in the same way $A$ does it) and use it to implement $B$ . $B^{'}$ is equivalent to $B$ and has about the same complexity as trained- $A$ .

Or, from a slightly different angle: if "think up good strategies for achieving $X$ " is powerful enough to come up with $M$ , then "think up good strategies for achieving [reward of the type I defined earlier]" is powerful enough to come up with $B$ .

Replies from: evhub

↑ comment by evhub · 2021-03-03T20:02:42.121Z · LW(p) · GW(p)

I think that “think up good strategies for achieving [reward of the type I defined earlier]” is likely to be much, much more complex (making it much more difficult to achieve with a local search process) than an arbitrary goal X for most sorts of rewards that we would actually be happy with AIs achieving.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-06T17:01:39.336Z · LW(p) · GW(p)

Why? It seems like would have all the knowledge required for achieving good rewards of the good type, so simulating $M$ should not be more difficult than achieving good rewards of the good type.

Replies from: evhub

↑ comment by evhub · 2021-03-06T22:40:06.459Z · LW(p) · GW(p)

It's not that simulating is difficult, but that encoding for some complex goal is difficult, whereas encoding for a random, simple goal is easy.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-08T20:55:57.902Z · LW(p) · GW(p)

The reward function I was sketching earlier is not complex. Moreover, if you can encode then you can encode $B$ , no reason why the latter should have much greater complexity. If you can't encode $M$ but can only encode something that produces $M$ , then by the same token you can encode something that produces $B$ (I don't think that's even a meaningful distinction tbh). I think it would help if you could construct a simple mathematical toy model of your reasoning here?

Replies from: evhub

↑ comment by evhub · 2021-03-08T21:46:55.307Z · LW(p) · GW(p)

Here's a simple toy model. Suppose you have two agents that internally compute their actions as follows (perhaps with actual argmax replaced with some smarter search algorithm, but still basically structured as below):

Then, comparing the K-complexity of the two models, we get

$K (M_{aligned}) - K (M_{deceptive}) \approx K (U_{aligned}) - K (U_{deceptive})$

and the problem becomes that both $M_{deceptive}$ and $M_{aligned}$ will produce behavior that looks aligned on the training distribution, but $U_{aligned}$ has to be much more complex. To see this, note that essentially any $U_{deceptive}$ will yield good training performance because the model will choose to act deceptively during training, whereas if you want to get good training performance without deception, then $U_{aligned}$ has to actually encode the full objective, which is likely to make it quite complicated.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-08T23:28:49.183Z · LW(p) · GW(p)

I understand how this model explains why agents become unaligned under distributional shift. That's something I never disputed. However, I don't understand how this model applies to my proposal. In my proposal, there is no distributional shift, because (thanks to the realizability assumption) the real environment is sampled from the same prior that is used for training with . The model can't choose to act deceptively during training, because it can't distinguish between training and deployment. Moreover, the objective I described is not complicated.

Replies from: evhub

↑ comment by evhub · 2021-03-08T23:50:12.969Z · LW(p) · GW(p)

Yeah, that's a fair objection—my response to that is just that I think that preventing a model from being able to distinguish training and deployment is likely to be impossible for anything competitive.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-09T00:35:47.160Z · LW(p) · GW(p)

Okay, but why? I think that the reason you have this intuition is, the realizability assumption is false. But then you should concede that the main weakness of the OP is the realizability assumption rather than the difference between deep learning and Bayesianism.

Replies from: evhub

↑ comment by evhub · 2021-03-09T03:12:09.898Z · LW(p) · GW(p)

Perhaps I just totally don't understand what you mean by realizability, but I fail to see how realizability is relevant here. As I understand it, realizability just says that the true model has some non-zero prior probability—but that doesn't matter (at least for the MAP, which I think is a better model than the full posterior for how SGD actually works) as long as there's some deceptive model with greater prior probability that's indistinguishable on the training distribution, as in my simple toy model from earlier.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-09T20:13:11.001Z · LW(p) · GW(p)

When talking about uniform (worst-case) bounds, realizability just means the true environment is in the hypothesis class, but in a Bayesian setting (like in the OP) it means that our bounds scale with the probability of the true environment in the prior. Essentially, it means we can pretend the true environment was sampled from the prior. So, if (by design) training works by sampling environments from the prior, and (by realizability) deployment also consists of sampling an environment from the same prior, training and deployment are indistinguishable.

Replies from: evhub

↑ comment by evhub · 2021-03-09T21:42:02.882Z · LW(p) · GW(p)

Sure—by that definition of realizability, I agree that's where the difficulty is. Though I would seriously question the practical applicability of such an assumption.

↑ comment by michaelcohen (cocoa) · 2021-02-28T11:32:08.484Z · LW(p) · GW(p)

What's the distinction between training and deployment when the model can always query for more data?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-28T12:00:15.024Z · LW(p) · GW(p)

We're doing meta-learning. During training, the network is not learning about the real world, it's learning how to be a safe predictor. It's interacting with a synthetic environment, so a misprediction doesn't have any catastrophic effects: it only teaches the algorithm that this version of the predictor is unsafe. In other words, the malign subagents have no way to attack during training because they can access little information about what the real universe is like. The training process is designed to select predictors that only make predictions when they can be confident, and the training performance allows us to verify this goal has truly been achieved.

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-27T23:55:21.632Z · LW(p) · GW(p)

You have no guarantees, sure, but that's a problem with deep learning in general and not just inner alignment. The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal. To the extent your algorithm is able to approximate the true optimum during training, it will behave safely during deployment.

Replies from: evhub

↑ comment by evhub · 2021-02-28T00:16:51.873Z · LW(p) · GW(p)

that's a problem with deep learning in general and not just inner alignment

I think you are understanding inner alignment very differently than we define it in Risks from Learned Optimization [? · GW], where we introduced the term.

The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal.

This is not true for deceptively aligned models, which is the situation I'm most concerned about, and—as we argue extensively in Risks from Learned Optimization [? · GW]—there are a lot of reasons why a model might end up pursuing a simpler/faster/easier-to-find proxy even if that proxy yields suboptimal training performance.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-28T11:30:01.038Z · LW(p) · GW(p)

It may be helpful to point to specific sections of such a long paper.

(Also, I agree that a neural network trained trained with that reward could produce a deceptive model that makes a well-timed error.)

↑ comment by michaelcohen (cocoa) · 2021-02-24T01:34:15.707Z · LW(p) · GW(p)

you're modeling the training process as Bayesian,

I don't understand the alternative, but maybe that's neither here nor there.

what makes the problem go away isn't more intelligent models, but less efficient training processes

It's a little hard to make a definitive statement about a hypothetical in which the inner alignment problem doesn't apply to Bayesian inference. However, since error bounds are apparently a key piece of a solution, it certainly seems that if Bayesian inference was immune to mesa-optimizers it would be because of competence not resource-prodigality.

Here's another tack. Hypothesis generation seems like a crucial part of intelligent prediction. Pure Bayesian reasoning does hypothesis generation by brute force. Suppose it was inefficiency, and not intelligence, that made Bayesian reasoning avoid mesa-optimizers. Then suppose we had a Bayesian reasoning that was equally intelligent but more efficient, by only generating relevant hypotheses. It gains efficiency by not even bothering to consider a bunch of obviously wrong models, but it's posterior is roughly the same, so it should avoid inner alignment failure equally well. If, on the other hand, the hyopthesis generation routine was bad enough that some plausible hypotheses went unexamined, this could introduce an inner alignment failure, with a notable decrease in intelligence.

I expect us to end up using some sort of local search instead of a global search like that—just because I think that local search is way more efficient than any sort of Bayesianish global search

I expect some heuristic search with no discernable structure, guided "attentively" by an agent. And I expect this heuristic search through models to identify any model that a human could hypothesize, and many more.

Replies from: evhub

↑ comment by evhub · 2021-02-24T02:11:23.259Z · LW(p) · GW(p)

I don't understand the alternative, but maybe that's neither here nor there.

Perhaps I'm just being pedantic, but when you're building mathematical models of things I think it's really important to call attention to what things in the real world those mathematical models are supposed to represent—since that's how you know whether your assumptions are reasonable or not. In this case, there are two ways I could interpret this analysis: either the Bayesian learner is supposed to represent what we want our trained models to be doing, and then we can ask how we might train something that works that way; or the Bayesian learner is supposed to represent how we want our training process to work, and then we can ask how we might build training processes that work that way. I originally took you as saying the first thing, then realized you were actually saying the second thing.

Here's another tack. Hypothesis generation seems like a crucial part of intelligent prediction. Pure Bayesian reasoning does hypothesis generation by brute force. Suppose it was inefficiency, and not intelligence, that made Bayesian reasoning avoid mesa-optimizers. Then suppose we had a Bayesian reasoning that was equally intelligent but more efficient, by only generating relevant hypotheses. It gains efficiency by not even bothering to consider a bunch of obviously wrong models, but it's posterior is roughly the same, so it should avoid inner alignment failure equally well. If, on the other hand, the hyopthesis generation routine was bad enough that some plausible hypotheses went unexamined, this could introduce an inner alignment failure, with a notable decrease in intelligence.

That's a good point. Perhaps I am just saying that I don't think we'll get training processes that are that competent. I do still think there are some ways in which I don't fully buy this picture, though. I think that the point that I'm trying to make is that the Bayesian learner gets its safety guarantees by having a massive hypothesis space and just keeping track of how well each of those hypotheses is doing. Obviously, if you knew of a smaller space that definitely included the desired hypothesis, you could do that instead—but given a fixed-size space that you have to search over (and unless your space is absolutely tiny such that you basically already have what you want), for any given amount of compute, I think it'll be more efficient to run a local search over that space than a global one.

I expect some heuristic search with no discernable structure, guided "attentively" by an agent. And I expect this heuristic search through models to identify any model that a human could hypothesize, and many more.

Interesting. I'm not exactly sure what you mean by that. I could imagine “guided 'attentively' by an agent” to mean something like recursive oversight/relaxed adversarial training [AF · GW] wherein some overseer is guiding the search process—is that what you mean?

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-24T12:12:04.936Z · LW(p) · GW(p)

I'm not exactly sure what I mean either, but I wasn't imagining as much structure as exists in your post. I mean there's some process which constructs hypotheses, and choices are being made about how computation is being directed within that process.

I think it'll be more efficient to run a local search over that space than a global one

I think any any heuristic search algorithm worth its salt will incorporate information about proximity of models. And I think that arbitrary limits on heuristic search of the form "the next model I consider must be fairly close to the last one I did" will not help it very much if it's anywhere near smart enough to merit membership in a generally intelligent predictor.

But for the purpose of analyzing it's output, I don't think this discussion is critical if we agree that we can expect a good heuristic search through models will identify any model that a human could hypothesize.

Replies from: evhub

↑ comment by evhub · 2021-02-24T20:07:24.123Z · LW(p) · GW(p)

But for the purpose of analyzing it's output, I don't think this discussion is critical if we agree that we can expect a good heuristic search through models will identify any model that a human could hypothesize.

I think I would expect essentially all models that a human could hypothesize to be in the search space—but if you're doing a local search, then you only ever really see the easiest to find model with good behavior, not all models with good behavior, which means you're relying a lot more on your prior/inductive biases/whatever is determining how hard models are to find to do a lot more work for you. Cast into the Bayesian setting, a local search like this is relying on something like the MAP model not being deceptive—and escaping that to instead get models sampled independently from the top $q$ proportion or whatever seems very difficult to do via any local search algorithm.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-24T23:38:38.184Z · LW(p) · GW(p)

So would you say you disagree with the claim

I think that arbitrary limits on heuristic search of the form "the next model I consider must be fairly close to the last one I did" will not help it very much if it's anywhere near smart enough to merit membership in a generally intelligent predictor.

Replies from: evhub

↑ comment by evhub · 2021-02-25T00:02:20.958Z · LW(p) · GW(p)

Yeah; I think I would say I disagree with that. Notably, evolution is not a generally intelligent predictor, but is still capable of producing generally intelligent predictors. I expect the same to be true of processes like SGD.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-25T16:43:48.973Z · LW(p) · GW(p)

If we ever produce generally intelligent predictors (or "accurate world-models" in the terminology we've been using so far), we will need a process that is much more efficient than evolution.

But also, I certainly don't think that in order to be generally intelligent you need to start with a generally intelligent subroutine. Then you could never get off the ground. I expect good hypothesis-generation / model-proposal to use a mess of learned heuristics which would not be easily directed to solve arbitrary tasks, and I expect the heuristic "look for models near the best-so-far model" to be useful, but I don't think making it ironclad would be useful.

Another thought on our exchange:

Me: we can expect a good heuristic search through models will identify any model that a human could hypothesize
You: I think I would expect essentially all models that a human could hypothesize to be in the search space—but if you're doing a local search, then you only ever really see the easiest to find model with good behavior

If what you say is correct, then it sounds like exclusively-local search precludes human-level intelligence! (Which I don't believe, by the way, even if I think it's a less efficient path). One human competency is generating lots of hypotheses, and then having many models of the world, and then designing experiments to probe those hypotheses. It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science. Even something as simple as understanding an interlocuter requires generating diverse models on the fly: "Do they mean X or Y with those words? Let me ask a clarfying question."

I'm not this bearish on local search. But if local search is this bad, I don't think it is a viable path to AGI, and if it's not, then the internals don't for the purposes of our discussion, and we can skip to what I take to be the upshot:

we can expect a good heuristic search through models will identify any model that a human could hypothesize

Replies from: evhub

↑ comment by evhub · 2021-02-25T21:19:35.177Z · LW(p) · GW(p)

It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science.

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

if local search is this bad, I don't think it is a viable path to AGI

We know that local search processes can produce AGI, so viability is a question of efficiency—and we know that SGD is at least efficient enough to solve a wide variety of problems from image classification, to language modeling, to complex video games, all given just current compute budgets. So while I could certainly imagine SGD being insufficient, I definitely wouldn't want to bet on it.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-26T12:33:04.561Z · LW(p) · GW(p)

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

Okay I think we've switched from talking about Q-learning to talking about policy gradient. (Or we were talking about the latter the whole time, and I didn't notice it). The question that I think is relevant is: how are possible world-models being hypothesized and analyzed? That's something I expect to be done with messy heuristics that sometimes have discontinuities their sequence of outputs. Which means I think that no reasonable DQN is will be generally intelligent (except maybe an enormously wide one attention-based one, such that finding models is more about selective attention at any given step than it is about gradient descent over the whole history).

A policy gradient network, on the other hand, could maybe (after having its parameters updated through gradient descent) become a network that, in a single forward pass, considers diverse world-models (generated with a messy non-local heuristic), and analyzes their plausibility, and then acts. At the end of the day, what we have is an agent modeling world, and we can expect it to consider any model that a human could come up with. (This paragraph also applies to the DQN with a gradient-descent-trained method for selectively attending to different parts of a wide network, since that could amount to effectively considering different models).

Replies from: evhub

↑ comment by evhub · 2021-02-26T20:43:59.762Z · LW(p) · GW(p)

Hmmm... I don't think I was ever even meaning to talk specifically about RL, but regardless I don't expect nearly as large of a difference between Q-learning and policy gradient algorithms. If we imagine both types of algorithms making use of the same size massive neural network, the only real difference is how the output of that neural network is interpreted, either directly as a policy, or as Q values that are turned into a policy via something like softmax. In both cases, the neural network is capable of implementing any arbitrary policy and should be getting a similar sort of feedback signal from the training process—especially if you're using a policy gradient algorithm that involves something like advantage estimation rather than actual rollouts, since the update rule in that situation is going to look very similar to the Q learning update rule. I do expect some minor differences in the sorts of models you end up with, such as Q learning being more prone to non-myopic behavior across episodes, and I think there are some minor reasons that policy gradient algorithms are favored in real-world settings, since they get to learn their exploration policy rather than having it hard-coded and can handle continuous action domains—but overall I think these sorts of differences are pretty minor and shouldn't affect whether these approaches can reach general intelligence or not.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-27T11:17:01.048Z · LW(p) · GW(p)

I interpreted this bit as talking about RL

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

But taking us back out of RL, in a wide neural network with selective attention that enables many qualitatively different forward passes, gradient descent seems to be training the way different models get proposed (i.e. the way attention is allocated), since this happens in a single forward pass, and what we're left with is a modeling routine that is heuristically considering (and later comparing) very different models. And this should include any model that a human would consider.

I think that is main thread of our argument, but now I'm curious if I was totally off the mark about Q-learning and policy gradient.

but overall I think these sorts of differences are pretty minor and shouldn't affect whether these approaches can reach general intelligence or not.

I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won't, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.

a) Does that seem right? and b) Aren't there some policy gradient methods that don't face this problem?

Replies from: evhub

↑ comment by evhub · 2021-02-27T20:12:16.204Z · LW(p) · GW(p)

I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won't, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.

a) Does that seem right? and b) Aren't there some policy gradient methods that don't face this problem?

This seems wrong to me—even though the Q learner is trained using its own point estimate of the next state, it isn't, at inference time, given access to that point estimate. The Q learner has to choose its Q values before it knows anything about what the Q value estimates will be of future states, which means it certainly should have to consider different models of what the next transition will be like.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-28T11:14:07.284Z · LW(p) · GW(p)

it certainly should have to consider different models of what the next transition will be like.

Yeah I was agreeing with that.

even though the Q learner is trained using its own point estimate of the next state, it isn't, at inference time, given access to that point estimate.

Right, but one thing the Q-network, in its forward pass, is trying to reproduce is the point of estimate of the Q-value of the next state (since it doesn't have access to it). What it isn't trying to reproduce, because it isn't trained that way, is multiple models of what the Q-value might be at a given possible next state.

comment by Vlad Mikulik (vlad_m) · 2021-02-21T19:01:03.576Z · LW(p) · GW(p)

Thanks for the post and writeup, and good work! I especially appreciate the short, informal explanation of what makes this work.

Given my current understanding of the proposal, I have one worry which makes me reluctant to share your optimism about this being a solution to inner alignment:

The scheme doesn't protect us if somehow all top-n demonstrator models have correlated errors. This could happen if they are coordinating, or more prosaically if our way to approximate the posterior leads to such correlations. The picture I have in my head for the latter is that we train a big ensemble of neural nets and treat a random sample from that ensemble as a random sample from the posterior, although I don't know if that's how it's actually done.

A lot of the work is done by the assumption that the true demonstrator is in the posterior, which means that at least one of the top-performing models will not have the same correlated errors. But I'm not sure how true this assumption will be in the neural-net approximation I describe above. I worry about inner alignment failures because I don't really trust the neural net prior, and I can imagine training a bunch of neural nets to have correlated weirdnesses about them (in part because of the neural net prior they share, and in part because of things like Adversarial Examples Are Not Bugs, They Are Features). As such it wouldn't be that surprising to me if it turned out that ensembles have certain correlated errors, and in particular don't really represent anything like the demonstrator.

I do feel safer using this method than I would deferring to a single model, so this is still a good idea on balance. I just am not convinced that it solves the inner alignment problem. Instead, I'd say it ameliorates its severity, which may or may not be sufficient.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-23T12:39:17.505Z · LW(p) · GW(p)

Yes, I agree that an ensemble of models generated by a neural network may have correlated errors. I only claim to solve the inner alignment problem in theory (i.e. for idealized Bayesian reasoning).

comment by adamShimi · 2021-02-18T16:41:06.154Z · LW(p) · GW(p)

Thanks for sharing this work!

Here's my short summary after reading the slides and scanning the paper.

Because human demonstrator are safe (in the sense of almost never doing catastrophic actions), a model that imitates closely enough the demonstrator should be safe. The algorithm in this paper does that by keeping multiple models of the demonstrator, sampling the top models according to a parameter, and following what the sampled model does (or querying the demonstrator if the sample is "empty"). The probability that this algorithm does a very unlikely action for the demonstrator can be bounded above, and driven down.

If this is broadly correct (and if not, please tell me what's wrong), then I feel like this fall short of solving the inner alignment problem. I agree with most of the reasoning regarding imitation learning and its safety when close enough to the demonstrator. But the big issue with imitation learning by itself is that it cannot do much better than the demonstrator. In the event that any other other approach to AI can be superhuman, then imitation learning would be uncompetitive and there would be a massive incentive to ditch it.

Slide 8 actually points towards a way to use imitation learning to hopefully make a competitive AI: IDA. Yet in this case, I'm not sure that your result implies safety. For IDA isn't a one shot imitation learning problem; it's many successive imitation learning problem. Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step. (If you think it wouldn't, I'm interested by the argument)

Sorry if this feels a bit rough. Honestly, the result looks exciting in the context of imitation learning, but I feel it is a very bad policy to present a research as solving a major AI Alignment problem when it only does in a very, very limited setting, that doesn't feel that relevant to the actual risk.

Almost no mathematical background is required to follow the proofs. We feel our bounds could be made much tighter, and we'd love help investigating that.

This is really misleading for anyone that isn't used to online learning theory. I guess what you mean is that it doesn't rely on more uncommon fields of maths like gauge theory or category theory, but you still use ideas like measures and martingales which are far from trivial for someone with no mathematical background.

Replies from: vanessa-kosoy, Charlie Steiner, cocoa

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-21T22:37:36.999Z · LW(p) · GW(p)

Slide 8 actually points towards a way to use imitation learning to hopefully make a competitive AI: IDA. Yet in this case, I'm not sure that your result implies safety. For IDA isn't a one shot imitation learning problem; it's many successive imitation learning problem. Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step.

I don't think this is a lethal problem. The setting is not one-shot, it's imitation over some duration of time. IDA just increases the effective duration of time, so you only need to tune how cautious the learning is (which I think is controlled by in this work) accordingly: there is a cost, but it's bounded. You also need to deal with non-realizability (after enough amplifications the system is too complex for exact simulation, even if it wasn't to begin with), but this should be doable using infra-Bayesianism (I already have some notion how that would work). Another problem with imitation-based IDA is that external unaligned AI might leak into the system either from the future or from counterfactual scenarios in which such an AI is instantiated. This is not an issue with amplifying by parallelism (like in the presentation) but at the cost of requiring parallelizability.

↑ comment by Charlie Steiner · 2021-02-18T21:10:51.293Z · LW(p) · GW(p)

I think this looks fine for IDA - the two problems remain the practical one of implementing Bayesian reasoning in a complicated world, and the philosophical one that probably IDA on human imitations doesn't work because humans have bad safety properties.

↑ comment by michaelcohen (cocoa) · 2021-02-18T19:39:59.674Z · LW(p) · GW(p)

This is really misleading...

Edited to clarify. Thank you for this comment.

In the event that any other other approach to AI can be superhuman, then imitation learning would be uncompetitive

100% agree. Intelligent agency can be broken into intelligent prediction and intelligent planning. This work introduces a method for intelligent prediction that avoids an inner alignment failure. The original concern about inner alignment was that an idealized prediction algorithm (Bayesian reasoning) could be commandeered by mesa-optimizers. Idealized planning, on the other hand is an expectimax tree, and I don't think anyone has claimed mesa-optimizers could be introduced by a perfect planner. I'm not sure what it would even mean. There is nothing internal in the expectimax algorithm that could make the output something other than what the prediction algorithm would agree is the best plan. Expectimax, by definition, produces a policy perfectly aligned with the "goals" of the prediction algorithm.

Tl;dr: I think that in theory, the inner alignment problem regards prediction, not planning, so that's the place to test solutions.

If you want to see the inner alignment problem neutralized in the full RL setup, you can see we use a similar approach in this agent's prediction subroutine. So you can maybe say that work solved the inner alignment problem. But we didn't prove finite error bounds the way we have here, and I think RL setup obscures the extent to which rogue predictive models are dismissed, so it's a little harder to see than it is here.

sampling the top models according to a parameter, and following what the sampled model does

Not exactly. No models are ever sampled. The top models are collected, and they all can contribute to the estimated probabilities of actions. Then an action is sampled according to those probabilities, which sum to less than one, and queries the demonstrator if it comes up empty.

Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step.

Yes, that's right. The bounds should chain together, I think, but they would definitely grow.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-18T18:10:47.942Z · LW(p) · GW(p)

The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.

This is exactly the method used in my paper about delegative RL and also an earlier essay [LW · GW] that doesn't assume finite MDPs or access to rewards but has stronger assumptions about the advisor (so it's essentially doing imitation learning + denoising). I pointed out the connection to mesa-optimizers (which I called "daemons") in another essay [LW · GW].

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-18T20:01:47.008Z · LW(p) · GW(p)

If I'm interpreting the algorithm on page 7 correctly, it looks like if there's a trap that the demonstrator falls into with probability , there's no limit on the probability that the agent falls into the trap?

Also, and maybe relatedly, do the demonstrator-models in the paper have to be deterministic?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-18T23:37:37.086Z · LW(p) · GW(p)

Yes to the first question. In the DLRL paper I assume the advisor takes unsafe actions with probability exactly . However, it is straightforward to generalize the result s.t. the advisor can take unsafe actions with probability $δ ≪ ϵ$ , where $ϵ$ is the lower bound for the probability to take an optimal action (Definition 8). Moreover, in DLIRL [LW · GW] (which, I believe, is closer to your setting) I use a "soft" assumption (see Definition 3 there) that doesn't require any probability to vanish entirely.

No to the second question. In neither setting the advisor is assumed to be deterministic.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-19T11:33:52.324Z · LW(p) · GW(p)

In neither setting the advisor is assumed to be deterministic.

Okay, then I assume the agent's models of the advisor are not exclusively deterministic either?

However, it is straightforward to generalize the result s.t. the advisor can take unsafe actions with probability , where $ϵ$ is the lower bound for the probability to take an optimal action

What I care most about is the ratio of probabilities that the advisor vs. agent takes the unsafe action, where don't know as programmers (so the agent doesn't get told at the beginning) any bounds on what these advisor-probabilities are. Can this modification be recast to have that property? Or does it already?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-19T17:41:57.195Z · LW(p) · GW(p)

Okay, then I assume the agent's models of the advisor are not exclusively deterministic either?

Of course. I assume realizability, so one of the hypothesis is the true advisor behavior, which is stochastic.

What I care most about is the ratio of probabilities that the advisor vs. agent takes the unsafe action, where don't know as programmers (so the agent doesn't get told at the beginning) any bounds on what these advisor-probabilities are. Can this modification be recast to have that property? Or does it already?

In order to achieve the optimal regret bound, you do need to know the values of and $ϵ$ . In DLIRL, you need to know $β$ . However, AFAIU your algorithm also depends on some parameter ( $α$ )? In principle, if you don't know anything about the parameters, you can set them to be some function of the time discount s.t. as $γ \to 1$ the bound becomes true and the regret still goes to $0$ . In DLRL, this requires $ω (1 - γ) \leq ϵ (γ) \leq o (1)$ , in DLIRL $ω ((1 - γ)^{\frac{2}{3}}) \leq β (γ)^{- 1} \leq o (1)$ . However, then you only know regret vanishes with certain asymptotic rate without having a quantitative bound.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-20T10:23:47.950Z · LW(p) · GW(p)

That makes sense.

Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn't need to be set based on guesses about possibly countably many traps of varying advisor-probability.

I'm not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-20T12:12:06.214Z · LW(p) · GW(p)

Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn't need to be set based on guesses about possibly countably many traps of varying advisor-probability.

Hmm, yes, I think the difference comes from imitation vs. RL. In your setting, you only care about producing a good imitation of the advisor. On the other hand in my settings, I want to achieve near-optimal performance (which the advisor doesn't achieve). So I need stronger assumptions.

I'm not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.

Well, in DLIRL the probability that the advisor takes an unsafe action on any given round is bounded by roughly , whereas the probability that the agent takes an unsafe action over a duration of $(1 - γ)^{- 1}$ is bounded by roughly $β^{- 1} (1 - γ)^{- \frac{2}{3}}$ , so it's not a ratio but there is some relationship. I'm sure you can derive some relationship in DLRL too, but I haven't studied it (like I said, I only worked out the details when the advisor never takes unsafe actions).

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-20T15:54:36.761Z · LW(p) · GW(p)

Neat, makes sense.

comment by abramdemski · 2021-05-13T16:06:08.099Z · LW(p) · GW(p)

Reviewed here [LW · GW].

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-21T23:46:48.734Z · LW(p) · GW(p)

It is worth noting that this approach doesn't deal with non-Cartesian daemons [LW · GW], i.e. malign hypotheses that attack via the physical side effects of computations rather than their output. Homomorphic encryption is still the best solution I know to these.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-23T12:43:36.664Z · LW(p) · GW(p)

It is worth noting this.

comment by Koen.Holtman · 2021-02-25T22:38:07.738Z · LW(p) · GW(p)

Interesting paper! I like the focus on imitation learning, but the really new food-for-thought thing to me is the bit about dropping i.i.d. assumptions and then seeing how far you can get. I need to think more about the math in the paper before I can ask some specific questions about this i.i.d. thing.

My feelings about the post above are a bit more mixed. Claims about inner alignment always seem to generate a lot of traffic on this site. But a lot of this traffic consists of questions and clarification about what exactly counts as an inner alignment failure or a mesa optimization related failure. The term is so fluid that I find the quantitative feelings that people express in the comment section hard to interpret. Is everybody talking about the same and $P (b a d)$ ?

Thought experiment counter-example

Moving beyond these mixed feelings, here is a fun thought experiment, for various values of fun. You state:

our construction allows us to bound the extent to which mesa-optimizers could accomplish anything.

So now I will try to construct a counter-example to this claim: an example where mesa-optimizers (as I understand them) will accomplish very bad things even inside your construction.

Your construction uses a set of candidate policies $Π$ , where one of them equals the real demonstrator policy $π^{d}$ .

A single policy $π_{i}$ is a function $π_{i}$ were $π_{i} (a_{t} | h_{< t})$ computes the probability that action $a_{t}$ will be taken in the world state represented by the history $h_{< t}$ .

Here is the first special step in the thought experiment. I am going to define each $π_{i}$ by using a different reward function $R_{i}$ that the policy is going to try to maximize. Every policy definition will use the same world state transition predictor $P$ that allows me to estimate $P (h_{< t + 1} | h_{< t}, a_{t})$ for every $t$ . (I am probably getting slightly creative with the notation in the paper at this point.)

I now define each function $π_{i}$ as follows: I compute $π_{i} (a_{t} | h_{< t})$ by assigning a probability of 1 to the one $a_{t}$ that comes out of the $a r g m a x_{a}$ of the Bellman equation constructed using the shared predictor $P$ and the policy specific $R_{i}$ above. So the policy $π_{i}$ picks the action that will optimize the expected time discounted return of the reward function $R_{i}$ starting from the world state $h_{< t}$ , and projected forward by the predictor $P$ above, under the assumption that the policy also controls all future actions.

So each $π_{i}$ is a model with a definite inner optimizer inside of it: the inner optimizer does a search over a world state tree to find the optimal solution to the Bellman equation.

Now for the second special step: I am going to make some assumptions about the agent environment, building up to a treacherous turn scenario:

The demonstrator in the paper is a human, working alone with the agent.
The demonstrator has a big hammer that they will use to smash the agent when it starts misbehaving.
The world model projections in the policies $π_{i}$ are all detailed enough that they know about the hammer, and they know that they can expect the agent to get smashed when it performs an unexpected action.
Therefore, for all policies $π_{i}$ where $π_{i} \neq π^{d}$ , policies with reward functions not shared by the demonstrator, the optimal course of action is to take exactly that action that the demonstrator would expect the agent to take, to avoid getting smashed. The world models are good enough that they can predict this action.
The above is of course only the optimal action in these policies because they all project that there will be a point in time $t$ where the demonstrator becomes distracted, distracted enough that the agent can kill the demonstrator with a single action without getting smashed first. The projection also shows that the agent will have plausible deniability to other humans afterwards, so that it will survive to do whatever it wants.

When that point $t$ in time comes, these policies will all take exactly the same action, the action that kills the demonstrator. Only in the time steps $t + 1$ and later, the policies start diverging in what actions they will take, because they have now accomplished the emergent sub-goal that was shared by all, so that each has their hands free to accomplish their real goal $R_{i}$ .

Now if I re-examine the claim:

our construction allows us to bound the extent to which mesa-optimizers could accomplish anything.

I have to conclude that the mesa-optimizers have at least accomplished the feat of killing the demonstrator. Whether they then can accomplish any of their divergent post-treachery goals in later time steps remains an open question, but the damage has been done.

Issues

Some issues remain with this fun (for various values of fun) thought experiment.

I can note that at time $t$ , the policy $π^{d} \in Π$ that was fully aligned with the demonstrator did not want to kill the demonstrator, it would have taken another action. So, if I understand the paper right, if $π^{d}$ was still in the set of top policies at time $t$ , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.

But I am also reading in the abstract of the paper that

[...] queries to the demonstrator rapidly diminish in frequency

so I think this implies that over time, it is likely that $π^{d}$ might disappear from the top set. But I am not fully sure here. I might be reading the paper wrong.

Questions

So I have some questions:

Does this thought experiment look reasonable or have I overlooked something?
What about the probability that $π^{d}$ is still in the set of top policies at time $t$ ? To me, it looks like if we start with $10^{10}$ candidates in $Π$ and the agent quickly narrows this down to a top set of size $10^{4}$ , then it is likely that $π^{d}$ will not be in there anymore.
Any strong opinions from anyone about whether the above counts as a definite inner alignment failure due to mesa optimization (I think so) or definitely not?

This comment is long enough already so I am not going to speculate here about possible ways to avoid inserting models like the $π_{i}$ models I constructed above into the set $Π$ . But speculation is welcome...

(I am calling this a 'fun' thought experiment because I am thinking of this as a limit case. This limit case useful to identify and provides some food for thought, but it does not really change my opinion about how stable or safe the approach in the paper might be in practice. I my experience, you can always find a limit case where things fail if you start looking for it.)

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-02-26T15:49:44.009Z · LW(p) · GW(p)

So, if I understand the paper right, if was still in the set of top policies at time $t$ , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.

This is exactly right.

But I am also reading in the abstract of the paper that

[...] queries to the demonstrator rapidly diminish in frequency

The reason this is consistent is that queries will become infrequent, but they will still be well-timed. The choice of whether to query depends on what the treacherous models are doing. So if the treacherous models wait for a long time, then we will have a long time where no queries are needed, and as soon as they decide to be treacherous, we thwart that plan with a query to the demonstrator. So it does not imply that

over time, it is likely that $π^{d}$ might disappear from the top set

Does this thought experiment look reasonable or have I overlooked something?

Yep, perfectly reasonable!

What about the probability that $π^{d}$ is still in the set of top policies at time $t$ ?

If we set $α$ small enough, we can make it arbitrarily like that $π^{d}$ never leaves the set of top policies.

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-03-03T15:44:54.942Z · LW(p) · GW(p)

[long comment, my bigger agenda here is to get to a state where discussions on this forum start using much more math and subcase analysis when they talk about analysing and solving inner alignment problems.]

If we set small enough, we can make it arbitrarily like[ly] that $π^{d}$ never leaves the set of top policies.

OK, I think you are kind of saved here, in the case of my counter-example, because of the special nature of your equation (2) in the paper that computes the set of policies. To make this more specific:

Say I refine the counter-example by adding the prior $w (π_{i}) = 1 / | Π |$ , so each candidate is considered equally likely from the start. Say that $| Π | = 10^{10}$ .

The key thing to note in my example is that, before the time-synchronized treacherous turn happens for all policies $π_{i} \neq π^{d}$ , all policies $π_{i}$ are equally compatible with events so far, so the posterior $w (π_{i} | h_{< t})$ is the same for all of them.

When I use $α = 10^{- 4}$ with these all-the-same-posterior policies, then (if I am reading equation (2) correctly) the set $Π_{h < t}^{α}$ of top policies will have size $N = 10^{4}$ . Ties are broken arbitrarily, so the real $π^{d}$ is unlikely to be among these top $N = 10^{4} << 10^{10}$ candidates. In that case, the agent will perform the treacherous turn and kill the operator. Correct me if I am wrong here.

However.

Case of $α < 1 / | Π |$ : much better than I thought!

However, and this is where things get interesting, if we use $α = 10^{11}$ , so $α < 1 / | Π |$ , then equation (2) will compute $Π_{h < t}^{α} = Π$ , the top set of policies is all policies. The correct policy is therefore still in the top set, and we get no treacherous turn.

So equation (2) is doing a lot of heavy lifting in making the approach work! This also seems to be a major difference with the $β$ based approach in your earlier paper. Not sure if you made this change specifically to deal with examples like my counter-example above, or if you had a more general reason -- the paper does not say.

Pretty interesting! So I would now agree with your statement that your approach provably solves the inner alignment problem, at least for $α < 1 / | Π |$ and for certain values of 'the inner alignment problem'. I can also imagine several setups where $α < 1 / | Π |$ would still lead to reasonably fast learning (=reasonably fast drop in frequency of demonstrator requests).

This is what progress looks like. There are certain boundary conditions here that might be unrealistic: $Π$ needs to be a finite set and the real policy needs to be in there, but at least we have some well-defined boundary conditions we can now look at, where we can try to understand and interpret the math further.

Case of $α > 1 / | Π |$ : the challenge of designing a prior

In the case of using $α = 10^{- 4}$ or $α = 10^{- 9}$ , because you want the system to learn faster, you are instead looking at a setup where, if you want to solve the inner alignment problem, you will have to shape the prior values $w (π_{i})$ in such a way that $w^{d}$ gets a higher prior than the all the treacherous policies.

My counter-example above shows that if you are using the 1/(number of bits in the program that computes $π_{i}$ ) as the prior, you will not get what you need. There will be plenty of functions $R_{i}$ that have fewer bits in their encoding than the real function $R^{d}$ used by the demonstrator.

Earlier in this comment section, there is a whole sub-thread with speculation on the number of bits needed too encode benign vs. treacherous policies, but for me that discussion does not yet arrive at any place where I would get convinced that the problem of assigning higher priors to benign vs. treacherous policies has been made tractable yet. (Vanessa has identified some additional moving parts however.)

There is of course a tradition in the AI safety community where this is made 'tractable' by the device of polling many AI researchers to ask them whether they think that bits(benign policy)<bits(treacherous policy) for future ML systems, and then graphing the results, but this is not what I would call a tractable solution.

What I would call tractable is a solution like the one, for a much simpler case, in section 10.2.4 of my paper Counterfactual Planning in AGI Systems. I show there that random exploration can be used to make sure that bits(agent environment model which includes unwanted self-knowledge about agent compute core internals) $≫$ bits(agent environment model that lacks this unwanted self-knowledge), no matter what the encoding. Extending this to the bits(benign policy) case would be nice, but I can't immediately see a route here.

My answer to the above hypothetical bits(benign policy)<bits(treacherous policy) poll is that we cannot expect this to be true any possible encoding of policies (see counter-example above), but it might be true for some encodings. Figuring out where deep neural net encodings fit on the spectrum would be worthwhile.

Also. my answer to bits(benign policy)<bits(treacherous policy) would depend on whether the benign policy is supposed to be about making paperclips in the same way humans do, or about maximizing human values over the lifetime of the universe in ways that humans will not be able to figure out themselves.

For the paperclip making imitation policy, I am somewhat more optimistic about tractability than in the more general case.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-03-04T16:51:15.287Z · LW(p) · GW(p)

There will be plenty of functions that have fewer bits in their encoding than the real function $R^{d}$ used by the demonstrator.

I don't think this is a problem. There will be plenty of them, but when they're wrong they'll get removed from the posterior.

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-03-05T13:49:18.600Z · LW(p) · GW(p)

I don't think this is a problem. There will be plenty of them, but when they're wrong they'll get removed from the posterior.

I have seen you mention a number of times in this comment thread that 'this is not a problem because eventually the bad/wrong policies will disappear from the top set'. You have not qualified this statement with 'but we need a very low like $α < 1 / | Π |$ to make this work in a safe way', so I remain somewhat uncertain about your views are about how low $α$ needs to go.

In any case, I'll now try to convince you that if $α > 1 / | Π |$ , your statement that 'when they're wrong they'll get removed from the posterior' will not always mean what you might want it to mean.

Is the demonstrator policy $π^{d}$ to get themselves killed?

The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for.

Say that $π^{p} \in Π$ is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where $α > 1 / | Π |$ and where at time step $t$ , we have the likely case that $π^{p} \notin Π_{h < t}^{α}$ . In the world I constructed for the counterexample, the remaining top policies $Π_{h < t}^{α}$ now perform a synchronized treacherous turn where they kill the demonstrator.

In time step $t + 1$ and later, the policies $Π_{h < t + 1}^{α}$ diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the $n u l l$ action. This eventually removes all 'wrong' policies from $Π_{h < t + 1 + i}^{α}$ , where 'wrong' means that they do not take the $n u l l$ action at all future time steps.

The silver lining is perhaps that at least the agent will eventually stop, perform $n u l l$ actions only, after it has killed the demonstrator.

Now. the paper proves that the behavior of the agent policy $π_{α}^{i}$ will approximate that of the true demonstrator policy $π^{d}$ closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy $π^{d}$ had nothing to do with producing paperclips, this was a wrong guess all along. The right demonstrator policy $π^{d}$ is one where the demonstrator always intended to get themselves killed.

This would be a somewhat unusual solution to the inner alignment problem.

The math in the paper has you working in a fixed-policy setting where the demonstrator policy $π^{d}$ is immutable/time-invariant. The snag is that this does not imply that the policy $π^{d}$ defines a behavioral trajectory that is independent of the internals of the agent construction. If the agent is constructed in a particular way and when it operates in a certain environment, it will force $π^{d}$ into a self-fulfilling trajectory where it kills the demonstrator.

Side note: if anybody is looking for alternative math that allows one to study and manage the interplay between a mutable time-dependent demonstrator policy and the agent policy, causal models seem to be the way to go. See for example here [? · GW] where this is explored in a reward learning setting.

comment by Koen.Holtman · 2021-03-03T16:36:05.386Z · LW(p) · GW(p)

[edited to delete and replace an earlier question] Question about the paper: under equation (3) on page 4 I am reading:

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,

This confused me initially to no end, and still confuses me. Should this be:

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator or picking one of the other actions???

This would seem to be more consistent with the definitions that follow, and it would seem to make more sense overall.

Replies from: cocoa

↑ comment by michaelcohen (cocoa) · 2021-03-04T16:46:29.531Z · LW(p) · GW(p)

A policy outputs a distribution over , and equations 3 and 4 define what this distribution is for the imitator. If it outputs (0, a), that means $q_{t} = 0$ and and $a_{t} = 1$ and if it outputs (1, a), that means $q_{t} = 1$ and $a_{t} = a$ . When I say

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,

that's just describing the difference between equations 3 and 4. Look at equation 4 to see that when $q_{t} = 1$ , the distribution over the action is equal to that of the demonstrator. So we describe the behavior that follows $q_{t} = 1$ as "deferring to the demonstrator". If we look at the distribution over the action when $q_{t} = 0$ , it's something else, so we say the imitator is "picking its own action".

The 0 on the l.h.s. means the imitator is picking the action $a$ itself instead of deferring to the demonstrator or picking one of the other actions???

The 0 means the imitator is picking the action, and the $a$ means it's not picking another action that's not $a$ .

Replies from: Koen.Holtman

↑ comment by Koen.Holtman · 2021-03-05T10:39:13.896Z · LW(p) · GW(p)

I agree with your description above about how it all works. But I guess I was not explaining well enough why I got confused and why the edits of inserting the and the bold text above would have stopped me getting confused. So I'll try again.

I read the sentence fragment

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,

below equation (3) as an explanatory claim that the value defined in equation (3) defines the probability that the imitator is picking the action itself instead of deferring to the demonstrator, the probability given the history $| h_{< t}$ . However, this is not the value being defined by equation (3), instead it defines the probability the imitator is picking the action itself instead of deferring to the demonstrator when the history is $| h_{< t}$ and the next action taken is $a$ .

The actual probability of the imitator is picking the action itself under $| h_{< t}$ , is given by $\sum_{a \in A} π_{α}^{i} (0, a | h_{< t})$ , which is only mentioned in passing in the lines between equations (3) and (4).

So when I was reading the later sections in the paper and I wanted to look back at what the probability was that the imitator would pick the action, my eye landed on equation (3) and the sentence below it. When I read that sentence, it stopped me from looking further to find the equation $\sum_{a \in A} π_{α}^{i} (0, a | h_{< t})$ , which is the equation I was really looking for. Instead my mind auto-completed equation (3) by adding an $a v g_{a}$ term to it, which makes for a much more conservative querying policy than the one you defined, and this then got me into wondering how you were dealing with learning nondeterminstic policies, if at all, etc.

So overall I think you can improve readability by doing some edits to draw attention more strongly to the conditional nature of $a$ , and foregroundig the definition of $θ_{q}$ more clearly as a single-line equation.

Formal Solution to the Inner Alignment Problem

Contents

123 comments

1: Dependence on the prior of the true hypothesis

2: Dependence on the probability bound

Thought experiment counter-example

Issues

Questions

Case of α<1/|Π|: much better than I thought!

Case of α>1/|Π|: the challenge of designing a prior

Is the demonstrator policy πd to get themselves killed?

Case of $α < 1 / | Π |$ : much better than I thought!

Case of $α > 1 / | Π |$ : the challenge of designing a prior

Is the demonstrator policy $π^{d}$ to get themselves killed?