Comment by paulfchristiano on What's wrong with these analogies for understanding Informed Oversight and IDA? · 2019-03-20T18:26:52.204Z · score: 13 (4 votes) · LW · GW

A universal reasoner is allowed to use an intuition "because it works." They only take on extra obligations once that intuition reflects more facts about the world which can't be cashed out as predictions that can be confirmed on the same historical data that led us to trust the intuition.

For example, you have an extra obligation if Ramanujan has some intuition about why theorem X is true, you come to trust such intuitions by verifying them against proof of X, but the same intuitions also suggest a bunch of other facts which you can't verify.

In that case, you can still try to be a straightforward Bayesian about it, and say "our intuition supports the general claim that process P outputs true statements;" you can then apply that regularity to trust P on some new claim even if it's not the kind of claim you could verify, as long as "P outputs true statements" had a higher prior than "P outputs true statements just in the cases I can check." That's an argument that someone can give to support a conclusion, and "does process P output true statements historically?" is a subquestion you can ask during amplification.

The problem becomes hard when there are further facts that can't be supported by this Bayesian reasoning (and therefore might undermine it). E.g. you have a problem if process P is itself a consequentialist, who outputs true statements in order to earn your trust but will eventually exploit that trust for their own advantage. In this case, the problem is that there is something going on internally inside process P that isn't surfaced by P's output. Epistemically dominating P requires knowing about that.

See the second and third examples in the post introducing ascription universality. There is definitely a lot of fuzziness here and it seems like one of the most important places to tighten up the definition / one of the big research questions for whether ascription universality is possible.

Comment by paulfchristiano on More realistic tales of doom · 2019-03-18T15:24:34.957Z · score: 6 (4 votes) · LW · GW
But why exactly should we expect that the problems you describe will be exacerbated in a future with powerful AI, compared to the state of contemporary human societies?

To a large extent "ML" refers to a few particular technologies that have the form "try a bunch of things and do more of what works" or "consider a bunch of things and then do the one that is predicted to work."

That is true but I think of this as a limitation of contemporary ML approaches rather than a fundamental property of advanced AI.

I'm mostly aiming to describe what I think is in fact most likely to go wrong, I agree it's not a general or necessary feature of AI that its comparative advantage is optimizing easy-to-measure goals.

(I do think there is some real sense in which getting over this requires "solving alignment.")

Comment by paulfchristiano on More realistic tales of doom · 2019-03-18T04:21:45.450Z · score: 4 (2 votes) · LW · GW

I'm not mostly worried about influence-seeking behavior emerging by "specify a goal" --> "getting influence is the best way to achieve that goal." I'm mostly worried about influence-seeking behavior emerging within a system by virtue of selection within that process (and by randomness at the lowest level).

## More realistic tales of doom

2019-03-17T20:18:59.800Z · score: 124 (40 votes)
Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-15T15:57:44.659Z · score: 4 (2 votes) · LW · GW

I don't see why their methods would be elegant. In particular, I don't see why any of {the anthropic update, importance weighting, updating from the choice of universal prior} would have a simple form (simpler than the simplest physics that gives rise to life).

I don't see how MAP helps things either---doesn't the same argument suggest that for most of the possible physics, the simplest model will be a consequentialist? (Even more broadly, for the universal prior in general, isn't MAP basically equivalent to a random sample from the prior, since some random model happens to be slightly more compressible?)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-15T01:20:30.341Z · score: 4 (2 votes) · LW · GW
I say maybe 1/5 chance it’s actually dominated by consequentialists

Do you get down to 20% because you think this argument is wrong, or because you think it doesn't apply?

What problem do you think bites you?

What's ? Is it O(1) or really tiny? And which value of do you want to consider, polynomially small or exponentially small?

But if it somehow magically predicted which actions BoMAI was going to take in no time at all, then c would have to be above 1/d.

Wouldn't they have to also magically predict all the stochasticity in the observations, and have a running time that grows exponentially in their log loss? Predicting what BoMAI will do seems likely to be much easier than that.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-14T17:04:49.765Z · score: 4 (2 votes) · LW · GW

This invalidates some of my other concerns, but also seems to mean things are incredibly weird at finite times. I suspect that you'll want to change to something less extreme here.

(I might well be misunderstanding something, apologies in advance.)

Suppose the "intended" physics take at least 1E15 steps to run on the UTM (this is a conservative lower bound, since you have to simulate the human for the whole episode). And suppose (I think you need much lower than this). Then the intended model gets penalized by at least exp(1E12) for its slowness.

For almost the same description complexity, I could write down physics + "precompute the predictions for the first N episodes, for every sequence of possible actions/observations, and store them in a lookup table." This increases the complexity by a few bits, some constant plus K(N|physics), but avoids most of the computation. In order for the intended physics to win, i.e. in order for the "speed" part of the speed prior to do anything, we need the complexity of this precomputed model to be at least 1E12 bits higher than the complexity of the fast model.

That appears to happen only once N > BB(1E12). Does that seem right to you?

We could talk about whether malign consequentialists also take over at finite times (I think they probably do, since the "speed" part of the speed prior is not doing any work until after BB(1E12) steps, long after the agent becomes incredibly smart), but it seems better to adjust the scheme first.

Using the speed prior seems more reasonable, but I'd want to know which version of the speed prior and which parameters, since which particular problem bites you will depend on those choices. And maybe to save time, I'd want to first get your take on whether the proposed version is dominated by consequentialists at some finite time.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-14T04:50:12.723Z · score: 4 (2 votes) · LW · GW

From the formal description of the algorithm, it looks like you use a universal prior to pick , and then allow the Turing machine to run for steps, but don't penalize the running time of the machine that outputs . Is that right? That didn't match my intuitive understanding of the algorithm, and seems like it would lead to strange outcomes, so I feel like I'm misunderstanding.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-14T04:48:33.191Z · score: 2 (1 votes) · LW · GW

(I actually have a more basic confusion, started a new thread.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-14T04:21:36.569Z · score: 2 (1 votes) · LW · GW

(ETA: I think this discussion depended on a detail of your version of the speed prior that I misunderstood.)

Given a world model ν, which takes k computation steps per episode, let νlog be the best world-model that best approximates ν (in the sense of KL divergence) using only logk computation steps. νlog is at least as good as the “reasoning-based replacement” of ν.
The description length of νlog is within a (small) constant of the description length of ν. That way of describing it is not optimized for speed, but it presents a one-time cost, and anyone arriving at that world-model in this way is paying that cost.

To be clear, that description gets ~0 mass under the speed prior, right? A direct specification of the fast model is going to have a much higher prior than a brute force search, at least for values of large enough (or small enough, however you set it up) to rule out the alien civilization that is (probably) the shortest description without regard for computational limits.

One could consider instead νlogε, which is, among the world-models that ε-approximate ν in less than logk computation steps (if the set is non-empty), the first such world-model found by a searching procedure ψ. The description length of νlogε is within a (slightly larger) constant of the description length of ν, but the one-time computational cost is less than that of νlog.

Within this chunk of the speed prior, the question is: what are good ψ? Any reasonable specification of a consequentialist would work (plus a few more bits for it to understand its situation, though most of the work is done by handing it ), or of a petri dish in which a consequentialist would eventually end up with influence. Do you have a concrete alternative in mind, which you think is not dominated by some consequentialist (i.e. a ψ for which every consequentialist is either slower or more complex)?

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-13T17:34:21.061Z · score: 2 (1 votes) · LW · GW
Once all the subroutines are "baked into its architecture" you just have: the algorithm "predict accurately" + "treacherous turn"

You only have to bake in the innermost part of one loop in order to get almost all the computational savings.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-13T17:31:54.293Z · score: 2 (1 votes) · LW · GW
a reasoning-based order (at least a Bayesian-reasoning-based order) should really just be called a posterior

Reasoning gives you a prior that is better than the speed prior, before you see any data. (*Much* better, limited only by the fact that the speed prior contains strategies which use reasoning.)

The reasoning in this case is not a Bayesian update. It's evaluating possible approximations *by reasoning about how well they approximate the underlying physics, itself inferred by a Bayesian update*, not by directly seeing how well they predict on the data so far.

The description length of the "do science" strategy (I contend) is less than the description length of the "do science" + "treacherous turn" strategy.

I think the only good arguments for this are in the limit where you don't care about simplicity at all and only care about running time, since then you can rule out all reasoning. The threshold where things start working depends on the underlying physics, for more computationally complex physics you need to pick larger and larger computation penalties to get the desired result.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T16:58:22.303Z · score: 2 (1 votes) · LW · GW
but that's exactly what we're doing to

It seems totally different from what we're doing, I may be misunderstanding the analogy.

Suppose I look out at the world and do some science, e.g. discovering the standard model. Then I use my understanding of science to design great prediction algorithms that run fast, but are quite complicated owing to all of the approximations and heuristics baked into them.

The speed prior gives this model a very low probability because it's a complicated model. But "do science" gives this model a high probability, because it's a simple model of physics, and then the approximations follow from a bunch of reasoning on top of that model of physics. We aren't trading off "shortness" for speed---we are trading off "looks good according to reasoning" for speed. Yes they are both arbitrary orders, but one of them systematically contains better models earlier in the order, since the output of reasoning is better than a blind prioritization of shorter models.

Of course the speed prior also includes a hypothesis that does "science with the goal of making good predictions," and indeed Wei Dai and I are saying that this is the part of the speed prior that will dominate the posterior. But now we are back to potentially-malign consequentialistism. The cognitive work being done internally to that hypothesis is totally different from the work being done by updating on the speed prior (except insofar as the speed prior literally contains a hypothesis that does that work).

In other words:

Suppose physics takes n bits to specify, and a reasonable approximation takes N >> n bits to specify. Then the speed prior, working in the intended way, takes N bits to arrive at the reasonable approximation. But the aliens take n bits to arrive at the standard model, and then once they've done that can immediately deduce the N bit approximation. So it sure seems like they'll beat the speed prior. Are you objecting to this argument?

(In fact the speed prior only actually takes n + O(1) bits, because it can specify the "do science" strategy, but that doesn't help here since we are just trying to say that the "do science" strategy dominates the speed prior.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T16:46:47.846Z · score: 2 (1 votes) · LW · GW
The only other approach I can think of is trying to do the anthropic update ourselves.

If you haven't seen Jessica's post in this area, it's worth taking a quick look.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T16:44:47.138Z · score: 2 (1 votes) · LW · GW

I just mean: "universality" in the sense of a UTM isn't a sufficient property when defining the speed prior, the analogous property of the UTM is something more like: "You can run an arbitrary Turing machine without too much slowdown." Of course that's not possible, but it seems like you still want to be as close to that as possible (for the same reasons that you wanted universality at all).

I agree that it would be fine to sacrifice this property if it was helpful for safety.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T01:48:22.856Z · score: 2 (1 votes) · LW · GW
Using "reasoning" to pick which one to favor, is just picking the first one in some new order.

Yes, some new order, but not an arbitrary one. The resulting order is going to be better than the speed prior order, so we'll update in favor of the aliens and away from the rest of the speed prior.

one can't escape the necessity to introduce the arbitrary criterion of "valuing" earlier things on the list

Probably some miscommunication here. No one is trying to object to the arbitrariness, we're just making the point that the aliens have a lot of leverage with which to beat the rest of the speed prior.

(They may still not be able to if the penalty for computation is sufficiently steep---e.g. if you penalize based on circuit complexity so that the model might as well bake in everything that doesn't depend on the particular input at hand. I think it's an interesting open question whether that avoids all problems of this form, which I unsuccessfully tried to get at here.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T01:39:48.795Z · score: 2 (1 votes) · LW · GW
That's what I was thinking too, but Michael made me realize this isn't possible, at least for some M. Suppose M is the C programming language, but in C there is no way to say "interpret this string as a C program and run it as fast as a native C program". Am I missing something at this point?

I agree this is only going to be possible for some universal Turing machines. Though if you are using a Turing machine to define a speed prior, this does seem like a desirable property.

I don't understand this sentence.

If physics is implemented in C, there are many possible bugs that would allow the attacker to execute arbitrary C code with no slowdown.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-11T17:49:08.087Z · score: 4 (2 votes) · LW · GW

The fast algorithms to predict our physics just aren't going to be the shortest ones. You can use reasoning to pick which one to favor (after figuring out physics), rather than just writing them down in some arbitrary order and taking the first one.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-11T17:46:11.380Z · score: 4 (2 votes) · LW · GW
I was assuming the worst, and guessing that there are diminishing marginal returns once your odds of a successful takeover get above ~50%, so instead of going all in on accurate predictions of the weakest and ripest target universe, you hedge and target a few universes.

There are massive diminishing marginal returns; in a naive model you'd expect essentially *every* universe to get predicted in this way.

But Wei Dai's basic point still stands. The speed prior isn't the actual prior over universes (i.e. doesn't reflect the real degree of moral concern that we'd use to weigh consequences of our decisions in different possible worlds). If you have some data that you are trying to predict, you can do way better than the speed prior by (a) using your real prior to estimate or sample from the actual posterior distribution over physical law, (b) using engineering reasoning to make the utility maximizing predictions, given that faster predictions are going to get given more weight.

(You don't really need this to run Wei Dai's argument, because there seem to be dozens of ways in which the aliens get an advantage over the intended physical model.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-11T17:40:49.060Z · score: 4 (2 votes) · LW · GW
If the AGI is using machinery that would allow it to simulate any world-model, it will be way slower than the Turing machine built for that algorithm.

Just consider a program that gives the aliens the ability to write arbitrary functions in M and then pass control to them. That program is barely any bigger (all you have to do is insert one use after free in physics :) ), and guarantees the aliens have zero slowdown.

For the literal simplest version of this, your program is M(Alien(), randomness), which is going to run just as fast as M(physics, randomness) for the intended physics, and probably much faster (if the aliens can think of any clever tricks to run faster without compromising much accuracy). The only reason you wouldn't get this is if Alien is expensive. That probably rules out crazy alien civilizations, but I'm with Wei Dai that it probably doesn't rule out simpler scientists.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-09T23:57:33.507Z · score: 5 (3 votes) · LW · GW

I'm sympathetic to this picture, though I'd probably be inclined to try to model it explicitly---by making some assumption about what the planning algorithm can actually do, and then showing how to use an algorithm with that property. I do think "just write down the algorithm, and be happier if it looks like a 'normal' algorithm" is an OK starting point though

Given that the setup is basically a straight reinforcement learner with a weird prior, I think that at that level of abstraction, the ceiling of competitiveness is quite high.

Stepping back from this particular thread, I think the main problem with competitiveness is that you are just getting "answers that look good to a human" rather than "actually good answers." If I try to use such a system to navigate a complicated world, containing lots of other people with more liberal AI advisors helping them do crazy stuff, I'm going to quickly be left behind.

It's certainly reasonable to try to solve safety problems without attending to this kind of competitiveness, though I think this kind of asymptotic safety is actually easier than you make it sound (under the implicit "nothing goes irreversibly wrong at any finite time" assumption).

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-09T23:39:55.378Z · score: 5 (3 votes) · LW · GW
I suppose I constrained myself to producing an algorithm/setup where the asymptotic benignity result followed from reasons that don’t require dangerous behavior in the interim.

I think my point is this:

• The intuitive thing you are aiming at is stronger than what the theorem establishes (understandably!)
• You probably don't need the memory trick to establish the theorem itself.
• Even with the memory trick, I'm not convinced you meet the stronger criterion. There are a lot of other things similar to memory that can cause trouble---the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia.
Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T17:40:21.735Z · score: 7 (3 votes) · LW · GW

The theorem is consistent with the aliens causing trouble any finite number of times. But each time they cause the agent to do something weird their model loses some probability, so there will be some episode after which they stop causing trouble (if we manage to successfully run enough episodes without in fact having anything bad happen in the meantime, which is an assumption of the asymptotic arguments).

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T17:10:47.980Z · score: 4 (2 votes) · LW · GW
Under a policy that doesn't cause the computer's memory to be tampered with (which is plausible, even ideal), ν† and ν⋆ are identical, so we can't count on ν†losing probability mass relative to ν⋆.

I agree with that, but if they are always making the same on-policy prediction it doesn't matter what happens to their relative probability (modulo exploration). The agent can't act on an incentive to corrupt memory infinitely often, because each time requires the models making a different prediction on-policy. So the agent only acts on such an incentive finitely many times, and hence never does so after some sufficiently late episode . Agree/disagree?

(Having a bad model can still hurt, since the bogus model might agree on-policy but assign lower rewards off-policy. But if they also always approximately agree on the exploration distribution, then a bad model also can't discourage exploration. And if they don't agree on the exploration distribution, then the bad model will eventually get tested.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T16:49:26.046Z · score: 4 (2 votes) · LW · GW

The algorithm takes an argmax over an exponentially large space of sequences of actions, i.e. it does 2^{episode length} model evaluations. Do you think the result is smarter than a group of humans of size 2^{episode length}? I'd bet against---the humans could do this particular brute force search, in which case you'd have a tie, but they'd probably do something smarter.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T04:57:40.628Z · score: 4 (2 votes) · LW · GW

I agree that you don't rely on this assumption (so I was wrong to assume you are more optimistic than I am). In the literal limit, you don't need to care about any of the considerations of the kind I was raising in my post.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T04:55:18.037Z · score: 4 (2 votes) · LW · GW
For the asymptotic results, one has to consider environments that produce observations with the true objective probabilities (hence the appearance that I'm unconcerned with competitiveness). In practice, though, given the speed prior, the agent will require evidence to entertain slow world-models, and for the beginning of its lifetime, the agent will be using low-fidelity models of the environment and the human-explorer, rendering it much more tractable than a perfect model of physics. And I think that even at that stage, well before it is doing perfect simulations of other humans, it will far surpass human performance. We manage human-level performance with very rough simulations of other humans.

I'm keen on asymptotic analysis, but if we want to analyze safety asymptotically I think we should also analyze competitiveness asymptotically. That is, if our algorithm only becomes safe in the limit because we shift to a super uncompetitive regime, it undermines the use of the limit as analogy to study the finite time behavior.

(Though this is not the most interesting disagreement, probably not worth responding to anything other than the thread where I ask about "why do you need this memory stuff?")

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T04:49:46.034Z · score: 4 (2 votes) · LW · GW
That leads me to think this approach is much more competitive that simulating a human and giving it a long time to think.

Surely that just depends on how long you give them to think. (See also HCH.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T04:48:18.033Z · score: 4 (2 votes) · LW · GW

Given that you are taking limits, I don't see why you need any of the machinery with forgetting or with memory-based world models (and if you did really need that machinery, it seems like your proof would have other problems). My understanding is:

• Your already assume that you can perform arbitrarily many rounds of the algorithm as intended (or rather you prove that there is some such that if you ran steps, with everything working as intended and in particular with no memory corruption, then you would get "benign" behavior).
• Any time the MAP model makes a different prediction from the intended model, it loses some likelihood. So this can only happen finitely many times in any possible world. Just take to be after the last time it happens w.h.p.

What's wrong with this?

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T01:16:07.173Z · score: 12 (7 votes) · LW · GW

If I have a great model of physics in hand (and I'm basically unconcerned with competitiveness, as you seem to be), why not just take the resulting simulation of the human and give it a long time to think? That seems to have fewer safety risks and to be more useful.

More generally, under what model of AI capabilities / competitiveness constraints would you want to use this procedure?

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T01:15:38.444Z · score: 7 (4 votes) · LW · GW

Here is an old post of mine on the hope that "computationally simplest model describing the box" is actually a physical model of the box. I'm less optimistic than you are, but it's certainly plausible.

From the perspective of optimization daemons / inner alignment, I think like the interesting question is: if inner alignment turns out to be a hard problem for training cognitive policies, do we expect it to become much easier by training predictive models? I'd bet against at 1:1 odds, but not 1:2 odds.

Comment by paulfchristiano on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-13T05:21:25.705Z · score: 9 (3 votes) · LW · GW
I am also not sure exactly what it means to use RL in iterated amplification.

You can use RL for the distillation step. (I usually mention RL as the intended distillation procedure when I describe the scheme, except perhaps in the AGZ analogy post.)

So then I don't really see why you want RL, which typically is solving a hard credit assignment problem that doesn't arise in the one-step setting.

The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and "RL" seems like the normal way to talk about that algorithm/problem. We you could instead call it "contextual bandits."

You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it's generic RL.

Using a combination of IRL + RL to achieve the same effect as imitation learning.

Does "imitation learning" refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it's normally the kind of algorithm I have in mind when talking about "imitation learning" (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)

Comment by paulfchristiano on Nuances with ascription universality · 2019-02-13T01:36:31.159Z · score: 9 (5 votes) · LW · GW
a system is ascription universal if, relative to our current epistemic state, its explicit beliefs contain just as much information as any other way of ascribing beliefs to it.

This is a bit different than the definition in my post (which requires epistemically dominating every other simple computation), but is probably a better approach in the long run. This definition is a little bit harder to use for the arguments in this post, and my current expectation is that the "right" definition will be usable for both informed oversight and securing HCH. Within OpenAI Geoffrey Irving has been calling a similar property "belief closure."

This is not necessarily a difficult concern to address in the sense of making sure that any definition of ascription universality includes some concept of ascribing beliefs to a system by looking at the beliefs of any systems that helped create that system.

In the language of my post, I'd say:

• The memoized table is easy to epistemically dominate. To the extent that malicious cognition went into designing the table, you can ignore it when evaluating epistemic dominance.
• The training process that produced the memoized table can be hard to epistemically dominate. That's what we should be interested in. (The examples in the post have this flavor.)
Comment by paulfchristiano on Thoughts on reward engineering · 2019-02-11T18:21:00.525Z · score: 5 (2 votes) · LW · GW
Is "informed oversight" entirely a subproblem of "optimizing for worst case"? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.

No, it's also important for getting good behavior from RL.

This is tangential but can you remind me why it's not a problem as far as competitiveness that your overseer is probably more costly to compute than other people's reward/evaluation functions?

This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)

(Note that even "10x slowdown" could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)

Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action.

In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.

If a model trained on historical data could predict good consequences, but your overseer can't, then you are going to sacrifice competitiveness. That is, your agent won't be motivated to use its understanding to help you achieve good consequences.

I think the confusion is coming from equivocating between multiple proposals. I'm saying, "We need to solve informed oversight for amplification to be a good training scheme." You are asking "Why is that a problem?" and I'm trying to explain why this is a necessary component of iterated amplification. In explaining that, I'm sometimes talking about why it wouldn't be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for "a story about why the model might do something unsafe," I assumed you were asking for the latter---why would the obvious approach to making it competitive be unsafe. My earlier comment "If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable" is explaining why approval-directed agents aren't competitive by default unless you solve something like this.

(That all said, sometimes the overseer believes that X will have good consequences because "stuff like X has had good consequences in the past;" that seems to be an important kind of reasoning that you can't just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don't use this kind of reasoning you sacrifice competitiveness.)

Comment by paulfchristiano on Thoughts on reward engineering · 2019-02-11T00:27:47.743Z · score: 5 (2 votes) · LW · GW
What if the overseer just asks itself, "If I came up with the idea for this action myself, how much would I approve of it?" Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn't the same thing happen if the overseer was just making the decisions itself?

No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work. (Unless they happen to get unlucky and find great actions by chance, in which case they can get unlucky, or if they run a really powerful search, in which case yes really powerful searches are scary for the same reason.)

Would this still be a problem if we were training the agent with SL instead of RL?

You would still need informed oversight in order to solve the inner alignment problem (i.e. to actually achieve robustness in the face of sophisticated reasoning within the SL system), but you wouldn't need it for the outer alignment problem.

If not, what is the motivation for using RL here?

I agree with Will. The point is to be competitive, I don't see how you could be competitive if you use SL (unless it turns out that RL just doesn't add any value, in which case I agree we don't have to worry about RL).

like inner optimizers for "optimizing for worst case"

But you need to solve this problem in order to cope with inner optimizers.

Here it seems like you're trying to train an agent that is more capable than the overseer in some way, and I'm not entirely sure why that has changed.

This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.

I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble

I don't quite understand what kind of story you are looking for, and this might have just been a communication problem in the post, so let me start by rephrasing that example:

• I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn't attacked.
• So my AI searches over actions to find one for which it expects I'll conclude "I wasn't attacked."
• Instead of actually protecting me, my AI undermines the mechanisms by which I would have detected that I was attacked.

We could run the same argument with "I want to acquire resources" instead of "I want to be protected from attack"---rather than actually helping me get resources, the agent can convince me that I have a bunch of resources when I don't really have any.

how it came to have more understanding than the overseer

We did a lot of gradient descent over the parameters of the agent, to find parameters that cause the agent to take good actions. That gradient descent gives the agent a bunch of ability to reason about the world, because successful reasoning can help it take good actions.

The overseer can also look at that data and try to learn about general patterns etc.. But how do we guarantee that the explicit procedure used by the overseer actually captures all of the stuff the agent was able to learn? This is very non-obvious.

Comment by paulfchristiano on When should we expect the education bubble to pop? How can we short it? · 2019-02-09T22:13:14.271Z · score: 24 (10 votes) · LW · GW
College spending is one sixth of US economy

What? That would be pretty crazy, if $1 of every$6 was being spent on college. The linked post mentions it in a parenthetical, without explanation or justification.

A few seconds of googling suggests that spending on college is about \$560 billion per year, around 3% of GDP, which makes way more sense. Opportunity cost from students in college might be a further 2% or so, though if you are going to count non-monetized time then you should probably be using a bigger denominator than GDP.

I don't know what the "one sixth" figure could be referring to. Total student debt is <10% of GDP (though that's basically a meaningless comparison---more meaningful would be to say that it's about 1% of outstanding debt in the US).

Comment by paulfchristiano on The Steering Problem · 2019-02-07T17:52:42.285Z · score: 2 (1 votes) · LW · GW

This is the typical way of talking about "more useful than" in computer science.

Saying "there is some way to use P to efficiently accomplish X" isn't necessarily helpful to someone who can't find that way. We want to say: if you can find a way to do X with H, then you can find a way to do it with P. And we need an efficiency requirement for the statement to be meaningful at all.

## Security amplification

2019-02-06T17:28:19.995Z · score: 20 (4 votes)
Comment by paulfchristiano on Reliability amplification · 2019-02-02T20:56:27.665Z · score: 2 (1 votes) · LW · GW

Yes, when I say:

Given a distribution A over policies that ε-close to a benign policy for some ε ≪ 1, can we implement a distribution A⁺ over policies which is δ-close to a benign policy of similar capability, for some δ ≪ ε?

a "benign" policy has to be benign for all inputs. (See also security amplification, stating the analogous problem where a policy is "mostly" benign but may fail on a "small" fraction of inputs.)

## Reliability amplification

2019-01-31T21:12:18.591Z · score: 21 (5 votes)
Comment by paulfchristiano on Thoughts on reward engineering · 2019-01-29T01:38:01.151Z · score: 2 (1 votes) · LW · GW

Some problems:

• If we accept the argument "well it worked, didn't it?" then we are back to the regime where the agent may know something we don't (e.g. about why the action wasn't good even though it looked good).
• Relatedly, it's still not really clear to me what it means to "only accept actions that we understand." If the agent presents an action that is unacceptable, for reasons the overseer doesn't understand, how do we penalize it? It's not like there are some actions for which we understand all consequences and others for which we don't---any action in practice could have lots of consequences we understand and lots we don't, and we can't rule out the existence of consequences we don't understand.
• As you observe, the agent learns facts from the training distribution, and even if the overseer has a memory there is no guarantee that they will be able to use it as effectively as the agent. Being able to look at training data in some way (I expect implicitly) is a reason that informed oversight isn't obviously impossible, but not reasons that this is a non-problem.
Comment by paulfchristiano on Techniques for optimizing worst-case performance · 2019-01-28T22:40:00.758Z · score: 2 (1 votes) · LW · GW

I agree that you probably need ensembling in addition to these techniques.

At best this technique would produce a system which has a small probability of unacceptable behavior for any input. You'd then need to combine multiple of those to get a system with negligible probability of unacceptable behavior.

I expect you often get this for free, since catastrophe either involves a bunch of different AI systems behaving unacceptably, or a single AI behaving consistently unacceptably across time.

Comment by paulfchristiano on Thoughts on reward engineering · 2019-01-28T22:36:22.086Z · score: 2 (1 votes) · LW · GW
I thought inner optimizers are supposed to be handled under "learning with catastrophe" / "optimizing for worst case". In particular inner optimizers would cause "malign" failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.

Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.

Is "informed oversight" just another name for that problem, or a particular approach to solving it?

Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.

If the latter, how is it different from "transparency"?

People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.

In the context of this sequence, transparency is relevant for both:

• Know what the agent knows, in order to evaluate its behavior.
• Figure out under what conditions the agent would behave differently, to facilitate adversarial training.

For both of those problems, it's not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.

So that's why there is a different name.

I thought those ideas would be enough to solve the more recent motivating example for "informed oversight" that Paul gave (training an agent to defend against network attacks).

(I disagreed with this upthread. I don't think "convince the overseer that an action is good" obviously incentivizes the right behavior, even if you are allowed to offer an explanation---certainly we don't have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)

Comment by paulfchristiano on Thoughts on reward engineering · 2019-01-28T22:28:20.010Z · score: 5 (2 votes) · LW · GW
If the overseer sees the agent output an action that the overseer can't understand the rationale of, why can't the overseer just give it a low approval rating? Sure, this limits the performance of the agent to that of the overseer, but that should be fine since we can amplify the agent later?

Suppose that taking action X results in good consequences empirically, but discovering why is quite hard. (It seems plausible to me that this kind of regularity is very important for humans actually behaving intelligently.) If you don't allow actions that are good for reasons you don't understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).

If this doesn't work for some reason, why don't we have the agent produce an explanation of the rationale for the action it proposes, and output that along with the action, and have the overseer use that as a hint to help judge how good the action is?

Two problems:

• Sometimes you need hints that help you see why an action is bad. You can take this proposal all the way to debate, though you are still left with a question about whether debate actually works.
• Agents can know things because of complicated regularities on the training data, and hints aren't enough to expose this to the overseer.

## Techniques for optimizing worst-case performance

2019-01-28T21:29:53.164Z · score: 23 (6 votes)
Comment by paulfchristiano on Thoughts on reward engineering · 2019-01-25T17:45:25.169Z · score: 6 (3 votes) · LW · GW

Happy to give more examples; if you haven't seen this newer post on informed oversight it might be helpful (and if not, I'm interested in understanding where the communication gaps are).

Comment by paulfchristiano on Thoughts on reward engineering · 2019-01-25T17:43:37.139Z · score: 4 (2 votes) · LW · GW
plausible to me that "merely" Paul's level of moral reasoning is sufficient to get us there

The hard part of "what is good" isn't the moral part, it's understanding things like "in this world, do humans actually have adequate control of and understanding of the situation?" Using powerful optimization to produce outcomes that look great to Paul-level reasoning doesn't seem wise, regardless of your views on moral questions.

I agree that Paul level reasoning is fine if no one else is building AI systems with more powerful reasoning.

## Thoughts on reward engineering

2019-01-24T20:15:05.251Z · score: 24 (3 votes)

## Learning with catastrophes

2019-01-23T03:01:26.397Z · score: 26 (8 votes)

## Capability amplification

2019-01-20T07:03:27.879Z · score: 24 (7 votes)
Comment by paulfchristiano on Towards formalizing universality · 2019-01-17T20:34:02.668Z · score: 4 (2 votes) · LW · GW

Suppose that I convinced you "if you didn't know much chemistry, you would expect this AI to yield good outcomes." I think you should be pretty happy. It may be that the AI would predictably cause a chemistry-related disaster in a way that would be obvious to you if you knew chemistry, but overall I think you should expect not to have a safety problem.

This feels like an artifact of a deficient definition, I should never end up with a lemma like "if you didn't know much chemistry, you'd expect this AI to to yield good outcomes" rather than being able to directly say what we want to say.

That said, I do see some appeal in proving things like "I expect running this AI to be good," and if we are ever going to prove such statements they are probably going to need to be from some impoverished perspective (since it's too hard to bring all of the facts about our actual epistemic state into such a proof), so I don't think it's totally insane.

If we had a system that is ascription universal from some impoverished perspective, you may or may not be OK. I'm not really worrying about it; I expect this definition to change before the point where I literally end up with a system that is ascription universal from some impoverished perspective, and this definition seems good enough to guide next research steps.

Comment by paulfchristiano on Towards formalizing universality · 2019-01-16T21:28:44.260Z · score: 4 (2 votes) · LW · GW

In order to satisfy this definition, 𝔼¹ needs to know every particular fact 𝔼 knows. It would be nice to have a definition that got at the heart of the matter while relaxing this requirement.

I don't think your condition gets around this requirement. Suppose that Y is a bit that 𝔼 knows and 𝔼¹ does not, Z[0] and Z[1] are two hard-to-estimate quantities (that 𝔼¹ and 𝔼² know but 𝔼 does not), and that X=Z[Y].

Comment by paulfchristiano on The E-Coli Test for AI Alignment · 2019-01-16T20:10:53.117Z · score: 7 (3 votes) · LW · GW
Perhaps you say “these cells are too simple, they can’t learn/reflect/etc.” Well, chances are humans will have the same issue once the computational burden gets large enough.

I don't think the situations is symmetrical here.

Humans have easy-to-extract preferences over possible "wiser versions of ourselves." That is, you can give me a menu of slightly modified versions of myself, and I can try to figure out which of those best capture my real values (or over what kind of process should be used for picking which of those best capture my real values, or etc.). Those wiser versions of ourselves can in turn have preferences over even wiser/smarter versions of ourselves, and we can hope that the process might go on ad infinitum.

It may be that the process with humans eventually hits a ceiling---we prefer that we become smarter and wiser in some obvious ways, but then eventually we've picked the low hanging fruit and we are at a loss for thinking about how to change without compromising our values. Or it may be that we are wrong about our preferences, and that iterating this deliberative process goes somewhere crazy.

But those are pretty fundamentally different from the situation with E. coli, where we have no way to even get the process started. In particular, the difficulty of running the process with E. coli doesn't give us much information about whether the process with humans would top out or go off the rails, once we know that humans are able to get the process started.

## The reward engineering problem

2019-01-16T18:47:24.075Z · score: 18 (3 votes)
Comment by paulfchristiano on Towards formalizing universality · 2019-01-16T18:16:39.658Z · score: 4 (2 votes) · LW · GW
What does "expect" mean here? Probability > .5?

We don't think that C has any expected edge in identifying an attack that A does not---if C is up to something, and A can't tell, then we think that the "being up to something" provides no evidence about whether something bad is about to happen. This is much stronger than a fixed probability of failure.

Also, can you walk through an example of how C might "trick" A[C] and how subjective dominance implies that we don't expect C to succeed?

Simple caricatured examples:

• C might propose a design for a computer that has a backdoor that an attacker can use to take over the computer. But if this backdoor will actually be effective, then A[C] will know about it.
• C might propose a design that exploits a predictable flaw in A's reasoning (e.g. overlooking consequences of a certain kind, being overly optimistic about some kinds of activities, incorrectly equating two importantly different quantities...). But then A[C] will know about it, and so if A[C] actually reasons in that way then (in some sense) it is endorsed.
I don't know how to make sense of this. If an impoverished perspective expects C not to be able to "trick" A, it seems kind of obvious that's not good enough for me to think the same?

If an impoverished perspective expects C not to be able to trick A, then either:

• In fact C won't be able to trick A.
• C will trick A, but the perspective is too weak to tell.

I think I don't quite understand what you are saying here, what exactly is obvious?

From a suitably advanced perspective it's obvious that C will be able to trick A sometimes---it will just get "epistemically lucky" and make an assumption that A regards as silly but turns out to be right.

Comment by paulfchristiano on Towards formalizing universality · 2019-01-16T08:24:49.486Z · score: 4 (2 votes) · LW · GW

I'm aiming for things like:

• round debate dominating any fast computation with alternations (including an round debate)
• max-HCH with budget dominating max-HCH with budget for some constant .
• HCH with advice and budget dominating HCH with no advice and budget .

Comment by paulfchristiano on Towards formalizing universality · 2019-01-15T02:24:20.671Z · score: 4 (2 votes) · LW · GW

Yes, thanks.

Comment by paulfchristiano on Towards formalizing universality · 2019-01-14T03:50:22.981Z · score: 4 (2 votes) · LW · GW
does an algorithm that adds two numbers have a belief about the rules of addition? Does a GIF to JPEG converter have a belief about which image format is "better"?

I'm not assuming any fact of the matter about what beliefs an system has. I'm quantifying over all "reasonable" ways of ascribing beliefs. So the only question is which ascription procedures are reasonable.

I think the most natural definition is to allow an ascription procedure to ascribe arbitrary fixed beliefs. That is, we can say that an addition algorithm has beliefs about the rules of addition, or about what kinds of operations will please God, or about what kinds of triples of numbers are aesthetically appealing, or whatever you like.

Universality requires dominating the beliefs produced by any reasonable ascription procedure, and adding particular arbitrary beliefs doesn't make an ascription procedure harder to dominate (so it doesn't really matter if we allow the procedures in the last paragraph as reasonable). The only thing that makes it hard to dominate C is the fact that C can do actual work that causes its beliefs to be accurate.

their inner workings are not immediately obvious

OK, consider the theorem prover that randomly searches over proofs then?

## Towards formalizing universality

2019-01-13T20:39:21.726Z · score: 29 (6 votes)

## Directions and desiderata for AI alignment

2019-01-13T07:47:13.581Z · score: 29 (6 votes)

## Ambitious vs. narrow value learning

2019-01-12T06:18:21.747Z · score: 18 (4 votes)

## AlphaGo Zero and capability amplification

2019-01-09T00:40:13.391Z · score: 25 (9 votes)

## Supervising strong learners by amplifying weak experts

2019-01-06T07:00:58.680Z · score: 28 (7 votes)

## Benign model-free RL

2018-12-02T04:10:45.205Z · score: 10 (2 votes)

## Corrigibility

2018-11-27T21:50:10.517Z · score: 39 (9 votes)

## Humans Consulting HCH

2018-11-25T23:18:55.247Z · score: 19 (3 votes)

## Approval-directed bootstrapping

2018-11-25T23:18:47.542Z · score: 19 (4 votes)

## Approval-directed agents

2018-11-22T21:15:28.956Z · score: 22 (4 votes)

## Prosaic AI alignment

2018-11-20T13:56:39.773Z · score: 36 (9 votes)

## An unaligned benchmark

2018-11-17T15:51:03.448Z · score: 27 (6 votes)

## Clarifying "AI Alignment"

2018-11-15T14:41:57.599Z · score: 54 (16 votes)

## The Steering Problem

2018-11-13T17:14:56.557Z · score: 37 (9 votes)

## Preface to the sequence on iterated amplification

2018-11-10T13:24:13.200Z · score: 39 (14 votes)

## The easy goal inference problem is still hard

2018-11-03T14:41:55.464Z · score: 38 (9 votes)

## Could we send a message to the distant future?

2018-06-09T04:27:00.544Z · score: 40 (14 votes)

## When is unaligned AI morally valuable?

2018-05-25T01:57:55.579Z · score: 97 (29 votes)

## Open question: are minimal circuits daemon-free?

2018-05-05T22:40:20.509Z · score: 110 (35 votes)

## Weird question: could we see distant aliens?

2018-04-20T06:40:18.022Z · score: 85 (25 votes)

## Implicit extortion

2018-04-13T16:33:21.503Z · score: 74 (22 votes)

## Prize for probable problems

2018-03-08T16:58:11.536Z · score: 135 (37 votes)

## Argument, intuition, and recursion

2018-03-05T01:37:36.120Z · score: 99 (29 votes)

## Funding for AI alignment research

2018-03-03T21:52:50.715Z · score: 108 (29 votes)

## Funding for independent AI alignment research

2018-03-03T21:44:44.000Z · score: 0 (0 votes)

## The abruptness of nuclear weapons

2018-02-25T17:40:35.656Z · score: 95 (35 votes)

2018-02-25T04:53:36.083Z · score: 100 (32 votes)

## Funding opportunity for AI alignment research

2017-08-27T05:23:46.000Z · score: 1 (1 votes)

## Ten small life improvements

2017-08-20T19:09:23.673Z · score: 18 (18 votes)

## Crowdsourcing moderation without sacrificing quality

2016-12-02T21:47:57.719Z · score: 15 (11 votes)

## Optimizing the news feed

2016-12-01T23:23:55.403Z · score: 9 (10 votes)

## The universal prior is malign

2016-11-30T22:31:41.000Z · score: 4 (4 votes)

## Recent AI control posts

2016-11-29T18:53:57.656Z · score: 12 (13 votes)

## My recent posts

2016-11-29T18:51:09.000Z · score: 5 (5 votes)

## If we can't lie to others, we will lie to ourselves

2016-11-26T22:29:54.990Z · score: 16 (17 votes)

## Less costly signaling

2016-11-22T21:11:06.028Z · score: 14 (16 votes)

## Control and security

2016-10-15T21:11:55.000Z · score: 3 (3 votes)

## What is up with carbon dioxide and cognition? An offer

2016-04-23T17:47:43.494Z · score: 37 (30 votes)

## Time hierarchy theorems for distributional estimation problems

2016-04-20T17:13:19.000Z · score: 2 (2 votes)

## Another toy model of the control problem

2016-01-30T01:50:12.000Z · score: 1 (1 votes)

## My current take on logical uncertainty

2016-01-29T21:17:33.000Z · score: 2 (2 votes)

## Active learning for opaque predictors

2016-01-03T21:15:28.000Z · score: 1 (1 votes)