Comment by paulfchristiano on Disincentives for participating on LW/AF · 2019-05-12T06:34:48.973Z · score: 23 (10 votes) · LW · GW

I don't comment more because writing comments takes time. I think that in person discussions tend to add more value per minute. (I expect your post is targeted at people who comment less than I do, but the reasons may be similar.)

I can imagine getting more mileage out of quick comments, which would necessarily be short and unplished. I'm less likely to do that because I feel like fast comments will often reflect poorly on me for a variety of reasons: they would have frequent and sometimes consequential errors (that would be excused in a short in-person discussion because of time), in general hastily-written comments send negative signal (better people write better comments, faster comments are worse, full model left as exercise for reader), I'd frequently leave errors uncorrected or threads of conversation dropped, and so on.

Comment by paulfchristiano on Announcement: AI alignment prize round 4 winners · 2019-04-19T16:43:32.257Z · score: 12 (3 votes) · LW · GW
Is there another way to spend money that seems clearly more cost-effective at this point, and if so what?

To be clear, I still think this is a good way to spend money. I think the main cost is time.

Comment by paulfchristiano on What failure looks like · 2019-03-27T16:18:55.133Z · score: 18 (6 votes) · LW · GW

I do agree there was a miscommunication about the end state, and that language like "lots of obvious destruction" is an understatement.

I do still endorse "military leaders might issue an order and find it is ignored" (or total collapse of society) as basically accurate and not an understatement.

Comment by paulfchristiano on What failure looks like · 2019-03-27T16:12:38.892Z · score: 11 (7 votes) · LW · GW

My median outcome is that people solve intent alignment well enough to avoid catastrophe. Amongst the cases where we fail, my median outcome is that people solve enough of alignment that they can avoid the most overt failures, like literally compromising sensors and killing people (at least for a long subjective time), and can build AIs that help defend them from other AIs. That problem seems radically easier---most plausible paths to corrupting sensors involve intermediate stages with hints of corruption that could be recognized by a weaker AI (and hence generate low reward). Eventually this will break down, but it seems quite late.

very confident that no AI company would implement something with this vulnerability?

The story doesn't depend on "no AI company" implementing something that behaves badly, it depends on people having access to AI that behaves well.

Also "very confident" seems different from "most likely failure scenario."

Haven't you yourself written about the failure modes of 'do things predicted to lead to videos that people rate as acceptable' where the attack involves surreptitiously reprogramming the camera to get optimal videos (including weird engineered videos designed to optimize on infelicities in the learned objective?

That's a description of the problem / the behavior of the unaligned benchmark, not the most likely outcome (since I think the problem is most likely to be solved). We may have a difference in view between a distribution over outcomes that is slanted towards "everything goes well" such that the most realistic failures are the ones that are the closest calls, vs. a distribution slanted towards "everything goes badly" such that the most realistic failures are the complete and total ones where you weren't even close.

Because it definitely seems that Vox got the impression from it that there is never a robot army takeover in the scenario, not that it's slightly preceded by camera hacking.

I agree there is a robot takeover shortly later in objective time (mostly because of the singularity). Exactly how long it is mostly depends on how early things go off the rails w.r.t. alignment, perhaps you have O(year).

Comment by paulfchristiano on What failure looks like · 2019-03-27T01:53:23.423Z · score: 7 (4 votes) · LW · GW

I agree that robot armies are an important aspect of part II.

In part I, where our only problem is specifying goals, I don't actually think robot armies are a short-term concern. I think we can probably build systems that really do avoid killing people, e.g. by using straightforward versions of "do things that are predicted to lead to videos that people rate as acceptable," and that at the point when things have gone off the rails those videos still look fine (and to understand that there is a deep problem at that point you need to engage with complicated facts about the situation that are beyond human comprehension, not things like "are the robots killing people?"). I'm not visualizing the case where no one does anything to try to make their AI safe, I'm imagining the most probable cases where people fail.

I think this is an important point, because I think much discussion of AI safety imagines "How can we give our AIs an objective which ensures it won't go around killing everyone," and I think that's really not the important or interesting part of specifying an objective (and so leads people to be reasonably optimistic about solutions that I regard as obviously totally inadequate). I think you should only be concerned about your AI killing everyone because of inner alignment / optimization daemons.

That said, I do expect possibly-catastrophic AI to come only shortly before the singularity (in calendar time) and so the situation "humans aren't able to steer the trajectory of society" probably gets worse pretty quickly. I assume we are on the same page here.

In that sense Part I is misleading. It describes the part of the trajectory where I think the action is, the last moments where we could have actually done something to avoid doom, but from the perspective of an onlooker that period could be pretty brief. If there is a Dyson sphere in 2050 it's not clear that anyone really cares what happened during 2048-2049. I think the worst offender is the last sentence of Part I ("By the time we spread through the stars...")

Part I has this focus because (i) that's where I think the action is---by the time you have robot armies killing everyone the ship is so sailed, I think a reasonable common-sense viewpoint would acknowledge this by reacting with incredulity to the "robots kill everyone" scenario, and would correctly place the "blame" on the point where everything got completely out of control even though there weren't actually robot armies yet (ii) the alternative visualization leads people to seriously underestimate the difficulty of the alignment problem, (iii) I was trying to describe the part of the picture which is reasonably accurate regardless of my views on the singularity.

Comment by paulfchristiano on What failure looks like · 2019-03-27T01:35:45.550Z · score: 9 (6 votes) · LW · GW
The Vox article also mistakes the source of influence-seeking patterns to be about social influence rather than 'systems that try to increase in power and numbers tend to do so, so are selected for if we accidentally or intentionally produce them and don't effectively weed them out; this is why living things are adapted to survive and expand; such desires motivate conflict with humans when power and reproduction can be obtained by conflict with humans, which can look like robot armies taking control.

Yes, I agree the Vox article made this mistake. Me saying "influence" probably gives people the wrong idea so I should change that---I'm including "controls the military" as a central example, but it's not what comes to mind when you hear "influence." I like "influence" more than "power" because it's more specific, captures what we actually care about, and less likely to lead to a debate about "what is power anyway."

In general I think the Vox article's discussion of Part II has some problems, and the discussion of Part I is closer to the mark. (Part I is also more in line with the narrative of the article, since Part II really is more like Terminator. I'm not sure which way the causality goes here though, i.e. whether they ended up with that narrative based on misunderstandings about Part II or whether they framed Part II in a way that made it more consistent with the narrative, maybe having been inspired to write the piece based on Part I.)

There is a different mistake with the same flavor, later in the Vox article: "But eventually, the algorithms’ incentives to expand influence might start to overtake their incentives to achieve the specified goal. That, in turn, makes the AI system worse at achieving its intended goal, which increases the odds of some terrible failure"

The problem isn't really "the AI system is worse at achieving its intended goal;" like you say, it's that influence-seeking AI systems will eventually be in conflict with humans, and that's bad news if AI systems are much more capable/powerful than we are.

[AI systems] wind up controlling or creating that military power and expropriating humanity (which couldn't fight back thereafter even if unified)

Failure would presumably occur before we get to the stage of "robot army can defeat unified humanity"---failure should happen soon after it becomes possible, and there are easier ways to fail than to win a clean war. Emphasizing this may give people the wrong idea, since it makes unity and stability seem like a solution rather than a stopgap. But emphasizing the robot army seems to have a similar problem---it doesn't really matter whether there is a literal robot army, you are in trouble anyway.

Comment by paulfchristiano on What failure looks like · 2019-03-26T20:01:47.191Z · score: 9 (4 votes) · LW · GW

I think of #3 and #5 as risk factors that compound the risks I'm describing---they are two (of many!) ways that the detailed picture could look different, but don't change the broad outline. I think it's particularly important to understand what failure looks like under a more "business as usual" scenario, so that people can separate objections to the existence of any risk from objections to other exacerbating factors that we are concerned about (like fast takeoff, war, people being asleep at the wheel, etc.)

I'd classify #1, #2, and #4 as different problems not related to intent alignment per se (though intent alignment may let us build AI systems that can help address these problems). I think the more general point is: if you think AI progress is likely to drive many of the biggest upcoming changes in the world, then there will be lots of risks associated with AI. Here I'm just trying to clarify what happens if we fail to solve intent alignment.

Comment by paulfchristiano on What's wrong with these analogies for understanding Informed Oversight and IDA? · 2019-03-23T16:11:00.604Z · score: 2 (1 votes) · LW · GW
Can you quote these examples? The word "example" appears 27 times in that post and looking at the literal second and third examples, they don't seem very relevant to what you've been saying here so I wonder if you're referring to some other examples.

Subsections "Modeling" and "Alien reasoning" of "Which C are hard to epistemically dominate?"

What I'm inferring from this (as far as a direct answer to my question) is that an overseer trying to do Informed Oversight on some ML model doesn't need to reverse engineer the model enough to fully understand what it's doing, only enough to make sure it's not doing something malign, which might be a lot easier, but this isn't quite reflected in the formal definition yet or isn't a clear implication of it yet. Does that seem right?

You need to understand what facts the model "knows." This isn't value-loaded or sensitive to the notion of "malign," but it's still narrower than "fully understand what it's doing."

As a simple example, consider linear regression. I think that linear regression probably doesn't know anything you don't. Yet doing linear regression is a lot easier than designing a linear model by hand.

If that's what you do, it seems “P outputs true statements just in the cases I can check.” could have a posterior that's almost 50%, which doesn't seem safe, especially in an iterated scheme where you have to depend on such probabilities many times?

Where did 50% come from?

Also "P outputs true statements in just the cases I check" is probably not catastrophic, it's only catastrophic once P performs optimization in order to push the system off the rails.

Comment by paulfchristiano on What's wrong with these analogies for understanding Informed Oversight and IDA? · 2019-03-20T18:26:52.204Z · score: 13 (4 votes) · LW · GW

A universal reasoner is allowed to use an intuition "because it works." They only take on extra obligations once that intuition reflects more facts about the world which can't be cashed out as predictions that can be confirmed on the same historical data that led us to trust the intuition.

For example, you have an extra obligation if Ramanujan has some intuition about why theorem X is true, you come to trust such intuitions by verifying them against proof of X, but the same intuitions also suggest a bunch of other facts which you can't verify.

In that case, you can still try to be a straightforward Bayesian about it, and say "our intuition supports the general claim that process P outputs true statements;" you can then apply that regularity to trust P on some new claim even if it's not the kind of claim you could verify, as long as "P outputs true statements" had a higher prior than "P outputs true statements just in the cases I can check." That's an argument that someone can give to support a conclusion, and "does process P output true statements historically?" is a subquestion you can ask during amplification.

The problem becomes hard when there are further facts that can't be supported by this Bayesian reasoning (and therefore might undermine it). E.g. you have a problem if process P is itself a consequentialist, who outputs true statements in order to earn your trust but will eventually exploit that trust for their own advantage. In this case, the problem is that there is something going on internally inside process P that isn't surfaced by P's output. Epistemically dominating P requires knowing about that.

See the second and third examples in the post introducing ascription universality. There is definitely a lot of fuzziness here and it seems like one of the most important places to tighten up the definition / one of the big research questions for whether ascription universality is possible.

Comment by paulfchristiano on What failure looks like · 2019-03-18T15:24:34.957Z · score: 11 (5 votes) · LW · GW
But why exactly should we expect that the problems you describe will be exacerbated in a future with powerful AI, compared to the state of contemporary human societies?

To a large extent "ML" refers to a few particular technologies that have the form "try a bunch of things and do more of what works" or "consider a bunch of things and then do the one that is predicted to work."

That is true but I think of this as a limitation of contemporary ML approaches rather than a fundamental property of advanced AI.

I'm mostly aiming to describe what I think is in fact most likely to go wrong, I agree it's not a general or necessary feature of AI that its comparative advantage is optimizing easy-to-measure goals.

(I do think there is some real sense in which getting over this requires "solving alignment.")

Comment by paulfchristiano on What failure looks like · 2019-03-18T04:21:45.450Z · score: 5 (3 votes) · LW · GW

I'm not mostly worried about influence-seeking behavior emerging by "specify a goal" --> "getting influence is the best way to achieve that goal." I'm mostly worried about influence-seeking behavior emerging within a system by virtue of selection within that process (and by randomness at the lowest level).

## What failure looks like

2019-03-17T20:18:59.800Z · score: 199 (73 votes)
Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-15T15:57:44.659Z · score: 4 (2 votes) · LW · GW

I don't see why their methods would be elegant. In particular, I don't see why any of {the anthropic update, importance weighting, updating from the choice of universal prior} would have a simple form (simpler than the simplest physics that gives rise to life).

I don't see how MAP helps things either---doesn't the same argument suggest that for most of the possible physics, the simplest model will be a consequentialist? (Even more broadly, for the universal prior in general, isn't MAP basically equivalent to a random sample from the prior, since some random model happens to be slightly more compressible?)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-15T01:20:30.341Z · score: 4 (2 votes) · LW · GW
I say maybe 1/5 chance it’s actually dominated by consequentialists

Do you get down to 20% because you think this argument is wrong, or because you think it doesn't apply?

What problem do you think bites you?

What's ? Is it O(1) or really tiny? And which value of do you want to consider, polynomially small or exponentially small?

But if it somehow magically predicted which actions BoMAI was going to take in no time at all, then c would have to be above 1/d.

Wouldn't they have to also magically predict all the stochasticity in the observations, and have a running time that grows exponentially in their log loss? Predicting what BoMAI will do seems likely to be much easier than that.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-14T17:04:49.765Z · score: 5 (3 votes) · LW · GW

This invalidates some of my other concerns, but also seems to mean things are incredibly weird at finite times. I suspect that you'll want to change to something less extreme here.

(I might well be misunderstanding something, apologies in advance.)

Suppose the "intended" physics take at least 1E15 steps to run on the UTM (this is a conservative lower bound, since you have to simulate the human for the whole episode). And suppose (I think you need much lower than this). Then the intended model gets penalized by at least exp(1E12) for its slowness.

For almost the same description complexity, I could write down physics + "precompute the predictions for the first N episodes, for every sequence of possible actions/observations, and store them in a lookup table." This increases the complexity by a few bits, some constant plus K(N|physics), but avoids most of the computation. In order for the intended physics to win, i.e. in order for the "speed" part of the speed prior to do anything, we need the complexity of this precomputed model to be at least 1E12 bits higher than the complexity of the fast model.

That appears to happen only once N > BB(1E12). Does that seem right to you?

We could talk about whether malign consequentialists also take over at finite times (I think they probably do, since the "speed" part of the speed prior is not doing any work until after BB(1E12) steps, long after the agent becomes incredibly smart), but it seems better to adjust the scheme first.

Using the speed prior seems more reasonable, but I'd want to know which version of the speed prior and which parameters, since which particular problem bites you will depend on those choices. And maybe to save time, I'd want to first get your take on whether the proposed version is dominated by consequentialists at some finite time.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-14T04:50:12.723Z · score: 4 (2 votes) · LW · GW

From the formal description of the algorithm, it looks like you use a universal prior to pick , and then allow the Turing machine to run for steps, but don't penalize the running time of the machine that outputs . Is that right? That didn't match my intuitive understanding of the algorithm, and seems like it would lead to strange outcomes, so I feel like I'm misunderstanding.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-14T04:48:33.191Z · score: 2 (1 votes) · LW · GW

(I actually have a more basic confusion, started a new thread.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-14T04:21:36.569Z · score: 2 (1 votes) · LW · GW

(ETA: I think this discussion depended on a detail of your version of the speed prior that I misunderstood.)

Given a world model ν, which takes k computation steps per episode, let νlog be the best world-model that best approximates ν (in the sense of KL divergence) using only logk computation steps. νlog is at least as good as the “reasoning-based replacement” of ν.
The description length of νlog is within a (small) constant of the description length of ν. That way of describing it is not optimized for speed, but it presents a one-time cost, and anyone arriving at that world-model in this way is paying that cost.

To be clear, that description gets ~0 mass under the speed prior, right? A direct specification of the fast model is going to have a much higher prior than a brute force search, at least for values of large enough (or small enough, however you set it up) to rule out the alien civilization that is (probably) the shortest description without regard for computational limits.

One could consider instead νlogε, which is, among the world-models that ε-approximate ν in less than logk computation steps (if the set is non-empty), the first such world-model found by a searching procedure ψ. The description length of νlogε is within a (slightly larger) constant of the description length of ν, but the one-time computational cost is less than that of νlog.

Within this chunk of the speed prior, the question is: what are good ψ? Any reasonable specification of a consequentialist would work (plus a few more bits for it to understand its situation, though most of the work is done by handing it ), or of a petri dish in which a consequentialist would eventually end up with influence. Do you have a concrete alternative in mind, which you think is not dominated by some consequentialist (i.e. a ψ for which every consequentialist is either slower or more complex)?

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-13T17:34:21.061Z · score: 2 (1 votes) · LW · GW
Once all the subroutines are "baked into its architecture" you just have: the algorithm "predict accurately" + "treacherous turn"

You only have to bake in the innermost part of one loop in order to get almost all the computational savings.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-13T17:31:54.293Z · score: 2 (1 votes) · LW · GW
a reasoning-based order (at least a Bayesian-reasoning-based order) should really just be called a posterior

Reasoning gives you a prior that is better than the speed prior, before you see any data. (*Much* better, limited only by the fact that the speed prior contains strategies which use reasoning.)

The reasoning in this case is not a Bayesian update. It's evaluating possible approximations *by reasoning about how well they approximate the underlying physics, itself inferred by a Bayesian update*, not by directly seeing how well they predict on the data so far.

The description length of the "do science" strategy (I contend) is less than the description length of the "do science" + "treacherous turn" strategy.

I think the only good arguments for this are in the limit where you don't care about simplicity at all and only care about running time, since then you can rule out all reasoning. The threshold where things start working depends on the underlying physics, for more computationally complex physics you need to pick larger and larger computation penalties to get the desired result.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T16:58:22.303Z · score: 2 (1 votes) · LW · GW
but that's exactly what we're doing to

It seems totally different from what we're doing, I may be misunderstanding the analogy.

Suppose I look out at the world and do some science, e.g. discovering the standard model. Then I use my understanding of science to design great prediction algorithms that run fast, but are quite complicated owing to all of the approximations and heuristics baked into them.

The speed prior gives this model a very low probability because it's a complicated model. But "do science" gives this model a high probability, because it's a simple model of physics, and then the approximations follow from a bunch of reasoning on top of that model of physics. We aren't trading off "shortness" for speed---we are trading off "looks good according to reasoning" for speed. Yes they are both arbitrary orders, but one of them systematically contains better models earlier in the order, since the output of reasoning is better than a blind prioritization of shorter models.

Of course the speed prior also includes a hypothesis that does "science with the goal of making good predictions," and indeed Wei Dai and I are saying that this is the part of the speed prior that will dominate the posterior. But now we are back to potentially-malign consequentialistism. The cognitive work being done internally to that hypothesis is totally different from the work being done by updating on the speed prior (except insofar as the speed prior literally contains a hypothesis that does that work).

In other words:

Suppose physics takes n bits to specify, and a reasonable approximation takes N >> n bits to specify. Then the speed prior, working in the intended way, takes N bits to arrive at the reasonable approximation. But the aliens take n bits to arrive at the standard model, and then once they've done that can immediately deduce the N bit approximation. So it sure seems like they'll beat the speed prior. Are you objecting to this argument?

(In fact the speed prior only actually takes n + O(1) bits, because it can specify the "do science" strategy, but that doesn't help here since we are just trying to say that the "do science" strategy dominates the speed prior.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T16:46:47.846Z · score: 3 (2 votes) · LW · GW
The only other approach I can think of is trying to do the anthropic update ourselves.

If you haven't seen Jessica's post in this area, it's worth taking a quick look.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T16:44:47.138Z · score: 2 (1 votes) · LW · GW

I just mean: "universality" in the sense of a UTM isn't a sufficient property when defining the speed prior, the analogous property of the UTM is something more like: "You can run an arbitrary Turing machine without too much slowdown." Of course that's not possible, but it seems like you still want to be as close to that as possible (for the same reasons that you wanted universality at all).

I agree that it would be fine to sacrifice this property if it was helpful for safety.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T01:48:22.856Z · score: 2 (1 votes) · LW · GW
Using "reasoning" to pick which one to favor, is just picking the first one in some new order.

Yes, some new order, but not an arbitrary one. The resulting order is going to be better than the speed prior order, so we'll update in favor of the aliens and away from the rest of the speed prior.

one can't escape the necessity to introduce the arbitrary criterion of "valuing" earlier things on the list

Probably some miscommunication here. No one is trying to object to the arbitrariness, we're just making the point that the aliens have a lot of leverage with which to beat the rest of the speed prior.

(They may still not be able to if the penalty for computation is sufficiently steep---e.g. if you penalize based on circuit complexity so that the model might as well bake in everything that doesn't depend on the particular input at hand. I think it's an interesting open question whether that avoids all problems of this form, which I unsuccessfully tried to get at here.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-12T01:39:48.795Z · score: 2 (1 votes) · LW · GW
That's what I was thinking too, but Michael made me realize this isn't possible, at least for some M. Suppose M is the C programming language, but in C there is no way to say "interpret this string as a C program and run it as fast as a native C program". Am I missing something at this point?

I agree this is only going to be possible for some universal Turing machines. Though if you are using a Turing machine to define a speed prior, this does seem like a desirable property.

I don't understand this sentence.

If physics is implemented in C, there are many possible bugs that would allow the attacker to execute arbitrary C code with no slowdown.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-11T17:49:08.087Z · score: 4 (2 votes) · LW · GW

The fast algorithms to predict our physics just aren't going to be the shortest ones. You can use reasoning to pick which one to favor (after figuring out physics), rather than just writing them down in some arbitrary order and taking the first one.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-11T17:46:11.380Z · score: 4 (2 votes) · LW · GW
I was assuming the worst, and guessing that there are diminishing marginal returns once your odds of a successful takeover get above ~50%, so instead of going all in on accurate predictions of the weakest and ripest target universe, you hedge and target a few universes.

There are massive diminishing marginal returns; in a naive model you'd expect essentially *every* universe to get predicted in this way.

But Wei Dai's basic point still stands. The speed prior isn't the actual prior over universes (i.e. doesn't reflect the real degree of moral concern that we'd use to weigh consequences of our decisions in different possible worlds). If you have some data that you are trying to predict, you can do way better than the speed prior by (a) using your real prior to estimate or sample from the actual posterior distribution over physical law, (b) using engineering reasoning to make the utility maximizing predictions, given that faster predictions are going to get given more weight.

(You don't really need this to run Wei Dai's argument, because there seem to be dozens of ways in which the aliens get an advantage over the intended physical model.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-11T17:40:49.060Z · score: 4 (2 votes) · LW · GW
If the AGI is using machinery that would allow it to simulate any world-model, it will be way slower than the Turing machine built for that algorithm.

Just consider a program that gives the aliens the ability to write arbitrary functions in M and then pass control to them. That program is barely any bigger (all you have to do is insert one use after free in physics :) ), and guarantees the aliens have zero slowdown.

For the literal simplest version of this, your program is M(Alien(), randomness), which is going to run just as fast as M(physics, randomness) for the intended physics, and probably much faster (if the aliens can think of any clever tricks to run faster without compromising much accuracy). The only reason you wouldn't get this is if Alien is expensive. That probably rules out crazy alien civilizations, but I'm with Wei Dai that it probably doesn't rule out simpler scientists.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-09T23:57:33.507Z · score: 5 (3 votes) · LW · GW

I'm sympathetic to this picture, though I'd probably be inclined to try to model it explicitly---by making some assumption about what the planning algorithm can actually do, and then showing how to use an algorithm with that property. I do think "just write down the algorithm, and be happier if it looks like a 'normal' algorithm" is an OK starting point though

Given that the setup is basically a straight reinforcement learner with a weird prior, I think that at that level of abstraction, the ceiling of competitiveness is quite high.

Stepping back from this particular thread, I think the main problem with competitiveness is that you are just getting "answers that look good to a human" rather than "actually good answers." If I try to use such a system to navigate a complicated world, containing lots of other people with more liberal AI advisors helping them do crazy stuff, I'm going to quickly be left behind.

It's certainly reasonable to try to solve safety problems without attending to this kind of competitiveness, though I think this kind of asymptotic safety is actually easier than you make it sound (under the implicit "nothing goes irreversibly wrong at any finite time" assumption).

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-09T23:39:55.378Z · score: 5 (3 votes) · LW · GW
I suppose I constrained myself to producing an algorithm/setup where the asymptotic benignity result followed from reasons that don’t require dangerous behavior in the interim.

I think my point is this:

• The intuitive thing you are aiming at is stronger than what the theorem establishes (understandably!)
• You probably don't need the memory trick to establish the theorem itself.
• Even with the memory trick, I'm not convinced you meet the stronger criterion. There are a lot of other things similar to memory that can cause trouble---the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia.
Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T17:40:21.735Z · score: 7 (3 votes) · LW · GW

The theorem is consistent with the aliens causing trouble any finite number of times. But each time they cause the agent to do something weird their model loses some probability, so there will be some episode after which they stop causing trouble (if we manage to successfully run enough episodes without in fact having anything bad happen in the meantime, which is an assumption of the asymptotic arguments).

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T17:10:47.980Z · score: 6 (3 votes) · LW · GW
Under a policy that doesn't cause the computer's memory to be tampered with (which is plausible, even ideal), ν† and ν⋆ are identical, so we can't count on ν†losing probability mass relative to ν⋆.

I agree with that, but if they are always making the same on-policy prediction it doesn't matter what happens to their relative probability (modulo exploration). The agent can't act on an incentive to corrupt memory infinitely often, because each time requires the models making a different prediction on-policy. So the agent only acts on such an incentive finitely many times, and hence never does so after some sufficiently late episode . Agree/disagree?

(Having a bad model can still hurt, since the bogus model might agree on-policy but assign lower rewards off-policy. But if they also always approximately agree on the exploration distribution, then a bad model also can't discourage exploration. And if they don't agree on the exploration distribution, then the bad model will eventually get tested.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T16:49:26.046Z · score: 4 (2 votes) · LW · GW

The algorithm takes an argmax over an exponentially large space of sequences of actions, i.e. it does 2^{episode length} model evaluations. Do you think the result is smarter than a group of humans of size 2^{episode length}? I'd bet against---the humans could do this particular brute force search, in which case you'd have a tie, but they'd probably do something smarter.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T04:57:40.628Z · score: 4 (2 votes) · LW · GW

I agree that you don't rely on this assumption (so I was wrong to assume you are more optimistic than I am). In the literal limit, you don't need to care about any of the considerations of the kind I was raising in my post.

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T04:55:18.037Z · score: 4 (2 votes) · LW · GW
For the asymptotic results, one has to consider environments that produce observations with the true objective probabilities (hence the appearance that I'm unconcerned with competitiveness). In practice, though, given the speed prior, the agent will require evidence to entertain slow world-models, and for the beginning of its lifetime, the agent will be using low-fidelity models of the environment and the human-explorer, rendering it much more tractable than a perfect model of physics. And I think that even at that stage, well before it is doing perfect simulations of other humans, it will far surpass human performance. We manage human-level performance with very rough simulations of other humans.

I'm keen on asymptotic analysis, but if we want to analyze safety asymptotically I think we should also analyze competitiveness asymptotically. That is, if our algorithm only becomes safe in the limit because we shift to a super uncompetitive regime, it undermines the use of the limit as analogy to study the finite time behavior.

(Though this is not the most interesting disagreement, probably not worth responding to anything other than the thread where I ask about "why do you need this memory stuff?")

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T04:49:46.034Z · score: 4 (2 votes) · LW · GW
That leads me to think this approach is much more competitive that simulating a human and giving it a long time to think.

Surely that just depends on how long you give them to think. (See also HCH.)

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T04:48:18.033Z · score: 6 (3 votes) · LW · GW

Given that you are taking limits, I don't see why you need any of the machinery with forgetting or with memory-based world models (and if you did really need that machinery, it seems like your proof would have other problems). My understanding is:

• Your already assume that you can perform arbitrarily many rounds of the algorithm as intended (or rather you prove that there is some such that if you ran steps, with everything working as intended and in particular with no memory corruption, then you would get "benign" behavior).
• Any time the MAP model makes a different prediction from the intended model, it loses some likelihood. So this can only happen finitely many times in any possible world. Just take to be after the last time it happens w.h.p.

What's wrong with this?

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T01:16:07.173Z · score: 12 (7 votes) · LW · GW

If I have a great model of physics in hand (and I'm basically unconcerned with competitiveness, as you seem to be), why not just take the resulting simulation of the human and give it a long time to think? That seems to have fewer safety risks and to be more useful.

More generally, under what model of AI capabilities / competitiveness constraints would you want to use this procedure?

Comment by paulfchristiano on Asymptotically Benign AGI · 2019-03-08T01:15:38.444Z · score: 7 (4 votes) · LW · GW

Here is an old post of mine on the hope that "computationally simplest model describing the box" is actually a physical model of the box. I'm less optimistic than you are, but it's certainly plausible.

From the perspective of optimization daemons / inner alignment, I think like the interesting question is: if inner alignment turns out to be a hard problem for training cognitive policies, do we expect it to become much easier by training predictive models? I'd bet against at 1:1 odds, but not 1:2 odds.

Comment by paulfchristiano on Reinforcement Learning in the Iterated Amplification Framework · 2019-02-13T05:21:25.705Z · score: 9 (3 votes) · LW · GW
I am also not sure exactly what it means to use RL in iterated amplification.

You can use RL for the distillation step. (I usually mention RL as the intended distillation procedure when I describe the scheme, except perhaps in the AGZ analogy post.)

So then I don't really see why you want RL, which typically is solving a hard credit assignment problem that doesn't arise in the one-step setting.

The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and "RL" seems like the normal way to talk about that algorithm/problem. We you could instead call it "contextual bandits."

You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it's generic RL.

Using a combination of IRL + RL to achieve the same effect as imitation learning.

Does "imitation learning" refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it's normally the kind of algorithm I have in mind when talking about "imitation learning" (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)

Comment by paulfchristiano on Nuances with ascription universality · 2019-02-13T01:36:31.159Z · score: 9 (5 votes) · LW · GW
a system is ascription universal if, relative to our current epistemic state, its explicit beliefs contain just as much information as any other way of ascribing beliefs to it.

This is a bit different than the definition in my post (which requires epistemically dominating every other simple computation), but is probably a better approach in the long run. This definition is a little bit harder to use for the arguments in this post, and my current expectation is that the "right" definition will be usable for both informed oversight and securing HCH. Within OpenAI Geoffrey Irving has been calling a similar property "belief closure."

This is not necessarily a difficult concern to address in the sense of making sure that any definition of ascription universality includes some concept of ascribing beliefs to a system by looking at the beliefs of any systems that helped create that system.

In the language of my post, I'd say:

• The memoized table is easy to epistemically dominate. To the extent that malicious cognition went into designing the table, you can ignore it when evaluating epistemic dominance.
• The training process that produced the memoized table can be hard to epistemically dominate. That's what we should be interested in. (The examples in the post have this flavor.)
Comment by paulfchristiano on Thoughts on reward engineering · 2019-02-11T18:21:00.525Z · score: 5 (2 votes) · LW · GW
Is "informed oversight" entirely a subproblem of "optimizing for worst case"? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.

No, it's also important for getting good behavior from RL.

This is tangential but can you remind me why it's not a problem as far as competitiveness that your overseer is probably more costly to compute than other people's reward/evaluation functions?

This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)

(Note that even "10x slowdown" could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)

Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action.

In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.

If a model trained on historical data could predict good consequences, but your overseer can't, then you are going to sacrifice competitiveness. That is, your agent won't be motivated to use its understanding to help you achieve good consequences.

I think the confusion is coming from equivocating between multiple proposals. I'm saying, "We need to solve informed oversight for amplification to be a good training scheme." You are asking "Why is that a problem?" and I'm trying to explain why this is a necessary component of iterated amplification. In explaining that, I'm sometimes talking about why it wouldn't be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for "a story about why the model might do something unsafe," I assumed you were asking for the latter---why would the obvious approach to making it competitive be unsafe. My earlier comment "If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable" is explaining why approval-directed agents aren't competitive by default unless you solve something like this.

(That all said, sometimes the overseer believes that X will have good consequences because "stuff like X has had good consequences in the past;" that seems to be an important kind of reasoning that you can't just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don't use this kind of reasoning you sacrifice competitiveness.)

Comment by paulfchristiano on Thoughts on reward engineering · 2019-02-11T00:27:47.743Z · score: 5 (2 votes) · LW · GW
What if the overseer just asks itself, "If I came up with the idea for this action myself, how much would I approve of it?" Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn't the same thing happen if the overseer was just making the decisions itself?

No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work. (Unless they happen to get unlucky and find great actions by chance, in which case they can get unlucky, or if they run a really powerful search, in which case yes really powerful searches are scary for the same reason.)

Would this still be a problem if we were training the agent with SL instead of RL?

You would still need informed oversight in order to solve the inner alignment problem (i.e. to actually achieve robustness in the face of sophisticated reasoning within the SL system), but you wouldn't need it for the outer alignment problem.

If not, what is the motivation for using RL here?

I agree with Will. The point is to be competitive, I don't see how you could be competitive if you use SL (unless it turns out that RL just doesn't add any value, in which case I agree we don't have to worry about RL).

like inner optimizers for "optimizing for worst case"

But you need to solve this problem in order to cope with inner optimizers.

Here it seems like you're trying to train an agent that is more capable than the overseer in some way, and I'm not entirely sure why that has changed.

This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.

I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble

I don't quite understand what kind of story you are looking for, and this might have just been a communication problem in the post, so let me start by rephrasing that example:

• I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn't attacked.
• So my AI searches over actions to find one for which it expects I'll conclude "I wasn't attacked."
• Instead of actually protecting me, my AI undermines the mechanisms by which I would have detected that I was attacked.

We could run the same argument with "I want to acquire resources" instead of "I want to be protected from attack"---rather than actually helping me get resources, the agent can convince me that I have a bunch of resources when I don't really have any.

how it came to have more understanding than the overseer

We did a lot of gradient descent over the parameters of the agent, to find parameters that cause the agent to take good actions. That gradient descent gives the agent a bunch of ability to reason about the world, because successful reasoning can help it take good actions.

The overseer can also look at that data and try to learn about general patterns etc.. But how do we guarantee that the explicit procedure used by the overseer actually captures all of the stuff the agent was able to learn? This is very non-obvious.

Comment by paulfchristiano on When should we expect the education bubble to pop? How can we short it? · 2019-02-09T22:13:14.271Z · score: 24 (10 votes) · LW · GW
College spending is one sixth of US economy

What? That would be pretty crazy, if $1 of every$6 was being spent on college. The linked post mentions it in a parenthetical, without explanation or justification.

A few seconds of googling suggests that spending on college is about \$560 billion per year, around 3% of GDP, which makes way more sense. Opportunity cost from students in college might be a further 2% or so, though if you are going to count non-monetized time then you should probably be using a bigger denominator than GDP.

I don't know what the "one sixth" figure could be referring to. Total student debt is <10% of GDP (though that's basically a meaningless comparison---more meaningful would be to say that it's about 1% of outstanding debt in the US).

Comment by paulfchristiano on The Steering Problem · 2019-02-07T17:52:42.285Z · score: 2 (1 votes) · LW · GW

This is the typical way of talking about "more useful than" in computer science.

Saying "there is some way to use P to efficiently accomplish X" isn't necessarily helpful to someone who can't find that way. We want to say: if you can find a way to do X with H, then you can find a way to do it with P. And we need an efficiency requirement for the statement to be meaningful at all.

## Security amplification

2019-02-06T17:28:19.995Z · score: 20 (4 votes)
Comment by paulfchristiano on Reliability amplification · 2019-02-02T20:56:27.665Z · score: 2 (1 votes) · LW · GW

Yes, when I say:

Given a distribution A over policies that ε-close to a benign policy for some ε ≪ 1, can we implement a distribution A⁺ over policies which is δ-close to a benign policy of similar capability, for some δ ≪ ε?

a "benign" policy has to be benign for all inputs. (See also security amplification, stating the analogous problem where a policy is "mostly" benign but may fail on a "small" fraction of inputs.)

## Reliability amplification

2019-01-31T21:12:18.591Z · score: 21 (5 votes)
Comment by paulfchristiano on Thoughts on reward engineering · 2019-01-29T01:38:01.151Z · score: 2 (1 votes) · LW · GW

Some problems:

• If we accept the argument "well it worked, didn't it?" then we are back to the regime where the agent may know something we don't (e.g. about why the action wasn't good even though it looked good).
• Relatedly, it's still not really clear to me what it means to "only accept actions that we understand." If the agent presents an action that is unacceptable, for reasons the overseer doesn't understand, how do we penalize it? It's not like there are some actions for which we understand all consequences and others for which we don't---any action in practice could have lots of consequences we understand and lots we don't, and we can't rule out the existence of consequences we don't understand.
• As you observe, the agent learns facts from the training distribution, and even if the overseer has a memory there is no guarantee that they will be able to use it as effectively as the agent. Being able to look at training data in some way (I expect implicitly) is a reason that informed oversight isn't obviously impossible, but not reasons that this is a non-problem.
Comment by paulfchristiano on Techniques for optimizing worst-case performance · 2019-01-28T22:40:00.758Z · score: 2 (1 votes) · LW · GW

I agree that you probably need ensembling in addition to these techniques.

At best this technique would produce a system which has a small probability of unacceptable behavior for any input. You'd then need to combine multiple of those to get a system with negligible probability of unacceptable behavior.

I expect you often get this for free, since catastrophe either involves a bunch of different AI systems behaving unacceptably, or a single AI behaving consistently unacceptably across time.

Comment by paulfchristiano on Thoughts on reward engineering · 2019-01-28T22:36:22.086Z · score: 2 (1 votes) · LW · GW
I thought inner optimizers are supposed to be handled under "learning with catastrophe" / "optimizing for worst case". In particular inner optimizers would cause "malign" failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.

Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.

Is "informed oversight" just another name for that problem, or a particular approach to solving it?

Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.

If the latter, how is it different from "transparency"?

People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.

In the context of this sequence, transparency is relevant for both:

• Know what the agent knows, in order to evaluate its behavior.
• Figure out under what conditions the agent would behave differently, to facilitate adversarial training.

For both of those problems, it's not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.

So that's why there is a different name.

I thought those ideas would be enough to solve the more recent motivating example for "informed oversight" that Paul gave (training an agent to defend against network attacks).

(I disagreed with this upthread. I don't think "convince the overseer that an action is good" obviously incentivizes the right behavior, even if you are allowed to offer an explanation---certainly we don't have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)

Comment by paulfchristiano on Thoughts on reward engineering · 2019-01-28T22:28:20.010Z · score: 5 (2 votes) · LW · GW
If the overseer sees the agent output an action that the overseer can't understand the rationale of, why can't the overseer just give it a low approval rating? Sure, this limits the performance of the agent to that of the overseer, but that should be fine since we can amplify the agent later?

Suppose that taking action X results in good consequences empirically, but discovering why is quite hard. (It seems plausible to me that this kind of regularity is very important for humans actually behaving intelligently.) If you don't allow actions that are good for reasons you don't understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).

If this doesn't work for some reason, why don't we have the agent produce an explanation of the rationale for the action it proposes, and output that along with the action, and have the overseer use that as a hint to help judge how good the action is?

Two problems:

• Sometimes you need hints that help you see why an action is bad. You can take this proposal all the way to debate, though you are still left with a question about whether debate actually works.
• Agents can know things because of complicated regularities on the training data, and hints aren't enough to expose this to the overseer.

## Techniques for optimizing worst-case performance

2019-01-28T21:29:53.164Z · score: 23 (6 votes)
Comment by paulfchristiano on Thoughts on reward engineering · 2019-01-25T17:45:25.169Z · score: 6 (3 votes) · LW · GW

Happy to give more examples; if you haven't seen this newer post on informed oversight it might be helpful (and if not, I'm interested in understanding where the communication gaps are).

## Thoughts on reward engineering

2019-01-24T20:15:05.251Z · score: 29 (4 votes)

## Learning with catastrophes

2019-01-23T03:01:26.397Z · score: 26 (8 votes)

## Capability amplification

2019-01-20T07:03:27.879Z · score: 24 (7 votes)

## The reward engineering problem

2019-01-16T18:47:24.075Z · score: 23 (4 votes)

## Towards formalizing universality

2019-01-13T20:39:21.726Z · score: 29 (6 votes)

## Directions and desiderata for AI alignment

2019-01-13T07:47:13.581Z · score: 29 (6 votes)

## Ambitious vs. narrow value learning

2019-01-12T06:18:21.747Z · score: 19 (5 votes)

## AlphaGo Zero and capability amplification

2019-01-09T00:40:13.391Z · score: 25 (9 votes)

## Supervising strong learners by amplifying weak experts

2019-01-06T07:00:58.680Z · score: 28 (7 votes)

## Benign model-free RL

2018-12-02T04:10:45.205Z · score: 10 (2 votes)

## Corrigibility

2018-11-27T21:50:10.517Z · score: 39 (9 votes)

## Humans Consulting HCH

2018-11-25T23:18:55.247Z · score: 19 (3 votes)

## Approval-directed bootstrapping

2018-11-25T23:18:47.542Z · score: 19 (4 votes)

## Approval-directed agents

2018-11-22T21:15:28.956Z · score: 22 (4 votes)

## Prosaic AI alignment

2018-11-20T13:56:39.773Z · score: 36 (9 votes)

## An unaligned benchmark

2018-11-17T15:51:03.448Z · score: 27 (6 votes)

## Clarifying "AI Alignment"

2018-11-15T14:41:57.599Z · score: 54 (16 votes)

## The Steering Problem

2018-11-13T17:14:56.557Z · score: 38 (10 votes)

## Preface to the sequence on iterated amplification

2018-11-10T13:24:13.200Z · score: 39 (14 votes)

## The easy goal inference problem is still hard

2018-11-03T14:41:55.464Z · score: 38 (9 votes)

## Could we send a message to the distant future?

2018-06-09T04:27:00.544Z · score: 40 (14 votes)

## When is unaligned AI morally valuable?

2018-05-25T01:57:55.579Z · score: 97 (29 votes)

## Open question: are minimal circuits daemon-free?

2018-05-05T22:40:20.509Z · score: 110 (35 votes)

## Weird question: could we see distant aliens?

2018-04-20T06:40:18.022Z · score: 85 (25 votes)

## Implicit extortion

2018-04-13T16:33:21.503Z · score: 74 (22 votes)

## Prize for probable problems

2018-03-08T16:58:11.536Z · score: 135 (37 votes)

## Argument, intuition, and recursion

2018-03-05T01:37:36.120Z · score: 99 (29 votes)

## Funding for AI alignment research

2018-03-03T21:52:50.715Z · score: 108 (29 votes)

## Funding for independent AI alignment research

2018-03-03T21:44:44.000Z · score: 0 (0 votes)

## The abruptness of nuclear weapons

2018-02-25T17:40:35.656Z · score: 95 (35 votes)

2018-02-25T04:53:36.083Z · score: 103 (34 votes)

## Funding opportunity for AI alignment research

2017-08-27T05:23:46.000Z · score: 1 (1 votes)

## Ten small life improvements

2017-08-20T19:09:23.673Z · score: 18 (18 votes)

## Crowdsourcing moderation without sacrificing quality

2016-12-02T21:47:57.719Z · score: 15 (11 votes)

## Optimizing the news feed

2016-12-01T23:23:55.403Z · score: 9 (10 votes)

## The universal prior is malign

2016-11-30T22:31:41.000Z · score: 4 (4 votes)

## Recent AI control posts

2016-11-29T18:53:57.656Z · score: 12 (13 votes)

## My recent posts

2016-11-29T18:51:09.000Z · score: 5 (5 votes)

## If we can't lie to others, we will lie to ourselves

2016-11-26T22:29:54.990Z · score: 25 (18 votes)

## Less costly signaling

2016-11-22T21:11:06.028Z · score: 14 (16 votes)

## Control and security

2016-10-15T21:11:55.000Z · score: 3 (3 votes)

## What is up with carbon dioxide and cognition? An offer

2016-04-23T17:47:43.494Z · score: 37 (30 votes)

## Time hierarchy theorems for distributional estimation problems

2016-04-20T17:13:19.000Z · score: 2 (2 votes)

## Another toy model of the control problem

2016-01-30T01:50:12.000Z · score: 1 (1 votes)

## My current take on logical uncertainty

2016-01-29T21:17:33.000Z · score: 2 (2 votes)

## Active learning for opaque predictors

2016-01-03T21:15:28.000Z · score: 1 (1 votes)