Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-19T10:08:55.709Z · score: 3 (2 votes) · LW · GW
I sort of object to titling this post "Value Learning is only Asymptotically Safe" when the actual point you make is that we don't yet have concrete optimality results for value learning other than asymptotic safety.

Doesn't the cosmic ray example point to a strictly positive probability of dangerous behavior?

EDIT: Nvm I see what you're saying. If I'm understanding correctly, you'd prefer, e.g. "Value Learning is not [Safe with Probability 1]".

Thanks for the pointer to PAC-type bounds.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-14T03:18:51.751Z · score: 1 (1 votes) · LW · GW

Oh sorry.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-14T01:30:53.580Z · score: 1 (1 votes) · LW · GW

Sure thing.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-14T01:27:13.945Z · score: 1 (1 votes) · LW · GW
even if trust did work like this

I'm not claiming things described as "trust" usually work like this, only that there exists a strategy like this. Maybe it's better described as "presenting an argument to run this particular code."

how exactly does taking over the world not increase the Q-values

The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP's Q-values for various other reward functions remain comparable to their prior values.

the agent now has a much more stable existence

If you're claiming that the other Q-values can't help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.

And let's forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it's not saying much to say that AUP + intent verification would make it safe.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-14T01:04:59.761Z · score: 1 (1 votes) · LW · GW

Okay fair. I just mean to make some requests for the next version of the argument.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-13T10:17:34.503Z · score: 1 (1 votes) · LW · GW
2) ... If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate.

Yeah I agree there's an easy way to avoid this problem. My main point in bringing it up was that there must be gaps in your justification that AUP is safe, if your justification does not depend on "and the action space must be sufficiently small." Since AUP definitely isn't safe for sufficiently large action spaces, your justification (or at least the one presented in the paper) must have at least one flaw, since it purports to argue that AUP is safe regardless of the size of the action space.

You must have read the first version of BoMAI (since you quoted here :) how did you find it by the way?). I'd level the same criticism against that draft. I believed I had a solid argument that it was safe, but then I discovered , which proved there was an error somewhere in my reasoning. So I started by patching the error, but I was still haunted by how certain I felt that it was safe without the patch. I decided I needed to explicitly figure out every assumption involved, and in the process, I discovered ones that I hadn't realized I was making. Likewise, this patch definitely does seem sufficient to avoid this problem of action-granularity, but I think the problem shows that a more rigorous argument is needed.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-13T10:01:53.744Z · score: 1 (1 votes) · LW · GW
1) Why wouldn't gaining trust be useful for other rewards?

Because the agent has already committed to what the trust will be "used for." It's not as easy to construct the story of an agent attempting to gain the trust to be allowed to do one particular thing as it is construct the story of an agent attempting to gain trust to be allowed to do anything, but the latter is unappealing to AUP, and the former is perfectly appealing. So all the optimization power will go towards convincing the operator to run this particular code (which takes over the world, and maximizes the reward). If done in the right way, AUP won't have made arguments which would render it easier to then convince the operator to run different code; running different code would be necessary to maximize a different reward function, so in this scenario, the Q-values for other random reward functions won't have increased wildly in the way that the Q-value for the real reward did.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-13T09:46:50.141Z · score: 1 (1 votes) · LW · GW
4) this is why we want to slowly increment N. This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually N gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.

Let's say for concreteness that it's a human policy that is used for , if you think it works either way. I think that most human actions are moderately low impact, and some are extremely high impact. No matter what N is, then, if the impact of is leaping to very large values infinitely often, then infinitely often there will effectively be no impact regularization, no matter what N is. No setting for N fixes this; if N were small enough to preclude even actions that are less impactful than , then agent can't ever act usefully, and if N permits actions as impactful as , then when has very large impact (which I contend happens infinitely often for any assignment of that permits any useful action ever), then dangerously high impact actions will be allowed.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-11T04:26:58.535Z · score: 6 (3 votes) · LW · GW

These comments are responding to the version of AUP presented in the paper. (Let me know if I should be commenting elsewhere).


If an action is useful w.r.t the actual reward but useless to all other rewards (as useless as taking ), that is the ideal according to —i.e. if it is not worth doing because the impact measure is too strong, nothing is worth doing. This is true even if the action is extremely useful to the actual reward. Am I right in thinking that we can conceptualize AUP as saying: “take actions which lead to reward, but wouldn’t be useful (or detrimental) to gaining reward if reward were specified differently”? A typical outline for an AGI gaining power to accomplish a goal might be: gain power, use the power to run some code to help you get maximal reward. We might imagine an AGI convincing a human to run some code, and then giving them the actual code. AUP would be less inclined to do this because after winning over the human, the Q-values for lots of reward functions would be extremely high, so it would be more reluctant to bring itself to that position. Suppose that AUP gives the human operator code to run first and then convinces them to run it. The actions which it takes to gain trust are not useful for other rewards, because they’ll only lead to the code already given being run, which is useless from the perspective of the other reward functions. Do you think AUP would be motivated to search for ways to lock in the effects of future power, and then pursue that power?


If increasing attainable utility and decreasing attainable utility are both dangerous, then raising the size of the actions space to a power makes the agent more dangerous. Consider transforming action/observation/reward into the agent submitting 3 actions, and receiving the next three observations (with the rewards averaged). This is just a new actions space cubically larger. But in this action space, if the “first” action decreased attainable utility dangerously, and the “third” action increased it dangerously, that would cancel out and fail to register as dangerous. Since this problem appears in the math, but not in the intuition, it makes me wary of the reliability of the intuition.


Q-learning converges by sampling all actions repeatedly from all states. AUP penalizes actions according to disruptions in Q-values. I understand that AGI won’t be a Q-learner in a finite-state MDP, but I think it’s worth noting: AUP learns to avoid catastrophic states (if in fact, it does) by testing them out.


Suppose we have a chatbot, and the actions space is finite length strings of text. What exactly is ? If it is a meaningless string of text, I suspect every meaningful string of text will be “too high impact”. Maybe is an imitation of a human? I think humans are sufficiently powerful that normal human policies often accidentally cause large impact (i.e. make it massively more difficult or easy to achieve random goals), and that infinitely often (although perhaps not frequently), having be a human policy would lead to an incredibly high tolerance for impact, which would give AUP plenty of leeway to do dangerous things.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-10T13:18:47.371Z · score: 1 (1 votes) · LW · GW

Linking this, I meant "with probability strictly greater than 0, the agent is not safe". Sorry for the confusion.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-10T13:17:55.554Z · score: 1 (1 votes) · LW · GW

Yes, I did mean the latter. Thank you for clarifying.

Value Learning is only Asymptotically Safe

2019-04-08T09:45:50.990Z · score: 7 (3 votes)
Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-02T04:16:14.586Z · score: 1 (1 votes) · LW · GW

Whoops--when I said

In a sense, the AI "in the box" is not really boxed

I meant the "AI Box" scenario where it is printing results to a screen in the outside world. I do think BoMAI is truly boxed.

We cannot "prove" that something is physically impossible, only that it is impossible under some model of physics.

Right, that's more or less what I mean to do. We can assign probabilities to statements like "it is physically impossible (under the true models of physics) for a human or a computer in isolation with an energy budget of x joules and y joules/second to transmit information in any way other than via a), b), or c) from above." This seems extremely likely to me for reasonable values of x and y, so it's still useful to have a "proof" even if it must be predicated on such a physical assumption.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T23:39:42.787Z · score: 13 (4 votes) · LW · GW

Thanks for a really productive conversation in the comment section so far. Here are the comments which won prizes.

Comment prizes:

Objection to the term benign (and ensuing conversation). Wei Dei. Link. $20

A plausible dangerous side-effect. Wei Dai. Link. $40

Short description length of simulated aliens predicting accurately. Wei Dai. Link. $120

Answers that look good to a human vs. actually good answers. Paul Christiano. Link. $20

Consequences of having the prior be based on K(s), with s a description of a Turing machine. Paul Christiano. Link. $90

Simulated aliens converting simple world-models into fast approximations thereof. Paul Christiano. Link. $35

Simulating suffering agents. cousin_it. Link. $20

Reusing simulation of human thoughts for simulation of future events. David Krueger. Link. $20

Options for transfer:

1) Venmo. Send me a request at @Michael-Cohen-45.

2) Send me your email address, and I’ll send you an Amazon gift card (or some other electronic gift card you’d like to specify).

3) Name a charity for me to donate the money to.

I would like to exert a bit of pressure not to do 3, and spend the money on something frivolous instead :) I want to reward your consciousness, more than your reflectively endorsed preferences, if you’re up for that. On that note, here’s one more option:

4) Send me a private message with a shipping address, and I’ll get you something cool (or a few things).

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T22:58:13.380Z · score: 1 (1 votes) · LW · GW

The computer and everything is in the inner concrete wall (separated from the glass box by a vacuum), as is the power supply. Nothing is cooling is the room, except maybe some ice on the floor. I think you could allow an oxygen tank in the room too.

Yes, the computer is using power, but a computer can't move matter except inside itself. This can generate sound and light, but the second vacuum chamber and Faraday cage will block that (and the rate at which power is drawn can be capped, which can also give a reasonable bound on how much noise to generate.

whatever's used for input, and whatever's used for output.

For input, the human is just blocking lasers. For output, they're looking through the inner vacuum at the screen on the inner wall of the inner concrete box.

They can be buried in concrete, but they'll still need to be within the vacuums and whatnot.

Yes, that's what the outer vacuum is for.

trying to outsmart a superintelligent AI is a Bad Idea

If I can construct a proof that Y is physically impossible, then I feel fine depending on the fact that an AI won't be able to figure out how to do Y. It doesn't feel to me like "trying to outsmart an AI."

Just because an unknown information channel does not have any humans listening to it directly, does not mean that it can't be used to push around matter, or couple to some other channel

Yes, you're right. My afterthought was hasty. I still think it is unlikely that other forms of information transmission are physically possible, and quite unlikely that they could be generated by a human or a computer in isolation.

Sidenote: I think we've all generated a lot of our intuitions here from the AI Box experiments. In a sense, the AI "in the box" is not really boxed. There's an information channel directly to the outside world!

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T10:25:12.627Z · score: 1 (1 votes) · LW · GW

What do you think about "aligned"? (in the sense of having goals which don't interfere with our own, by being limited in scope to the events of the room)

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T10:20:56.233Z · score: 1 (1 votes) · LW · GW

What do you think about "domesticated"?

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T09:51:54.859Z · score: 1 (1 votes) · LW · GW

Responding to this comment here.

I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.

I didn't go into the details of the physical design for a really secure box in the paper, because it's not really a topic in computer science, but this is a good excuse to write it up.

The operator is inside a glass box. Outside the glass box is a vacuum, and then a concrete box. Outside the concrete box is another vacuum and another concrete box with a Faraday cage around it. There are (airtight) doors through each layer. There is a pressure sensor in both the inner and outer vacuum layer. The computer lives in the inner wall of the inner concrete box; the screen is visible from the inside of the glass box. If the pressure increases in either vacuum layer, the computer's memory is erased, and the whole thing is blown up. There is a laser shining through the glass box (Laser 1). If the operator blocks the path of Laser 1, this corresponds to pressing the button which ends the episode and opens the door. The pressure sensor -> dynamite pathway is deactivated, the vacuum layers are filled with air, and the doors are unlocked. The operator "types" observations and rewards by blocking other lasers that pass through the box. For now, just imagine a few dozen lasers in a row, that the operator can reach up and block in order to type. (With a bit of creativity, and requiring that two lasers be blocked at once to "press a key" this could probably be made somewhat like a qwerty setup.) The lasers and the laser-detectors are all on the inner wall of the inner concrete box.

So inside the box is only a human body and some clothes. Let's assume: 1) the only ways for information to move in space are with a) physically moving matter, b) light, and c) gravitational waves and 2) it is impossible for a human being to convert a nuclear reaction into useful work with only their body and clothes. Physically moving packets matter cannot leave the inside of the glass box without releasing air into the vacuum and causing an explosion. Light cannot leave the Faraday cage. For sound waves (through an imperfect vacuum) and gravitational waves, we can use the amount of usable energy stored by the person in the room to bound how much noise and randomly moving matter we need to be producing outside to bury the signal in the noise. With a sufficiently large amount of noise, even if all the human's energy went toward producing a signal, we can create enough noise that this will not be detectable. Furthermore, an analysis of the near-perfect vacuum can provide an upper bound on the efficiency with which a human can convert energy into an informative matter wave, which should drive down the amount of noise we need to produce to almost nothing. We can also put a decibel meter in the inner wall and blow the whole thing up if it gets too high. I haven't done the math, but I bet the amount of matter we need to be randomly moving in order to hide signals coming from gravitational waves originating from within the box really is nothing. I bet ambient temperature takes care of that.

What if there are other ways information can be transmitted through space? If we ever learn of these, we shut down BoMAI (or modify the setup in light of them). Until then, if such a discovery is made by BoMAI, we can't exactly detect those sorts of signals from the outside, because we don't know they exist.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T09:09:49.778Z · score: 1 (1 votes) · LW · GW
(I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)

If you're right about this, the setup is not safe. I'm going to respond to this in the "concerns about the box" section. I don't think a commitment to give the agent high reward if it jumps through a few hoops will save us.

In this particular case, it will never do anything harder than hacking its human operator, and that's where most of the safety benefit comes from.

I disagree with this. The safety benefit comes from it not having outside-world instrumental goals (which it lacks if and only if the box is secure).

My intuition says that nothing else in the proposal actually matters for safety

That's what I would conclude as well if the box were not secure.

In particular, I think just giving the AI direct exploration abilities won't mess up safety,

See Appendix F. If the agent picks its own exploratory policies (reasonably), the agent will try every computable policy until it dies, including the policies of every simple AGI.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T03:57:58.760Z · score: 1 (1 votes) · LW · GW
I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event).

So this is the sense in which I think my statement is technically correct. This is what literally does.

The next question is whether it is correct in way that isn't fragile once we start considering fast/simple approximations of . You're right that there is more to discuss here than I discuss in the paper: if a human's simulation of the future has fidelity, and the world-model itself has fidelity, then a clever memory-based world-model could reuse the computation of the human's prediction of the future when it is computing the actual future. If it hasn't spent much computation time "going down the wrong path" there isn't much that's lost for having done so.

I don't expect the human operator will be simulating/imagining all post-episode events that are relevant for -accurate predictions of future episodes. -accurate world-models have to simulate all the outside-world events that are necessary to get within an threshold of understanding how episodes affect each other, and it won't be necessary for the human operator to consider all this. So I think that even for approximately accurate world-models, following the wrong counterfactual won't be perfectly useful to future computation.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T03:21:03.190Z · score: 1 (1 votes) · LW · GW
To put it differently: this setup will not provide a solution to any problem which is more difficult than hacking the human operator.

Yep. I actually expect it to be pretty hard to hack the operator compared to other tasks, but more importantly, if it turns out to be easier, we can tinker with the setup until hacking the operator becomes very hard, since a mistake here isn't an existential risk. For example, we could add a monitor who observes the operator, and can set the rewards of the episode to 0 if he observes anything suspicious; the box would include them both, but the operator couldn't easily communicate to the monitor, and if they tried, the monitor could immediately end the episode.

This comment has a recap of other parts of the conversation on this topic, alongside a more detailed proposal.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T03:07:16.099Z · score: 1 (1 votes) · LW · GW

That's a good name for the assumption. Well, any Turing machine/computable function can be described in English (perhaps quite arduously), so consider the universal Turing machine which converts the binary description to English, and then uses that description to identify the Turing machine to simulate. This UTM certainly satisfies this assumption.

It strikes me as potentially running up against issues of NFL / self-reference.

Can you explain more? (If the above doesn't answer it).

Another intuition I have for this assumption which doesn't appear in the paper: English is really good language. (This is admittedly vague). In thinking about this intuition further, I've noticed a weaker form of Assumption 3 that would also do the trick: the assumption need only hold for -accurate world-models (for some ). In that version of the assumption, one can use the more plausible intuitive justification: "English is a really good language for describing events arising from human-civilization in our universe."

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T02:17:26.384Z · score: 1 (1 votes) · LW · GW

I don't understand what you mean by a revealed preference. If you mean "that which is rewarded," then it seems pretty straightforward to me that a reinforcement learner can't optimize anything other than that which is rewarded (in the limit).

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T02:15:23.663Z · score: 1 (1 votes) · LW · GW
1) It seems too weak: In the motivating scenario of Figure 3, isn't is the case that "what the operator inputs" and "what's in the memory register after 1 year" are "historically distributed identically"?

This assumption isn't necessary to rule out memory-based world-models (see Figure 4). And yes you are correct that indeed it doesn't rule them out.

2) It seems too strong: aren't real-world features and/or world-models "dense"? Shouldn't I be able to find features arbitrarily close to F*? If I can, doesn't that break the assumption?

Yes. Yes. No. There are only finitely many short English sentences. (I think this answers your concern if I understand it correctly).

3) Also, I don't understand what you mean by: "its on policy behavior [is described as] simulating X". It seems like you (rather/also) want to say something like "associating reward with X"?

I don't quite rely on the latter. Associating reward with X means that the rewards are distributed identically to X under all action sequences. Instead, the relevant implication here is: "the world-model's on-policy behavior can be described as simulating X" implies "for on-policy action sequences, the world-model simulates X" which means "for on-policy action sequences, rewards are distributed identically to X."

Comment by michaelcohen on Asymptotically Benign AGI · 2019-04-01T01:16:35.197Z · score: 1 (1 votes) · LW · GW

Yeah that's what I mean to refer to: this is a system which learns everything it needs to from the human while querying her less and less, which makes human-lead exploration viable from a capabilities standpoint. Do you think that clarification would make things clearer?

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-29T02:18:32.988Z · score: 1 (1 votes) · LW · GW
Oh yeah--that's good news.

Although I don't really like to make anything that would fall apart if the world were deterministic. Relying on stochasticity feels wrong to me.

Comment by michaelcohen on The Main Sources of AI Risk? · 2019-03-29T01:40:38.793Z · score: 1 (1 votes) · LW · GW

Maybe something along the lines of "Inability to specify any 'real-world' goal for an artificial agent"?

Comment by michaelcohen on The Main Sources of AI Risk? · 2019-03-29T00:32:31.625Z · score: 4 (2 votes) · LW · GW
3. Misspecified or incorrectly learned goals/values

I think this phrasing misplaces the likely failure modes. An example that comes to mind from this phrasing is that we mean to maximize conscious flourishing, but we accidentally maximize dopamine in large brains.

Of course, this example includes an agent intervening in the provision of its own reward, but since that seems like the paradigmatic example here, maybe the language could better reflect that, or maybe this could be split into two.

The single technical problem that appears biggest to me is that we don't know how to align an agent with any goal. If we had an indestructible magic box that printed a number to a screen corresponding to the true amount of Good in the world, we still don't know how to design an agent that maximizes that number (instead of taking over the world, and tampering with the cameras that are aimed at the screen/the optical character recognition program used to decipher the image). This problems seems to me like the single most fundamental source of AI risk. Is 3 meant to include this?

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-16T06:04:33.195Z · score: 4 (2 votes) · LW · GW
I don't see why their methods would be elegant.

Yeah I think we have different intuitions here; are we at least within a few bits of log-odds disagreement? Even if not, I am not willing to stake anything on this intuition, so I'm not sure this is a hugely important disagreement for us to resolve.

I don't see how MAP helps things either

I didn't realize that you think that a single consequentialist would plausibly have the largest share of the posterior. I assumed your beliefs were in the neighborhood of:

it seems plausible that the weight of the consequentialist part is in excess of 1/million or 1/billion

(from your original post on this topic). In a Bayes mixture, I bet that a team of consequentialists that collectively amount to 1/10 or even 1/50 of the posterior could take over our world. In MAP, if you're not first, you're last, and more importantly, you can't team up with other consequentialist-controlled world-models in the mixture.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-15T03:03:15.602Z · score: 1 (1 votes) · LW · GW
Do you get down to 20% because you think this argument is wrong, or because you think it doesn't apply?

You argument is about a Bayes mixture, not a MAP estimate; I think the case is much stronger that consequentialists can take over a non-trivial fraction of a mixture. I think that the methods with consequentialists discover for gaining weight in the prior (before the treacherous turn) are mostly likely to be elegant (short description on UTM), and that is the consequentialists' real competition; then [the probability the universe they live in produces them with their specific goals]or [the bits to directly specify a consequentialist deciding to to do this] set them back (in the MAP context).

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-15T03:02:53.198Z · score: 1 (1 votes) · LW · GW

Let's say , .

Wouldn't they have to also magically predict all the stochasticity in the observations, and have a running time that grows exponentially in their log loss?

Oh yeah--that's good news.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-14T23:16:00.163Z · score: 1 (1 votes) · LW · GW
Does that seem right to you?

Yes. I recall thinking about precomputing observations for various actions in this phase, but I don’t recall noticing how bad the problem was not in the limit.

your take on whether the proposed version is dominated by consequentialists at some finite time.

This goes in the category of “things I can’t rule out”. I say maybe 1/5 chance it’s actually dominated by consequentialists (that low because I think the Natural Prior Assumption is still fairly plausible in its original form), but for all intents and purposes, 1/5 is very high, and I’ll concede this point.

I'd want to know which version of the speed prior and which parameters

is a measure over binary strings. Instead, let’s try , where is the length of , is the time it takes to run on , and is a constant. If there were no cleverer strategy than precomputing observations for all the actions, then could be above , where is the number of episodes we can tolerate not having a speed prior for. But if it somehow magically predicted which actions BoMAI was going to take in no time at all, then would have to be above .

What problem do you think bites you?

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-14T10:18:37.650Z · score: 1 (1 votes) · LW · GW
Do you have a concrete alternative in mind, which you think is not dominated by some consequentialist (i.e. a ψ for which every consequentialist is either slower or more complex)?

Well one approach is in the flavor of the induction algorithm I messaged you privately about (I know I didn't give you a completely specified algorithm). But when I wrote that, I didn't have a concrete algorithm in mind. Mostly, it just seems to me that the powerful algorithms which have been useful to humanity have short descriptions in themselves. It seems like there are many cases where there is a simple "ideal" approach which consequentialists "discover" or approximately discover. A powerful heuristic search would be one such algorithm, I think.

(ETA: I think this discussion depended on a detail of your version of the speed prior that I misunderstood.)

I don't think anything here changes if K(x) were replaced with S(x) (if that was what you misunderstood).

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-14T05:57:17.922Z · score: 1 (1 votes) · LW · GW

Yes this is correct. If you use the same bijection consistently from strings to natural numbers, it looks a little more intuitive than if you don't. The universal prior picks (the number) by outputting as a string. The th Turing machine is the Turing machine described by as a string. So you end up looking at the Kolmogorov complexity of the description of the Turing machine. So the construction of the description of the world-model isn't time-penalized. This doesn't change the asymptotic result, so I went with the more familiar rather than translating this new speed prior into measure over finite strings, which would require some more exposition, but I agree with you it feels like there might be some strange outcomes "before the limit" as a result of this approach: namely, the code on the UTM that outputs the description of the world-model-Turing-machine will try to do as much of the computation as possible in advance, by computing the description of an speed-optimized Turing machine for when the actions start coming.

The other reasonable choices here instead of are (constructed to be like the new speed prior here) and --the length of x. But basically tells you that a Turing machine with fewer states is simpler, which would lead to a measure over that is dominated by world-models that are just universal Turing machines, which defeats the purpose of doing maximum a posteriori instead of a Bayes mixture. The way this issue appears in the proof renders the Natural Prior Assumption less plausible.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-14T02:31:25.069Z · score: 1 (1 votes) · LW · GW

I've made a case that the two endpoints in the trade-off are not problematic. I've argued (roughly) that one reduces computational overhead by doing things that dissociate the naturalness of describing "predict accurately" and "treacherous turn" all at once. This goes back to the general principle I proposed above: "The more general a system is, the less well it can do any particular task." The only thing I feel like I can still do is argue against particular points in the trade-off that you think are likely to cause trouble. Can you point me to an exact inner loop that can be native to an AGI that would cause this to fall outside of this trend? To frame this case, the Turing machine description must specify [AGI + a routine that it can call]--sort of like a brain-computer interface, where the AGI is the brain and the fast routine is the computer.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-14T02:14:31.053Z · score: 1 (1 votes) · LW · GW

Given a world model , which takes computation steps per episode, let be the best world-model that best approximates (in the sense of KL divergence) using only computation steps. is at least as good as the “reasoning-based replacement” of .

The description length of is within a (small) constant of the description length of . That way of describing it is not optimized for speed, but it presents a one-time cost, and anyone arriving at that world-model in this way is paying that cost.

One could consider instead , which is, among the world-models that -approximate in less than computation steps (if the set is non-empty), the first such world-model found by a searching procedure . The description length of is within a (slightly larger) constant of the description length of , but the one-time computational cost is less than that of .

, , and a host of other approaches are prominently represented in the speed prior.

If this is what you call “the speed prior doing reasoning,” so be it, but the relevance for that terminology only comes in when you claim that “once you’ve encoded ‘doing reasoning’, you’ve basically already written the code for it to do the treachery that naturally comes along with that.” That sense of “reasoning” really only applies, I think, to the case where our code is simulating aliens or an AGI.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-13T09:20:59.020Z · score: 1 (1 votes) · LW · GW

Yeah... I don't have much to add here. Let's keep thinking about this. I wonder if Paul is more bullish on the premise that "it is harder to mystify a judge than it is to pierce through someone else mystifying a judge" than I am?

Recall that this idea was to avoid

essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on

If it also reduces the risk of operator-devotion, and it might well do that (because a powerful adversary is opposed to that), that wasn't originally what brought us here.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-13T06:53:43.282Z · score: 1 (1 votes) · LW · GW
Assuming the former is true (and it seems like a big assumption), why can't what I suggested still happen?

If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.

So actually I framed my point above wrong: "demanding that A use their words" could look like the protocol I describe; it is not something that would work independently of the assumption that it is easier to deflate an attempted mind-hacking than it is to mind-hack (with an equal amount of intelligence/resources).

But your original point was "why doesn't A just claim B is mind-hacking" not "why doesn't B just mind-hack"? The answer to that point was "demand A use their words rather than negotiate an end to the conversation" or more moderately, "75%-demand that A do this."

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-13T04:19:59.272Z · score: 1 (1 votes) · LW · GW

*but A could concoct a story ... counterarguments from B .. mind hacked by B, right?

I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn't be a problem.

That said, here's one possibility: if A ever suggests that you don't read more output from B, don't read anything more from B, then flip coins to give A a 25% chance of victory.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-13T04:00:22.704Z · score: 1 (1 votes) · LW · GW
And since (with low β) we're going through many more different world models as the number of episodes increases, that also gives malign world models more chances to "win"?

Check out the order of the quantifiers in the proofs. One works for all possibilities. If the quantifiers were in the other order, they couldn't be trivially flipped since the number of world-models is infinite, and the intuitive worry about malign world-models getting "more chances to win" would apply.

Let's continue the conversation here, and this may be a good place to reference this comment.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-13T03:51:27.658Z · score: 3 (2 votes) · LW · GW


Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-13T02:55:06.005Z · score: 1 (1 votes) · LW · GW

Some visualizations which might help with this:

But then one needs to factor in "simplicity" or the prior penalty from description length:

Note also that these are average effects; they are just for forming intuitions.

Your concern was:

is there a β such that BoMAI is both safe and intelligent enough to answer questions like "how to build a safe unbounded AGI" [after a reasonable number of episodes]?

This was the sort of thing I assumed could be improved upon later once the asymptotic result was established. Now that you’re asking for the improvement, here’s a proposal:

Set safely. Once enough observations have been provided that you believe human-level AI should be possible, exclude world-models that use less than computation steps per episode. Every episode, increase until human-level performance is reached. Under the assumption that the average computation time of a malign world-model is at least a constant times that of the “corresponding” benign one (corresponding in the sense of using the same ((coarse) approximate) simulation of the world), then should be safe for some (and ).

I need to think more carefully about what happens here, but I think the design space is large.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-13T00:14:10.505Z · score: 1 (1 votes) · LW · GW

The longer reply will include an image that might help, but a couple other notes. If it causes you to doubt the asymptotic result, it might be helpful to read the benignity proof (especially the proof of Rejecting the Simple Memory-Based Lemma, which isn't that long). The heuristic reason for why it can be helpful to decrease for long-run behavior, even though long-run behavior is qualitatively similar, is that while accuracy eventually becomes the dominant concern, along the way the prior is *sort of* a random perturbation to this which changes the posterior weight, so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length. Put another way, for benignity, we don't need concern for speed to dominate concern for accuracy; we need it to dominate concern for "simplicity" (on some reference machine).

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-13T00:03:33.552Z · score: 1 (1 votes) · LW · GW

I'm not sure which of these arguments will be more convincing to you.

Yes they are both arbitrary orders, but one of them systematically contains better models earlier in the order, since the output of reasoning is better than a blind prioritization of shorter models.

This is what is what I was trying to contextualize above. This is an unfair comparison. You're imagining that the "reasoning"-based order gets to see past observations, and the "shortness"-based order does not. A reasoning-based order is just a shortness-based order that has been updated into a posterior after seeing observations (under the view that good reasoning is Bayesian reasoning). Maybe the term "order" is confusing us, because we both know it's a distribution, not an order, and we were just simplifying to a ranking. A shortness-based order should really just be called a prior, and a reasoning-based order (at least a Bayesian-reasoning-based order) should really just be called a posterior (once it has done some reasoning; before it has done the reasoning, it is just a prior too). So yes, the whole premise of Bayesian reasoning is that updating based on reasoning is a good thing to do.

Here's another way to look at it.

The speed prior is doing the brute force search that scientists try to approximate efficiently. The search is for a fast approximation of the environment. The speed prior considers them all. The scientists use heuristics to find one.

In fact the speed prior only actually takes n + O(1) bits, because it can specify the "do science" strategy

Exactly. But this does help for reasons I describe here. The description length of the "do science" strategy (I contend) is less than the description length of the "do science" + "treacherous turn" strategy. (I initially typed that as "tern", which will now be the image I have of a treacherous turn.)

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-12T06:48:27.504Z · score: 1 (1 votes) · LW · GW
They may still not be able to if the penalty for computation is sufficiently steep

It was definitely reassuring to me that someone else had had the thought that prioritizing speed could eliminate optimization daemons (re: minimal circuits), since the speed prior came in here for independent reasons. The only other approach I can think of is trying to do the anthropic update ourselves.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-12T06:23:26.555Z · score: 1 (1 votes) · LW · GW

The only point I was trying to respond to in the grandparent of this comment was your comment

The fast algorithms to predict our physics just aren't going to be the shortest ones. You can use reasoning to pick which one to favor (after figuring out physics), rather than just writing them down in some arbitrary order and taking the first one.

Your concern (I think) is that our speed prior would assign a lower probability to [fast approximation of real world] than the aliens' speed prior.

I can't respond at once to all of the reasons you have for this belief, but the one I was responding to here (which hopefully we can file away before proceeding) was that our speed prior trades off shortness with speed, and aliens could avoid this and only look at speed.

My point here was just that there's no way to not trade off shortness with speed, so no one has a comparative advantage on us as result of the claim "The fast algorithms to predict our physics just aren't going to be the shortest ones."

The "after figuring out physics" part is like saying that they can use a prior which is updated based on evidence. They will observe evidence for what our physics is like, and use that to update their posterior, but that's exactly what we're doing to. The prior they start with can't be designed around our physics. I think that the only place this reasoning gets you is that their posterior will assign a higher probability to [fast approximation of real world] than our prior does, because the world-models have been reasonably reweighted in light of their "figuring out physics". Of course I don't object to that--our speed prior's posterior will be much better than the prior too.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-12T01:32:00.685Z · score: 1 (1 votes) · LW · GW

Longer response coming. On hold for now.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-12T01:10:35.181Z · score: 1 (1 votes) · LW · GW

I think what you're saying is that the following don't commute:

"real prior" (universal prior) + speed update + anthropic update + can-do update + worth-doing update

compared to

universal prior + anthropic update + can-do update + worth-doing update + speed update

When universal prior is next to speed update, this is naturally conceptualized as a speed prior, and when it's last, it is naturally conceptualized as "engineering reasoning" identifying faster predictions.

I happy to go with the second order if you prefer, in part because I think they do commute--all these updates just change the weights on measures that get mixed together to be piped to output during the "predict accurately" phase.

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-12T00:53:52.531Z · score: 1 (1 votes) · LW · GW

Let the AGI's "predict accurately" algorithm be fixed.

What you call a sequence of improvements to the prediction algorithm, let's just call that the prediction algorithm. Imagine this to have as much or as little overhead as you like compared to what was previously conceptualized as "predict accurately." I think this reconceptualization eliminates 2) as a concern, and if I'm understanding correctly, 1) is only able to mitigate slowdown, not overpower it.

Also I think 1) doesn't work--maybe you came to this conclusion as well?

Suppose M is the C programming language, but in C there is no way to say "interpret this string as a C program and run it as fast as a native C program".

But maybe you're saying that doesn't apply because:

(this is not a decision by the AGI but just a matter of which AGI ends up having the highest posterior)

I think this way throws off the contention that this AGI will have a short description length. One can imagine a sliding scale here. Short description, lots of overhead: a simple universe evolves life, aliens decide to run "predict accurately" + "treacherous turn". Longer description, less overhead: an AGI that runs "predict accurately" + "treacherous turn." Longer description, less overhead: an AGI with some of the subroutines involved already (conveniently) baked in to its architecture. Once all the subroutines are "baked into its architecture" you just have: the algorithm "predict accurately" + "treacherous turn". And in this form, that has a longer description than just "predict accurately".

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-12T00:29:03.314Z · score: 1 (1 votes) · LW · GW

I'm having a hard time following this. Can you expand on this, without using "sequence of increasingly better algorithms"? I keep translating that to "algorithm."

Comment by michaelcohen on Asymptotically Benign AGI · 2019-03-12T00:26:37.901Z · score: 1 (1 votes) · LW · GW
You can use reasoning to pick which one to favor (after figuring out physics), rather than just writing them down in some arbitrary order and taking the first one.

Using "reasoning" to pick which one to favor, is just picking the first one in some new order. (And not really picking the first one, just giving earlier ones preferential treatment). In general, if you have an infinite list of possibilities, and you want to pick the one that maximizes some property, this is not a procedure that halts. I'm agnostic about what order you use (for now) but one can't escape the necessity to introduce the arbitrary criterion of "valuing" earlier things on the list. One can put 50% probability mass on the first billion instead of the first 1000 if one wants to favor "simplicity" less, but you can't make that number infinity.

Asymptotically Benign AGI

2019-03-06T01:15:21.621Z · score: 40 (19 votes)

Impact Measure Testing with Honey Pots and Myopia

2018-09-21T15:26:47.026Z · score: 11 (7 votes)