Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-11T05:59:03.549Z · score: 1 (1 votes) · LW · GW

This seems correct. The agent's policy is optimal by definition with respect to its beliefs about the evaluators "policy" in providing rewards, but that evaluator-policy is not optimal with respect to the agent's policy. In fact, I'm skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other's policy. But I don't think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn't.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-11T01:28:13.943Z · score: 9 (3 votes) · LW · GW

A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator).

I believe this agent's beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the "training environment" from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator's beliefs might be.

I agree this agent should definitely be compared to IRD, since they are both agents who don't "take rewards literally", but rather process them in some way first. Note that the design space of things which fit this description is quite large.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-10T02:02:21.609Z · score: 4 (2 votes) · LW · GW

In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A''. State BC looks like C, but has utility like B. C is the best state.

ETA: And for a sequence of states, , is the sum of the utilities of the individual states.

A' and A" look like A, and BC looks like C.

In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent's belief distribution.

The agent is quite sure they're in state A.

The agent is quite sure that the evaluator is pretty sure, they're in state A'', which is a very similar state, but has one key difference--from A'', has no effect. The agent won't capitalize on this confusion.

The optimal policy is , followed by (forever) if , otherwise followed by . Since the agent is all but certain about the utility function, none of the other details matter much.

Note that the agent could get higher reward by doing , , then forever. The reason for this is that after the evaluator observes the observation C, it will assign probability 4/5 to being in state C, and probability 1/5 to being in state BC. Since they will stay in that state forever, 4/5 of the time, the reward will be 10, and 1/5 of the time, the reward will be -1.

The agent doesn't have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-10T00:23:41.976Z · score: 7 (3 votes) · LW · GW

Yep.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-10T00:21:05.011Z · score: 1 (1 votes) · LW · GW

An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T05:11:15.186Z · score: 1 (1 votes) · LW · GW
defining the evaluator is a fuzzy problem

I'm not sure what you mean by this. We don't need a mathematical formulation of the evaluator; we can grab one from the real world.

if you don't have the right formalism, you're going to get Goodharting on incorrect conceptual contours

I would agree with this for a "wrong" formalism of the evaluator, but we don't need a formalism of the evaluator. A "wrong" formalism of "deception" can't affect agent behavior because "deception" is not a concept used in constructing the agent; it's only a concept used in arguments about how the agent behaves. So "Goodharting" seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T04:53:51.387Z · score: 4 (2 votes) · LW · GW

A key problem here is that if we use a human as the evaluator, the agent assigns 0 prior probability to the truth: the human won't be able to update beliefs as a perfect Bayesian, sample a world-state history from his beliefs and assign a value to it according to a utility function. For a Bayesian reason that assigns 0 prior probability to the truth, God only knows how it will behave, even in the limit. (Unless there is some very odd utility function such that the human could be described in this way?)

But maybe this problem could be fixed if the agent takes some more liberties in modeling the evaluator. Maybe once we have a better understanding of bounded approximately-Bayesian reasoning, the agent can model the human as being a bounded reasoner, not a perfectly Bayesian reasoner, which might allow the agent to assign a strictly positive prior to the truth.

And all this said, I don't think we're totally clueless when it comes to guessing how this agent would behave, even though a human evaluator would not satisfy the assumptions that the agent makes about him.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T04:08:59.435Z · score: 1 (1 votes) · LW · GW

This is approximately where I am too btw

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:48:16.647Z · score: 1 (1 votes) · LW · GW

Thanks for the meta-comment; see Wei's and my response to Rohin.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:46:01.831Z · score: 1 (1 votes) · LW · GW

thanks :)

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:45:11.735Z · score: 1 (1 votes) · LW · GW
It looks closer to the Value Learning Agent in that paper to me and maybe can be considered an implementation / specific instance of that?

Yes. What the value learning agent doesn't specify is what constitutes observational evidence of the utility function, or in this notation, how to calculate and thereby calculate . So this construction makes a choice about how to specify how the true utility function becomes manifest in the agent's observations. A number of simpler choices don't seem to work.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:34:10.762Z · score: 4 (2 votes) · LW · GW
Something that confuses me is that since the evaluator sees everything the agent sees/does, it's not clear how the agent can deceive the evaluator at all. Can someone provide an example in which the agent has an opportunity to deceive in some sense and declines to do that in the optimal policy?

(Copying a comment I just made elsewhere)

This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That's what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it's maximizing the utility of the true state, not the state that the evaluator believes they're in.

(Expanding on it)

So suppose the evaluator was human. The human's lifetime of observations in the past give it a posterior belief distribution which looks to the agent like a weird prior, with certain domains that involve oddly specific convictions. The agent could steer the world toward those domains, and steer towards observations that will make the evaluator believe they are in a state with very high utility. But it won't be particularly interested in this, and it might even be particularly disinterested, because the information it gets about what the evaluator values may less relevant to the actual states it finds itself in a position to navigate between, if the agent believes the evaluator believes they are in a different region of the state space. I can work on a toy example if that isn't satisfying.

ETA: One such "oddly specific conviction", e.g., might be the relative implausibility of being placed in a delusion box where all the observations are manufactured.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:25:10.720Z · score: 3 (2 votes) · LW · GW
Is the point you are trying to make different from the one in Learning What to Value? (Specifically, the point about observation-utility maximizers.) If so, how?

I may be missing something, but it looks to me like specifying an observation-utility maximizer requires writing down a correct utility function? We don't need to do that for this agent.

Do you have PRIOR in order to make the evaluator more realistic? Does the theoretical point still stand if we get rid of PRIOR and instead have an evaluator that has direct access to states?

Yes--sort of. If the evaluator had access to the state, it would be impossible to deceive the evaluator, since they know everything. This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That's what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it's maximizing the utility of the true state, not the state that the evaluator believes they're in.

How does the evaluator influence the behavior of the agent?

Wei's answer is good; it also might be helpful to note that with defined in this way, equals the same thing, but with everything on the right hand side conditioned on as well. When written that way, it is easier to notice the appearance of , which captures how the agent learns a utility function from the rewards.

Not Deceiving the Evaluator

2019-05-08T05:37:59.674Z · score: 5 (5 votes)
Comment by michaelcohen on Strategic implications of AIs' ability to coordinate at low cost, for example by merging · 2019-05-01T00:12:04.433Z · score: 5 (3 votes) · LW · GW

One utility function might turn out much easier to optimize than the other, in which case the harder-to-optimize one will be ignored completely. Random events might influence which utility function is harder to optimize, so one can't necessarily tune in advance to try to take this into account.

One of the reasons was the problem of positive affine scaling preserving behavior, but I see Stuart addresses that.

And actually, some of the reasons for thinking there would be more complicated mixing are going away as I think about it more.

EDIT: yeah if they had the same priors and did unbounded reasoning, I wouldn't be surprised anymore if there exists a that they would agree to.

Comment by michaelcohen on Strategic implications of AIs' ability to coordinate at low cost, for example by merging · 2019-04-30T14:47:38.455Z · score: 1 (1 votes) · LW · GW

Have you thought at all about what merged utility function two AI's would agree on? I doubt it would be of the form .

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-28T02:57:42.152Z · score: 6 (3 votes) · LW · GW

This is an interesting world-model.

In practice, this means that the world model can get BoMAI to choose any action it wants

So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.

Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI--getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.

However, it can also save computation

Only the on-policy computation is accounted for.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-23T04:40:13.393Z · score: 1 (1 votes) · LW · GW

So the AI only takes action a from state s if it has already seen the human do that? If so, that seems like the root of all the safety guarantees to me.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-22T01:10:39.087Z · score: 1 (1 votes) · LW · GW

Can you add the key assumptions being made when you say it is safe asymptotically? From skimming, it looked like "assuming the world is an MDP and that a human can recognize which actions lead to catastrophes."

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-22T01:04:19.598Z · score: 1 (1 votes) · LW · GW
the time it would take to go back to the optimal trajectory

In the real world, this is usually impossible.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-22T01:03:28.660Z · score: 1 (1 votes) · LW · GW

I have to admit I got a little swamped by unfamiliar notation. Can you give me a short description of a Delegative Reinforcement Learner?

Comment by michaelcohen on Delegative Reinforcement Learning with a Merely Sane Advisor · 2019-04-22T00:56:53.980Z · score: 1 (1 votes) · LW · GW

I did a search for "ergodic" and was surprised not to find it. Then I did a search for "reachable" and found this:

Without loss of generality, assumes all states of M are reachable from S(λ) (otherwise, υ is an O-realization of the MDP we get by discarding the unreachable states)

You could just be left with one state after that! If that's the domain that the results cover, that should be flagged. It seems to me like this result only applies to ergodic MDPs.

Comment by michaelcohen on Delegative Reinforcement Learning with a Merely Sane Advisor · 2019-04-22T00:50:57.988Z · score: 1 (1 votes) · LW · GW
(as opposed to standard regret bounds in RL which are only applicable in the episodic setting)

??

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-19T10:08:55.709Z · score: 3 (2 votes) · LW · GW
I sort of object to titling this post "Value Learning is only Asymptotically Safe" when the actual point you make is that we don't yet have concrete optimality results for value learning other than asymptotic safety.

Doesn't the cosmic ray example point to a strictly positive probability of dangerous behavior?

EDIT: Nvm I see what you're saying. If I'm understanding correctly, you'd prefer, e.g. "Value Learning is not [Safe with Probability 1]".

Thanks for the pointer to PAC-type bounds.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-14T03:18:51.751Z · score: 1 (1 votes) · LW · GW

Oh sorry.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-14T01:30:53.580Z · score: 1 (1 votes) · LW · GW

Sure thing.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-14T01:27:13.945Z · score: 1 (1 votes) · LW · GW
even if trust did work like this

I'm not claiming things described as "trust" usually work like this, only that there exists a strategy like this. Maybe it's better described as "presenting an argument to run this particular code."

how exactly does taking over the world not increase the Q-values

The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP's Q-values for various other reward functions remain comparable to their prior values.

the agent now has a much more stable existence

If you're claiming that the other Q-values can't help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.

And let's forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it's not saying much to say that AUP + intent verification would make it safe.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-14T01:04:59.761Z · score: 1 (1 votes) · LW · GW

Okay fair. I just mean to make some requests for the next version of the argument.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-13T10:17:34.503Z · score: 1 (1 votes) · LW · GW
2) ... If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate.

Yeah I agree there's an easy way to avoid this problem. My main point in bringing it up was that there must be gaps in your justification that AUP is safe, if your justification does not depend on "and the action space must be sufficiently small." Since AUP definitely isn't safe for sufficiently large action spaces, your justification (or at least the one presented in the paper) must have at least one flaw, since it purports to argue that AUP is safe regardless of the size of the action space.

You must have read the first version of BoMAI (since you quoted here :) how did you find it by the way?). I'd level the same criticism against that draft. I believed I had a solid argument that it was safe, but then I discovered , which proved there was an error somewhere in my reasoning. So I started by patching the error, but I was still haunted by how certain I felt that it was safe without the patch. I decided I needed to explicitly figure out every assumption involved, and in the process, I discovered ones that I hadn't realized I was making. Likewise, this patch definitely does seem sufficient to avoid this problem of action-granularity, but I think the problem shows that a more rigorous argument is needed.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-13T10:01:53.744Z · score: 1 (1 votes) · LW · GW
1) Why wouldn't gaining trust be useful for other rewards?

Because the agent has already committed to what the trust will be "used for." It's not as easy to construct the story of an agent attempting to gain the trust to be allowed to do one particular thing as it is construct the story of an agent attempting to gain trust to be allowed to do anything, but the latter is unappealing to AUP, and the former is perfectly appealing. So all the optimization power will go towards convincing the operator to run this particular code (which takes over the world, and maximizes the reward). If done in the right way, AUP won't have made arguments which would render it easier to then convince the operator to run different code; running different code would be necessary to maximize a different reward function, so in this scenario, the Q-values for other random reward functions won't have increased wildly in the way that the Q-value for the real reward did.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-13T09:46:50.141Z · score: 1 (1 votes) · LW · GW
4) this is why we want to slowly increment N. This should work whether it's a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually N gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.

Let's say for concreteness that it's a human policy that is used for , if you think it works either way. I think that most human actions are moderately low impact, and some are extremely high impact. No matter what N is, then, if the impact of is leaping to very large values infinitely often, then infinitely often there will effectively be no impact regularization, no matter what N is. No setting for N fixes this; if N were small enough to preclude even actions that are less impactful than , then agent can't ever act usefully, and if N permits actions as impactful as , then when has very large impact (which I contend happens infinitely often for any assignment of that permits any useful action ever), then dangerously high impact actions will be allowed.

Comment by michaelcohen on Towards a New Impact Measure · 2019-04-11T04:26:58.535Z · score: 6 (3 votes) · LW · GW

These comments are responding to the version of AUP presented in the paper. (Let me know if I should be commenting elsewhere).

1)

If an action is useful w.r.t the actual reward but useless to all other rewards (as useless as taking ), that is the ideal according to —i.e. if it is not worth doing because the impact measure is too strong, nothing is worth doing. This is true even if the action is extremely useful to the actual reward. Am I right in thinking that we can conceptualize AUP as saying: “take actions which lead to reward, but wouldn’t be useful (or detrimental) to gaining reward if reward were specified differently”? A typical outline for an AGI gaining power to accomplish a goal might be: gain power, use the power to run some code to help you get maximal reward. We might imagine an AGI convincing a human to run some code, and then giving them the actual code. AUP would be less inclined to do this because after winning over the human, the Q-values for lots of reward functions would be extremely high, so it would be more reluctant to bring itself to that position. Suppose that AUP gives the human operator code to run first and then convinces them to run it. The actions which it takes to gain trust are not useful for other rewards, because they’ll only lead to the code already given being run, which is useless from the perspective of the other reward functions. Do you think AUP would be motivated to search for ways to lock in the effects of future power, and then pursue that power?

2)

If increasing attainable utility and decreasing attainable utility are both dangerous, then raising the size of the actions space to a power makes the agent more dangerous. Consider transforming action/observation/reward into the agent submitting 3 actions, and receiving the next three observations (with the rewards averaged). This is just a new actions space cubically larger. But in this action space, if the “first” action decreased attainable utility dangerously, and the “third” action increased it dangerously, that would cancel out and fail to register as dangerous. Since this problem appears in the math, but not in the intuition, it makes me wary of the reliability of the intuition.

3)

Q-learning converges by sampling all actions repeatedly from all states. AUP penalizes actions according to disruptions in Q-values. I understand that AGI won’t be a Q-learner in a finite-state MDP, but I think it’s worth noting: AUP learns to avoid catastrophic states (if in fact, it does) by testing them out.

4)

Suppose we have a chatbot, and the actions space is finite length strings of text. What exactly is ? If it is a meaningless string of text, I suspect every meaningful string of text will be “too high impact”. Maybe is an imitation of a human? I think humans are sufficiently powerful that normal human policies often accidentally cause large impact (i.e. make it massively more difficult or easy to achieve random goals), and that infinitely often (although perhaps not frequently), having be a human policy would lead to an incredibly high tolerance for impact, which would give AUP plenty of leeway to do dangerous things.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-10T13:18:47.371Z · score: 1 (1 votes) · LW · GW

Linking this, I meant "with probability strictly greater than 0, the agent is not safe". Sorry for the confusion.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-10T13:17:55.554Z · score: 1 (1 votes) · LW · GW

Yes, I did mean the latter. Thank you for clarifying.

Value Learning is only Asymptotically Safe

2019-04-08T09:45:50.990Z · score: 7 (3 votes)
Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-02T04:16:14.586Z · score: 1 (1 votes) · LW · GW

Whoops--when I said

In a sense, the AI "in the box" is not really boxed

I meant the "AI Box" scenario where it is printing results to a screen in the outside world. I do think BoMAI is truly boxed.

We cannot "prove" that something is physically impossible, only that it is impossible under some model of physics.

Right, that's more or less what I mean to do. We can assign probabilities to statements like "it is physically impossible (under the true models of physics) for a human or a computer in isolation with an energy budget of x joules and y joules/second to transmit information in any way other than via a), b), or c) from above." This seems extremely likely to me for reasonable values of x and y, so it's still useful to have a "proof" even if it must be predicated on such a physical assumption.

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T23:39:42.787Z · score: 13 (4 votes) · LW · GW

Thanks for a really productive conversation in the comment section so far. Here are the comments which won prizes.

Comment prizes:

Objection to the term benign (and ensuing conversation). Wei Dei. Link. $20

A plausible dangerous side-effect. Wei Dai. Link. $40

Short description length of simulated aliens predicting accurately. Wei Dai. Link. $120

Answers that look good to a human vs. actually good answers. Paul Christiano. Link. $20

Consequences of having the prior be based on K(s), with s a description of a Turing machine. Paul Christiano. Link. $90

Simulated aliens converting simple world-models into fast approximations thereof. Paul Christiano. Link. $35

Simulating suffering agents. cousin_it. Link. $20

Reusing simulation of human thoughts for simulation of future events. David Krueger. Link. $20

Options for transfer:

1) Venmo. Send me a request at @Michael-Cohen-45.

2) Send me your email address, and I’ll send you an Amazon gift card (or some other electronic gift card you’d like to specify).

3) Name a charity for me to donate the money to.

I would like to exert a bit of pressure not to do 3, and spend the money on something frivolous instead :) I want to reward your consciousness, more than your reflectively endorsed preferences, if you’re up for that. On that note, here’s one more option:

4) Send me a private message with a shipping address, and I’ll get you something cool (or a few things).

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T22:58:13.380Z · score: 1 (1 votes) · LW · GW

The computer and everything is in the inner concrete wall (separated from the glass box by a vacuum), as is the power supply. Nothing is cooling is the room, except maybe some ice on the floor. I think you could allow an oxygen tank in the room too.

Yes, the computer is using power, but a computer can't move matter except inside itself. This can generate sound and light, but the second vacuum chamber and Faraday cage will block that (and the rate at which power is drawn can be capped, which can also give a reasonable bound on how much noise to generate.

whatever's used for input, and whatever's used for output.

For input, the human is just blocking lasers. For output, they're looking through the inner vacuum at the screen on the inner wall of the inner concrete box.

They can be buried in concrete, but they'll still need to be within the vacuums and whatnot.

Yes, that's what the outer vacuum is for.

trying to outsmart a superintelligent AI is a Bad Idea

If I can construct a proof that Y is physically impossible, then I feel fine depending on the fact that an AI won't be able to figure out how to do Y. It doesn't feel to me like "trying to outsmart an AI."

Just because an unknown information channel does not have any humans listening to it directly, does not mean that it can't be used to push around matter, or couple to some other channel

Yes, you're right. My afterthought was hasty. I still think it is unlikely that other forms of information transmission are physically possible, and quite unlikely that they could be generated by a human or a computer in isolation.

Sidenote: I think we've all generated a lot of our intuitions here from the AI Box experiments. In a sense, the AI "in the box" is not really boxed. There's an information channel directly to the outside world!

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T10:25:12.627Z · score: 1 (1 votes) · LW · GW

What do you think about "aligned"? (in the sense of having goals which don't interfere with our own, by being limited in scope to the events of the room)

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T10:20:56.233Z · score: 1 (1 votes) · LW · GW

What do you think about "domesticated"?

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T09:51:54.859Z · score: 1 (1 votes) · LW · GW

Responding to this comment here.

I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.

I didn't go into the details of the physical design for a really secure box in the paper, because it's not really a topic in computer science, but this is a good excuse to write it up.

The operator is inside a glass box. Outside the glass box is a vacuum, and then a concrete box. Outside the concrete box is another vacuum and another concrete box with a Faraday cage around it. There are (airtight) doors through each layer. There is a pressure sensor in both the inner and outer vacuum layer. The computer lives in the inner wall of the inner concrete box; the screen is visible from the inside of the glass box. If the pressure increases in either vacuum layer, the computer's memory is erased, and the whole thing is blown up. There is a laser shining through the glass box (Laser 1). If the operator blocks the path of Laser 1, this corresponds to pressing the button which ends the episode and opens the door. The pressure sensor -> dynamite pathway is deactivated, the vacuum layers are filled with air, and the doors are unlocked. The operator "types" observations and rewards by blocking other lasers that pass through the box. For now, just imagine a few dozen lasers in a row, that the operator can reach up and block in order to type. (With a bit of creativity, and requiring that two lasers be blocked at once to "press a key" this could probably be made somewhat like a qwerty setup.) The lasers and the laser-detectors are all on the inner wall of the inner concrete box.

So inside the box is only a human body and some clothes. Let's assume: 1) the only ways for information to move in space are with a) physically moving matter, b) light, and c) gravitational waves and 2) it is impossible for a human being to convert a nuclear reaction into useful work with only their body and clothes. Physically moving packets matter cannot leave the inside of the glass box without releasing air into the vacuum and causing an explosion. Light cannot leave the Faraday cage. For sound waves (through an imperfect vacuum) and gravitational waves, we can use the amount of usable energy stored by the person in the room to bound how much noise and randomly moving matter we need to be producing outside to bury the signal in the noise. With a sufficiently large amount of noise, even if all the human's energy went toward producing a signal, we can create enough noise that this will not be detectable. Furthermore, an analysis of the near-perfect vacuum can provide an upper bound on the efficiency with which a human can convert energy into an informative matter wave, which should drive down the amount of noise we need to produce to almost nothing. We can also put a decibel meter in the inner wall and blow the whole thing up if it gets too high. I haven't done the math, but I bet the amount of matter we need to be randomly moving in order to hide signals coming from gravitational waves originating from within the box really is nothing. I bet ambient temperature takes care of that.

What if there are other ways information can be transmitted through space? If we ever learn of these, we shut down BoMAI (or modify the setup in light of them). Until then, if such a discovery is made by BoMAI, we can't exactly detect those sorts of signals from the outside, because we don't know they exist.

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T09:09:49.778Z · score: 1 (1 votes) · LW · GW
(I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)

If you're right about this, the setup is not safe. I'm going to respond to this in the "concerns about the box" section. I don't think a commitment to give the agent high reward if it jumps through a few hoops will save us.

In this particular case, it will never do anything harder than hacking its human operator, and that's where most of the safety benefit comes from.

I disagree with this. The safety benefit comes from it not having outside-world instrumental goals (which it lacks if and only if the box is secure).

My intuition says that nothing else in the proposal actually matters for safety

That's what I would conclude as well if the box were not secure.

In particular, I think just giving the AI direct exploration abilities won't mess up safety,

See Appendix F. If the agent picks its own exploratory policies (reasonably), the agent will try every computable policy until it dies, including the policies of every simple AGI.

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T03:57:58.760Z · score: 1 (1 votes) · LW · GW
I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event).

So this is the sense in which I think my statement is technically correct. This is what literally does.

The next question is whether it is correct in way that isn't fragile once we start considering fast/simple approximations of . You're right that there is more to discuss here than I discuss in the paper: if a human's simulation of the future has fidelity, and the world-model itself has fidelity, then a clever memory-based world-model could reuse the computation of the human's prediction of the future when it is computing the actual future. If it hasn't spent much computation time "going down the wrong path" there isn't much that's lost for having done so.

I don't expect the human operator will be simulating/imagining all post-episode events that are relevant for -accurate predictions of future episodes. -accurate world-models have to simulate all the outside-world events that are necessary to get within an threshold of understanding how episodes affect each other, and it won't be necessary for the human operator to consider all this. So I think that even for approximately accurate world-models, following the wrong counterfactual won't be perfectly useful to future computation.

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T03:21:03.190Z · score: 1 (1 votes) · LW · GW
To put it differently: this setup will not provide a solution to any problem which is more difficult than hacking the human operator.

Yep. I actually expect it to be pretty hard to hack the operator compared to other tasks, but more importantly, if it turns out to be easier, we can tinker with the setup until hacking the operator becomes very hard, since a mistake here isn't an existential risk. For example, we could add a monitor who observes the operator, and can set the rewards of the episode to 0 if he observes anything suspicious; the box would include them both, but the operator couldn't easily communicate to the monitor, and if they tried, the monitor could immediately end the episode.

This comment has a recap of other parts of the conversation on this topic, alongside a more detailed proposal.

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T03:07:16.099Z · score: 1 (1 votes) · LW · GW

That's a good name for the assumption. Well, any Turing machine/computable function can be described in English (perhaps quite arduously), so consider the universal Turing machine which converts the binary description to English, and then uses that description to identify the Turing machine to simulate. This UTM certainly satisfies this assumption.

It strikes me as potentially running up against issues of NFL / self-reference.

Can you explain more? (If the above doesn't answer it).

Another intuition I have for this assumption which doesn't appear in the paper: English is really good language. (This is admittedly vague). In thinking about this intuition further, I've noticed a weaker form of Assumption 3 that would also do the trick: the assumption need only hold for -accurate world-models (for some ). In that version of the assumption, one can use the more plausible intuitive justification: "English is a really good language for describing events arising from human-civilization in our universe."

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T02:17:26.384Z · score: 1 (1 votes) · LW · GW

I don't understand what you mean by a revealed preference. If you mean "that which is rewarded," then it seems pretty straightforward to me that a reinforcement learner can't optimize anything other than that which is rewarded (in the limit).

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T02:15:23.663Z · score: 1 (1 votes) · LW · GW
1) It seems too weak: In the motivating scenario of Figure 3, isn't is the case that "what the operator inputs" and "what's in the memory register after 1 year" are "historically distributed identically"?

This assumption isn't necessary to rule out memory-based world-models (see Figure 4). And yes you are correct that indeed it doesn't rule them out.

2) It seems too strong: aren't real-world features and/or world-models "dense"? Shouldn't I be able to find features arbitrarily close to F*? If I can, doesn't that break the assumption?

Yes. Yes. No. There are only finitely many short English sentences. (I think this answers your concern if I understand it correctly).

3) Also, I don't understand what you mean by: "its on policy behavior [is described as] simulating X". It seems like you (rather/also) want to say something like "associating reward with X"?

I don't quite rely on the latter. Associating reward with X means that the rewards are distributed identically to X under all action sequences. Instead, the relevant implication here is: "the world-model's on-policy behavior can be described as simulating X" implies "for on-policy action sequences, the world-model simulates X" which means "for on-policy action sequences, rewards are distributed identically to X."

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-01T01:16:35.197Z · score: 1 (1 votes) · LW · GW

Yeah that's what I mean to refer to: this is a system which learns everything it needs to from the human while querying her less and less, which makes human-lead exploration viable from a capabilities standpoint. Do you think that clarification would make things clearer?

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-03-29T02:18:32.988Z · score: 1 (1 votes) · LW · GW
Oh yeah--that's good news.

Although I don't really like to make anything that would fall apart if the world were deterministic. Relying on stochasticity feels wrong to me.

Comment by michaelcohen on The Main Sources of AI Risk? · 2019-03-29T01:40:38.793Z · score: 1 (1 votes) · LW · GW

Maybe something along the lines of "Inability to specify any 'real-world' goal for an artificial agent"?

Comment by michaelcohen on The Main Sources of AI Risk? · 2019-03-29T00:32:31.625Z · score: 5 (3 votes) · LW · GW
3. Misspecified or incorrectly learned goals/values

I think this phrasing misplaces the likely failure modes. An example that comes to mind from this phrasing is that we mean to maximize conscious flourishing, but we accidentally maximize dopamine in large brains.

Of course, this example includes an agent intervening in the provision of its own reward, but since that seems like the paradigmatic example here, maybe the language could better reflect that, or maybe this could be split into two.

The single technical problem that appears biggest to me is that we don't know how to align an agent with any goal. If we had an indestructible magic box that printed a number to a screen corresponding to the true amount of Good in the world, we still don't know how to design an agent that maximizes that number (instead of taking over the world, and tampering with the cameras that are aimed at the screen/the optical character recognition program used to decipher the image). This problems seems to me like the single most fundamental source of AI risk. Is 3 meant to include this?

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-03-16T06:04:33.195Z · score: 4 (2 votes) · LW · GW
I don't see why their methods would be elegant.

Yeah I think we have different intuitions here; are we at least within a few bits of log-odds disagreement? Even if not, I am not willing to stake anything on this intuition, so I'm not sure this is a hugely important disagreement for us to resolve.

I don't see how MAP helps things either

I didn't realize that you think that a single consequentialist would plausibly have the largest share of the posterior. I assumed your beliefs were in the neighborhood of:

it seems plausible that the weight of the consequentialist part is in excess of 1/million or 1/billion

(from your original post on this topic). In a Bayes mixture, I bet that a team of consequentialists that collectively amount to 1/10 or even 1/50 of the posterior could take over our world. In MAP, if you're not first, you're last, and more importantly, you can't team up with other consequentialist-controlled world-models in the mixture.

Asymptotically Unambitious AGI

2019-03-06T01:15:21.621Z · score: 40 (19 votes)

Impact Measure Testing with Honey Pots and Myopia

2018-09-21T15:26:47.026Z · score: 11 (7 votes)