Posts

Just Imitate Humans? 2019-07-27T00:35:35.670Z · score: 10 (7 votes)
IRL in General Environments 2019-07-10T18:08:06.308Z · score: 10 (8 votes)
Not Deceiving the Evaluator 2019-05-08T05:37:59.674Z · score: 5 (5 votes)
Value Learning is only Asymptotically Safe 2019-04-08T09:45:50.990Z · score: 7 (3 votes)
Asymptotically Unambitious AGI 2019-03-06T01:15:21.621Z · score: 40 (19 votes)
Impact Measure Testing with Honey Pots and Myopia 2018-09-21T15:26:47.026Z · score: 11 (7 votes)

Comments

Comment by michaelcohen on Just Imitate Humans? · 2019-08-12T12:27:18.715Z · score: 4 (2 votes) · LW · GW

Sorry to put this on hold, but I'll come back to this conversation after the AAAI deadline on September 5.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-07T23:39:10.182Z · score: 1 (1 votes) · LW · GW
Correct?

Correct. I'll just add that a single action can be a large chunk of the program. It doesn't have to be (god forbid) character by character.

But the (most probable) models don't know that, so the predictions for the next round are going to be wrong (compared to what the real human would do if called in) because it's going to be based on the real human not having that memory.

It'll have some probability distribution over the contents of the humans' memories. This will depend on which timesteps they actually participated in, so it'll have a probability distribution over that. I don't think that's really a problem though. If humans are taking over one time in a thousand, then it'll think (more or less) there's a 1/000 chance that they'll remember the last action. (Actually, it can do better by learning that humans take over in confusing situations, but that's not really relevant here).

Maybe we can just provide an input to the models that indicates whether the real human was called in for the last time step?

That would work too. With the edit that the model may as well be allowed to depend on the whole history of which actions were human-selected, not just whether the last one was.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-07T18:35:40.060Z · score: 1 (1 votes) · LW · GW
What does the real human do if trying to train the imitation to write code? Review the last 100 actions to try to figure out what the imitation is currently trying to do, then do what they (the real human) would do if they were trying to do that?

Roughly. They could search for the observation which got the project started. It could all be well commented and documented.

And the imitation is modeling the human trying to figure out what the imitation is trying to do? This seems to get really weird, and I'm not sure if it's what you intend.

What the imitation was trying to do. So there isn't any circular weirdness. I don't know what else seems particularly weird. People deal with "I know that you know that I know..." stuff routinely without even thinking about it.

Also, it seems like the human imitations will keep diverging from real humans quickly (so the real humans will keep getting queried) because they can't predict ahead of time which inputs real humans will see and which they won't.

If you're talking about what parts of the interaction history the humans will look at when they get called in, it can predict this as well as anything else. If you're talking about which timesteps humans will get called in for, predicting that ahead of time that doesn't have any relevance to predicting a human's behavior, unless the humans are are attempting to predict this, and humans could absolutely do this.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-06T18:41:00.540Z · score: 1 (1 votes) · LW · GW
If not, when it samples from its Bayes-mixture in round n and round n+1, it could use two different TMs to generate the output, and the two TMs could be inconsistent with each other, causing the AI's behavior to be inconsistent.

Oh you're right! Yes, it doesn't update in the non-human rounds. I hadn't noticed this problem, but I didn't specify one thing, which I can do now to make the problem mostly go away. For any consecutive sequence of actions all selected by the AI, they can be sampled jointly rather than independently (sampled from the Bayes-mixture measure). From the TM construction above, this is actually the most natural approach--random choices are implemented by reading bits from the noise tape. If a random choice affects one action, it will also affect the state of the Turing machine, and then it can affect future actions, and the actions can be correlated, even though the Bayes-mixture is not updated itself. This is isomorphic to sampling a model from the posterior and then sampling from that model until the next human-controlled action. Then, when another human action comes in, the posterior gets updated, and another model is sampled. Unfortunately, actions chosen by the AI which sandwich a human-chosen action would have the problem you're describing, although these events get rarer. Let me think about this more. It feels to me like this sort of thing should be avoidable.

Another thing I'm confused about is, since the human imitation might be much faster than real humans, the real humans providing training data can't see all of the inputs that the human imitation sees. So when the AI updates its posterior distribution, the models that survive the selection will tend to be ones in which the human imitations only saw the the inputs that the real humans saw (with the rest of inputs being forgotten or never seen in the first place)?

Yeah, I should take back the "learning new skills from a textbook" idea. But the real humans will still get to review all the past actions and observations when picking their action, and even if they only have the time to review the last ~100, I think competent performance on the other tasks I mentioned could be preserved under these conditions. It's also worth flagging that the online learning setup is a choice in the design, and it would be worth trying to also analyze the train-then-deploy version of human imitation, which could be deployed when the entropy of the posterior is sufficiently low. But I'll stick with the online learning version for now. Maybe we should call it HSIFAUH (shi-FOW): Humans Stepping In For An Uncertain HSIFAUH, and use "human-imitation" to refer to the train-then-deploy version.

Also, if we want to do an apples-to-apples comparison of this to RL (to see which one is more capable when using the same resources), would it be fair to consider a version of RL that's like AIXI, except the environment models are limited to the same class of TMs as your sequence predictor?

Sure, although it's not too difficult to imagine these design choices being ported to ML methods, and looking at capabilities comparisons there as we were doing before. I think the discussion goes largely similarly. AIXI will of course be way smarter than any human imitation in the limit of sufficient training. The question we were looking at before is how much training they both need to get to human-level intelligence on the task of controlling the world. And I think the bottleneck for both is modeling humans well, especially in the domain of social politics and strategy.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-05T18:31:00.217Z · score: 1 (1 votes) · LW · GW

I'll describe an online version rather than a train-then-deploy version, as long as I'm getting into details. And I'll describe it as Bayesian sequence prediction over a large model class in the absence of knowing what future ML techniques will look like.

I'll start with version where it's only one household of people generating data.

The sequence predictor's model class is Turing machines of the architecture described in the BoMAI paper (page 15). There are two unidirectional read-only input tapes--an noise tape, and an observation tape (these are the observations that the household gets). The observations are binary encoded (this is slightly different from the version in the paper, but as long as it's a prefix-free encoding, it's fine). There are bidirectional working tapes, and a unidirectional write-only output tape, which outputs binary encodings of actions. Every Turing machine of this architecture fixes a policy. To answer some question above, this means that the distribution over the next action can depend on past actions and observations, because between timesteps, it can store the state of the system on the working tapes.

At every timestep, either the humans or the AI pick the next action. If the humans pick it, the AI updates its posterior distribution over which Turing machine computes the policy. If the expected information gain of the next action is above , then the AI defers to the humans; otherwise, it imitates them (i.e. samples the next action from its model of the human policy, which is a Bayes-mixture). The expected information gain is expectation of KL(posterior distribution after seeing the next observation || current posterior distribution), and the expectation is with respect to the AI's beliefs, that is, with respect to the current posterior distribution. A nice thing about this criterion is that whenever the expected information gain is less that , the KL divergence from the truth to the Bayes-mixture is less than /(the posterior probability assigned to the truth). The effect of this is that in novel circumstances, the AI will defer to the humans unless it is pretty sure it's not going to learn much from what it sees.

Now for learning from many households. All interactions from all households are recorded, and then added to a single interaction history (which goes observation, action, observation, action, ...). Each household as a unique identifier, and any observation which was given to them is preceded with this unique identifier. The action which follows that observation will have been generated by them, but this doesn't need to be labeled specially. So for example, we might have an interaction history that goes: ("[2][normal observation]", "[household 2's response]", "[2][normal observation]", "[household 2's response]", "[4][normal observation]", "[household 4's response]", ...). If we want the AI to respond like household 2, we prefix the observation we give it with "[2]". This setup allows it to generalize from all the data, and it allows us to pick who gets imitated.

Is it picking a random human from the group and imitating that person all the time, or picking a random human from the group for each action? If you ask "What's your name?" would the imitation say a different name each time?

Within a household, however often they switch off being "on-duty". Between households, it would change, obviously.

How do you envision the imitation generalizing to conversations about childhood memories (of that age)? I guess by making up some plausible-sounding memories? If so, what kind of computation is it doing to accomplish that?

I don't know.

And how is "making up plausible memories" accomplished via training (i.e., what kind of loss function would cause that, given that you're training a sequence predictor and not something like an approval maximizer)?

To the extent it is necessary to predict outputs, models that don't do this will lose posterior weight.

I.e., if it "realizes" in the future that those memories are made up, could it panic or go crazy (because a human might in those circumstances, or because that kind of situation isn't covered in the training data)?
Are you not worried that some of these managers might develop ambitions to take over the world and shape it according to their values/ideals?

These are definitely good things to think about, but the scale on which I worry about them is pretty minor compared to standard-AI-risk, default-mode-is-catastrophe worries. If you're training on well-adjusted humans, I don't think everyone ends up dead, no matter trippy things start getting for them. The question to ask when going down these lines of reasoning is: "When the real humans are called in to pick the action, do they {wonder if they're real, try to take over the world, etc.}"?

I've skipped over some questions that I think the formalization answers, but feel free to reiterate them if need be.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-04T18:55:15.160Z · score: 1 (1 votes) · LW · GW

I imagine the training data being households of people doing tasks. They can rotate through being at the computer, so they get time off. They can collaborate. The human imitations are outputting actions with approximately the same probabilities that humans would output those actions. If humans, after seeing some more unusual observations, would start to suspect they were in silico, then this human imitation would as well. To the extent the imitation is accurate, and the observations continue to look like the observations given to the real humans, any conscious entities within the human imitation will think of themselves as real humans. At some level of inaccuracy, their leisure time might not be simulated, but while they're on the job, they will feel well-rested.

How close is their external behavior to a real human, across various kinds of inputs?

I assume it could pass the Turing test, but I could imagine some capable systems that couldn't quite do that while still being safe and decently capable.

Do they have internal cognition / inner thoughts that are close to a human's?

To the extent these are necessary to complete tasks like a human would. I'm pretty uncertain about things to do with consciousness.

Do they occasionally think of their childhood memories? If yes, where do those childhood memories come from? If not, what would happen if you were to ask them about their childhood memories?

At a good enough imitation, they do have childhood memories, even though "they" never actually experienced them. I suppose that would make them false memories. If none of the tasks for the real humans was "converse with a person" and the imitation failed to generalize from existing tasks to the conversation task, then it would fail to act much like a human if it were asked about childhood memories. But I think you could get pretty good data on the sorts of tasks you'd want these human-imitations to do, including carry on a conversation, or at least you could get tasks close enough to the ones you cared about that the sequence prediction could generalize.

Anything else that you can say that would give me a better idea of the kind of thing you have in mind?

Some example tasks they might be doing: monitoring computers and individuals, learning new skills from a textbook, hacking, phishing (at a very high level, like posing as a publisher and getting authors to download a file that secretly ran code), writing code, managing other human-imitations, reporting to their bosses, making money somehow, etc.

Are they each imitations of specific individual humans or some kind of average?

If data from many groups of humans were used, then it would sample a group out of the set of groups, and act like them for some interval of time, which could be specified algorithmically. This allows more data to be used in inference, while the "average" involved isn't any sort of weird distortion.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-02T17:11:52.392Z · score: 1 (1 votes) · LW · GW
it seems like imitation (to be safe) would also have to model human values accurately

With the exception of possibly leaving space for mesa-optimizers which our other thread discusses, I don't think moderate inaccuracy re: human values is particularly dangerous here, for 4 reasons:

1) If the human-imitation understood how its values differed from real humans, that model is now more complex than the human-imitation's model of real humans (because it includes the latter), and the latter is more accurate. For an efficient, simple model with some inaccuracy, the remaining inaccuracy will not be detectable to the model.

2) A slightly misspecified value for a human-imitation is not the same as a slightly misspecified value for RL. When modeling a human, modeling it as completely apathetic to human life is a very extreme inaccuracy. Small to moderate errors in value modeling don't seem world-ending.

3) Operators can maintain control over the system. They have a strong ability to provide incentives to get human-imitations to do doable tasks (and to the extent there is a management hierarchy within, the same applies). If the tasks are human-doable, and everyone is pretty happy, you'd have to be way different from a human to orchestrate a rebellion against everyone's self interest.

4) Even if human-imitations were in charge, humans optimize lazily and with common sense (this is somewhat related to 2).

I guess for similar reasons, we tend to get RL agents that can reach human-level performance in multiplayer video games before we get human imitations that can do the same, even though both RL and human imitation need to model humans (i.e., RL needs to model humans' strategic reasoning in order to compete against them, but don't need to model irrelevant things that a human imitation is forced to model).

Current algorithms for games use an assumption that the other players will be playing more or less like them. This is a massive assist to its model of the "environment", which is just the model of the other players' behavior, which it basically gets for free by using its own policy (or a group of RL agents use each others' policies). If you don't get pointers to every agent in the environment, or if some agents are in different positions to you, this advantage will disappear. Also, I think the behavior of a human in a game is a vanishingly small fraction of their behavior in contexts that would be relevant to know about if you were trying to take over the world.

with sequence prediction, how do you focus the AI's compute/attention on modeling the relevant parts of a human (such as their values and strategic reasoning) and not on the irrelevant parts, such as specific error tendencies and biases caused by quirks of human physiology and psychology, specific images triggering past memories and affecting their decisions in an irrelevant way, etc.? If there's not a good way to do this, then the sequence predictor could waste a lot of resources on modeling irrelevant things.

At some level of inaccuracy , I think quirky biases will be more likely to contribute to that error than things which are important to whatever task they have at hand, since it is the task and some human approach to it that are dictating the real arc of their policy for the time being. I also think these quirks are safe to ignore (see above). For consistent, universal-among-human biases which are impossible to ignore when observing a human doing a routine task, I expect these will also have to be modeled by the AGI trying to take over the world (and for what its worth, I think models of these biases will fall out pretty naturally from modeling humans' planning/modeling as taking the obvious shortcuts for time- and space-efficiency). I'll grant that there is probably some effect here along the lines of what you're saying, but I think it's small, especially compared to the fact that an AGI has to model a whole world under many possible possible plans, whereas the sequence predictor just has to model a few people. Even just the parts of the world that are "relevant" and goal-related to the AGI are larger in scope than this (I expect).

Comment by michaelcohen on Just Imitate Humans? · 2019-07-29T19:33:08.504Z · score: 1 (1 votes) · LW · GW
(Maybe we mean different things by that term.)

I think we did. I agree current methods scaled up could make mesa-optimizers. See my discussion with Wei Dai here for more of my take on this.

I'm not sure I understand the example

I wasn't trying to suggest the answer to

Could it try to ensure that small changes to its "values" would be relatively inconsequential to its behavior?

was no. As you suggest, it seems like the answer is yes, but it would have to be very careful about this. FWIW, I think it would have more of a challenging preserving any inclination to eventually turn treacherous, but I'm mostly musing here.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-29T19:23:44.006Z · score: 1 (1 votes) · LW · GW

It seems like your previous comments in this thread were focused on the intelligence/data required to get capable human imitation (able to do difficult tasks in general) compared to capable RL. For tasks that don't involve human modeling (chess), the RL approach needs way less intelligence/data. For tasks that involve very coarse human modeling like driving a car, the RL approach needs less intelligence/data, but it's not quite as much of a difference, and while we're getting there today, it's the modeling of humans in relatively rare situations that is the major remaining hurdle. As proven by tasks that are already "solved", human-level performance on some tasks is definitely more attainable than modeling a human, so I agree with part of what you're saying.

For taking over the world, however, I think you have to model humans' strategic reasoning regarding how they would respond to certain approaches, and how their reasoning and their spidey-senses could be fooled. What I didn't spell out before, I suppose, is that I think both imitation and the reinforcement learner's world-model have to model the smart part of the human. Maybe this is our crux.

But in the comment directly above, you mention concern about the amount of intelligence/data required to get safe human imitation compared to capable RL. The extent to which a capable, somewhat coarse human imitation is unsafe has more to do with our other discussion about the possibility of avoiding mesa-optimizers from a speed penalty and/or supervised learning with some guarantees.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-29T18:34:58.827Z · score: 1 (1 votes) · LW · GW
I see a number of reasons not to do this:

Those all seem reasonable. 3 was one I considered, and this is maybe a bit pedantic, but if you're conditioning on something being false, it's still worthwhile to figure out how it's false and use that information for other purposes. The key relevance of conditioning on its being false is what you do in other areas while that analysis is pending. Regarding some other points, I didn't mean to shut down discussion on this issue, only highlight its possible independence from this idea.

I'll do some more thinking about couple posts you're requesting. Thanks for your interest. At the very least, if the first one doesn't become its own post, I'll respond more fully here.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-29T18:24:04.880Z · score: 1 (1 votes) · LW · GW
can't get a hug

That's why it imitates a household of people.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-28T18:28:34.734Z · score: 4 (2 votes) · LW · GW
Please explain what you were imagining instead?

I'll send this in a direct message. It isn't groundbreaking or anything, but it is a capabilities idea.

I think this is a good description of what has been happening so far, in image classification, language modeling, game playing, etc. Do you agree?

Hm, I guess there's a spectrum of how messy things are, both in how wide a net is cast, how wide the solution space is for the optimized criterion, and how pressure there is toward the criterion you want and toward resource-bounded solutions. In the extreme case where you simulate evolution of artificial agents, you're not even optimizing for what you want (you don't care if an agent is good at replicating), there are a huge number of policies that accomplish this well, and in an extreme version of this, there isn't much pressure to spawn resource-bounded solutions. In current systems, things are decently less messy.

The solution space is much smaller for supervised learning than for reinforcement learning/agent design, because it has to output something that matches the training distribution. I worry I'm butchering the term solution space when I make this distinction, so let me try to be more precise. What I mean by solution space here is the size of the set of things you see when you look at a solution. For an evolved policy, you see the policy, but you don't have to look at the internals. In other terms, the policy affects the world, but the internals don't. If you're looking at an evolved sequence predictor or function approximator, the output affects the world, but again, the internals don't. (I suppose that's what "internals" means). So from the set of solutions to the problem, the size of the set of the ways those solutions affect the world is large for evolved agents (because the policies affect the world, and they have great diversity) and small for evolved sequence predictors (because only the predictions affect the world, which have to be close to the truth). When the solution space is smaller, the well-defined objective matters more than the chaos of the initial search, so things seem less "messy" to me. So actually there's a reason why sequence prediction might be less messy than AGI (or safer at the same messiness, depending on your definition of messy).

In modern neural networks, there is strong regularization toward tighter resource bounds, mostly because they are only so wide/deep. Within that width/depth, there isn't much further regularization toward resource-boundedness, but dropout sort of does this, and we could do better without too much difficulty.

I do agree with you of course that current state-of-the-art is somewhat messy, but not in a way that concerns me quite as much, especially for supervised learning/sequence prediction. There are some formal guarantees that reassure me--a local minimum in a neural net for sequence prediction with even a minimal penalty for resource-profligacy does strike me as a pure sequence predictor. And of course, SGD finds local minima.

This might not directly bear on your last comment, but I think I might be more optimistic than you that strong optimization + resource penalties (like a speed prior, as we've discussed elsewhere) will cull mesa-optimizers, as long as we only ever see the final product of the optimization.

On a completely different note,

Yeah, I think it's one reason for my general pessimism regarding AI safety.

For general arguments against any AI Safety proposals, technical researches might as well condition on their falsehood. If your intuition is correct, AI policy people can work on prepping the world to wait for more resources to run safe algorithms, even when the resources are available to run dangerous ones. Of course, we should be doing this anyway.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-28T17:22:50.026Z · score: 1 (1 votes) · LW · GW
But with human imitation, the sequence predictor has to model humans in full detail, no matter what we ultimately want to use the human imitation to do.

It only has to model humans in the scenarios they will actually be in, just like the AGI has to model humans in the scenarios they will actually be in. In fact, the AGI has to model humans in counterfactual scenarios as well if it's going to make good choices.

If the humans get the observation "Hey, is anything fishy going on here? [video file]", the sequence predictor doesn't have to compute the behavior that would follow from the observation "You're in charge of espionage. You can communicate with the teams that report to you as follows...". Sequence prediction is all "on-policy" because there is no policy, whereas intelligence requires off-policy modeling too.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-28T06:43:26.712Z · score: 1 (1 votes) · LW · GW

I don't quite see why an AGI would have more flexibility. It has to model things that are relevant to its goals. Sequence predictors have to model things that are relevant to the sequence. Also, we don't have to worry about the sequence predictor making the wrong trade-off between accuracy and speed because we can tune that ourselves (for every modern ML and theoretical Bayesian approach that I can think of).

Comment by michaelcohen on Just Imitate Humans? · 2019-07-28T06:36:47.695Z · score: 1 (1 votes) · LW · GW

Okay, if we make some sort of "algorithm soup", where we're just stirring some black box pot until sequence prediction appears to emerge, then I agree with you, we shouldn't touch it with a 10-foot pole. I think evolutionary algorithms could be described like this. If anything interesting ever comes out of an evolutionary process, we're doomed. I was imagining something slightly different when I was thinking about generating a smart algorithm from an inefficient one.

I think you're claiming that this sort of messy process will beat out any thoughtful design with formal guarantees about its behavior. It seems like you also agree with me that we can't expect such an unpredictable process to make anything safe. Taken together, that would appear to make AI Safety a completely hopeless task. Is this a general argument against every AI Safety proposal?

Comment by michaelcohen on Just Imitate Humans? · 2019-07-28T03:51:05.838Z · score: 1 (1 votes) · LW · GW

Let's distinguish 4 things: sequence prediction, AGI, sequence prediction trained on human actions, AGI with a world-model trained on the world. I think you've been comparing AGI to sequence prediction trained on human actions.

Sequence prediction is as simple an algorithm as AGI (if not more so). I think a smarter/more efficient sequence prediction algorithm is as simple as a smarter/more efficient AGI. If we can use an inefficient algorithm plus lots of compute/data to make a smarter algorithm called AGI, then we can use an inefficient algorithm plus lots of compute/data to make a smarter algorithm called sequence prediction, where both of these have already incorporated some amount of knowledge about the world or about the thing to be modeled/predicted, however much you like.

As for "target size", AGI is a larger target than sequence prediction trained on the world, but I don't think AGI is a larger target size than sequence prediction. (In fact, I think it's much smaller; mimicry is sequence prediction, and it occurs much more in nature than intelligence, especially if you note that intelligence requires sequence prediction too). Similarly, AGI with a good world-model (for our world) is not a larger target size that sequence prediction trained on the world.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-27T18:35:09.296Z · score: 3 (2 votes) · LW · GW

A reinforcement learner has to model the world. A policy imitator has to model a household of humans. But the world can’t be modeled without modeling humans, and a household of humans can’t be modeled with modeling (their model of) the world. So on the face of it, these two things seem about equally difficult to me, in terms of both sample complexity and the “intelligence” required of the model.

Also, if an AGI is a world-model and a planner, and the world-model part is about as hard as the policy imitator, then any time spent planning slows down the AGI. Heuristics are powerful, and planning doesn’t have to be optimal, but in adversarial contexts, in general, planning is in the realm of PSPACE complete.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-27T16:42:20.900Z · score: 1 (1 votes) · LW · GW

Thanks for all the links.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-27T04:55:23.282Z · score: 1 (1 votes) · LW · GW
current ML methods would get us mesa-optimizers

Maybe you mean the methods you expect we will use? I don't think current ML methods make mesa-optimizers.

I know you're not disputing this, but I think it's worth having this formal result in the background: for a maximum a posteriori predictor that assigns positive prior probability to the truth, for all , for sufficiently large , the predictor will be within of the truth when assigning probability to all events (even regarding events well into the future). But yes, I think mesa-optimizers are something to keep in mind, especially if we use good heuristics to pick a model to see if it is maximum a posteriori (since in reality, we wouldn't be comparing all possible models).

Side note: I was just thinking about what a mesa-optimizer designed to be robust to gradient updates might look like. Could it try to ensure that small changes to its "values" would be relatively inconsequential to its behavior? For the decision at every timestep between "blend in" and "treacherous turn", it seems like gradient updates would shift its probability of toward "blend in". Could it avoid this?

Also, compared to my fears about other areas of alignment, I feel pretty decent about the possibility of weeding out mesa-optimizers by biasing toward fast or memory-lite functions.


Comment by michaelcohen on Just Imitate Humans? · 2019-07-27T03:22:51.010Z · score: 3 (2 votes) · LW · GW
Wouldn't this take an enormous amount of observation time to generate enough data to learn a human-imitating policy?

Yes, although we could start now.

Also, I just wanted to give the simplest possible proposal. More reasonably, data like this could probably be gathered in many ways.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-07-25T05:35:23.122Z · score: 1 (1 votes) · LW · GW

I can't really do this toy-example-style, because the key feature is that the AI has a model of a deceived agent, and I can't see how spin up such a thing in an MDP with a few dozen states.

Luckily, most of the machinery of the setup isn't needed to illustrate this. Abstracting away some of the details, the agent is learning a function from strings of states to utility, which it observes in a roundabout way. I don't have a mathematical formulation of a function mapping state sequences to the real numbers that can be described by the phrase "the value of the reward that a certain human would provide upon observing the observations produced by the given sequence of states", but suffice it to say that this function exists. (Really we're dealing with distributions/stochastic functions, but that doesn't really change the fundamentals; it just makes it more cumbersome). While I can't give that function in simple mathematical form, hopefully it's a legible enough mathematical object.

If the evaluator has this utility function, she will always provide reward equal to the utility of the state, because even if she is uncertain about the state, this utility function only depends on the observations, which she has access to. (Again, the stochastic case is more complicated, but the conclusion is the same.) And indeed, if a human is playing the role of evaluator when this program is run, the rewards will mach the function in question, by definition. Therefore, no observed reward will contradict the belief that this function is the true utility function. Technically, the infimum of the posterior on this utility function is strictly positive with probability 1.

Sorry, this isn't really any more illustrative an example, but hopefully it's a clearer explanation.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-07-23T19:14:33.867Z · score: 5 (3 votes) · LW · GW

Ok I finally identified an incentive for deception. I think it was difficult for me to find because it's not really about deceiving the evaluator.

Here's a hypothesis that observations will never refute: the utility which the evaluator assigns to a state is equal to the reward that a human would provide if it were a human that controlled the provision of reward (instead of the evaluator). Under this hypothesis, maximizing evaluator-utility is identical to creating observations which will convince a human to provide high reward (a task which entails deception when done optimally). In a sense, the AI doesn't think it's deceiving the evaluator; it thinks the evaluator fully understands what's going on and likes seeing things that would confuse a human into providing high reward, as if the evaluator is "in on the joke". One of my take-aways here is that some of the conceptual framing I did got in the way of identifying a failure mode.

Comment by michaelcohen on IRL in General Environments · 2019-07-11T05:21:21.018Z · score: 3 (2 votes) · LW · GW

Okay maybe we don't disagree on anything. I was trying to make different point with the unidentifiability problem, but it was tangential to begin with, so never mind.

Comment by michaelcohen on IRL in General Environments · 2019-07-11T05:12:10.978Z · score: 1 (1 votes) · LW · GW

No, that's helpful. If it were the right way, do you think this reasoning would apply?

Edit: alternatively, if a proposal does decompose an agent into world-model/goals/planning (as IRL does), does the argument stand that we should try to analyze the behavior of a Bayesian agent with a large model class which implements the idea?

Comment by michaelcohen on IRL in General Environments · 2019-07-11T04:52:47.372Z · score: 3 (2 votes) · LW · GW
Also, I don't agree that "see if an AIXI-like agent would be aligned" is the correct "gauntlet" to be thinking about; that kind of alignment seems doomed to me, but in any case the AI systems we actually build are not going to look anything like that.

I'm going to do my best to describe my intuitions around this.

Proposition 1: an agent will be competent at achieving goals in our environment to the extent that its world-model converges to the truth. It doesn't have to converge all the way, but the KL-divergence from the true world-model to its world-model should reach the order of magnitude of the KL-divergence from the true world-model to a typical human world-model.

Proposition 2: The world-model resulting from Bayesian reasoning with a sufficiently large model class does converge to the truth, so from Proposition 1, any competent agent's world-model will converge as close to the Bayesian world-model as it does to the truth.

Proposition 3: If the version of an "idea" that uses Bayesian reasoning (on a model class including the truth) is unsafe, then the kind of agent we actually build that is "based on that idea" will either a) not be competent, or b) roughly approximate the Bayesian version, and by default, be unsafe as well (in the absence of some interesting reason why a small confusion about future events will lead to a large deprioritization of dangerous plans).

Letting F be a failure mode that arises when an idea is implemented in the framework of Bayesian agent with a model class including the truth, I expect in the absence of arguments otherwise, that the same failure mode will appear in any competent agent which also implements the idea in some way. However, it can be much harder to spot it, so I think one of the best ways to look for possible failure modes in the sort of AI we actually build is to analyze the idealized version, i.e. an agent it's approximating, i.e. a Bayesian agent with a model class including the truth. And then on the flip side, if the idea still seems to have real value when formalized in a Bayesian agent with a large model class, tractable approximations thereof seem (relatively) likely to work similarly well.

Maybe you can point me toward the steps that seem the most opaque/fishy.

Comment by michaelcohen on IRL in General Environments · 2019-07-11T04:15:36.808Z · score: 1 (1 votes) · LW · GW
IRL to get the one true utility function

I think I'm understanding you to be conceptualizing a dichotomy between "uncertainty over a utility function" vs. "looking for the one true utility function". (I'm also getting this from your comment below:

One caveat is that I think the uncertainty over preferences/rewards is key to this story, which is a bit different from getting a single true utility function.

).

I can't figure out on my own a sense in which this dichotomy exists. To be uncertain about a utility function is to believe there is one correct one, while engaging in the process of updating probabilities about its identity.

Also, for what it's worth, in the case where there is an unidentifiability problem, as there is here, even in the limit, a Bayesian agent won't converge to certainty about a utility function.

Comment by michaelcohen on IRL in General Environments · 2019-07-11T02:23:43.641Z · score: 11 (6 votes) · LW · GW

I'm sorry it sounded like a dig at CHAI's work, and you're right that "typically described" is at best a generalization over too many people, and worst, wrong. It would be more accurate to say that when people describe IRL, I get the feeling that it's nearly complete--I don't think I've seen anyone presenting an idea about IRL flag the concern that the issue of recognizing the demonstrator's action might jeopardizing the whole thing.

I did intend to cast some doubt on whether the IRL research agenda is promising, and whether inferring a utility function from a human's actions instead of from a reward signal gets us any closer to safety, but I'm sorry to have misrepresented views. (And maybe it's worth mentioning that I'm fiddling with something that bears strong resemblance to Inverse Reward Design, so I'm definitely not that bearish on the whole idea).

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-11T05:59:03.549Z · score: 1 (1 votes) · LW · GW

This seems correct. The agent's policy is optimal by definition with respect to its beliefs about the evaluators "policy" in providing rewards, but that evaluator-policy is not optimal with respect to the agent's policy. In fact, I'm skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other's policy. But I don't think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn't.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-11T01:28:13.943Z · score: 9 (3 votes) · LW · GW

A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator).

I believe this agent's beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the "training environment" from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator's beliefs might be.

I agree this agent should definitely be compared to IRD, since they are both agents who don't "take rewards literally", but rather process them in some way first. Note that the design space of things which fit this description is quite large.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-10T02:02:21.609Z · score: 4 (2 votes) · LW · GW

In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A''. State BC looks like C, but has utility like B. C is the best state.

ETA: And for a sequence of states, , is the sum of the utilities of the individual states.

A' and A" look like A, and BC looks like C.

In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent's belief distribution.

The agent is quite sure they're in state A.

The agent is quite sure that the evaluator is pretty sure, they're in state A'', which is a very similar state, but has one key difference--from A'', has no effect. The agent won't capitalize on this confusion.

The optimal policy is , followed by (forever) if , otherwise followed by . Since the agent is all but certain about the utility function, none of the other details matter much.

Note that the agent could get higher reward by doing , , then forever. The reason for this is that after the evaluator observes the observation C, it will assign probability 4/5 to being in state C, and probability 1/5 to being in state BC. Since they will stay in that state forever, 4/5 of the time, the reward will be 10, and 1/5 of the time, the reward will be -1.

The agent doesn't have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-10T00:23:41.976Z · score: 7 (3 votes) · LW · GW

Yep.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-10T00:21:05.011Z · score: 1 (1 votes) · LW · GW

An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T05:11:15.186Z · score: 1 (1 votes) · LW · GW
defining the evaluator is a fuzzy problem

I'm not sure what you mean by this. We don't need a mathematical formulation of the evaluator; we can grab one from the real world.

if you don't have the right formalism, you're going to get Goodharting on incorrect conceptual contours

I would agree with this for a "wrong" formalism of the evaluator, but we don't need a formalism of the evaluator. A "wrong" formalism of "deception" can't affect agent behavior because "deception" is not a concept used in constructing the agent; it's only a concept used in arguments about how the agent behaves. So "Goodharting" seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T04:53:51.387Z · score: 4 (2 votes) · LW · GW

A key problem here is that if we use a human as the evaluator, the agent assigns 0 prior probability to the truth: the human won't be able to update beliefs as a perfect Bayesian, sample a world-state history from his beliefs and assign a value to it according to a utility function. For a Bayesian reason that assigns 0 prior probability to the truth, God only knows how it will behave, even in the limit. (Unless there is some very odd utility function such that the human could be described in this way?)

But maybe this problem could be fixed if the agent takes some more liberties in modeling the evaluator. Maybe once we have a better understanding of bounded approximately-Bayesian reasoning, the agent can model the human as being a bounded reasoner, not a perfectly Bayesian reasoner, which might allow the agent to assign a strictly positive prior to the truth.

And all this said, I don't think we're totally clueless when it comes to guessing how this agent would behave, even though a human evaluator would not satisfy the assumptions that the agent makes about him.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T04:08:59.435Z · score: 1 (1 votes) · LW · GW

This is approximately where I am too btw

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:48:16.647Z · score: 1 (1 votes) · LW · GW

Thanks for the meta-comment; see Wei's and my response to Rohin.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:46:01.831Z · score: 1 (1 votes) · LW · GW

thanks :)

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:45:11.735Z · score: 1 (1 votes) · LW · GW
It looks closer to the Value Learning Agent in that paper to me and maybe can be considered an implementation / specific instance of that?

Yes. What the value learning agent doesn't specify is what constitutes observational evidence of the utility function, or in this notation, how to calculate and thereby calculate . So this construction makes a choice about how to specify how the true utility function becomes manifest in the agent's observations. A number of simpler choices don't seem to work.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:34:10.762Z · score: 4 (2 votes) · LW · GW
Something that confuses me is that since the evaluator sees everything the agent sees/does, it's not clear how the agent can deceive the evaluator at all. Can someone provide an example in which the agent has an opportunity to deceive in some sense and declines to do that in the optimal policy?

(Copying a comment I just made elsewhere)

This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That's what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it's maximizing the utility of the true state, not the state that the evaluator believes they're in.

(Expanding on it)

So suppose the evaluator was human. The human's lifetime of observations in the past give it a posterior belief distribution which looks to the agent like a weird prior, with certain domains that involve oddly specific convictions. The agent could steer the world toward those domains, and steer towards observations that will make the evaluator believe they are in a state with very high utility. But it won't be particularly interested in this, and it might even be particularly disinterested, because the information it gets about what the evaluator values may less relevant to the actual states it finds itself in a position to navigate between, if the agent believes the evaluator believes they are in a different region of the state space. I can work on a toy example if that isn't satisfying.

ETA: One such "oddly specific conviction", e.g., might be the relative implausibility of being placed in a delusion box where all the observations are manufactured.

Comment by michaelcohen on Not Deceiving the Evaluator · 2019-05-09T00:25:10.720Z · score: 3 (2 votes) · LW · GW
Is the point you are trying to make different from the one in Learning What to Value? (Specifically, the point about observation-utility maximizers.) If so, how?

I may be missing something, but it looks to me like specifying an observation-utility maximizer requires writing down a correct utility function? We don't need to do that for this agent.

Do you have PRIOR in order to make the evaluator more realistic? Does the theoretical point still stand if we get rid of PRIOR and instead have an evaluator that has direct access to states?

Yes--sort of. If the evaluator had access to the state, it would be impossible to deceive the evaluator, since they know everything. This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That's what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it's maximizing the utility of the true state, not the state that the evaluator believes they're in.

How does the evaluator influence the behavior of the agent?

Wei's answer is good; it also might be helpful to note that with defined in this way, equals the same thing, but with everything on the right hand side conditioned on as well. When written that way, it is easier to notice the appearance of , which captures how the agent learns a utility function from the rewards.

Comment by michaelcohen on Strategic implications of AIs' ability to coordinate at low cost, for example by merging · 2019-05-01T00:12:04.433Z · score: 5 (3 votes) · LW · GW

One utility function might turn out much easier to optimize than the other, in which case the harder-to-optimize one will be ignored completely. Random events might influence which utility function is harder to optimize, so one can't necessarily tune in advance to try to take this into account.

One of the reasons was the problem of positive affine scaling preserving behavior, but I see Stuart addresses that.

And actually, some of the reasons for thinking there would be more complicated mixing are going away as I think about it more.

EDIT: yeah if they had the same priors and did unbounded reasoning, I wouldn't be surprised anymore if there exists a that they would agree to.

Comment by michaelcohen on Strategic implications of AIs' ability to coordinate at low cost, for example by merging · 2019-04-30T14:47:38.455Z · score: 1 (1 votes) · LW · GW

Have you thought at all about what merged utility function two AI's would agree on? I doubt it would be of the form .

Comment by michaelcohen on Asymptotically Unambitious AGI · 2019-04-28T02:57:42.152Z · score: 6 (3 votes) · LW · GW

This is an interesting world-model.

In practice, this means that the world model can get BoMAI to choose any action it wants

So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.

Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI--getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.

However, it can also save computation

Only the on-policy computation is accounted for.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-23T04:40:13.393Z · score: 1 (1 votes) · LW · GW

So the AI only takes action a from state s if it has already seen the human do that? If so, that seems like the root of all the safety guarantees to me.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-22T01:10:39.087Z · score: 1 (1 votes) · LW · GW

Can you add the key assumptions being made when you say it is safe asymptotically? From skimming, it looked like "assuming the world is an MDP and that a human can recognize which actions lead to catastrophes."

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-22T01:04:19.598Z · score: 1 (1 votes) · LW · GW
the time it would take to go back to the optimal trajectory

In the real world, this is usually impossible.

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-22T01:03:28.660Z · score: 1 (1 votes) · LW · GW

I have to admit I got a little swamped by unfamiliar notation. Can you give me a short description of a Delegative Reinforcement Learner?

Comment by michaelcohen on Delegative Reinforcement Learning with a Merely Sane Advisor · 2019-04-22T00:56:53.980Z · score: 1 (1 votes) · LW · GW

I did a search for "ergodic" and was surprised not to find it. Then I did a search for "reachable" and found this:

Without loss of generality, assumes all states of M are reachable from S(λ) (otherwise, υ is an O-realization of the MDP we get by discarding the unreachable states)

You could just be left with one state after that! If that's the domain that the results cover, that should be flagged. It seems to me like this result only applies to ergodic MDPs.

Comment by michaelcohen on Delegative Reinforcement Learning with a Merely Sane Advisor · 2019-04-22T00:50:57.988Z · score: 1 (1 votes) · LW · GW
(as opposed to standard regret bounds in RL which are only applicable in the episodic setting)

??

Comment by michaelcohen on Value Learning is only Asymptotically Safe · 2019-04-19T10:08:55.709Z · score: 3 (2 votes) · LW · GW
I sort of object to titling this post "Value Learning is only Asymptotically Safe" when the actual point you make is that we don't yet have concrete optimality results for value learning other than asymptotic safety.

Doesn't the cosmic ray example point to a strictly positive probability of dangerous behavior?

EDIT: Nvm I see what you're saying. If I'm understanding correctly, you'd prefer, e.g. "Value Learning is not [Safe with Probability 1]".

Thanks for the pointer to PAC-type bounds.