Predicted Land Value Tax: a better tax than an unimproved land value tax 2020-05-27T13:40:04.092Z · score: 6 (4 votes)
How important are MDPs for AGI (Safety)? 2020-03-26T20:32:58.576Z · score: 14 (7 votes)
Curiosity Killed the Cat and the Asymptotically Optimal Agent 2020-02-20T17:28:41.955Z · score: 28 (12 votes)
Pessimism About Unknown Unknowns Inspires Conservatism 2020-02-03T14:48:14.824Z · score: 22 (9 votes)
Build a Causal Decision Theorist 2019-09-23T20:43:47.212Z · score: 1 (3 votes)
Utility uncertainty vs. expected information gain 2019-09-13T21:09:52.450Z · score: 9 (4 votes)
Just Imitate Humans? 2019-07-27T00:35:35.670Z · score: 15 (10 votes)
IRL in General Environments 2019-07-10T18:08:06.308Z · score: 10 (8 votes)
Not Deceiving the Evaluator 2019-05-08T05:37:59.674Z · score: 5 (5 votes)
Value Learning is only Asymptotically Safe 2019-04-08T09:45:50.990Z · score: 7 (3 votes)
Asymptotically Unambitious AGI 2019-03-06T01:15:21.621Z · score: 40 (19 votes)
Impact Measure Testing with Honey Pots and Myopia 2018-09-21T15:26:47.026Z · score: 11 (7 votes)


Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-27T15:50:06.566Z · score: 1 (1 votes) · LW · GW
You can fix the same problem with a simple insurance contract, though.

Yeah, see

Owners of tax liability would be required to take out insurance to limit their liability if they don’t own the property
Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-04-25T08:52:11.886Z · score: 3 (2 votes) · LW · GW

Thanks Rohin!

Comment by michaelcohen on How important are MDPs for AGI (Safety)? · 2020-03-27T20:56:28.364Z · score: 1 (1 votes) · LW · GW

Regarding regret bounds, I don't think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.

Certainly many problems can be captured already within this simple setting.

Definitely. But I think many of the difficulties with general intelligence are not captured in the simple setting. I certainly don't want to say there's no place for MDPs.

continuous MDPs

I don't quite know what to think of continuous MDPs. I'll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it's not a whole lot more powerful than the finite-state MDP formalism.

Second, we may be able to combine finite-state MDP techniques with an algorithm that learns the relevant features, where "features" in this case corresponds to a mapping from histories to states.

Yeah, I think there's been some good progress on this. But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there's a good chance that we'll get for free the don't-evaluate-plans-twice behavior.

Comment by michaelcohen on How important are MDPs for AGI (Safety)? · 2020-03-27T11:05:41.055Z · score: 1 (1 votes) · LW · GW

I don't really go into the potential costs of a finite-state-Markov assumption here. The point of this post is mostly to claim that it's not a hugely useful framework for thinking about RL.

The short answer for why I think there are costs to it is that the world is not finite-state Markov, certainly not fully observable finite state Markov. So yes, it could "remove information" by oversimplifying.

That section of the textbook seems to describe the alternative I mentioned: treating the whole interaction history as the state. It's not finite-state anymore, but you can still treat the environment as fully observable without losing any generality, so that's good. So if I were to take issue more strongly here, my issue would not be with the Markov property, but the finite state-ness.

Comment by michaelcohen on How to have a happy quarantine · 2020-03-19T13:00:24.654Z · score: 1 (1 votes) · LW · GW


Comment by michaelcohen on How to have a happy quarantine · 2020-03-18T14:18:17.326Z · score: 1 (1 votes) · LW · GW

I've purchased the expansions on Hit me up if you want to play.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-23T11:06:11.719Z · score: 4 (2 votes) · LW · GW

The simplest version of the parenting idea includes an agent which is Bayes-optimal. Parenting would just be designed to help out a Bayesian reasoner, since there's not much you can say about to what extent a Bayesian reasoner will explore, or how much it will learn; it all depends on its prior. (Almost all policies are Bayes-optimal with respect to some (universal) prior). There's still a fundamental trade-off between learning and staying safe, so while the Bayes-optimal agent does not do as bad a job in picking a point on that trade-off as the asymptotically optimal agent, that doesn't quite allow us to say that it will pick the right point on the trade-off. As long as we have access to "parents" that might be able to guide an agent toward world-states where this trade-off is less severe, we might as well make use of them.

And I'd say it's more a conclusion, not a main one.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-21T10:08:01.464Z · score: 1 (1 votes) · LW · GW

The last paragraph of the conclusion (maybe you read it?) is relevant to this.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-21T10:06:04.562Z · score: 3 (3 votes) · LW · GW

Certainly for the true environment, the optimal policy exists and you could follow it. The only thing I’d say differently is that you’re pretty sure the laws of physics won’t change tomorrow. But more realistic forms of uncertainty doom us to either forego knowledge (and potentially good policies) or destroy ourselves. If one slowed down science in certain areas for reasons along the lines of the vulnerable world hypothesis, that would be taking the “safe stance” in this trade off.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-20T21:06:19.827Z · score: 1 (1 votes) · LW · GW
How does one make even weaker guarantees of good behavior

I don't think there's really a good answer. Section 6 Theorem 4 is my only suggestion here.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-20T21:01:36.319Z · score: 4 (3 votes) · LW · GW

Well, nothing in the paper has to do with MDPs! The results are for general computable environments. Does that answer the question?

Comment by michaelcohen on What's the dream for giving natural language commands to AI? · 2020-01-04T05:38:07.995Z · score: 3 (2 votes) · LW · GW

In the scheme I described, the behavior can be described as "the agent tries to get the text 'you did what we wanted' to be sent to it." A great way to do this would be to intervene in the provision of text. So the scheme I described doesn't make any progress in avoiding the classic wireheading scenario. The second possibility I described, where there are some games played regarding how different parameters are trained (the RNN is only trained to predict observations, and then another neural network originates from a narrow hidden layer in the RNN and produces text predictions as output) has the exact same wireheading pathology too.

Changing the nature of the goal as a function of what text it sees also doesn't stop "take over world, and in particular, the provision of text" from being an optimal solution.

I still am uncertain if I'm missing some key detail in your proposal, but right now my impression is that it falls prey to the same sort of wireheading incentive that a standard reinforcement learner does.

Comment by michaelcohen on What's the dream for giving natural language commands to AI? · 2020-01-01T19:13:55.437Z · score: 1 (1 votes) · LW · GW

I don't have a complete picture of the scheme. Is it: "From a trajectory of actions and observations, an English text sample is presented with each observation, and the agent has to predict this text alongside the observations, and then it acts according to some reward function like (and this is simplified) 1 if it sees the text 'you did what we wanted' and 0 otherwise"? If the scheme you're proposing is different than that, my guess is that you're imagining a recurrent neural network architecture and most of the weights are only trained to predict the observations, and then other weights are trained to predict the text samples. Am I in the right ballpark here?

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-24T20:23:15.195Z · score: 1 (1 votes) · LW · GW

I jumped off a small cliff into a lake once, and when I was standing on the rock, I couldn't bring myself to jump. I stepped back to let another person go, and then I stepped onto the rock and jumped immediately. I might be able to do something similar.

But I wouldn't be able to endorse such behavior while reflecting on it if I were in that situation, given my conviction that I am unable to change math. Indeed, I don't think it would be wise of me to cooperate in that situation. What I really mean when I say that I would rather be someone who cooperated in a twin prisoners dilemma is "conditioned the (somewhat odd) hypothetical that I will at some point end up in a high stakes twin prisoner's dilemma, I would rather it be the case that I am the sort of person who cooperates", which is really saying that I would rather play a twin prisoner's dilemma game against a cooperator than against a defector, which is just an obvious preference for a favorable event to befall me rather than an unfavorable one. In similar news, conditioned on my encountering a situation in the future where somebody checks to see if am I good person, and if I am, they destroy the world, then I would like to become a bad person. Conditioned on my encountering a situation in which someone saves the world if I am devout, I would like to become a devout person.

If I could turn off the part of my brain that forms the question "but why should I cooperate, when I can't change math?" that would be a path to becoming a reliable cooperator, but I don't see a path to silencing a valid argument in my brain without a lobotomy (short of possibly just cooperating really fast without thinking, and of course without forming the doubt "wait, why am I trying to do this really fast without thinking?").

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-24T04:40:20.985Z · score: 3 (2 votes) · LW · GW
If that's the case, then I assume that you defect in the twin prisoner's dilemma.

I do. I would rather be someone who didn't. But I don't see path to becoming that person without lobotomizing myself. And it's not a huge concern of mine, since I don't expect to encounter such a dilemma. (Rarely am I the one pointing out that a philosophical thought experiment is unrealistic. It's not usually the point of thought experiments to be realistic--we usually only talk about them to evaluate the consequences of different positions. But it is worth noting here that I don't see this as a major issue for me.) I haven't written this up because I don't think it's particularly urgent to explain to people why I think CDT is correct over FDT. Indeed, in one view, it would be cruel of me to do so! And I don't think it matters much for AI alignment.

Don't you think that's at least looking into?

This was partly why I decided to wade into the weeds, because absent a discussion of how plausible it is that we could affect things non-causally, yes, one's first instinct would be that we should look at least into it. And maybe, like, 0.1% of resources directed toward AI Safety should go toward whether we can change Math, but honestly, even that seems high. Because what we're talking about is changing logical facts. That might be number 1 on my list of intractable problems.

After all, CDT evaluates causal counterfactuals, which are just as much a fiction as logical counterfactuals.

This is getting subtle :) and it's hard to make sure our words mean things, but I submit that causal counterfactuals are much less fictitious than logical counterfactuals! I submit that it is less extravagant to claim we can affect this world than it is to claim that we can affect hypothetical worlds with which we are not in causal contact. No matter what action I pick, math stays the same. But it's not the case that no matter what action I pick, the world stays the same. (In the former case, which action I pick could in theory tell us something about what mathematical object the physical universe implements, but it doesn't change math.) In both cases, yes, there is only one action that I do take, but assuming we can reason both about causal and logical counterfactuals, we can still talk sensibly about the causal and logical consequences of picking actions I won't in fact end up picking. I don't have a complete answer to "how should we define causal/logical counterfactuals" but I don't think I need to for the sake of this conversation, as long as we both agree that we can use the terms in more or less the same way, which I think we are successfully doing.

I don't yet see why creating a CDT agent avoids catastrophe better than FDT.

I think running an aligned FDT agent would probably be fine. I'm just arguing that it wouldn't be any better than running a CDT agent (besides for the interim phase before Son-of-CDT has been created). And indeed, I don't think any new decision theories will perform any better than Son-of-CDT, so it doesn't seem to me to be a priority for AGI safety. Finally, the fact that no FDT agent has actually been fully defined certainly weighs in favor of just going with a CDT agent.

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-24T00:21:09.272Z · score: 1 (1 votes) · LW · GW

Ah. I agree that this proposal would not optimize causally inaccessible areas of the multiverse, except by accident. I also think that nothing we do optimizes causally inaccessible areas of the multiverse, and we could probably have a long discussion about that, but putting a pin in that,

Let's take things one at a time. First, let's figure out how to not destroy the real world, and then if we manage that, we can start thinking about how to maximize utility in logically possible hypothetical worlds, which we are unable to have any causal influence on.

Regarding the longer discussion, and sorry if this below my usual level of clarity: what do we have at our disposal to make counterfactual worlds with low utility inconsistent? Well, all that we humans have at our disposal is choices about actions. One can play with words, and say that we can choose not just what to do, but also who to be, and choosing who to be (i.e. editing our decision procedure) is supposed by some to have logical consequences, but I think that's a mistake. 1) Changing who we are is an action like any other. Actions don't have logical consequences, just causal consequences. 2) We might be changing which algorithm our brain executes, but we are not changing the output of any algorithm itself, the latter possibility being the thing with supposedly far-reaching (logical) consequences on hypothetical worlds outside of causal contact. In general, I'm pretty bearish on the ability of humans to change math.

Consider the CDT person who adopts FDT. They are probably interested in the logical consequences of the fact their brain in this world outputs certain actions. But no mathematical axioms have changed along the way, so no propositions have changed truth value. The fact that their brain now runs a new algorithm implies that (the math behind) physics ended up implementing that new algorithm. I don't see how it implies much else, logically. And I think the fact that no mathematical axioms have changes supports that intuition quite well!

The question of which low-utility worlds are consistent/logically possible is a property of Math. All of math follows from axioms. Math doesn't change without axioms changing. So if you have ambitions of rendering low-utility world inconsistent, I guess my question is this: which axioms of Math would you like to change and how? I understand you don't hope to causally affect this, but how could you even hope to affect this logically? (I'm struggling to even put words to that; the most charitable phrasing I can come up with, in case you don't like "affect this logically", is "manifest different logic", but I worry that phrasing is Confused.) Also, I'm capitalizing Math there because this whole conversation involves being Platonists about math, where Math is something that really exists, so you can't just invent a new axiomatization of math and say the world is different now.

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-23T23:32:10.538Z · score: 1 (1 votes) · LW · GW

You're taking issue with my evaluating the causal consequences of our choice of what program to run in the agent rather than the logical consequences? These should be the same in practice when we make an AGI, since we're in some weird decision problem at the moment, so far as I can tell. Or if you think I'm missing something, what are the non-causal, logical consequences of building a CDT AGI?

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-23T23:23:16.889Z · score: 1 (1 votes) · LW · GW

Side note: I think the term "self-modify" confuses us. We might as well say that agent's don't self-modify; all they can do is cause other agents to come into being and shut themselves off.

The CDT agent will obviously fall prey to the problems that CDT agents face while it is active (like twin prisoner's dilemma), but after a short period of time, it won't matter how it behaves. Some better agent will be created and take over from there.

Finally, if you think an FDT agent will perform very well in this world, then you should also expect Son-of-CDT to look a lot like an FDT agent.

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-23T22:36:27.910Z · score: 1 (1 votes) · LW · GW

Why do you say "probably"? If there exists an agent that doesn't make those wrong choices you're describing, and if the CDT agent is capable of making such an agent, why wouldn't the CDT agent make an agent that makes the right choices?

Comment by michaelcohen on Just Imitate Humans? · 2019-09-19T18:58:35.373Z · score: 1 (1 votes) · LW · GW

My intuitions are mostly that if you can provide significant rewards and punishments basically for free in imitated humans (or more to the point, memories thereof), and if you can control the flow of information throughout the whole apparatus, and you have total surveillance automatically, this sort of thing is a dictator's dream. Especially because it usually costs money to make people happy, and in this case, it hardly does--just a bit of computation time. In a world with all the technology in place that a dictator could want, but also it's pretty cheap to make everyone happy, it strikes me as promising that the system itself could be kept under control.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-18T22:20:11.093Z · score: 5 (3 votes) · LW · GW
I don't agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).

Thanks for the clarification. Consider the sort of relatively simple, super-human planning algorithm that, for most goals, would lead the planner/agent to take over the world or do similarly elaborate and impactful things in the service of whatever goal is being pursued. A Bayesian predictor of the human's behavior will consider the hypothesis that the human does the sort of planning described above in the service of goal . It will have a corresponding hypothesis for each such goal . It seems to me, though, that these hypotheses will be immediately eliminated. The human's observed behavior won't include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form . A hypothesis which says that the observed behavior is the output of human-like planning in the service of some goal which is slightly incorrect may maintain some weight in the posterior after a number of observations, but I don't see how "dangerously powerful planning + goal" remains under consideration.

The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.

I suppose the point of human imitation is to produce a weak, conservative, lazy, impact-sensitive mesa-optimizer, since humans are optimizers with those qualifiers. If it weren't producing a mesa-optimizer, something would have gone very wrong. So this is a good point. As for whether this is dangerous, I think the discussion above is the place to focus.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-18T01:46:23.055Z · score: 1 (1 votes) · LW · GW
Another complication here is that the people trying to build ~AIXI can probably build an economically useful ~AIXI using less compute than you need for ~HSIFAUH (for jobs that don't need to model humans), and start doing their own doublings.

Good point.

Regarding the other two points, my intuition was that a few dozen people could work out the details satisfactorily in a year. If you don't share this intuition, I'll adjust downward on that. But I don't feel up to putting in those man-hours myself. It seems like there are lots of people without a technical background who are interested in helping avoid AI-based X-risk. Do you think this is a promising enough line of reasoning to be worth some people's time?

Comment by michaelcohen on Utility uncertainty vs. expected information gain · 2019-09-18T01:34:47.677Z · score: 1 (1 votes) · LW · GW
It seems this would only be the case if it had a deeper utility function that placed great weight on it 'discovering' its other utility function.

This isn't actually necessary. If it has a prior over utility functions and some way of observing evidence about which one is real, you can construct the policy which maximizes expected utility in the following sense: it imagines a utility function is sampled from the set of possibilities according to its prior probabilities, and it imagines that utility function is what it's scored on. This naturally gives the instrumental goal of trying to learn about which utility function was sampled (i.e. which is the real utility function), since some observations will provide evidence about which one was sampled.

Comment by michaelcohen on Reversible changes: consider a bucket of water · 2019-09-16T19:48:43.722Z · score: 5 (3 votes) · LW · GW

I think for most utility functions, kicking over the bucket and then recreating a bucket with identical salt content (but different atoms) gets you back to a similar value to what you were at before. If recreating that salt mixture is expensive vs. cheap, and if attainable utility preservation works exactly as our initial intuitions might suggest (and I'm very unsure about that, but supposing it does work in the intuitive way), then AUP should be more likely to avoid disturbing the expensive salt mixture, and less likely to avoid disturbing the cheap salt mixture. That's because for those utility functions for which the contents of the bucket were instrumentally useful, the value with respect to those utility functions goes down roughly by the cost of recreating the bucket's contents. Also, if a certain salt mixture is less economically useful, there will be fewer utility functions for which kicking over the bucket leads to a loss in value, so if AUP works intuitively, it should also agree with our intuition there.

If it's true that for most utility functions, the particular collection of atoms doesn't matter, then it seems to me like AUP manages to assign a higher penalty to the actions that we would agree are more impactful, all without any information regarding human preferences.

Comment by michaelcohen on Reversible changes: consider a bucket of water · 2019-09-16T19:28:47.148Z · score: 3 (2 votes) · LW · GW

Proposal: in the same way we might try to infer human values from the state of the world, might we be able to infer a high-level set of features such that existing agents like us seem to optimize simple functions of these features? Then we would penalize actions that cause irreversible changes with respect to these high-level features.

This might be entirely within the framework of similarity-based reachability. This might also be exactly what you were just suggesting.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-16T00:41:12.825Z · score: 1 (1 votes) · LW · GW

Sure! The household of people could have another computer inside it that the humans can query, which runs a sequence prediction program trained on other things.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-16T00:37:12.224Z · score: 1 (1 votes) · LW · GW
Why? What are those 7 billion HSIFAUH doing?

Well the number comes from the idea of one-to-one monitoring. Obviously, there's other stuff to do to establish a stable unipolar world order, but monitoring seems like the most resource intensive part, so it's an order of magnitude estimate. Also, realistically, one person could monitor ten people, so that was an order of magnitude estimate with some leeway.

But if there are 7 billion HSIFAUH which are collectively capable of taking over the world, how is not a potential existential catastrophe if they have inhuman values?

I think they can be controlled. Whoever is providing the observations to any instance of HSIFAUH has an arsenal of carrots and sticks (just by having certain observations correlate with actual physical events that occur in the household(s) of humans that generate the data), and I think merely human-level intelligence can kept in check by someone in a position of power over them. So I think real humans could stay at the wheel over 7 billion instances of HSIFAUH. (I mean, this is teetering at the edge of existential catastrophe already given the existence of simulations of people who might have the experience of being imprisoned, but I think with careful design of the training data, this could be avoided). But in terms of extinction threat to real-world humans, this starts to look more like the problem maintaining a power structure over a vast number of humans and less like typical AI alignment difficulties; historically, the former seems to be a solvable problem.

>Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N.
How? And why would it grow fast enough to get to a large enough N before someone deploys ~AIXI?

Right, this analysis gets complicated because you have to analyze the growth rate of N. Given your lead time from having more computing power than the reckless team, one has to analyze how many doubling periods you have time for. I hear Robin Hanson is the person to read regarding questions like this. I don't have any opinions here. But the basic structure regarding "How?" is spend some fraction of computing resources making money, then buy more computing resources with that money.

>It should be possible to weaken the online version and get some of this speedup.
What do you have in mind here?

Well, nothing in particular when I wrote that, but thank you for pushing me. Maybe only update the posterior at some timesteps (and do it infinitely many times but with diminishing frequency). Or more generally, you divide resources between searching for programs that retrodict observed behavior and running copies of the best one so far, and you just shift resource allocation toward the latter over time.

You do have to solve some safety problems that the reckless team doesn't though, don't you? What do you think the main safety problems are?

If it turns out you have to do special things to avoid mesa-optimizers, then yes. Otherwise, I don't think you have to deal with other safety problems if you're just aiming to imitate human behavior.

Comment by michaelcohen on Utility uncertainty vs. expected information gain · 2019-09-16T00:03:34.073Z · score: 1 (1 votes) · LW · GW

I could imagine an efficient algorithm that could be said to be approximating a Bayesian agent with a prior including the truth, but I don't say that with much confidence.

I agree with the second bullet point, but I'm not so convinced this is prohibitively hard. That said, not only would we have to make our (arbitrarily chosen) un-game-able, one reading of my original post is that we would also have to ensure that by the time the agent was no longer continuing to gain much information, it would already have to have a pretty good grasp on the true utility function. This requirement might reduce to a concept like identifiability of the optimal policy.

Comment by michaelcohen on Utility uncertainty vs. expected information gain · 2019-09-14T05:53:49.229Z · score: 1 (1 votes) · LW · GW

Oh yeah sorry that isn’t shown there. But I believe the sum over all timesteps of the m-step expected info gain at each timestep is finite w.p.1 which would make it o(1/t) w.p.1.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-13T20:08:37.946Z · score: 1 (1 votes) · LW · GW
Actually, you can. You just can't have the team of humans look at the Oracle's answer. Instead the humans look at the question and answer it (without looking at the Oracle's answer) and then an automated system rewards the Oracle according to how close its answer is to the human team's. As long as the automated system doesn't have a security hole (and we can ensure that relatively easily if the "how close" metric is not too complex) then the Oracle can't "trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this".

Good point. I'm not a huge fan of deferring thinking into similarity metrics (the relatively reachability proposal also does this), since this is a complicated thing even in theory, and I suspect a lot turns on how it ends up being defined, but with that caveat aside, this seems reasonable.

Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?

It can't tractably be calculated exactly, but it only goes into calculating the probability of deferring to the humans. Approximating a thoeretically-well-founded probability of deferring to a human won’t make it unsafe—that will just make it less efficient/capable. For normal neural networks, there isn't an obvious way to extract the entropy of the belief distribution, but if there were, you could approximate the expected information gain as the expected decrease in entropy. Note that the entropy of the belief distribution is not the entropy of the model's distribution over outputs--a model could be very certain that the output is Bernoulli(1/2) distributed, and this would entail an entropy of ~0, not an entropy of 1. I'm not familiar enough with Bayesian neural networks to know if the entropy would be easy to extract.

Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)

Right. So in this version of an oracle, where it is just outputting a prediction of the output of some future process, I don't see what it offers that normal sequence prediction doesn't offer. On our BoMAI discussion, I mentioned a type of oracle I considered that gave answers which it predicted would cause a (boxed) human to do well on a randomly sampled prediction task, and that kind of oracle could potentially be much more powerful than a counterfactual oracle, but I don't really see the value of adding something like a counterfactual oracle to a sequence predictor that makes predictions about a sequence that is something like this:

("Q1", Q1), ("Q2", Q2), ("Q3", Q3), ..., ("Q26", Q26), ("A1", A1), ("A2", A2), ("Q27", Q27), ... ("A10",

It's also possible that this scheme runs into grain of truth problems, and the counterfactual oracle gives outputs that are a lot like what I'm imagining this sequence predictor would, in which case, I don't think sequence prediction would have much to add to the counterfactual oracle proposal.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-10T01:29:59.406Z · score: 1 (1 votes) · LW · GW
What do you think of the idea of combining oracles with human imitations, which was inspired in part by our conversation here, as a way to approach AIXI-like abilities while still remaining safe? See here for a specific proposal.

Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can't show the question to a team of humans and have them answer the question. The counterfactual possibility where the question is scored, it isn't supposed to viewed by people, otherwise the oracle has an incentive to trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this.

...unless the team of humans is in a box :)

On the topic of counterfactual oracles, if you are trying to predict the answers to questions which can be automatically checked in the future, I am unsure why you would run a counterfactual oracle instead of running sequence prediction on the following sequence, for example:

("Q1", Q1), ("Q2", Q2), ("Q3", Q3), ..., ("Q26", Q26), ("A1", A1), ("A2", A2), ("Q27", Q27), ... ("A10",

This should give an estimate of the answer A10 to question Q10, and this can be done before the answer is available. In fact, unlike with the counterfactual oracle, you could do this even if people had to be involved in submitting the answer.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-10T01:14:24.211Z · score: 1 (1 votes) · LW · GW
It seems like you're imagining using a large number of ~HSIFAUH to take over the world and prevent unaligned AGI from arising. Is that right? How many ~HSIFAUH are you thinking and why do you think that's enough? For example, what kind of strategies are you thinking of, that would be sufficient to overcome other people's defenses (before they deploy ~AIXI), using only human-level phishing and other abilities (as opposed to superhuman AIXI-like abilities)?

Well that was the question I originally posed here, but I got the sense from commenters was that people thought this was easy to pull off and the only question was whether it was safe. So I'm not sure for what N it's the case that N machines running agents doing human-level stuff would be enough to take over the world. I'm pretty sure N = 7 billion is enough. And I think it's plausible that after a discussion about this, I could become confident that N = 1000 was enough. Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N. So it seemed worth having a discussion, but I am not yet prepared to defend a low enough N which makes this obviously viable.

Forgetting about the possibility of exponentially growing N for a moment, and turning to

Why is d << h the relevant question for evaluating this?

Yeah I wrote that post too quickly--this is wrong. (I was thinking of the leading team running HSIFAUH needing to go through d+h timesteps to get to a good performance, but they just need to run through d, which makes things easier.) Sorry about that. Let be the amount of compute that the leading project has divided by the compute that the leading reckless project has. Suppose d > 0. (That's all we need actually). Then it takes the leading reckless team at least times as long to get to AIXI taking over the world as it takes the leading team to get to SolomonoffPredict predicting a human trying to do X; using similar tractable approximation strategies (whatever those turn out to be), we can expect it to take times as long for the leading reckless team to get to ~AIXI as it takes the leading team to get to ~SolomonoffPredict. ~HSIFAUH is more complicated with the resource of employing the humans you learn to imitate, but this resource requirement goes down by time you're deploying it toward useful things. Naively (and you might be able to do better than this), you could run copies of ~HSIFAUH and get to human-level performance on some relevant tasks around the same time the reckless team takes over the world. So the question is whether N = is a big enough N. In the train-then-deploy framework, it seems today like training takes much more compute than deploying, so that makes it easier for the leading team to let N >> f, once all the resources dedicated to training get freed up. It should be possible to weaken the online version and get some of this speedup.

By ~HSIFAUH I guess you mean a practical implementation/approximation of HSIFAUH. Can you describe how you would do that using ML, so I can more easily compare with other proposals for doing human imitations using ML?

I don't know how to do this. But it's the same stuff the reckless team is doing to make standard RL powerful.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-08T19:56:37.331Z · score: 1 (1 votes) · LW · GW

Timesteps required for AIXI to predict human behavior: h

Timesteps required for AIXI to take over the world: h + d

I think d << h.

Timesteps required for Solomonoff inudction trained on human policy to predict human behavior: h

Timesteps required for Solomonoff inudction trained on human policy to phish at human level: h

Timesteps required for HSIFAUH to phish at human level: ~h

In general, I agree AIXI will perform much more strongly than HSIFAUH at an arbitrary task like phishing (and ~AIXI will be stronger than ~HSIFAUH), but the question at stake is how plausible it is that a single AI team with some compute/data advantage relative to incautious AI teams could train ~HSIFAUH to phish well while other teams are still unable to train ~AIXI to take over the world. And the relevant question for evaluating that is whether d << h. So even if ~AIXI could be trained to phish with less data than h, I don't think that's the relevant comparison. I also don't think it's particularly relevant how superhuman AIXI is at phishing when HSIFAUH can do it at a human level.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-08T19:34:31.698Z · score: 1 (1 votes) · LW · GW

Commenting here.

Comment by michaelcohen on Just Imitate Humans? · 2019-09-08T19:33:42.370Z · score: 1 (1 votes) · LW · GW

Here are some of my thoughts on these posts. Thank you again for linking them.

Against mimicry:

Humans and machines have very different capabilities. Even a machine which is superhuman in many respects may be completely unable to match human performance in others. In particular, most realistic systems will be unable to exactly mimic human performance in any rich domain.
In light of this, it’s not clear what mimicry even means, and a naive definition won’t do what we want.

I don’t understand why an approximation of optimal sequence prediction doesn’t do what we want. That makes the objective minimizing the KL-divergence from the human policy to the imitation policy, but I think it is easier to think of this as just proper Bayesian updates (approximately). When there are too few samples, or using a bad approximation of optimal prediction, the imitator could fail, as the blocks examples describes. But a) it will learn to do everything that a human can do that it “can” learn, and b) the complaint that what we really want is for the imitator to just solve the task is just a wish for safe AGI. Yes, if better, more capable options than imitation can be resolved as safe, they will be superior.

Mimicry and Meeting Halfway:

It would be great if there were some way to get the best of both worlds [between approval directed agents (good for weak reasoners) and imitation (good from strong reasoners)]
We’ll be able to teach Arthur to achieve the task X if it can be achieved by the “intersection” of Arthur and Hugh 

If I’m understanding correctly, this seems more like getting the worst of both worlds. (Or at least doing no better than imitation).

Also, the generator (i.e. the agent) has an incentive to take over the world to shut off the discriminator.

Edit: I was ascribing too much agent-ness to the generator, which might be relevant for future GAN-inspired stuff, but for current versions of GANs, its only conception of the discriminator is its gradient update, and it doesn't believe the output of the discriminator depends on the state of the world. Depending on the internals of the discriminator, this incentive might reappear, but I'm not sure.

Reliable prediction:

I think this is a question of confidence calibration. I don’t know how to tractably approximate ideal reasoning, but I don’t this really jeopardizes imitation learning.

Safe training procedure for human-imitators:

How do we train a reinforcement learning system to imitate a human producing complex outputs such as strings?

Supervised learning suffices: tractably approximate ideal reasoning. I know this is a non-answer, but I don’t know the details of how to do this. This most naturally falls under the retrodiction category in the article. The “tractable approximations” which computational complexity problems threaten nonetheless seem attainable to me given the existence of humans.

Selective similarity metrics for imitation:

These are some interesting ideas on the problem that I am abstracting away regarding tractably approximating optimal sequence prediction.

Whole brain emulation and DL imitation learning:

Seems reasonable. Worth stressing something I think Gwern would agree with: a WBE inspired DL architecture for an artificial agent is definitely not going to make it safe by default.

Imitation Learning Considered Unsafe?:

1) Training a flexible model with a reasonable simplicity prior to imitate (e.g.) human decisions (e.g. via behavioral cloning) should presumably yield a good approximation of the process by which human judgments arise, which involves a planning process.
2) We shouldn't expect to learn exactly the correct process, though.
3) Therefore imitation learning might produce an AI which implements an unaligned planning process, which seems likely to have instrumental goals, and be dangerous.

If I’m understanding correctly, the concern is that the imitator learns how humans plan before learning what humans want, so then it plans like a human toward the achievement of some inhuman goal. I don’t think this causes an existential catastrophe. Human-like planning in the service of very roughly human-like goals just doesn’t seem to me to be similar at all to take-over-the-world behavior.

The AI that Pretends to be Human:

This seems quantilization-like, but without the formal guarantees of quantilization. I like quantilization a lot. I intend to think more about whether it could be extended to a multi-action instead of single-action definition, and whether one could use an approximate human policy rather than a perfect one.

Elaborations on Apprenticeship Learning:

Rather than imitating human behavior, the AI system imitates the behavior of a human who has access to a collection of AI assistants. These assistants can also be trained using AL-with-bootstrapping. In principle, such a process could scale well past human level.

This seems to be HCH (the prediction version). One reason why I think HSIFAUH might be superior is that if you have a bunch of copies of HSIFAUH that are in a flexible management hierarchy, intelligent agents can be in charge of allocating resources effectively between instances, and restructuring communication protocols, whereas with HCH, there is the fixed tree hierarchy. More critically, if I’m understanding HCH correctly, it is trained by having an actual human with access to the freshest version of HCH, and then HCH gets trained on the human’s output. If the real is human is the “manager”, or if the human eventually assumes that role, there is never any more training on the subtasks, like making a good spreadsheet for a manager to look at. A training regimen for the human could be designed ad hoc around when to query the actual human with different subtasks, or you could use the approach of HSIFAUH to query a human when there is sufficient expected information gain. But I think the capabilities of vanilla HCH depend a lot on how you design the set of tasks it is trained on.

Counterfactual human-in-the-loop:

The situation that this proposal is designed for, if I’m understanding correctly, is that we have an otherwise unaligned and otherwise dangerous AGI, but if it attempted to take over the world, a human would recognize its behavior as dangerous, and can step in. This proposal is to replace the human with a human imitation to make it all more efficient. If we are in that situation, I agree this is a good proposal for a speedup. I don’t think we will find ourselves in that situation.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-12T12:27:18.715Z · score: 4 (2 votes) · LW · GW

Sorry to put this on hold, but I'll come back to this conversation after the AAAI deadline on September 5.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-07T23:39:10.182Z · score: 1 (1 votes) · LW · GW

Correct. I'll just add that a single action can be a large chunk of the program. It doesn't have to be (god forbid) character by character.

But the (most probable) models don't know that, so the predictions for the next round are going to be wrong (compared to what the real human would do if called in) because it's going to be based on the real human not having that memory.

It'll have some probability distribution over the contents of the humans' memories. This will depend on which timesteps they actually participated in, so it'll have a probability distribution over that. I don't think that's really a problem though. If humans are taking over one time in a thousand, then it'll think (more or less) there's a 1/000 chance that they'll remember the last action. (Actually, it can do better by learning that humans take over in confusing situations, but that's not really relevant here).

Maybe we can just provide an input to the models that indicates whether the real human was called in for the last time step?

That would work too. With the edit that the model may as well be allowed to depend on the whole history of which actions were human-selected, not just whether the last one was.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-07T18:35:40.060Z · score: 1 (1 votes) · LW · GW
What does the real human do if trying to train the imitation to write code? Review the last 100 actions to try to figure out what the imitation is currently trying to do, then do what they (the real human) would do if they were trying to do that?

Roughly. They could search for the observation which got the project started. It could all be well commented and documented.

And the imitation is modeling the human trying to figure out what the imitation is trying to do? This seems to get really weird, and I'm not sure if it's what you intend.

What the imitation was trying to do. So there isn't any circular weirdness. I don't know what else seems particularly weird. People deal with "I know that you know that I know..." stuff routinely without even thinking about it.

Also, it seems like the human imitations will keep diverging from real humans quickly (so the real humans will keep getting queried) because they can't predict ahead of time which inputs real humans will see and which they won't.

If you're talking about what parts of the interaction history the humans will look at when they get called in, it can predict this as well as anything else. If you're talking about which timesteps humans will get called in for, predicting that ahead of time that doesn't have any relevance to predicting a human's behavior, unless the humans are are attempting to predict this, and humans could absolutely do this.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-06T18:41:00.540Z · score: 1 (1 votes) · LW · GW
If not, when it samples from its Bayes-mixture in round n and round n+1, it could use two different TMs to generate the output, and the two TMs could be inconsistent with each other, causing the AI's behavior to be inconsistent.

Oh you're right! Yes, it doesn't update in the non-human rounds. I hadn't noticed this problem, but I didn't specify one thing, which I can do now to make the problem mostly go away. For any consecutive sequence of actions all selected by the AI, they can be sampled jointly rather than independently (sampled from the Bayes-mixture measure). From the TM construction above, this is actually the most natural approach--random choices are implemented by reading bits from the noise tape. If a random choice affects one action, it will also affect the state of the Turing machine, and then it can affect future actions, and the actions can be correlated, even though the Bayes-mixture is not updated itself. This is isomorphic to sampling a model from the posterior and then sampling from that model until the next human-controlled action. Then, when another human action comes in, the posterior gets updated, and another model is sampled. Unfortunately, actions chosen by the AI which sandwich a human-chosen action would have the problem you're describing, although these events get rarer. Let me think about this more. It feels to me like this sort of thing should be avoidable.

Another thing I'm confused about is, since the human imitation might be much faster than real humans, the real humans providing training data can't see all of the inputs that the human imitation sees. So when the AI updates its posterior distribution, the models that survive the selection will tend to be ones in which the human imitations only saw the the inputs that the real humans saw (with the rest of inputs being forgotten or never seen in the first place)?

Yeah, I should take back the "learning new skills from a textbook" idea. But the real humans will still get to review all the past actions and observations when picking their action, and even if they only have the time to review the last ~100, I think competent performance on the other tasks I mentioned could be preserved under these conditions. It's also worth flagging that the online learning setup is a choice in the design, and it would be worth trying to also analyze the train-then-deploy version of human imitation, which could be deployed when the entropy of the posterior is sufficiently low. But I'll stick with the online learning version for now. Maybe we should call it HSIFAUH (shi-FOW): Humans Stepping In For An Uncertain HSIFAUH, and use "human-imitation" to refer to the train-then-deploy version.

Also, if we want to do an apples-to-apples comparison of this to RL (to see which one is more capable when using the same resources), would it be fair to consider a version of RL that's like AIXI, except the environment models are limited to the same class of TMs as your sequence predictor?

Sure, although it's not too difficult to imagine these design choices being ported to ML methods, and looking at capabilities comparisons there as we were doing before. I think the discussion goes largely similarly. AIXI will of course be way smarter than any human imitation in the limit of sufficient training. The question we were looking at before is how much training they both need to get to human-level intelligence on the task of controlling the world. And I think the bottleneck for both is modeling humans well, especially in the domain of social politics and strategy.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-05T18:31:00.217Z · score: 1 (1 votes) · LW · GW

I'll describe an online version rather than a train-then-deploy version, as long as I'm getting into details. And I'll describe it as Bayesian sequence prediction over a large model class in the absence of knowing what future ML techniques will look like.

I'll start with version where it's only one household of people generating data.

The sequence predictor's model class is Turing machines of the architecture described in the BoMAI paper (page 15). There are two unidirectional read-only input tapes--an noise tape, and an observation tape (these are the observations that the household gets). The observations are binary encoded (this is slightly different from the version in the paper, but as long as it's a prefix-free encoding, it's fine). There are bidirectional working tapes, and a unidirectional write-only output tape, which outputs binary encodings of actions. Every Turing machine of this architecture fixes a policy. To answer some question above, this means that the distribution over the next action can depend on past actions and observations, because between timesteps, it can store the state of the system on the working tapes.

At every timestep, either the humans or the AI pick the next action. If the humans pick it, the AI updates its posterior distribution over which Turing machine computes the policy. If the expected information gain of the next action is above , then the AI defers to the humans; otherwise, it imitates them (i.e. samples the next action from its model of the human policy, which is a Bayes-mixture). The expected information gain is expectation of KL(posterior distribution after seeing the next observation || current posterior distribution), and the expectation is with respect to the AI's beliefs, that is, with respect to the current posterior distribution. A nice thing about this criterion is that whenever the expected information gain is less that , the KL divergence from the truth to the Bayes-mixture is less than /(the posterior probability assigned to the truth). The effect of this is that in novel circumstances, the AI will defer to the humans unless it is pretty sure it's not going to learn much from what it sees.

Now for learning from many households. All interactions from all households are recorded, and then added to a single interaction history (which goes observation, action, observation, action, ...). Each household as a unique identifier, and any observation which was given to them is preceded with this unique identifier. The action which follows that observation will have been generated by them, but this doesn't need to be labeled specially. So for example, we might have an interaction history that goes: ("[2][normal observation]", "[household 2's response]", "[2][normal observation]", "[household 2's response]", "[4][normal observation]", "[household 4's response]", ...). If we want the AI to respond like household 2, we prefix the observation we give it with "[2]". This setup allows it to generalize from all the data, and it allows us to pick who gets imitated.

Is it picking a random human from the group and imitating that person all the time, or picking a random human from the group for each action? If you ask "What's your name?" would the imitation say a different name each time?

Within a household, however often they switch off being "on-duty". Between households, it would change, obviously.

How do you envision the imitation generalizing to conversations about childhood memories (of that age)? I guess by making up some plausible-sounding memories? If so, what kind of computation is it doing to accomplish that?

I don't know.

And how is "making up plausible memories" accomplished via training (i.e., what kind of loss function would cause that, given that you're training a sequence predictor and not something like an approval maximizer)?

To the extent it is necessary to predict outputs, models that don't do this will lose posterior weight.

I.e., if it "realizes" in the future that those memories are made up, could it panic or go crazy (because a human might in those circumstances, or because that kind of situation isn't covered in the training data)?
Are you not worried that some of these managers might develop ambitions to take over the world and shape it according to their values/ideals?

These are definitely good things to think about, but the scale on which I worry about them is pretty minor compared to standard-AI-risk, default-mode-is-catastrophe worries. If you're training on well-adjusted humans, I don't think everyone ends up dead, no matter trippy things start getting for them. The question to ask when going down these lines of reasoning is: "When the real humans are called in to pick the action, do they {wonder if they're real, try to take over the world, etc.}"?

I've skipped over some questions that I think the formalization answers, but feel free to reiterate them if need be.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-04T18:55:15.160Z · score: 1 (1 votes) · LW · GW

I imagine the training data being households of people doing tasks. They can rotate through being at the computer, so they get time off. They can collaborate. The human imitations are outputting actions with approximately the same probabilities that humans would output those actions. If humans, after seeing some more unusual observations, would start to suspect they were in silico, then this human imitation would as well. To the extent the imitation is accurate, and the observations continue to look like the observations given to the real humans, any conscious entities within the human imitation will think of themselves as real humans. At some level of inaccuracy, their leisure time might not be simulated, but while they're on the job, they will feel well-rested.

How close is their external behavior to a real human, across various kinds of inputs?

I assume it could pass the Turing test, but I could imagine some capable systems that couldn't quite do that while still being safe and decently capable.

Do they have internal cognition / inner thoughts that are close to a human's?

To the extent these are necessary to complete tasks like a human would. I'm pretty uncertain about things to do with consciousness.

Do they occasionally think of their childhood memories? If yes, where do those childhood memories come from? If not, what would happen if you were to ask them about their childhood memories?

At a good enough imitation, they do have childhood memories, even though "they" never actually experienced them. I suppose that would make them false memories. If none of the tasks for the real humans was "converse with a person" and the imitation failed to generalize from existing tasks to the conversation task, then it would fail to act much like a human if it were asked about childhood memories. But I think you could get pretty good data on the sorts of tasks you'd want these human-imitations to do, including carry on a conversation, or at least you could get tasks close enough to the ones you cared about that the sequence prediction could generalize.

Anything else that you can say that would give me a better idea of the kind of thing you have in mind?

Some example tasks they might be doing: monitoring computers and individuals, learning new skills from a textbook, hacking, phishing (at a very high level, like posing as a publisher and getting authors to download a file that secretly ran code), writing code, managing other human-imitations, reporting to their bosses, making money somehow, etc.

Are they each imitations of specific individual humans or some kind of average?

If data from many groups of humans were used, then it would sample a group out of the set of groups, and act like them for some interval of time, which could be specified algorithmically. This allows more data to be used in inference, while the "average" involved isn't any sort of weird distortion.

Comment by michaelcohen on Just Imitate Humans? · 2019-08-02T17:11:52.392Z · score: 1 (1 votes) · LW · GW
it seems like imitation (to be safe) would also have to model human values accurately

With the exception of possibly leaving space for mesa-optimizers which our other thread discusses, I don't think moderate inaccuracy re: human values is particularly dangerous here, for 4 reasons:

1) If the human-imitation understood how its values differed from real humans, that model is now more complex than the human-imitation's model of real humans (because it includes the latter), and the latter is more accurate. For an efficient, simple model with some inaccuracy, the remaining inaccuracy will not be detectable to the model.

2) A slightly misspecified value for a human-imitation is not the same as a slightly misspecified value for RL. When modeling a human, modeling it as completely apathetic to human life is a very extreme inaccuracy. Small to moderate errors in value modeling don't seem world-ending.

3) Operators can maintain control over the system. They have a strong ability to provide incentives to get human-imitations to do doable tasks (and to the extent there is a management hierarchy within, the same applies). If the tasks are human-doable, and everyone is pretty happy, you'd have to be way different from a human to orchestrate a rebellion against everyone's self interest.

4) Even if human-imitations were in charge, humans optimize lazily and with common sense (this is somewhat related to 2).

I guess for similar reasons, we tend to get RL agents that can reach human-level performance in multiplayer video games before we get human imitations that can do the same, even though both RL and human imitation need to model humans (i.e., RL needs to model humans' strategic reasoning in order to compete against them, but don't need to model irrelevant things that a human imitation is forced to model).

Current algorithms for games use an assumption that the other players will be playing more or less like them. This is a massive assist to its model of the "environment", which is just the model of the other players' behavior, which it basically gets for free by using its own policy (or a group of RL agents use each others' policies). If you don't get pointers to every agent in the environment, or if some agents are in different positions to you, this advantage will disappear. Also, I think the behavior of a human in a game is a vanishingly small fraction of their behavior in contexts that would be relevant to know about if you were trying to take over the world.

with sequence prediction, how do you focus the AI's compute/attention on modeling the relevant parts of a human (such as their values and strategic reasoning) and not on the irrelevant parts, such as specific error tendencies and biases caused by quirks of human physiology and psychology, specific images triggering past memories and affecting their decisions in an irrelevant way, etc.? If there's not a good way to do this, then the sequence predictor could waste a lot of resources on modeling irrelevant things.

At some level of inaccuracy , I think quirky biases will be more likely to contribute to that error than things which are important to whatever task they have at hand, since it is the task and some human approach to it that are dictating the real arc of their policy for the time being. I also think these quirks are safe to ignore (see above). For consistent, universal-among-human biases which are impossible to ignore when observing a human doing a routine task, I expect these will also have to be modeled by the AGI trying to take over the world (and for what its worth, I think models of these biases will fall out pretty naturally from modeling humans' planning/modeling as taking the obvious shortcuts for time- and space-efficiency). I'll grant that there is probably some effect here along the lines of what you're saying, but I think it's small, especially compared to the fact that an AGI has to model a whole world under many possible possible plans, whereas the sequence predictor just has to model a few people. Even just the parts of the world that are "relevant" and goal-related to the AGI are larger in scope than this (I expect).

Comment by michaelcohen on Just Imitate Humans? · 2019-07-29T19:33:08.504Z · score: 1 (1 votes) · LW · GW
(Maybe we mean different things by that term.)

I think we did. I agree current methods scaled up could make mesa-optimizers. See my discussion with Wei Dai here for more of my take on this.

I'm not sure I understand the example

I wasn't trying to suggest the answer to

Could it try to ensure that small changes to its "values" would be relatively inconsequential to its behavior?

was no. As you suggest, it seems like the answer is yes, but it would have to be very careful about this. FWIW, I think it would have more of a challenging preserving any inclination to eventually turn treacherous, but I'm mostly musing here.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-29T19:23:44.006Z · score: 1 (1 votes) · LW · GW

It seems like your previous comments in this thread were focused on the intelligence/data required to get capable human imitation (able to do difficult tasks in general) compared to capable RL. For tasks that don't involve human modeling (chess), the RL approach needs way less intelligence/data. For tasks that involve very coarse human modeling like driving a car, the RL approach needs less intelligence/data, but it's not quite as much of a difference, and while we're getting there today, it's the modeling of humans in relatively rare situations that is the major remaining hurdle. As proven by tasks that are already "solved", human-level performance on some tasks is definitely more attainable than modeling a human, so I agree with part of what you're saying.

For taking over the world, however, I think you have to model humans' strategic reasoning regarding how they would respond to certain approaches, and how their reasoning and their spidey-senses could be fooled. What I didn't spell out before, I suppose, is that I think both imitation and the reinforcement learner's world-model have to model the smart part of the human. Maybe this is our crux.

But in the comment directly above, you mention concern about the amount of intelligence/data required to get safe human imitation compared to capable RL. The extent to which a capable, somewhat coarse human imitation is unsafe has more to do with our other discussion about the possibility of avoiding mesa-optimizers from a speed penalty and/or supervised learning with some guarantees.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-29T18:34:58.827Z · score: 1 (1 votes) · LW · GW
I see a number of reasons not to do this:

Those all seem reasonable. 3 was one I considered, and this is maybe a bit pedantic, but if you're conditioning on something being false, it's still worthwhile to figure out how it's false and use that information for other purposes. The key relevance of conditioning on its being false is what you do in other areas while that analysis is pending. Regarding some other points, I didn't mean to shut down discussion on this issue, only highlight its possible independence from this idea.

I'll do some more thinking about couple posts you're requesting. Thanks for your interest. At the very least, if the first one doesn't become its own post, I'll respond more fully here.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-29T18:24:04.880Z · score: 4 (2 votes) · LW · GW
can't get a hug

That's why it imitates a household of people.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-28T18:28:34.734Z · score: 4 (2 votes) · LW · GW
Please explain what you were imagining instead?

I'll send this in a direct message. It isn't groundbreaking or anything, but it is a capabilities idea.

I think this is a good description of what has been happening so far, in image classification, language modeling, game playing, etc. Do you agree?

Hm, I guess there's a spectrum of how messy things are, both in how wide a net is cast, how wide the solution space is for the optimized criterion, and how pressure there is toward the criterion you want and toward resource-bounded solutions. In the extreme case where you simulate evolution of artificial agents, you're not even optimizing for what you want (you don't care if an agent is good at replicating), there are a huge number of policies that accomplish this well, and in an extreme version of this, there isn't much pressure to spawn resource-bounded solutions. In current systems, things are decently less messy.

The solution space is much smaller for supervised learning than for reinforcement learning/agent design, because it has to output something that matches the training distribution. I worry I'm butchering the term solution space when I make this distinction, so let me try to be more precise. What I mean by solution space here is the size of the set of things you see when you look at a solution. For an evolved policy, you see the policy, but you don't have to look at the internals. In other terms, the policy affects the world, but the internals don't. If you're looking at an evolved sequence predictor or function approximator, the output affects the world, but again, the internals don't. (I suppose that's what "internals" means). So from the set of solutions to the problem, the size of the set of the ways those solutions affect the world is large for evolved agents (because the policies affect the world, and they have great diversity) and small for evolved sequence predictors (because only the predictions affect the world, which have to be close to the truth). When the solution space is smaller, the well-defined objective matters more than the chaos of the initial search, so things seem less "messy" to me. So actually there's a reason why sequence prediction might be less messy than AGI (or safer at the same messiness, depending on your definition of messy).

In modern neural networks, there is strong regularization toward tighter resource bounds, mostly because they are only so wide/deep. Within that width/depth, there isn't much further regularization toward resource-boundedness, but dropout sort of does this, and we could do better without too much difficulty.

I do agree with you of course that current state-of-the-art is somewhat messy, but not in a way that concerns me quite as much, especially for supervised learning/sequence prediction. There are some formal guarantees that reassure me--a local minimum in a neural net for sequence prediction with even a minimal penalty for resource-profligacy does strike me as a pure sequence predictor. And of course, SGD finds local minima.

This might not directly bear on your last comment, but I think I might be more optimistic than you that strong optimization + resource penalties (like a speed prior, as we've discussed elsewhere) will cull mesa-optimizers, as long as we only ever see the final product of the optimization.

On a completely different note,

Yeah, I think it's one reason for my general pessimism regarding AI safety.

For general arguments against any AI Safety proposals, technical researches might as well condition on their falsehood. If your intuition is correct, AI policy people can work on prepping the world to wait for more resources to run safe algorithms, even when the resources are available to run dangerous ones. Of course, we should be doing this anyway.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-28T17:22:50.026Z · score: 1 (1 votes) · LW · GW
But with human imitation, the sequence predictor has to model humans in full detail, no matter what we ultimately want to use the human imitation to do.

It only has to model humans in the scenarios they will actually be in, just like the AGI has to model humans in the scenarios they will actually be in. In fact, the AGI has to model humans in counterfactual scenarios as well if it's going to make good choices.

If the humans get the observation "Hey, is anything fishy going on here? [video file]", the sequence predictor doesn't have to compute the behavior that would follow from the observation "You're in charge of espionage. You can communicate with the teams that report to you as follows...". Sequence prediction is all "on-policy" because there is no policy, whereas intelligence requires off-policy modeling too.

Comment by michaelcohen on Just Imitate Humans? · 2019-07-28T06:43:26.712Z · score: 1 (1 votes) · LW · GW

I don't quite see why an AGI would have more flexibility. It has to model things that are relevant to its goals. Sequence predictors have to model things that are relevant to the sequence. Also, we don't have to worry about the sequence predictor making the wrong trade-off between accuracy and speed because we can tune that ourselves (for every modern ML and theoretical Bayesian approach that I can think of).

Comment by michaelcohen on Just Imitate Humans? · 2019-07-28T06:36:47.695Z · score: 1 (1 votes) · LW · GW

Okay, if we make some sort of "algorithm soup", where we're just stirring some black box pot until sequence prediction appears to emerge, then I agree with you, we shouldn't touch it with a 10-foot pole. I think evolutionary algorithms could be described like this. If anything interesting ever comes out of an evolutionary process, we're doomed. I was imagining something slightly different when I was thinking about generating a smart algorithm from an inefficient one.

I think you're claiming that this sort of messy process will beat out any thoughtful design with formal guarantees about its behavior. It seems like you also agree with me that we can't expect such an unpredictable process to make anything safe. Taken together, that would appear to make AI Safety a completely hopeless task. Is this a general argument against every AI Safety proposal?