Comment by rohinmshah on IRL in General Environments · 2019-07-11T16:29:40.352Z · score: 2 (1 votes) · LW · GW

... Plausibly? Idk, it's very hard for me to talk about the validity of intuitions in an informal, intuitive model that I don't share. I don't see anything obviously wrong with it.

There's the usual issue that Bayesian reasoning doesn't properly account for embeddedness, but I don't think that would make much of a difference here.

Comment by rohinmshah on The AI Timelines Scam · 2019-07-11T05:27:50.870Z · score: 13 (4 votes) · LW · GW

¯\_(ツ)_/¯

Note that even if AI researchers do this similarly to other groups of people, that doesn't change the conclusion that there are distortions that push towards shorter timelines.

Comment by rohinmshah on IRL in General Environments · 2019-07-11T05:08:12.946Z · score: 4 (2 votes) · LW · GW

Sorry in advance for how unhelpful this is going to be. I think decomposing an agent into "goals", "world-model", and "planning" is the wrong way to be decomposing agents. I hope to write a post about this soon.

Comment by rohinmshah on IRL in General Environments · 2019-07-11T04:59:39.713Z · score: 2 (1 votes) · LW · GW
I think I'm understanding you to be conceptualizing a dichotomy between "uncertainty over a utility function" vs. "looking for the one true utility function".

Well, I don't personally endorse this. I was speculating on what might be relevant to Stuart's understanding of the problem.

I was trying to point towards the dichotomy between "acting while having uncertainty over a utility function" vs. "acting with a known, certain utility function" (see e.g. The Off-Switch Game). I do know about the problem of fully updated deference and I don't know what Stuart thinks about it.

Also, for what it's worth, in the case where there is an unidentifiability problem, as there is here, even in the limit, a Bayesian agent won't converge to certainty about a utility function.

Agreed, but I'm not sure why that's relevant. Why do you need certainty about the utility function, if you have certainty about the policy?

Comment by rohinmshah on IRL in General Environments · 2019-07-11T03:45:04.289Z · score: 2 (1 votes) · LW · GW
Does this not sound like a plan of running (C)IRL to get the one true utility function?

I do not think that is actually his plan, but I agree it sounds like it. One caveat is that I think the uncertainty over preferences/rewards is key to this story, which is a bit different from getting a single true utility function.

But really my answer is, the inferential distance between Stuart and the typical reader of this forum is very large. (The inferential distance between Stuart and me is very large.) I suspect he has very different empirical beliefs, such that you could reasonably say that he's working on a "different problem", in the same way that MIRI and I work on radically different stuff mostly due to different empirical beliefs.

Comment by rohinmshah on The AI Timelines Scam · 2019-07-11T03:30:33.461Z · score: 24 (12 votes) · LW · GW

Planned summary:

This post argues that AI researchers and AI organizations have an incentive to predict that AGI will come soon, since that leads to more funding, and so we should expect timeline estimates to be systematically too short. Besides the conceptual argument, we can also see this in the field's response to critics: both historically and now, criticism is often met with counterarguments based on "style" rather than engaging with the technical meat of the criticism.

Planned opinion:

I agree with the conceptual argument, and I think it does hold in practice, quite strongly. I don't really agree that the field's response to critics implies that they are biased towards short timelines -- see these comments. Nonetheless, I'm going to do exactly what this post critiques, and say that I put significant probability on short timelines, but not explain my reasons (because they're complicated and I don't think I can convey them, and certainly can't convey them in a small number of words).

Comment by rohinmshah on IRL in General Environments · 2019-07-10T20:11:15.410Z · score: 6 (5 votes) · LW · GW
My main point is that IRL, as it is typically described, feels nearly complete: just throw in a more advanced RL algorithm as a subroutine and some narrow-AI-type add-on for identifying human actions from a video feed, and voila, we have a superhuman human helper.
[...]
But maybe we could be spending more effort trying to follow through to fully specified proposals which we can properly put through the gauntlet.

Regardless of whether it is intended or not, this sounds like a dig at CHAI's work. I do not think that IRL is "nearly complete". I expect that researchers who have been at CHAI for at least a year do not think that IRL is "nearly complete". I wrote a sequence partly for the purpose of telling everyone "No, really, we don't think that we just need to run IRL to get the one true utility function; we aren't even investigating that plan".

(Sorry, this shouldn't be directed just at you in particular. I'm annoyed at how often I have to argue against this perception, and this paper happened to prompt me to actually write something.)

Also, I don't agree that "see if an AIXI-like agent would be aligned" is the correct "gauntlet" to be thinking about; that kind of alignment seems doomed to me, but in any case the AI systems we actually build are not going to look anything like that.

Comment by rohinmshah on Diversify Your Friendship Portfolio · 2019-07-10T19:42:12.512Z · score: 27 (12 votes) · LW · GW

Strongly agree. Another benefit is that it exposes you to a broader swath of the world, which makes your models of the world better / more generalizable. I often feel like the rationalist community has "beliefs about people" that I think only apply to a small subset of people, e.g.

• People need to find meaning in their jobs to be happy
• Everyone thinks that the thing that they are doing is "good for the world" or "morally right" (as opposed to thinking that the thing they are doing is justifiable / reasonable to do)
Comment by rohinmshah on [AN #59] How arguments for AI risk have changed over time · 2019-07-09T15:27:10.345Z · score: 5 (3 votes) · LW · GW

I see, so the argument is mostly that jobs are performed more stably and so you can learn better how to deal with the principal-agent problems that arise. This seems plausible.

Comment by rohinmshah on What's the most "stuck" you've been with an argument, that eventually got resolved? · 2019-07-09T15:14:50.302Z · score: 2 (1 votes) · LW · GW

I don't think that's it. The inference I most disagree with is "rationality must have a simple core", or "Occam's razor works on rationality". I'm sure there's some meaning of "fundamental" or "epistemologically basic" such that I'd agree that rationality has that property, but that doesn't entail "rationality has a simple core".

Comment by rohinmshah on [AN #59] How arguments for AI risk have changed over time · 2019-07-09T03:44:18.695Z · score: 4 (2 votes) · LW · GW
The core of my intuition is that with different optimized AIs, it will be straightforward to determine exactly what the principal-agent problem consists of, and this can be compensated for.

I feel like it is not too hard to determine principal-agent problems with humans either? It's just hard to adequately compensate for them.

Comment by rohinmshah on Learning biases and rewards simultaneously · 2019-07-09T03:40:47.058Z · score: 2 (1 votes) · LW · GW
Would you associate "ambitious value learning vs. adequate value learning" with "works in theory vs. doesn't work in theory but works in practice"?

Potentially. I think the main question is whether adequate value learning will work in practice.

Comment by rohinmshah on Musings on Cumulative Cultural Evolution and AI · 2019-07-08T18:01:28.674Z · score: 3 (2 votes) · LW · GW
Moreover, there is a core difference between the growth of the cost of brain size between humans and AI (sublinear vs linear).

Actually, I was imagining that for humans the cost of brain size grows superlinearly. The paper you linked uses a quadratic function, and also tried an exponential and found similar results.

But in the world where AI dev faces hardware constraints, social learning will be much more useful.

Agreed if the AI uses social learning to learn from humans, but that only gets you to human-level AI. If you want to argue for something like fast takeoff to superintelligence, you need to talk about how the AI learns independently of humans, and in that setting social learning won't be useful given linear costs.

E.g. Suppose that each unit of adaptive knowledge requires one unit of asocial learning. Every unit of learning costs $K, regardless of brain size, so that everything is linear. No matter how much social learning you have, the discovery of units of knowledge is going to cost$, so the best thing you can do is put units of asocial learning in a single brain/model so that you don't have to pay any cost for social learning.

In contrast, if units of asocial learning in a single brain costs $, then having N units of asocial learning in a single brain/model is very expensive. You can instead have separate brains each with 1 unit of asocial learning, for a total cost of$, and that is enough to discover the units of knowledge. You can then invest a unit or two of social learning for each brain/model so that they can all accumulate the units of knowledge, giving a total cost that is still linear in .

I'm claiming that AI is more like the former while this paper's model is more like the latter. Higher hardware constraints only changes the value of , which doesn't affect this analysis.

## [AN #59] How arguments for AI risk have changed over time

2019-07-08T17:20:01.998Z · score: 43 (9 votes)
Comment by rohinmshah on Musings on Cumulative Cultural Evolution and AI · 2019-07-07T22:10:41.552Z · score: 8 (4 votes) · LW · GW

Planned summary:

A recent paper develops a conceptual model that retrodicts human social learning. They assume that asocial learning allows you adapt to the current environment, while social learning allows you to copy the adaptations that other agents have learned. Both can be increased by making larger brains, at the cost of increased resource requirements. What conditions lead to very good social learning?

First, we need high transmission fidelity, so that social learning is effective. Second, we need some asocial learning, in order to bootstrap -- mimicking doesn't help if the people you're mimicking haven't learned anything in the first place. Third, to incentivize larger brains, the environment needs to be rich enough that additional knowledge is actually useful. Finally, we need low reproductive skew, that is, individuals that are more adapted to the environment should have only a slight advantage over those who are less adapted. (High reproductive skew would select too strongly for high asocial learning.) This predicts pair bonding rather than a polygynous mating structure.

This story cuts against the arguments in Will AI See Sudden Progress? and Takeoff speeds: it seems like evolution "stumbled upon" high asocial and social learning and got a discontinuity in reproductive fitness of species. We should potentially also expect discontinuities in AI development.

We can also forecast the future of AI based on this story. Perhaps we need to be watching for the perfect combination of asocial and social learning techniques for AI, and once these components are in place, AI intelligence will develop very quickly and autonomously.

Planned opinion:

As the post notes, it is important to remember that this is one of many plausible accounts for human success, but I find it reasonably compelling. It moves me closer to the camp of "there will likely be discontinuities in AI development", but not by much.

I'm more interested in what predictions about AI development we can make based ont his model. I actually don't think that this suggests that AI development will need both social and asocial learning: it seems to me that in this model, the need for social learning arises because of the constraints on brain size and the limited lifetimes. Neither of these constraints apply to AI -- costs grow linearly with "brain size" (model capacity, maybe also training time) as opposed to superlinearly for human brains, and the AI need not age and die. So, with AI I expect that it would be better to optimize just for asocial learning, since you don't need to mimic the transmission across lifetimes that was needed for humans.

Comment by rohinmshah on A shift in arguments for AI risk · 2019-07-07T19:44:43.893Z · score: 9 (4 votes) · LW · GW

Planned summary:

Early arguments for AI safety focus on existential risk cause by a failure of alignment combined with a sharp, discontinuous jump in AI capabilities. The discontinuity assumption is needed in order to argue for a treacherous turn, for example: without a discontinuity, we would presumably see less capable AI systems fail to hide their misaligned goals from us, or to attempt to deceive us without success. Similarly, in order for an AI system to obtain a decisive strategic advantage, it would need to be significantly more powerful than all the other AI systems already in existence, which requires some sort of discontinuity.

Now, there are several other arguments for AI risk, though none of them have been made in great detail and are spread out over a few blog posts. This post analyzes several of them and points out some open questions.

First, even without a discontinuity, a failure of alignment could lead to a bad future: since the AIs have more power and intelligence their values will determine what happens in the future, rather than ours. (Here **it is the difference between AIs and humans that matters**, whereas for a decisive strategic advantage it is the difference between the most intelligent agent and the next-most intelligent agents that matters.) See also More realistic tales of doom and Three impacts of machine intelligence. However, it isn't clear why we wouldn't be able to fix the misalignment at the early stages when the AI systems are not too powerful.

Even if we ignore alignment failures, there are other AI risk arguments. In particular, since AI will be a powerful technology, it could be used by malicious actors; it could help ensure robust totalitarian regimes; it could increase the likelihood of great-power war, and it could lead to stronger competitive pressures that erode value. With all of these arguments, it's not clear why they are specific to AI in particular, as opposed to any important technology, and the arguments for risk have not been sketched out in detail.

The post ends with an exhortation to AI safety researchers to clarify which sources of risk motivate them, because it will influence what safety work is most important, it will help cause prioritization efforts that need to determine how much money to allocate to AI risk, and it can help avoid misunderstandings with people who are skeptical of AI risk.

Planned opinion:

I'm glad to see more work of this form; it seems particularly important to gain more clarity on what risks we actually care about, because it strongly influences what work we should do. In the particular scenario of an alignment failure without a discontinuity, I'm not satisfied with the solution "we can fix the misalignment early on", because early on even if the misalignment is apparent to us, it likely will not be easy to fix, and the misaligned AI system could still be useful because it is "aligned enough", at least at this low level of capability.

Personally, the argument that motivates me most is "AI will be very impactful, and it's worth putting in effort into making sure that that impact is positive". I think the scenarios involving alignment failures without a discontinuity are a particularly important subcategory of this argument: while I do expect we will be able to handle this issue if it arises, this is mostly because of meta-level faith in humanity to deal with the problem. We don't currently have a good object-level story for why the issue _won't_ happen, or why it will be fixed when it does happen, and it would be good to have such a story in order to be confident that AI will in fact be beneficial for humanity.

I know less about the non-alignment risks, and my work doesn't really address any of them. They seem worth more investigation; currently my feeling towards them is "yeah, those could be risks, but I have no idea how likely the risks are".

Comment by rohinmshah on AGI will drastically increase economies of scale · 2019-07-06T02:13:48.655Z · score: 3 (2 votes) · LW · GW

Oh, right, I forgot we were considering the setting where we already have AGI systems that can be intent aligned. This seems like a plausible story, though it only implies that there is centralization within the corrupted nation.

Comment by rohinmshah on Learning biases and rewards simultaneously · 2019-07-06T01:54:52.253Z · score: 9 (6 votes) · LW · GW

Planned summary:

Typically, inverse reinforcement learning assumes that the demonstrator is optimal, or that any mistakes they make are caused by random noise. Without a model of how the demonstrator makes mistakes, we should expect that IRL would not be able to outperform the demonstrator. So, a natural question arises: can we learn the systematic mistakes that the demonstrator makes from data? While there is an impossibility result here, we might hope that it is only a problem in theory, not in practice.

In this paper, my coauthors and I propose that we learn the cognitive biases of the demonstrator, by learning their planning algorithm. The hope is that the cognitive biases are encoded in the learned planning algorithm. We can then perform bias-aware IRL by finding the reward function that when passed into the planning algorithm results in the observed policy. We have two algorithms which do this, one which assumes that we know the ground-truth rewards for some tasks, and one which tries to keep the learned planner “close to” the optimal planner. In a simple environment with simulated human biases, the algorithms perform better than the standard IRL assumptions of perfect optimality or Boltzmann rationality -- but they lose a lot of performance by using an imperfect differentiable planner to learn the planning algorithm.

Planned opinion:

Although this only got published recently, it’s work I did over a year ago. I’m no longer very optimistic about ambitious value learning, and so I’m less excited about its impact on AI alignment now. In particular, it seems unlikely to me that we will need to infer all human values perfectly, without any edge cases or uncertainties, which we then optimize as far as possible. I would instead want to build AI systems that start with an adequate understanding of human preferences, and then learn more over time, in conjunction with optimizing for the preferences they know about. However, this paper is more along the former line of work, at least for long-term AI alignment.

I do think that this is a contribution to the field of inverse reinforcement learning -- it shows that by using an appropriate inductive bias, you can become more robust to (cognitive) biases in your dataset. It’s not clear how far this will generalize, since it was tested on simulated biases on simple environments, but I’d expect it to have at least a small effect. In practice though, I expect that you’d get better results by providing more information, as in T-REX.

## Learning biases and rewards simultaneously

2019-07-06T01:45:49.651Z · score: 41 (11 votes)
Comment by rohinmshah on AGI will drastically increase economies of scale · 2019-07-05T22:57:49.104Z · score: 3 (2 votes) · LW · GW

You'd have to get the employees to move there, which seems like a dealbreaker currently given how hot of a commodity AI researchers are.

Comment by rohinmshah on What's the most "stuck" you've been with an argument, that eventually got resolved? · 2019-07-04T17:10:26.745Z · score: 9 (4 votes) · LW · GW

Realism about rationality is an ongoing one for me that hasn't yet gotten unstuck. See in particular Vanessa and ricraz:

Vanessa Kosoy:

However, this does not mean that it is impossible to speak of a relatively simple abstract theory of intelligence. This is because the latter theory aims to describe mindspace as a whole rather than describing a particular rather arbitrary point inside it.
[...]
Now, "rationality" and "intelligence" are in some sense even more fundumental than physics. Indeed, rationality is what tells us how to form correct beliefs, i.e. how to find the correct theory of physics. Looking an anthropic paradoxes, it is even arguable that making decisions is even more fundumental than forming beliefs (since anthropic paradoxes are situations in which assigning subjective probabilities seems meaningless but the correct decision is still well-defined via "functional decision theory" or something similar). Therefore, it seems like there has to be a simple theory of intelligence, even if specific instances of intelligence are complex by virtue of their adaptation to specific computational hardware, specific utility function (or maybe some more general concept of "values"), somewhat specific (although still fairly diverse) class of environments, and also by virtue of arbitrary flaws in their design (that are still mild enough to allow for intelligent behavior).

ricraz:

This feels more like a restatement of our disagreement than an argument. I do feel some of the force of this intuition, but I can also picture a world in which it's not the case. Note that most of the reasoning humans do is not math-like, but rather a sort of intuitive inference where we draw links between different vague concepts and recognise useful patterns - something we're nowhere near able to formalise. I plan to write a follow-up post which describes my reasons for being skeptical about rationality realism in more detail.

Vanessa Kosoy:

I don't think it's a mere restatement? I am trying to show that "rationality realism" is what you should expect based on Occam's razor, which is a fundamental principle of reason. Possibly I just don't understand your position. In particular, I don't know what epistemology is like in the world you imagine. Maybe it's a subject for your next essay.

ricraz:

Sorry, my response was a little lazy, but at the same time I'm finding it very difficult to figure out how to phrase a counterargument beyond simply saying that although intelligence does allow us to understand physics, it doesn't seem to me that this implies it's simple or fundamental. Maybe one relevant analogy: maths allows us to analyse tic-tac-toe, but maths is much more complex than tic-tac-toe. I understand that this is probably an unsatisfactory intuition from your perspective, but unfortunately don't have time to think too much more about this now; will cover it in a follow-up.

I fall pretty strongly in ricraz's camp, and I feel the same way, especially the sentence "I'm finding it very difficult to figure out how to phrase a counterargument beyond simply saying that although intelligence does allow us to understand physics, it doesn't seem to me that this implies it's simple or fundamental."

Comment by rohinmshah on Research Agenda in reverse: what *would* a solution look like? · 2019-07-04T02:33:01.793Z · score: 2 (1 votes) · LW · GW

I agree with basically all of this; maybe I'm more pessimistic about tractability, but not enough to matter for any actual decision.

It sounds to me that given these beliefs the thing you would want to advocate is "let those who want to figure out a theory of human preferences do so and don't shun them from AI safety". Perhaps also "let's have some introductory articles for such a theory so that new entrants to the field know that it is a problem that could use more work and can make an informed decision about what to work on". Both of these I would certainly agree with.

In your original comment it sounded to me like you were advocating something stronger: that a theory of human preferences was necessary for AI safety, and (by implication) at least some of us who don't work on it should switch to working on it. In addition, we should differentially encourage newer entrants to the field to work on a theory of human preferences, rather than some other problem of AI safety, so as to build a community around (4). I would disagree with these stronger claims.

Do you perhaps only endorse the first paragraph and not the second?

Comment by rohinmshah on Research Agenda in reverse: what *would* a solution look like? · 2019-07-03T02:21:05.568Z · score: 2 (1 votes) · LW · GW
That still seems dangerous to me, since I see no reason to believe it wouldn't end up optimizing for something we didn't want. I guess you would have a theory of optimization and agents so good you could know that it wouldn't optimize in ways you didn't want it to

In my head, the theory + implementation ensures that all of the optimization is pointed toward the goal "try to help the human". If you could then legitimately say "it could still end up optimizing for something else", then we don't have the right theory + implementation as I'm imagining it.

but I think this also begs the question by hiding details in "want" that would ultimately require a sufficient theory of human preferences.

I think it's hiding details in "optimization", "try" and "help" (and to a lesser extent, "human"). I don't think it's hiding details in "want". You could maybe argue that any operationalization of "help" would necessarily have "want" as a prerequisite, but this doesn't seem obvious to me.

You could also argue that any beneficial future requires us to figure out our preferences, but that wouldn't explain why it had to happen before building superintelligent AI.

As I often say, the reason I think we need to prioritize a theory of human preferences is not because I have a slam dunk proof that we need it, but because I believe we fail to adequately work to mitigate known risks of superintelligent AI if we don't because we don't, on the other side, have a slam dunk argument for why we wouldn't end up needing it, and I'd rather live in a world where we worked it out and didn't need it than one where we didn't work it out and do need it.

I agree with this, but it's not an argument on the margin. There are many aspects of AI safety I could work on. Why a theory of human preferences in particular, as opposed to e.g. detecting optimization?

Comment by rohinmshah on Research Agenda in reverse: what *would* a solution look like? · 2019-07-03T01:24:50.958Z · score: 2 (1 votes) · LW · GW

The thing you're describing is a theory of human preferences, not (4): An actual grounded definition of human preferences (which implies that in addition to the theory we need to run some computation that produces some representation of human preferences). I was mostly arguing against requiring an actual grounded definition of human preferences.

I am unsure on the question of whether it is necessary to have a theory of human preferences or values. I agree that such a theory would help us evaluate whether or not a particular AI agent is going to be aligned or not. But how much does it help? I can certainly see other paths that don't require it. For example, if we had a theory of optimization and agents, and a method of "pointing" optimization power at humans so that the AI is "trying to help the human", I could imagine feeling confident enough to turn on that AI system. (It obviously depends on the details.)

Comment by rohinmshah on Research Agenda in reverse: what *would* a solution look like? · 2019-07-01T15:47:57.809Z · score: 4 (2 votes) · LW · GW

There's a difference between "creating an explicit preference learning system" and "having a generally capable system learn preferences". I think the former is difficult (because of the Occam's razor argument) but the latter is not.

Suppose I told you that we built a superintelligent AI system without thinking at all about grounded human preferences. Do you think that AI system doesn't "know" what humans would want it to do, even if it doesn't optimize for it? (See also this failed utopia story.)

Comment by rohinmshah on Research Agenda in reverse: what *would* a solution look like? · 2019-06-30T18:32:00.514Z · score: 7 (3 votes) · LW · GW
I encountered serious AI safety researchers who were dismissive of the need to work on (4)

The argument against (4) is that the AI will be able to figure out our preferences since it is superintelligent, so all we need to do is ensure that it is incentivized to figure out and satisfy our preferences, and then it will do the rest. I wouldn't dismiss work on (4), but it doesn't seem like the highest priority given this argument.

One potential counterargument is that the AI must look like an expected utility maximizer due to coherence arguments, and so we need to figure out the utility function, but I don't buy this argument.

Comment by rohinmshah on [AN #58] Mesa optimization: what it is, and why we should care · 2019-06-26T16:24:20.346Z · score: 3 (2 votes) · LW · GW
I'm curious about what value "this thing that isn't learned cooperation" doesn't capture.

It suggests that in other environments that aren't tragedies of the commons, the technique won't lead to cooperation. It also suggests that you could get the same result by giving the agents any sort of extra reward (that influences their actions somehow).

Is "useful" a global improvement, or a local improvement?

Also not clear what the answer to this is.

Hm, I thought of them as things that would require looking at:
1) Behavior in environments constructed for that purpose.
2) Looking at the information the agents communicate.

The agents won't work in any environment other than the one they were trained in, and the information they communicate is probably in the form of vectors of numbers that are not human-interpretable. It's not impossible to analyze them, but it would be difficult.

Comment by rohinmshah on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-25T16:15:34.946Z · score: 2 (1 votes) · LW · GW
In other words, I don't think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.

This seems wrong to me, but I'm having trouble articulating why. It feels like for the actual "prior" we use there will be many more hypotheses for capable behavior than for safe, capable behavior.

A background fact that's probably relevant: I don't expect that we'll be using an explicit prior, and to the extent that we have an implicit prior, I doubt it will look anything like the universal prior.

The way I imagine it will work, the advisor will not do something weird and complicated that ey don't understand emself. [...] I have a research direction based on the "debate" approach about how to strengthen it.

Yeah, this seems good to me!

The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example).

Okay, that makes sense.

Comment by rohinmshah on [AN #58] Mesa optimization: what it is, and why we should care · 2019-06-24T23:36:24.231Z · score: 2 (1 votes) · LW · GW
and this implicitly means there's a constructed group of quasi-altruistic agents who are getting less concrete reward because they're being incentivized by this auxiliary reward.

Suppose Alice and Bob are in an iterated prisoners dilemma ($2/$2 for both cooperating, 1/1 for both defecting, and 3/0 for cooperate/defect.) I now tell Alice that actually she can have an extra \$5 each time if she always cooperates. Now the equilibrium is for Alice to always cooperate and Bob to always defect (which is not an equilibrium behavior in normal IPD).

The worry here is that by adding this extra auxiliary intrinsic reward, you are changing the equilibrium behavior. In particular, agents will exploit the commons less and instead focus more on finding and transmitting useful information. This doesn't really seem like you've "learned cooperation".

This reminds me of OpenAI Five - the way they didn't communicate, but all had the same information.

Note that in the referenced paper the agents don't have the same information. (I think you know that, just wanted to clarify in case you didn't.)

Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards). One exception: you should expect that the information given by the agent will be true and useful: if it weren't, then the other agents would learn to ignore the information over time, which means that the information doesn't affect the other agents' actions and so won't get any intrinsic reward.

I'm surprised they got good results from "try to get other agents to do something different", but it is the borrowing from the structure of causality.

I do think that it's dependent on the particular environments you use.

This reminds me of the Starcraft AI, AlphaStar. While I didn't get all the details I recall something about the reason for the population was so they could each be given a bunch of different narrower/easier objectives than "Win the game" like "Build 2 Deathstalkers" or "Scout this much of the map" or "find the enemy base ASAP", in order to find out what kind of easy to learn things helped them get better at the game.

While I didn't read the Ray Interference paper, I think its point was that if the same weights are used for multiple skills, then updating the weights for one of the skills might reduce performance on the other skills. AlphaStar would have this problem too. I guess by having "specialist" agents in the population you are ensuring that those agents don't suffer from as much ray interference, but your final general agent will still need to know all the skills and would suffer from ray interference (if it is actually a thing, again I haven't read the paper).

This sounds like one of those "as General Intelligences we find this easy but it's really hard to program".

Yup, sounds right to me.

Comment by rohinmshah on [AN #58] Mesa optimization: what it is, and why we should care · 2019-06-24T23:20:16.524Z · score: 2 (1 votes) · LW · GW

Fixed, thanks.

Comment by rohinmshah on [AN #58] Mesa optimization: what it is, and why we should care · 2019-06-24T23:17:42.940Z · score: 2 (1 votes) · LW · GW

Thanks!

Comment by rohinmshah on No, it's not The Incentives—it's you · 2019-06-24T17:55:16.248Z · score: 16 (5 votes) · LW · GW
Using peers in a field as a proxy for good vs. bad behavior doesn't make sense if the entire field is corrupt and destroying value.

This seems to imply that you think that the world would be better off without academia at all. Do you endorse that?

Perhaps you only mean that if the world would be better off without academia at all, and nearly everyone in it is net negative / destroying value, then no one could justify joining it. I can agree with the implication, but I disagree with the premise.

## [AN #58] Mesa optimization: what it is, and why we should care

2019-06-24T16:10:01.330Z · score: 49 (12 votes)
Comment by rohinmshah on The Hacker Learns to Trust · 2019-06-23T22:32:24.296Z · score: 4 (2 votes) · LW · GW

You're right, it's too harsh to claim that this is deceptive. That does seem more reasonable. I still think it isn't worth it given the harm to your ability to coordinate.

I was coming up with reasons that a nearsighted consequentialist (aka not worried about being manipulative) might use.

Sorry, I thought you were defending the decision. I'm currently only interested in decision-relevant aspects of this, which as far as I can tell means "how the decision should be made ex-ante", so I'm not going to speculate on nearsighted-consequentialist-reasons.

Comment by rohinmshah on The Hacker Learns to Trust · 2019-06-23T21:51:26.692Z · score: 4 (2 votes) · LW · GW

Agreed that this is a benefit of what actually happened, but I want to note that if you're banking on this ex ante, you're deciding not to cooperate with a group X because you want to publicly signal allegiance to group Y with the expectation that you will then switch to group X and take along some people from group Y.

This is deceptive, and it harms our ability to cooperate. It seems pretty obvious to me that we should not do that under normal circumstances.

(I really do only want to talk about what should be done ex ante, that seems like the only decision-relevant thing here.)

Comment by rohinmshah on No, it's not The Incentives—it's you · 2019-06-23T21:44:29.818Z · score: 8 (4 votes) · LW · GW

Agreed that it's related, and I do think it's part of the explanation.

I will go even further: while in that post the selection happens at the level of properties of individuals who participate in some culture, I'm claiming that the selection happens at the higher level of norms of behavior in the culture, because most people are imitating the rest of the culture.

This requires even fewer misaligned individuals. Under the model where you select on individuals, you would still need a fairly large number of people to have the property of interest -- if only 1% of salesmen had the personality traits leading to them being scammy and the other 99% were usually honest about the product, the scammy salesmen probably wouldn't be able to capture all of the sales jobs. However, if most people imitate, then those 1% of salesmen will slowly push the norms towards being more scammy over generations, and you'd end up in the equilibrium where nearly every salesman is scammy.

Come to think of it, I think I would estimate that ~1% of academics are explicitly thinking about how to further their own career at the cost of science (in ways that are different from imitation).

Comment by rohinmshah on Risks from Learned Optimization: Introduction · 2019-06-23T20:51:19.116Z · score: 10 (6 votes) · LW · GW
More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).

Just want to note that I think this is extremely far from a formal definition. I don't know what perfect IRL would be. Does perfect IRL assume that the agent is perfectly optimal, or can it have biases? How do you determine what the action space is? How do you break ties between reward functions that are equally good on the training data?

I get that definitions are hard -- the main thing bothering me here is the "more formally" phrase, not the definition itself. This gives it a veneer of precision that it really doesn't have.

(I'm pedantic about this because similar implied false precision about the importance of utility functions confused me for half a year.)

Comment by rohinmshah on No, it's not The Incentives—it's you · 2019-06-23T16:23:54.801Z · score: 10 (4 votes) · LW · GW

The former is a statement about outcomes while the latter is a statement about intentions.

My model for how most academics end up following bad incentives is that they pick up the incentivized bad behaviors via imitation. Anyone who doesn't do this ends up doing poorly and won't make it in academia (and in any case such people are rare, imitation is the norm for humans in general). As part of imitation, people come up with explanations for why the behavior is necessary and good for them to do. (And this is also usually the right thing to do; if you are imitating a good behavior, it makes sense to figure out why it is good, so that you can use that underlying explanation to reason about what other behaviors are good.)

I think that I personally am engaging in bad behaviors because I incorrectly expect that they are necessary for some goal (e.g. publishing papers to build academic credibility). I just can't tell which ones really are necessary and which ones aren't.

Comment by rohinmshah on No, it's not The Incentives—it's you · 2019-06-23T16:08:21.564Z · score: 2 (1 votes) · LW · GW
And how many if you didn't intervene?

Significantly more, maybe 20. To do a proper estimate I'd need to know which field we're considering, what the base rates are, etc. The thing I should have said was that I expect it makes it ~10x less likely that you become a professor; that seems more robust to the choice of field and isn't conditional on base rates that I don't know.

The Internet suggests a base rate of 3-5%, which means without intervention 3-5 of them would become professors; if that's true I would say that with intervention an expected 0.4 of them would become professors.

How do you reconcile this with the immediately prior sentence?

I didn't mean that it was literally impossible for a person who doesn't follow the incentives to get into academia, I meant that it was much less likely. I do in fact know people in academia who I think are reasonably good at not following bad incentives.

Comment by rohinmshah on The Hacker Learns to Trust · 2019-06-23T15:54:52.542Z · score: 4 (2 votes) · LW · GW

As I mentioned above, it's always possible to publicly post after you've come to the decision privately.

Comment by rohinmshah on The Hacker Learns to Trust · 2019-06-22T20:13:05.392Z · score: 2 (1 votes) · LW · GW
I also think it would’ve been quite reasonable to not expect any response from a big organisation like OpenAI, and to be doing it only out of courtesy.

Yeah, that seems reasonable, but it doesn't seem like you could reasonably have 99% confidence in this.

It seems from above that talking to OpenAI didn’t change Connor’s mind, and that public discourse was very useful. I expect Buck would not have talked to him if he hadn’t done this publicly (I will ask Buck when I see him).

I agree with this, but it's ex-post reasoning, I don't think this was predictable with enough certainty ex-ante.

Given the OP I don’t think it would’ve been able to resolve privately, but if it had I think I’d be less happy than with what actually happened, which is someone publicly deciding to not unilaterally break an important new norm, even while they strongly believe this particular application of the norm is redundant/unhelpful.

It's always possible to publicly post after you've come to the decision privately. (Also, I'm really only talking about what should have been done ex-ante, not ex-post.)

I’d be interested to know if you think that it would’ve been perfectly pro-social to give OpenAI a week’s heads-up and then writing your reasoning publicly and reading everyone else’s critiques (100% of random people from Hacker News and Twitter and longer chats with Buck). I have a sense that you wouldn’t but I’m not fully sure why.

That seems fine, and very close to what I would have gone with myself. Maybe I would have first emailed OpenAI, and if I hadn't gotten a response in 2-3 days, then said I would make it public if I didn't hear back in another 2-3 days. (This is all assuming I don't know anyone at OpenAI, to put myself in the author's position.)

Comment by rohinmshah on No, it's not The Incentives—it's you · 2019-06-22T19:04:27.235Z · score: 15 (6 votes) · LW · GW

It is probably correct that each individual instance of having to deal with bad incentives doesn't make that much of a difference, but there are many such instances. Probably there's an 80-20 thing to do here where you get 80% of the benefit by not following the worst 20% of bad incentives, but it's actually quite hard to identify these, and it requires you to be able to predict the consequences of not following the bad incentives, which is really hard to do. (I don't think I could do it, and I've been in a PhD program for 5 years now.)

To be clear: if you know that someone explicitly and intentionally committed fraud for personal gain with the knowledge that it would result in bad science, that seems fine to punish. But this is rare, and it's easy to mistake well-intentioned mistakes for intentional fraud.

Comment by rohinmshah on The Hacker Learns to Trust · 2019-06-22T18:31:16.005Z · score: 5 (3 votes) · LW · GW
On reading that I was genuinely delighted to see such pro-social and cooperative behaviour from the person who believed OpenAI was wrong.

I think the pro-social and cooperative thing to do was to email OpenAI privately rather than issuing a public ultimatum.

Comment by rohinmshah on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-21T23:04:26.697Z · score: 6 (3 votes) · LW · GW

For the last month or two, I've been too busy to get a newsletter out every week. It is still happening, just not on any consistent schedule at the moment.

Comment by rohinmshah on [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming · 2019-06-21T18:42:31.211Z · score: 7 (3 votes) · LW · GW
Consider a corrupt state in which the human's brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored?

I agree that state shouldn't be explored.

Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability.

That seems closer to my objection but not exactly it.

Indeed, the algorithm deals with corruption by never letting the agent go there.

For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?

Maybe I'm also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn't realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn't seem like DRL saves us here.

Maybe another way of putting it -- is there additional safety conferred by this approach that you couldn't get by having a human review all of the AI's actions? If so, should I think of this as "we want a human to review actions, but that's expensive, DRL is a way to make it more sample efficient"?

Comment by rohinmshah on Let's talk about "Convergent Rationality" · 2019-06-15T01:17:17.003Z · score: 2 (1 votes) · LW · GW

I guess my position is that CRT is only true to the extent that you build a goal-directed agent. (Technically, the inner optimizers argument is one way that CRT could be true even without building an explicitly goal-directed agent, but it seems like you view CRT as broader and more likely than inner optimizers, and I'm not sure how.)

Maybe another way to get at the underlying misunderstanding: do you see a difference between "convergent rationality" and "convergent goal-directedness"? If so, what is it? From what you've written they sound equivalent to me.

Comment by rohinmshah on Let's talk about "Convergent Rationality" · 2019-06-14T15:38:24.822Z · score: 3 (2 votes) · LW · GW
The main counter-arguments arise from VNMUT, which can be interpreted as saying "rational agents are more fit" (in an evolutionary sense).

While I generally agree with CRT as applied to advanced agents, the VNM theorem is not the reason why, because it is vacuous. I agree with steve that the real argument for it is that humans are more likely to build goal-directed agents because that's the only way we know how to get AI systems that do what we want. But we totally could build non-goal-directed agents that CRT doesn't apply to, e.g. Google Maps.

Comment by rohinmshah on Conclusion to the sequence on value learning · 2019-06-12T06:56:13.018Z · score: 2 (1 votes) · LW · GW

It sounds to me like you're requiring "superintelligent" to include "has a goal" as part of the definition. If that's part of the definition, then I would rephrase my point as "why do we have to build something superintelligent? Let's instead build something that doesn't have a goal but is still useful, like an AI system that follows norms."

Comment by rohinmshah on AGI will drastically increase economies of scale · 2019-06-11T06:08:00.929Z · score: 2 (1 votes) · LW · GW

In that case this model would only hold if governments:

• Actually think through the long-term implications of AI
• Have enough certainty in this argument to actually act upon it

Notably, there aren't any feedback loops for the thing-being-competed-on, and so natural-selection style optimization doesn't happen. This makes me much less likely to believe in arguments of the form "The thing-being-competed-on will have a high value, because there is competition" -- the mechanism that usually makes that true is natural selection or some equivalent.

Comment by rohinmshah on AGI will drastically increase economies of scale · 2019-06-11T06:00:43.549Z · score: 2 (1 votes) · LW · GW
Why? Each division can still have separate profit-loss accounting, so you can decide to shut one down if it starts making losses, and the benefits of having that division to the rest of the company doesn't outweigh the losses. The latter may be somewhat tricky to judge though. Perhaps that's what you meant?

That's a good point. I was imagining that each division ends up becoming a monopoly in its particular area due to the benefits of within-firm coordination, which means that even if the division is inefficient there isn't an alternative that the firm can go with. But that was an assumption, and I'm not sure it would actually hold.

Comment by rohinmshah on AGI will drastically increase economies of scale · 2019-06-11T01:03:53.503Z · score: 5 (2 votes) · LW · GW

Okay, I see, that makes sense and seems plausible, though I'd bet against it happening. But you've convinced me that I should qualify that sentence more.

Comment by rohinmshah on AGI will drastically increase economies of scale · 2019-06-11T01:00:27.740Z · score: 2 (1 votes) · LW · GW
If companies had fully aligned workers and managers, they could adopt what Robin Hanson calls the "divisions" model where each division works just like a separate company except that there is an overall CEO that "looks for rare chances to gain value by coordinating division activities"

Once you switch to the "divisions" model your divisions are no longer competing with other firms, and all the divisions live or die as a group. So you're giving up the optimization that you could get via observing which companies succeed / fail at division-level tasks. I'm not sure how big this effect is, though I'd guess it's small.

While searching for that post, I also came across Firm Inefficiency which like Moral Mazes (but much more concisely) lists many inefficiencies that seem all or mostly related to value differences.

Yeah, I'm more convinced now that principal-agent issues are significantly larger than other issues.

I think it's at least one of the main arguments that Eric Drexler makes, since he wrote this in his abstract

Yeah, I agree it's an argument against that argument from Eric. I forgot that Eric makes that point (mainly because I have never been very convinced by it)

Yeah I'm not very familiar with this either, but my understanding is that such mergers are only illegal if the effect "may be substantially to lessen competition" or "tend to create a monopoly", which technically (it seems to me) isn't the case when existing monopolies in different industries merge.

My guess would be that the spirit of the law would apply, and that would be enough, but really I'd want to ask a social scientist or lawyer.

Comment by rohinmshah on AGI will drastically increase economies of scale · 2019-06-11T00:38:54.149Z · score: 2 (1 votes) · LW · GW
Does that seem right to you, or do you see things turn out a different way (in the long run)?

I agree that direct military competition would create such a pressure.

I'm not sure that absent that there actually is competition between countries -- what are they even competing on? You're reasoning as though they compete on economic efficiency, but what causes countries with lower economic efficiency to vanish? Perhaps in countries with lower economic efficiency, voters tend to put in a new government -- but in that case it seems like really the competition between countries is on "what pleases voters", which may not be exactly what we want but it probably isn't too risky if we have an AGI-fueled government that's intent-aligned with "what pleases voters".

(It's possible that you get politicians who look like they're trying to please voters but once they have enough power they then serve their own interests, but this looks like "the government gains power, and the people no longer have effective control over government".)

## [AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming

2019-06-05T23:20:01.202Z · score: 28 (9 votes)

## [AN #56] Should ML researchers stop running experiments before making hypotheses?

2019-05-21T02:20:01.765Z · score: 22 (6 votes)

## [AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI

2019-05-05T02:20:01.030Z · score: 17 (5 votes)

## [AN #54] Boxing a finite-horizon AI system to keep it unambitious

2019-04-28T05:20:01.179Z · score: 21 (6 votes)

## Alignment Newsletter #53

2019-04-18T17:20:02.571Z · score: 22 (6 votes)

## Alignment Newsletter One Year Retrospective

2019-04-10T06:58:58.588Z · score: 93 (27 votes)

## Alignment Newsletter #52

2019-04-06T01:20:02.232Z · score: 20 (5 votes)

## Alignment Newsletter #51

2019-04-03T04:10:01.325Z · score: 28 (5 votes)

## Alignment Newsletter #50

2019-03-28T18:10:01.264Z · score: 16 (3 votes)

## Alignment Newsletter #49

2019-03-20T04:20:01.333Z · score: 26 (8 votes)

## Alignment Newsletter #48

2019-03-11T21:10:02.312Z · score: 31 (13 votes)

## Alignment Newsletter #47

2019-03-04T04:30:11.524Z · score: 21 (5 votes)

## Alignment Newsletter #46

2019-02-22T00:10:04.376Z · score: 18 (8 votes)

## Alignment Newsletter #45

2019-02-14T02:10:01.155Z · score: 26 (8 votes)

## Learning preferences by looking at the world

2019-02-12T22:25:16.905Z · score: 47 (13 votes)

## Alignment Newsletter #44

2019-02-06T08:30:01.424Z · score: 20 (6 votes)

## Conclusion to the sequence on value learning

2019-02-03T21:05:11.631Z · score: 48 (11 votes)

## Alignment Newsletter #43

2019-01-29T21:10:02.373Z · score: 15 (5 votes)

## Future directions for narrow value learning

2019-01-26T02:36:51.532Z · score: 12 (5 votes)

## The human side of interaction

2019-01-24T10:14:33.906Z · score: 16 (4 votes)

## Alignment Newsletter #42

2019-01-22T02:00:02.082Z · score: 21 (7 votes)

## Following human norms

2019-01-20T23:59:16.742Z · score: 27 (10 votes)

## Reward uncertainty

2019-01-19T02:16:05.194Z · score: 20 (6 votes)

## Alignment Newsletter #41

2019-01-17T08:10:01.958Z · score: 23 (4 votes)

## Human-AI Interaction

2019-01-15T01:57:15.558Z · score: 26 (7 votes)

## What is narrow value learning?

2019-01-10T07:05:29.652Z · score: 20 (8 votes)

## Alignment Newsletter #40

2019-01-08T20:10:03.445Z · score: 21 (4 votes)

## Reframing Superintelligence: Comprehensive AI Services as General Intelligence

2019-01-08T07:12:29.534Z · score: 91 (35 votes)

## AI safety without goal-directed behavior

2019-01-07T07:48:18.705Z · score: 48 (14 votes)

## Will humans build goal-directed agents?

2019-01-05T01:33:36.548Z · score: 41 (11 votes)

## Alignment Newsletter #39

2019-01-01T08:10:01.379Z · score: 33 (10 votes)

## Alignment Newsletter #38

2018-12-25T16:10:01.289Z · score: 9 (4 votes)

## Alignment Newsletter #37

2018-12-17T19:10:01.774Z · score: 26 (7 votes)

## Alignment Newsletter #36

2018-12-12T01:10:01.398Z · score: 22 (6 votes)

## Alignment Newsletter #35

2018-12-04T01:10:01.209Z · score: 15 (3 votes)

## Coherence arguments do not imply goal-directed behavior

2018-12-03T03:26:03.563Z · score: 64 (21 votes)

## Intuitions about goal-directed behavior

2018-12-01T04:25:46.560Z · score: 32 (12 votes)

## Alignment Newsletter #34

2018-11-26T23:10:03.388Z · score: 26 (5 votes)

## Alignment Newsletter #33

2018-11-19T17:20:03.463Z · score: 25 (7 votes)

## Alignment Newsletter #32

2018-11-12T17:20:03.572Z · score: 20 (4 votes)

## Future directions for ambitious value learning

2018-11-11T15:53:52.888Z · score: 44 (11 votes)

## Alignment Newsletter #31

2018-11-05T23:50:02.432Z · score: 19 (3 votes)

## What is ambitious value learning?

2018-11-01T16:20:27.865Z · score: 44 (13 votes)

## Preface to the sequence on value learning

2018-10-30T22:04:16.196Z · score: 65 (26 votes)

## Alignment Newsletter #30

2018-10-29T16:10:02.051Z · score: 31 (13 votes)

## Alignment Newsletter #29

2018-10-22T16:20:01.728Z · score: 16 (5 votes)

## Alignment Newsletter #28

2018-10-15T21:20:11.587Z · score: 11 (5 votes)