[AN #93]: The Precipice we’re standing at, and how we can back away from it 2020-04-01T17:10:01.987Z · score: 25 (6 votes)
[AN #92]: Learning good representations with contrastive predictive coding 2020-03-25T17:20:02.043Z · score: 19 (7 votes)
[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement 2020-03-18T17:10:02.205Z · score: 16 (5 votes)
[AN #90]: How search landscapes can contain self-reinforcing feedback loops 2020-03-11T17:30:01.919Z · score: 12 (4 votes)
[AN #89]: A unifying formalism for preference learning algorithms 2020-03-04T18:20:01.393Z · score: 17 (5 votes)
[AN #88]: How the principal-agent literature relates to AI risk 2020-02-27T09:10:02.018Z · score: 20 (6 votes)
[AN #87]: What might happen as deep learning scales even further? 2020-02-19T18:20:01.664Z · score: 29 (10 votes)
[AN #86]: Improving debate and factored cognition through human experiments 2020-02-12T18:10:02.213Z · score: 16 (6 votes)
[AN #85]: The normative questions we should be asking for AI alignment, and a surprisingly good chatbot 2020-02-05T18:20:02.138Z · score: 16 (6 votes)
[AN #84] Reviewing AI alignment work in 2018-19 2020-01-29T18:30:01.738Z · score: 24 (10 votes)
AI Alignment 2018-19 Review 2020-01-28T02:19:52.782Z · score: 140 (39 votes)
[AN #83]: Sample-efficient deep learning with ReMixMatch 2020-01-22T18:10:01.483Z · score: 16 (7 votes)
rohinmshah's Shortform 2020-01-18T23:21:02.302Z · score: 14 (3 votes)
[AN #82]: How OpenAI Five distributed their training computation 2020-01-15T18:20:01.270Z · score: 20 (6 votes)
[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment 2020-01-08T18:00:01.566Z · score: 22 (8 votes)
[AN #80]: Why AI risk might be solved without additional intervention from longtermists 2020-01-02T18:20:01.686Z · score: 34 (16 votes)
[AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL 2020-01-01T18:00:01.839Z · score: 12 (5 votes)
[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison 2019-12-26T01:10:01.626Z · score: 26 (7 votes)
[AN #77]: Double descent: a unification of statistical theory and modern ML practice 2019-12-18T18:30:01.862Z · score: 21 (8 votes)
[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations 2019-12-04T18:10:01.739Z · score: 14 (6 votes)
[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee 2019-11-27T18:10:01.332Z · score: 39 (11 votes)
[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts 2019-11-20T18:20:01.647Z · score: 19 (7 votes)
[AN #73]: Detecting catastrophic failures by learning how agents tend to break 2019-11-13T18:10:01.544Z · score: 11 (4 votes)
[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety 2019-11-06T18:10:01.604Z · score: 28 (7 votes)
[AN #71]: Avoiding reward tampering through current-RF optimization 2019-10-30T17:10:02.211Z · score: 13 (5 votes)
[AN #70]: Agents that help humans who are still learning about their own preferences 2019-10-23T17:10:02.102Z · score: 18 (6 votes)
Human-AI Collaboration 2019-10-22T06:32:20.910Z · score: 39 (13 votes)
[AN #69] Stuart Russell's new book on why we need to replace the standard model of AI 2019-10-19T00:30:01.642Z · score: 64 (21 votes)
[AN #68]: The attainable utility theory of impact 2019-10-14T17:00:01.424Z · score: 19 (5 votes)
[AN #67]: Creating environments in which to study inner alignment failures 2019-10-07T17:10:01.269Z · score: 17 (6 votes)
[AN #66]: Decomposing robustness into capability robustness and alignment robustness 2019-09-30T18:00:02.887Z · score: 12 (6 votes)
[AN #65]: Learning useful skills by watching humans “play” 2019-09-23T17:30:01.539Z · score: 12 (4 votes)
[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning 2019-09-16T17:10:02.103Z · score: 11 (5 votes)
[AN #63] How architecture search, meta learning, and environment design could lead to general intelligence 2019-09-10T19:10:01.174Z · score: 24 (8 votes)
[AN #62] Are adversarial examples caused by real but imperceptible features? 2019-08-22T17:10:01.959Z · score: 28 (11 votes)
Call for contributors to the Alignment Newsletter 2019-08-21T18:21:31.113Z · score: 39 (12 votes)
Clarifying some key hypotheses in AI alignment 2019-08-15T21:29:06.564Z · score: 72 (29 votes)
[AN #61] AI policy and governance, from two people in the field 2019-08-05T17:00:02.048Z · score: 11 (5 votes)
[AN #60] A new AI challenge: Minecraft agents that assist human players in creative mode 2019-07-22T17:00:01.759Z · score: 25 (10 votes)
[AN #59] How arguments for AI risk have changed over time 2019-07-08T17:20:01.998Z · score: 43 (9 votes)
Learning biases and rewards simultaneously 2019-07-06T01:45:49.651Z · score: 43 (12 votes)
[AN #58] Mesa optimization: what it is, and why we should care 2019-06-24T16:10:01.330Z · score: 50 (13 votes)
[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming 2019-06-05T23:20:01.202Z · score: 28 (9 votes)
[AN #56] Should ML researchers stop running experiments before making hypotheses? 2019-05-21T02:20:01.765Z · score: 22 (6 votes)
[AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI 2019-05-05T02:20:01.030Z · score: 18 (6 votes)
[AN #54] Boxing a finite-horizon AI system to keep it unambitious 2019-04-28T05:20:01.179Z · score: 21 (6 votes)
Alignment Newsletter #53 2019-04-18T17:20:02.571Z · score: 22 (6 votes)
Alignment Newsletter One Year Retrospective 2019-04-10T06:58:58.588Z · score: 93 (27 votes)
Alignment Newsletter #52 2019-04-06T01:20:02.232Z · score: 20 (5 votes)
Alignment Newsletter #51 2019-04-03T04:10:01.325Z · score: 28 (5 votes)


Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2020-04-03T02:10:23.498Z · score: 2 (1 votes) · LW · GW

Condition 2 is necessary for race dynamics to arise, which is what people are usually worried about.

Suppose that AI systems weren't going to be useful for anything -- the only effect of AI systems was that they posed an x-risk to the world. Then it would still be true that "neither side wants to do the thing, because if they do the thing they get destroyed too".

Nonetheless, I think in this world, no one ever builds AI systems and so don't need to worry about x-risk.

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2020-04-03T02:07:00.777Z · score: 3 (2 votes) · LW · GW
I guess I was just suggesting that your comments there, taken by themselves/out of context, seemed to ignore those important arguments, and thus might seem overly optimistic.

Sure, that seems reasonable.

Is this for existential risk from AI as a whole, or just "adversarial optimisation"/"misalignment" type scenarios?

Just adversarial optimization / misalignment. See the comment thread with Wei Dai below, especially this comment.

Like, for you, "there's no action from longtermists" would be a specific constraint you have to add to your world model?

Oh yeah, definitely. (Toby does the same in The Precipice; his position is that it's clearer not to condition on anything, because it's usually unclear what exactly you are conditioning on, though in person he did like the operationalization of "without action from longtermists".)

Like, my model of the world is that for any sufficiently important decision like the development of powerful AI systems, there are lots of humans bringing many perspectives to the table, which usually ends up with most considerations being brought up by someone, and an overall high level of risk aversion. On this model, longtermists are one of the many groups that argue for being more careful than we otherwise would be.

I imagine you could also condition on something like "surprisingly much action from longtermists", which would reduce your estimated risk further?

Yeah, presumably. The 1 in 20 number was very made up, even more so than the 1 in 10 number. I suppose if our actions were very successful, I could see us getting down to 1 in 1000? But if we just exerted a lot more effort (i.e. "surprisingly much action"), the extra effort probably doesn't help much more than the initial effort, so maybe... 1 in 25? 1 in 30?

(All of this is very anchored on the initial 1 in 10 number.)

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2020-04-02T17:27:58.986Z · score: 3 (2 votes) · LW · GW
So it seems like, unless we expect the relevant actors to act in accordance with something close to impartial altruism, we should expect them to avoid risks somewhat to avoid existential risks (or extinction specifically), but far less than they really should. (Roughly this argument is made in The Precipice, and I believe by 80k.)

I agree that actors will focus on x-risk far less than they "should" -- that's exactly why I work on AI alignment! This doesn't mean that x-risk is high in an absolute sense, just higher than it "should" be from an altruistic perspective. Presumably from an altruistic perspective x-risk should be very low (certainly below 1%), so my 10% estimate is orders of magnitude higher than what it "should" be.

Also, re: Precipice, it's worth noting that Toby and I don't disagree much -- I estimate 1 in 10 conditioned on no action from longtermists; he estimates 1 in 5 conditioned on AGI being developed this century. Let's say that action from longtermists can halve the risk; then my unconditional estimate would be 1 in 20, and would be very slightly higher if we condition on AGI being developed this century (because we'd have less time to prepare), so overall there's a 4x difference, which given the huge uncertainty is really not very much.

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2020-04-02T17:18:19.835Z · score: 2 (1 votes) · LW · GW

MAD-style strategies happen when:

1. There are two (or more) actors that are in competition with each other

2. There is a technology such that if one actor deploys it and the other actor doesn't, the first actor remains the same and the second actor is "destroyed".

3. If both actors deploy the technology, then both actors are "destroyed".

(I just made these up right now; you could probably get better versions from papers about MAD.)

Condition 2 doesn't hold for accident risk from AI: if any actor deploys an unaligned AI, then both actors are destroyed.

I agree I didn't explain this well in the interview -- when I said

if the destruction happens, that affects you too

I should have said something like

if you deploy a dangerous AI system, that affects you too

which is not true for nuclear weapons (deploying a nuke doesn't affect you in and of itself).

Comment by rohinmshah on Rohin Shah on reasons for AI optimism · 2020-04-02T17:03:24.724Z · score: 3 (2 votes) · LW · GW

Yeah I think I was probably wrong about this (including what other people were talking about when they said "nuclear arms race").

Comment by rohinmshah on Thinking About Filtered Evidence Is (Very!) Hard · 2020-04-02T17:02:31.218Z · score: 2 (1 votes) · LW · GW
the Bayesian notion of belief doesn't allow us to make the distinction you are pointing to

Sure, that seems reasonable. I guess I saw this as the point of a lot of MIRI's past work, and was expecting this to be about honesty / filtered evidence somehow.

I also think this result has nothing to do with "you can't have a perfect model of Carol". Part of the point of my assumptions is that they are, individually, quite compatible with having a perfect model of Carol amongst the hypotheses.

I think we mean different things by "perfect model". What if I instead say "you can't perfectly update on X and Carol-said-X , because you can't know why Carol said X, because that could in the worst case require you to know everything that Carol will say in the future"?

Comment by rohinmshah on Thinking About Filtered Evidence Is (Very!) Hard · 2020-04-01T19:36:59.578Z · score: 4 (2 votes) · LW · GW

Yeah, I feel like while honesty is needed to prove the impossibility result, the problem arose with the assumption that the agent could effectively reason now about all the outputs of a recursively enumerable process (regardless of honesty). Like, the way I would phrase this point is "you can't perfectly update on X and Carol-said-X , because you can't have a perfect model of Carol"; this applies whether or not Carol is honest. (See also this comment.)

Comment by rohinmshah on Solipsism is Underrated · 2020-03-30T17:48:57.791Z · score: 2 (1 votes) · LW · GW

This update seems like it would be extraordinarily small, given our poor understanding of the brain, and the relatively small amount of concerted effort that goes into understanding consciousness.

Comment by rohinmshah on Thinking About Filtered Evidence Is (Very!) Hard · 2020-03-30T07:01:58.318Z · score: 2 (1 votes) · LW · GW

I still don't get it but probably not worth digging further. My current confusion is that even under the behaviorist interpretation, it seems like just believing condition 2 implies knowing all the things Carol would ever say (or Alice has a mistaken belief). Probably this is a confusion that would go away with enough formalization / math, but it doesn't seem worth doing that.

Comment by rohinmshah on Solipsism is Underrated · 2020-03-28T19:52:47.135Z · score: 3 (2 votes) · LW · GW
But using either interpretation, how puzzling is the view, that the activity of these little material things somehow is responsible for conscious qualia? This is where a lot of critical thinking has led many people to say things like “consciousness must be what an algorithm implemented on a physical machine feels like from the ‘inside.’” And this is a decent hypothesis, but not an explanatory one at all. The emergence of consciousness and qualia is just something that materialists need to accept as a spooky phenomenon. It's not a very satisfying solution to the hard problem of consciousness.

"lack of a satisfying explanatory solution" does not imply low likelihood if you think that the explanatory solution exists but is computationally hard to find (which in fact seems pretty reasonable).

Like, the same structure of argument could be used to argue that computers are extremely low likelihood -- how puzzling is the view, that the activity of electrons moving around somehow is responsible for proving mathematical theorems?

With laptops, we of course have a good explanation of how computation arises from electrons, but that's because we designed them -- it would probably be much harder if we had no knowledge of laptops or even electricity and then were handed a laptop and asked to explain how it could reliably produce true mathematical theorems. This seems pretty analogous to the situation we find ourselves in with consciousness.

Comment by rohinmshah on When to assume neural networks can solve a problem · 2020-03-28T17:03:40.515Z · score: 10 (4 votes) · LW · GW
"Human Compatible" is making basically the same points as "Superintelligence," only in a dumbed-down and streamlined manner, with lots of present-day examples to illustrate.

I do not agree with this. I think the arguments in Human Compatible are more convincing than the ones in Superintelligence (mostly because they make fewer questionable assumptions).

(I agree that Stuart probably does agree somewhat with the "Bostromian position".)

Comment by rohinmshah on Deconfusing Human Values Research Agenda v1 · 2020-03-28T05:53:47.140Z · score: 2 (1 votes) · LW · GW
Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:

What's your model for why those actions weren't undone?

To pop back up to the original question -- if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it's only good to make them 2x smarter, but after that more marginal intelligence is bad?

It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we're at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let's suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?

(I'm aware that intelligence is not one-dimensional, but I feel like this is still a mostly meaningful question.)

Just to be clear about my own position, a well intentioned superintelligent AI system totally could make mistakes. However, it seems pretty unlikely that they'd be of the existentially-catastrophic kind. Also, the mistake could be net negative, but the AI system overall should be net positive.

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-28T05:32:00.185Z · score: 4 (2 votes) · LW · GW
Do you think I'm wrong?

No, which is why I want to stop using the example.

(The counterfactual I was thinking of was more like "imagine we handed a laptop to 19th-century scientists, can they mechanistically understand it?" But even that isn't a good analogy, it overstates the difficulty.)

Comment by rohinmshah on Alignment as Translation · 2020-03-28T00:02:08.436Z · score: 4 (2 votes) · LW · GW
Let me know if this analogy sounds representative of the strategies you imagine.

Yeah, it does. I definitely agree that this doesn't get around the chicken-and-egg problem, and so shouldn't be expected to succeed on the first try. It's more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.

the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further.

I think you get "ground truth data" by trying stuff and seeing whether or not the AI system did what you wanted it to do.

(This does suggest that you wouldn't ever be able to ask your AI system to do something completely novel without having a human along to ensure it's what we actually meant, which seems wrong to me, but I can't articulate why.)

Comment by rohinmshah on Alignment as Translation · 2020-03-27T23:52:57.990Z · score: 4 (2 votes) · LW · GW

Yeah, this could be a way that things are. My intuition is that it wouldn't be this way, but I don't have any good arguments for it.

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-27T23:48:06.869Z · score: 2 (1 votes) · LW · GW

Yup, that seems like a pretty reasonable estimate to me.

Note that my default model for "what should be the input to estimate difficulty of mechanistic transparency" would be the number of parameters, not number of neurons. If a neuron works over a much larger input (leading to more parameters), wouldn't that make it harder to mechanistically understand?

Comment by rohinmshah on Alignment as Translation · 2020-03-27T20:44:13.250Z · score: 4 (2 votes) · LW · GW
Anyway, sounds like value-in-the-tail is a central crux here.

Seems somewhat right to me, subject to caveat below.

it's not a necessary condition - if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn't really decreased.

An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you've translated better and now you've eliminated 99% of the risk, and iterating this process you get to effectively no ongoing risk. There is of course risk during the iteration, but that risk can be reasonably small.

A similar argument applies to economic competitiveness: yes, your first agent is pretty slow relative to what it could be, but you can make it faster and faster over time, so you only lose a lot of value during the first few initial phases.

(For the economic value part, this is mostly based on industry experience trying to automate things.)

I have the same intuition, and strongly agree that usually most of the value is in the long tail. The hope is mostly that you can actually keep making progress on the tail as time goes on, especially with the help of your newly built AI systems.

Comment by rohinmshah on Alignment as Translation · 2020-03-27T20:30:25.185Z · score: 4 (2 votes) · LW · GW

I did see that answer and pretty strongly agree with it, the "low-level structure" part of my summary was meant to be an example, not a central case. To make this clearer, I changed

which could potentially be detailed accurate low-level simulations


which could be very alien to us (e.g. perhaps the AI system uses detailed low-level simulations)
Comment by rohinmshah on Deconfusing Human Values Research Agenda v1 · 2020-03-27T20:24:28.024Z · score: 4 (2 votes) · LW · GW
I expect it to be net negative.

Man, I do not share that intuition.

I'd be interested in specific examples of well-intentioned dictators that screwed things up (though I anticipate my objections will be that 1. they weren't well-intentioned or 2. they didn't have the power to actually impose decisions centrally, and had to spend most of their power ensuring that they remained in power).

I'm saying even with all that (or especially with all of that), the measurement gap is large and we should expect high deviance from the target that will readily lead to Goodharting.

I know you're saying that, I just don't see many arguments for it. From my perspective, you are asserting that Goodhart problems are robust, rather than arguing for it. That's fine, you can just call it an intuition you have, but to the extent you want to change my mind, restating it in different words is not very likely to work.

It's definitely harder.

This is an assertion, not an argument.

Do you really believe that you can predict facts about humans better just by reasoning about evolution (and using no information you've learned by looking at humans), relative to building a model by looking at humans (and using no information you've learned from the theory of evolution)? I suspect you actually mean some other thing, but idk what.

Comment by rohinmshah on Alignment as Translation · 2020-03-27T18:19:50.030Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

At a very high level, we can model powerful AI systems as moving closer and closer to omniscience. As we move in that direction, what becomes the new constraint on technology? This post argues that the constraint is _good interfaces_, that is, something that allows us to specify what the AI should do. As with most interfaces, the primary challenge is dealing with the discrepancy between the user's abstractions (how humans think about the world) and the AI system's abstractions, which could be very alien to us (e.g. perhaps the AI system uses detailed low-level simulations). The author believes that this is the central problem of AI alignment: how to translate between these abstractions that accurately preserves meaning.
The post goes through a few ways that we could attempt to do this translation, but all of them seem to only reduce the amount of translation that is necessary: none of them solve the chicken-and-egg problem of how you do the very first translation between the abstractions.

Planned opinion:

I like this view on alignment, but I don't know if I would call it the _central_ problem of alignment. It sure seems important that the AI is _optimizing_ something: this is what prevents solutions like "make sure the AI has an undo button / off switch", which would be my preferred line of attack if the main source of AI risk were bad translations between abstractions. There's a longer discussion on this point here.

(I might change the opinion based on further replies to my other comment.)

Comment by rohinmshah on Thinking About Filtered Evidence Is (Very!) Hard · 2020-03-27T17:51:25.893Z · score: 11 (2 votes) · LW · GW

Yeah, I did find that reformulation clearer, but it also then seems to not be about filtered evidence?

Like, it seems like you need two conditions to get the impossibility result, now using English instead of math:

1. Alice believes Carol is always honest (at least with probability > 50%)

2. For any statement s: [if Carol will ever say s, Alice currently believes that Carol will eventually say s (at least with probability > 50%)]

It really seems like the difficulty here is with condition 2, not with condition 1, so I don't see how this theorem has anything to do with filtered evidence.

Maybe the point is just "you can't perfectly update on X and Carol-said-X , because you can't have a perfect model of them, because you aren't bigger than they are"?

(Probably you agree with this, given your comment.)

Comment by rohinmshah on Alignment as Translation · 2020-03-27T17:41:07.120Z · score: 4 (2 votes) · LW · GW
I think what's driving this intuition is that you're looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities.

Yup, that is definitely the intuition.

Taking a more outside view... restrictions like "make it slow and reversible" feel like patches which don't really address the underlying issues.


In general, I'd expect the underlying issues to continue to manifest themselves in other ways when patches are applied.

I mean, they continue to manifest in the normal sense, in that when you say "cure cancer", the AI systems works on a plan to kill everyone; you just now get to stop the AI system from actually running that plan.

For instance, even with slow & reversible changes, it's still entirely plausible that humans don't stop something bad because they don't understand what's going on in enough detail - that's a typical scenario in the "translation problem" worldview.
Also, there's still the problem of translating a human's high-level notion of "reversible" into a low-level notion of "reversible".
simple solutions will be patches which don't address everything, there will be a long tail of complicated corner cases, etc.

All of this is true; I'm more arguing that slow & reversible eliminates ~95% of the problems, and so if it's easier to do than "full" alignment, then it probably becomes the best thing to do on the margin.

I don't think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of "slow" is roughly "a human has time to double-check", which immediately makes things very expensive.

I'd expect we'd be able to solve this over time, e.g. first you use your AI system for simple tasks which you can check quickly, then as you start trusting that you've worked out the bugs for those tasks, you let the AI do them faster / without oversight, and move on to more complicated tasks, etc.

(This is a much more testing + engineering based approach; the standard argument against such an approach is that it fails in the presence of optimization.)

It certainly does mean you take a hit to economic competitiveness, I mostly think the hit is not that large and is something we could pay.

Comment by rohinmshah on Deconfusing Human Values Research Agenda v1 · 2020-03-27T17:27:20.981Z · score: 2 (1 votes) · LW · GW
This is a reasonably hope but I generally think hope is dangerous when it comes to existential risks

When I say "hope", I mean "it is reasonably likely that the research we do pans out and leads to a knowably-aligned AI system", not "we will look at the AI system's behavior, pull a risk estimate out of nowhere, and then proceed to deploy it anyway".

In this sense, literally all AI risk research is based on hope, since no existing AI risk research knowably will lead to us building an aligned AI system.

I'm moved to pursue this line of research because I believe it to be neglected, I believe it's likely enough to be useful to building aligned AI to be worth pursuing, and I would rather us have explored it thoroughly and ended up not needing it than have not explored it and end up needing to have.

This is all reasonable; most of it can be said about most AI risk research. The main distinguishing feature between different kinds of technical AI risk research is:

it's likely enough to be useful to building aligned AI to be worth pursuing

So that's the part you'd have to argue for to convince me (but also it would be reasonable not to bother).

I would expect that if you felt more or less aligned with a superintelligent AI system the way you feel you are aligned with your friends, the AI system would optimize so hard that it would no longer be aligned

Suppose one of your friends became 10x more intelligent, or got a superpower where they could choose at will to stop time for everything except themselves and a laptop (that magically still has Internet access). Is this a net positive change to the world, or a net negative one?

Perhaps you think AI systems will be different in kind to your friends, in which case see next point.

I suspect that at the level of measurement you're talking about where you can infer alignment from observed behavior there is too much room for error between the measure and the target such that deviance is basically guaranteed.

Wait, I infer alignment from way more than just observed behavior. In the case of my friends, I have a model of how humans work in general, informed both by theory (e.g. evolutionary psychology) and empirical evidence (e.g. reasoning about how I would do X, and projecting it onto them). In the case of AI systems, I would want similar additional information beyond just their behavior, e.g. an understanding of what their training process incentivizes, running counterfactual queries on them early in training when they are still relatively unintelligent and I can understand them, etc.

I am perhaps more conservative here and want to make the gap between the measure and the target much smaller so that we can effectively get "under" Goodhart effects for the targets we care about by measure and modeling the processes that generate those targets rather than the targets themselves.

It's not obvious to me that modeling the generators of a thing is easier than modeling the thing. E.g. It's much easier for me to model humans than to model evolution.

Comment by rohinmshah on Alignment as Translation · 2020-03-26T19:09:57.050Z · score: 4 (2 votes) · LW · GW
Tool AI can be plenty dangerous if it's capable of making large, fast, irreversible changes to the world, and the alignment problem is still hard for that sort of AI.

I definitely agree with that characterization. I think the solutions I would look for would be quite different though: they would look more like "how do I ensure that the AI system has an undo button" and "how do I ensure that the AI system does things slowly", similarly to how with nuclear power plants (I assume) there are (possibly redundant) mechanisms that ensure you can turn off the power plant.

Of course these solutions are also subject to the same translation problem, but it seems plausible to me that that translation problem is easier to solve, relative to solving translation in full generality.

AI-as-optimizer would suggest that even if the translation problem were solved for the particular things I mentioned, it still might not be enough, because e.g. the AI might deliberately prevent me from pressing the undo button.

You could say something like "an AI that can enact large irreversible changes might form a plan where the large irreversible change starts with disabling the undo button", but then it sort of feels like we're bringing back in the idea of optimization. Maybe that's fine, we're pretty confused about optimization anyway.

Comment by rohinmshah on AGI in a vulnerable world · 2020-03-26T16:18:23.883Z · score: 4 (2 votes) · LW · GW

Hmm, I find it plausible that we will have that on average p(build unaligned AGI | can build unaligned AGI) is about 0.01, which implies that unaligned AGI is built when there are ~100 actors that can build AGI, which seems to fit many-people-can-build-AGI.

The 0.01 probability could happen because of regulations / laws, as you mention, but also if the world has sufficient common knowledge of the risks of unaligned AGI (which seems not implausible to me, perhaps because of warning shots, or because of our research, or because of natural human risk aversion).

Comment by rohinmshah on AGI in a vulnerable world · 2020-03-26T16:11:27.996Z · score: 2 (1 votes) · LW · GW

Agreed, and this also happens "for free" with openness norms, as the post suggests. I'm not strongly disagreeing with the overall thesis of the post, just the specific point that small teams can reproduce impressive results with much fewer resources.

Comment by rohinmshah on If I were a well-intentioned AI... IV: Mesa-optimising · 2020-03-26T05:00:42.301Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This sequence takes on the perspective of an AI system that is well-intentioned, but lacking information about what humans want. The hope is to find what good AI reasoning might look like, and hopefully use this to derive insights for safety. The sequence considers Goodhart problems, adversarial examples, distribution shift, subagent problems, etc.

Planned opinion:

I liked this sequence. Often when presented with a potential problem in AI safety, I ask myself why the problem doesn't also apply to humans, and how humans have managed to solve the problem. This sequence was primarily this sort of reasoning, and I think it did a good job of highlighting how with sufficient conservatism it seems plausible that many problems are not that bad if the AI is well-intentioned, even if it has very little information, or finds it hard to communicate with humans, or has the wrong abstractions.
Comment by rohinmshah on Deconfusing Human Values Research Agenda v1 · 2020-03-26T04:18:37.126Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post argues that since 1. human values are necessary for alignment, 2. we are confused about human values, and 3. we couldn't verify it if an AI system discovered the structure of human values, we need to do research to become less confused about human values. This research agenda aims to deconfuse human values by modeling them as the input to a decision process which produces behavior and preferences. The author's best guess is that human values are captured by valence, as modeled by minimization of prediction error.

Planned opinion:

This is similar to the argument in <@Why we need a *theory* of human values@>, and my opinion remains roughly the same: I strongly agree that we are confused about human values, but I don't see an understanding of human values as necessary for value alignment. We could hope to build AI systems in a way where we don't need to specify the ultimate human values (or even a framework for learning them) before running the AI system. As an analogy, my friends and I are all confused about human values, but nonetheless I think they are more or less aligned with me (in the sense that if AI systems were like my friends but superintelligent, that sounds broadly fine).
Comment by rohinmshah on Alignment as Translation · 2020-03-26T03:47:42.356Z · score: 4 (2 votes) · LW · GW

So it seems like this framing of alignment removes the notion of the AI "optimizing for something" or "being goal-directed". Do you endorse dropping that idea?

With just this general argument, I would probably not argue for AI risk -- if I had to argue for it, the argument would go "we ask the AI to do something, this gets mistranslated and the AI does something else with weird consequences, maybe the weird consequences include extinction", but it sure seems like as it starts doing the "something else" we would e.g. turn it off.

Comment by rohinmshah on AGI in a vulnerable world · 2020-03-26T03:37:40.158Z · score: 4 (2 votes) · LW · GW
It seems like large organizations make the majority of progress on the frontier, but smaller teams are close behind and able to reproduce impressive results with dramatically fewer resources.

I'd be surprised if that latter part continued for several more years. At least for ImageNet, compute cost in dollars has not been a significant constraint (I expect the cost of researcher time far dominates it, even for the non-optimized implementations), so it's not that surprising that researchers don't put in the work needed to make it as fast and cheap as possible. Presumably there will be more effort along these axes as compute costs overtake researcher time costs.

Comment by rohinmshah on Thinking About Filtered Evidence Is (Very!) Hard · 2020-03-25T20:19:14.245Z · score: 13 (3 votes) · LW · GW
Since this hypothesis makes distinct predictions, it is possible for the confidence to rise above 50% after finitely many observations. At that point, since the listener expects each theorem of PA to eventually be listed, with probability > 50%, and the listener believes the speaker, the listener must assign > 50% probability to each theorem of PA!

I don't see how this follows. At the point where the confidence in PA rises above 50%, why can't the agent be mistaken about what the theorems of PA are? For example, let T be a theorem of PA that hasn't been claimed yet. Why can't the agent believe P(claims-T) = 0.01 and P(claims-not-T) = 0.99? It doesn't seem like this violates any of your assumptions. I suspect you wanted to use Assumption 2 here:

A listener believes a speaker to be honest if the listener distinguishes between "X" and "the speaker claims X at time t" (aka "claimst-X"), and also has beliefs such that P(X| claimst-X)=1 when P(claims-X) > 0.

But as far as I can tell the scenario I gave is compatible with that assumption.

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-24T01:57:21.748Z · score: 2 (1 votes) · LW · GW

I think I should not have used the laptop example, it's not really communicating what I meant it to communicate. I was trying to convey "mechanistic transparency is hard" rather than "mechanistic transparency requires a single person to understand everything".

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-24T01:55:19.571Z · score: 4 (2 votes) · LW · GW
I guess I don't understand why linear scaling would imply this - in fact, I'd guess that training should probably be super-linear, since each backward pass takes linear time, but the more neurons you have, the bigger the parameter space, and so the greater number of gradient steps you need to take to reach the optimum, right?

Yeah, that's plausible. This does mean the mechanistic transparency cost could scale sublinearly w.r.t compute cost, though I doubt it (for the other reasons I mentioned).

If that estimate comes from OpenAI's efforts to understand image recognition, I think it's too high, since we presumably learned a bunch about what to look for from their efforts.

Nah, I just pulled a number out of nowhere. The estimate based on existing efforts would be way higher. Back of the envelope: it costs ~$50 to train on ImageNet (see here). Meanwhile, there have been probably around 10 person-years spent on understanding one image classifier? At $250k per person-year, that's $2.5 million on understanding, making it 50,000x more expensive to understand it than to train it.

Things that would move this number down:

  • Including the researcher time in the cost to train on ImageNet. I think that we will soon (if we haven't already) enter the regime where researcher cost < compute cost, so that would only change the conclusion by a factor of at most 2.
  • Using the cost for an unoptimized implementation, which would probably be > $50. I'd expect those optimizations to already be taken for the systems we care about -- it's way more important to get a 2x cost reduction when your training run costs $100 million than when your training run costs under $1000.
  • Including the cost of hyperparameter tuning. This also seems like a thing we will cause to be no more than a factor of 2, e.g. by using population-based training of hyperparameters.
  • Including the cost of data collection. This seems important, future data collection probably will be very expensive (even if simulating, there's the compute cost of the simulation), but idk how to take it into account. Maybe decrease the estimate by a factor of 10?
Once you have a model of a module such that if the module worked according to your model things would be fine, you can just train the module to better fit your model.

You could also just use the model, if it's fast. It would be interesting to see how well this works. My guess is that abstractions are leaky because there are no good non-leaky abstractions, which would predict that this doesn't work very well.

I think this is a minor benefit. In most domains, specialists will understand the meanings of input data to their systems

I think this is basically just the same point as "the problem gets harder when the AI system is superhuman", except the point is that the AI system becomes superhuman way faster on domains that are not native to humans, e.g. DNA, drug structures, protein folding, math intuition, relative to domains that are native to humans, like image classification.

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-24T01:31:17.142Z · score: 2 (1 votes) · LW · GW
That being said, if it's as hard as you think it will be, I don't understand how it could usefully contribute to the dot points you mention.

Taking each of the bullet points I mentioned in turn:

Regulations / laws to not build powerful AI

You could imagine a law "we will not build AI systems that use >X amount of compute unless they are mechanistically transparent". Then research on mechanistic transparency reduces the cost of such a law, making it more palatable to implement it.

Increasing AI researcher paranoia, so all AI researchers are very careful with powerful AI systems

The most obvious way to do this is to demonstrate that powerful AI systems are dangerous. One very compelling demonstration would be to train an AI system that we expect to be deceptive (that isn't powerful enough to take over), make it mechanistically transparent, and show that it is deceptive.

Here, the mechanistic transparency would make the demonstration much more compelling (relative to a demonstration where you show deceptive behavior, but there's the possibility that it's just a weird bug in that particular scenario).

Safety benchmarks (set of tests looking for common problems, updated as we encounter new problems) ("all the potentially dangerous AI systems we could have built failed one of the benchmark tests")

Mechanistic transparency opens up the possibility for safety tests of the form "train an AI system on this environment, and then use mechanistic transparency to check if it has learned <prohibited cognition>". (You could imagine that the environment is small, or the models trained are small, and that's why the cost of mechanistic transparency isn't prohibitive.)

Any of the AI alignment methods, e.g. value learning or iterated amplification ("we don't build dangerous AI systems because we build aligned AI systems instead")

Informed oversight can be solved via universality or interpretability; worst-case optimization currently relies on "magic" interpretability techniques. Even if full mechanistic transparency is too hard to do, I would expect that insights along the way would be helpful. For example, perhaps in adversarial training, if the adversary shares weights with the agent, the adversary already "knows" what the agent is "thinking", but it might need to use mechanistic transparency just for the final layer to understand what that part is doing.

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-24T01:19:08.722Z · score: 2 (1 votes) · LW · GW

Hmm, I was more pointing at the distinction where the first claim doesn't need to argue for the subclaim "we will be able to get people to use mechanistic transparency" (it's assumed away by "if I were in charge of the world"), while the second claim does have to argue for it.

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-22T23:56:44.052Z · score: 2 (1 votes) · LW · GW

Okay, I think I see the miscommunication.

The story you have is "the developers build a few small neural net modules that do one thing, mechanistically understand those modules, then use those modules to build newer modules that do 'bigger' things, and mechanistically understand those, and keep iterating this until they have an AGI". Does that sound right to you? If so, I agree that by following such a process the developer team could get mechanistic transparency into the neural net the same way that laptop-making companies have mechanistic transparency into laptops.

The story I took away from this post is "we do end-to-end training with regularization for modularity, and then we get out a neural net with modular structure. We then need to understand this neural net mechanistically to ensure it isn't dangerous". This seems much more analogous to needing to mechanistically understand a laptop that "fell out of the sky one day" before we had ever made a laptop.

My critiques are primarily about the second story. My critique of the first story would be that it seems like you're sacrificing a lot of competitiveness by having to develop the modules one at a time, instead of using end-to-end training.

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-22T15:42:46.475Z · score: 2 (1 votes) · LW · GW
You seem to be arguing the stronger claim that nobody understands the an entire laptop "all at once"

weaker claim?

But such an understanding is almost never possible for any complex system, and yet we still try to approach it. So this objection could show that mechanistic transparency is hard in the limit, but it doesn't show that mechanistic transparency is uniquely bad in any sense.

This seems to be assuming that we have to be able to take any complex trained AGI-as-a-neural-net and determine whether or not it is dangerous. Under that assumption, I agree that the problem is itself very hard, and mechanistic transparency is not uniquely bad relative to other possibilities.

But my point is that because it is so hard to detect whether an arbitrary neural net is dangerous, you should be trying to solve a different problem. This only depends on the claim that mechanistic transparency is hard in an absolute sense, not a relative sense (given the problem it is trying to solve).

Relatedly, from Evan Hubinger:

Put another way: once you're playing the game where I can hand you any model and then you have to figure out whether it's deceptive or not, you've already lost. Instead, you want to be in the regime where your training process is constructed so as to steer clear of situations in which your model might become deceptive in the first place.

All of the other stories for preventing catastrophe that I mentioned in the grandparent are tackling a hopefully easier problem than "detect whether an arbitrary neural net is dangerous".

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-22T03:08:20.316Z · score: 2 (1 votes) · LW · GW
Note that I'm not asking for systems to be mechanistically transparent to people with backgrounds and training in the relevant field, just that they be mechanistically transparent to their developers. This is still difficult, but as far as I know it's possible for laptops (although I could be wrong about this, I'm not a laptop expert).

I'd be shocked if there was anyone to whom it was mechanistically transparent how a laptop loads a website, down to the gates in the laptop.

I'd be surprised if there was anyone to whom it was mechanistically transparent how a laptop boots up, down to the gates in the laptop. (Note you'd have to understand the entire BIOS as well as all of the hardware in the laptop.)

Whoa, I'm so confused by that. It seems pretty clear to me that it's easier to regularise for properties that have nicer, more 'mathematical' definitions, and if that's false then I might just be fundamentally misunderstanding something.

It's easier in the sense that it's easier to compute it in Tensorflow and then use gradient descent to make the number smaller / bigger. But if you ignore that factor and ask whether a more mathematical definition will lead to more human-interpretability, then I don't see a particular reason to expect mathematical definitions to work better.

This basically seems right to me, and as such I'm researching how to make networks modular and identify their modularity structure. It feels to me like this research is doing OK and is not obviously doomed.

I think my argument was more like "in the world where your modularity research works out perfectly, you get linear scaling, and then it still costs 100x to have a mechanistically-understood AI system relative to a black-box AI system, which seems prohibitively expensive". And that's without including a bunch of other difficulties:

  • Right now we're working with subhuman AI systems where we already have concepts that we can use to understand AI systems; this will become much more difficult with superhuman AI systems.
  • All abstractions are leaky; as you build up hierarchies of abstractions for mechanistically understanding a neural net, the problems with your abstractions can cause you to miss potential problems. (As an analogy, when programming without any APIs / external code, you presumably mechanistically understand the code you write; yet bugs are common in such programming.)
  • With image classifiers we have the benefit of images being an input mechanism we are used to; it will presumably be a lot harder with input mechanisms we aren't used to.

It is certainly not unimaginable to me that these problems get solved somehow, but to convince me to promote this particular story for AI alignment to attention (at least beyond the threshold of "a smart person I know is excited about it"), you'd need to have some story / hope for how to deal with these problems. (E.g. as you mention in your post, you could imagine dealing with the last one using something like iterated amplification? Maybe?)

Here are some other stories for preventing catastrophes:

  • Regulations / laws to not build powerful AI
  • Increasing AI researcher paranoia, so all AI researchers are very careful with powerful AI systems
  • BoMAI-style boxing ("all the powerful AI systems we build don't care about anything that would make catastrophe instrumentally useful")
  • Impact regularization ("all the AI systems we build don't want to do something as high-impact as a catastrophe")
  • Safety benchmarks (set of tests looking for common problems, updated as we encounter new problems) ("all the potentially dangerous AI systems we could have built failed one of the benchmark tests")
  • Any of the AI alignment methods, e.g. value learning or iterated amplification ("we don't build dangerous AI systems because we build aligned AI systems instead")

Currently I find all of these stories more plausible than the story "we don't deploy a dangerous AI system because the developers mechanistically understood the dangerous AI system, detected the danger, and decided not to deploy it".

I want to emphasize that I think the general research direction is good and will be useful and I want more people to work on it (it makes the first, second, fifth and sixth bullet points above more effective); I only disagree with the story you've presented for how it reduces x-risk.

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-22T02:35:01.898Z · score: 2 (1 votes) · LW · GW
For what it's worth, I really dislike this terminology. Of course saying "I want X" is normative, and of course it's based on empirical beliefs.

Here are two claims:

  • "If I were in charge of the world, I would ensure that no powerful AI system were deployed unless we had mechanistic transparency into that system, because anything short of that is an unacceptable level of risk"
  • "I think that we should push for mechanistic transparency, because by doing so we will cause developers not to deploy dangerous AI systems, because they will use mechanistic transparency techniques to identify when the AI system is dangerous"

There is an axis on which these two claims differ, where I would say the first one is normative and the second one is empirical. The phrase "perfect is the enemy of good" is also talking about this axis. What would you name that axis?

In any case, probably at this point you know what I mean. I would like to see more argumentation for the second kind of claim, and am trying to say that arguments for the first kind of claim are not likely to sway me.

Re: clarification of desideratum, that makes sense.

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-22T02:20:27.858Z · score: 7 (4 votes) · LW · GW

Asya's summary for the Alignment Newsletter:

In this post, Daniel Filan presents an analytic perspective on how to do useful AI alignment research. His take is that in a world with powerful AGI systems similar to neural networks, it may be sufficient to be able to detect whether a system would cause bad outcomes before you deploy it on real-world systems with unknown distributions. To this end, he advocates for work on transparency that gives <@mechanistic understandings@><@Mechanistic Transparency for Machine Learning@> of the systems in question, combined with foundational research that allows us to reason about the safety of the produced understandings.

My opinion:

My broad take is that I agree that analyzing neural nets is useful and more work should go into it, but I broadly disagree that this leads to reduced x-risk by increasing the likelihood that developers can look at their trained model, determine whether it is dangerous by understanding it mechanistically, and decide whether to deploy it, in a "zero-shot" way. The key difficulty here is the mechanistic transparency, which seems like far too strong a property for us to aim for given empirical facts about the world (though perhaps normatively it would be better if humanity didn't deploy powerful AI until it was mechanistically understood).
The main intuition here is that the difficulty of mechanistic transparency increases at least linearly and probably superlinearly with model complexity. Combine that with the fact that right now for e.g. image classifiers, some people on OpenAI's Clarity team have spent multiple years understanding a single image classifier, which is orders of magnitude more expensive than training the classifier, and it seems quite unlikely that we could have mechanistic transparency for even more complex AGI systems built out of neural nets. More details in this comment. Note that Daniel agrees that it is an open question whether this sort of mechanistic transparency is possible, and thinks that we don't have much evidence yet that it isn't.
Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-21T21:49:28.780Z · score: 2 (1 votes) · LW · GW

I was mostly trying to illustrate the point, but if you want a different example:

There are a large number of mammalian species, meaning that one-celled mammalian species can be found.


There are infinitely many prime numbers, meaning that many even prime numbers can be found.
Comment by rohinmshah on [AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement · 2020-03-21T21:21:32.031Z · score: 2 (1 votes) · LW · GW

Unclear, seems like it could go either way.

If you aren't forced to learn all the ways of doing the task, then you should expect the neural net to learn only one of the ways. So maybe it's that the adversarial nature of OpenAI Five forced it to learn all the ways, and it was then paradoxically easier to remember all of the ways than just one of the ways.

Comment by rohinmshah on [AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement · 2020-03-21T21:18:48.086Z · score: 4 (2 votes) · LW · GW
This leaves unclear how it is decided that old "agents" should be used.

Yeah, it's complicated and messy and not that important for the main point of the paper, so I didn't write about it in the summary.

Was this switch to a new agent automatic or done by hand? (Was 'the agent has plateaued' determined by a program or the authors of the paper?)

Automatic / program. See Section 4, whose first sentence is "To generalize this observation, we first propose a simple algorithm for selecting states associated with plateaus of the last agent."

(The algorithm cheats a bit by assuming that you can run the original agent for some additional time, but then "roll it back" to the first state at which it got the max reward along the trajectory.)

Not apparent.

I may be missing your point, but isn't the fact that the Memento agent works on Montezuma's Revenge evidence that learning is not generalizing across "sections" in Montezuma's Revenge?

Comment by rohinmshah on An Analytic Perspective on AI Alignment · 2020-03-21T20:36:51.469Z · score: 14 (4 votes) · LW · GW

Overall take: Broadly agree that analyzing neural nets is useful and more work should go into it. Broadly disagree with the story for how this leads to reduced x-risk. Detailed comments below:

Background beliefs:

Broadly agree, with one caveat:

Futhermore [sic], I’m imagining that the training process of these ML systems does not provide enough guarantees about deployment performance.

I'm assuming "guarantees" means something like "strong arguments", and would include things like "when I train the agent on this loss function and it does well on the validation set, it will also do well on a test set drawn from the same distribution" (although I suppose you can prove that that holds with high probability). Perhaps a more interesting strong argument that's not a proof but that might count as a guarantee would be something like "if I perform adversarial training with a sufficiently smart adversary, it is unlikely that the agent finds and fails on an example that was within the adversary's search space".

If you include these sorts of things as guarantees, then I think the training process "by default" won't provide enough guarantees, but we might be able to get it to provide enough guarantees, e.g. by adversarial training. Alternatively, there will exist training processes that won't provide enough guarantees but will knowably be likely to produce AGI; but there may also be versions that do provide enough guarantees.

Background desiderata:

I want this determination to be made before the system is deployed, in a ‘zero-shot’ fashion, since this minimises the risk of the system actually behaving badly before you can detect and prevent it.

This seems normative rather than empirical. Certainly we need some form of 'zero-shot' analysis -- in particular, we must be able to predict whether a system causes x-risk in a zero-shot way (you can't see any examples of a system actually causing x-risk). But depending on what exactly you mean, I think you're probably aiming for too strong a property, one that's unachievable given background facts about the world. (More explanation in the Transparency section.)

Ways in which this desideratum is unclear to me:

  • Why is the distinction between training and deployment important? Most methods of training involve the AI system acting. Are you hoping that the training process (e.g. gradient descent) leads to safety?
  • Presumably many forms of interpretability techniques involve computing specific outputs of the neural net in order to understand them. Why doesn't this count as "running" the neural net?
  • My best guess is that you are distinguishing between the AI system acting in the real world during deployment (whereas training and interpretability were in simulation or with hypothetical inputs, or involved some other form of boxing that prevented it from "doing much real stuff"). What about training schemes in which the agent gradually becomes more and more exposed to the real world? Where is "deployment" then? (For example, consider OpenAI Five: while most of its training was in simulation, it played several games against humans during training, with more and more capable humans, and then eventually was "given access" to the full Internet via Arena. Which point was "deployment"?)

EDIT: Tbc, I think "deployment" is a relatively crisp concept when considering AI governance, where you can think of it as the point at which you release the AI system into the world and other actors besides the one that trained the system start interacting with it in earnest, and this point is a pretty important point in terms of the impacts of the AI system. For OpenAI Five, this would be the launch of Arena. But this sort of distinction seems much less relevant / crisp for AI alignment.


The type of transparency that I’m most excited about is mechanistic

Mechanistic transparency seems incredibly difficult to achieve to me. As an analogy, I don't think I understand how a laptop works at a mechanistic level, despite having a lot of training in Computer Science. This is a system that is built to be interpretable to humans, human civilization as a whole has a mechanistic understanding of laptops, and lots of effort has been put into creating good educational materials that most clearly convey a mechanistic understanding of (components of) laptops -- we have none of these advantages for neural nets. Of course, a laptop is very complex; but I would expect an AGI-via-neural-nets to be pretty complex as well.

I also think that mechanistic transparency becomes much more difficult as systems become more complex: in the best case where the networks are nice and modular, it becomes linearly harder, which might keep the cost ratio the same (seems plausible to scale human effort spent understanding the net at the same rate that we scale model capacity), but if it is superlinearly harder (seems more likely to me, because I don't expect it to be easy to identify human-interpretable modularity even when present), then as model capacity increases, human oversight becomes a larger and larger fraction of the cost.

Currently human oversight is already 99+% of the cost of mechanistically transparent image classifiers: Chris Olah and co. have spent multiple years on one image classifier and are maybe getting close to a mechanistic-ish understanding of it, though of course presumably future efforts would be less costly because they'll have learned important lessons. (Otoh, things that aren't image classifiers are probably harder to mechanistically understand, especially things that are better-than-human, as in e.g. AlphaGo's move 37.)

This will be easier to do if the transparency method is simpler, more ‘mathematical’, and minimally reliant on machine learning.

Controversial, I'm pretty uncertain but weakly lean against. (Probably not worth discussing though, just wanted to note the disagreement.)

This paper on the intrinsic dimension of objective landscapes shows that you can constrain neural network weights to a low-dimensional subspace and still find good solutions.

But interestingly, you can't just use fewer neurons (corresponding to a low-dimensional subspace where the projection matrices consists of unit vectors along the axes) -- it has to be a random subspace. I think we don't really understand what's going on here and I wouldn't update too much on the possibility of transparency from it (though it is weak evidence that regularization is possible and strong evidence that there are lots of good models).

This paper argues that there are a large number of models with roughly the same performance, meaning that ones with good qualities (e.g. interpretability) can be found.

Compare: There are a large number of NBA players, meaning that ones who are short can be found.

This paper applies regularisation to machine learning models that ensures that they are represented by small decision trees.

Looking at the results of the paper, it only seems to work for simple tasks, as you might expect. For the most neural-net-like task (recognizing stop phonemes from audio, which is still far simpler than e.g. speech recognition), the neural net gets ~0.95 AUC while the decision tree gets ~0.75 (a vast difference: random is 0.5 and perfect is 1).

Generally there seem to be people (e.g. Cynthia Rudin) who argue "we can have interpretability and accuracy", and when you look at the details they are looking at some very low-dimensional, simple-looking tasks; I certainly agree with that (and that we should use interpretable models in these situations) but it doesn't seem to apply to e.g. image classifiers or speech recognition, and seems like it would apply even less to AGI-via-neural-nets.

I think that an important part of this is ‘agent foundations’, by which I broadly mean a theory of what agents should look like, and what structural facts about agents could cause them to display undesired behaviour. (emphasis mine)

Huh? Surely if you're trying to understand agents that arise, you should have a theory of arbitrary agents rather than ideal agents. John Wentworth's stuff seems way more relevant than MIRI's Agent Foundations for the purpose you have in mind.

I could see it being useful to do MIRI-style Agent Foundations work to discover what sorts of problems could arise, though I could imagine this happening in many other ways as well.

Comment by rohinmshah on [AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI · 2020-03-19T20:12:35.645Z · score: 4 (2 votes) · LW · GW

I agree that regulation could be helpful by reducing races to the bottom; I think what I was getting at here (which I might be wrong about, as it was several months ago) was that it is hard to build regulations that directly attack the technical problem. Consider for example the case for car manufacturing. You could have two types of regulations:

  1. Regulations that provide direct evidence of safety: For example, you could require that all car designs be put through a battery of safety tests, e.g. crashing them into a wall and ensuring that the airbags deploy.
  2. Regulations that provide evidence of thinking about safety: For example, you could require that all car designs have at least 5 person-years of safety analysis done by people with a degree in Automotive Safety (which is probably not an actual field but in theory could be one).

Iirc, the regulatory markets paper seemed to have most of its optimism on the first kind of regulation, or at least that's how I interpreted it. That kind of regulation seems particularly hard in the one-shot alignment case. The second kind of regulation seems much more possible to do in all scenarios, and preventing races to the bottom is an example of that kind of regulation.

I'm not sure what I meant by legible regulation -- probably I was just emphasizing the fact that for regulations to be good, they need to be sufficiently clear and understood by companies so that they can actually be in compliance with them. Again, for regulations of the first kind this seems pretty hard to do.

Comment by rohinmshah on [AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement · 2020-03-19T00:49:39.743Z · score: 2 (1 votes) · LW · GW
True, but OA5 is inherently a different setup than ALE.

I broadly agree with this, but have some nitpicks on the specific things you mentioned.

Catastrophic forgetting is at least partially offset by the play against historical checkpoints, which doesn't have an equivalent in your standard ALE

But since you're always starting from the same state, you always have to solve the earlier subtasks? E.g. in Montezuma's revenge in every trajectory you have to successfully get the key and climb the ladder; this doesn't change as you learn more.

there's no adversarial dynamics or AlphaStar-style population of agents which can exploit forgotten area of state-space

The thing about Montezuma's revenge and similar hard exploration tasks is that there's only one trajectory you need to learn; and if you forget any part of it you fail drastically; I would by default expect this to be better than adversarial dynamics / populations at ensuring that the agent doesn't forget things.

There's also the batch size. The OA5 batch size was ridiculously large. Given all of the stochasticity in a DoTA2 game & additional exploration, that covers an awful lot of possible trajectories.

Agreed, but the Memento observation also shows that the problem isn't about exploration: if you make a literal copy of the agent that gets 6600 reward and train that from the 6600 reward states, it reliably gets more reward than the original agent got. The only difference between the two situations is that in the original situation, the original agent still had to remember how to get to the 6600 reward states in order to maintain its performance, while the new agent was allowed to start directly from that state and so was allowed to forget how to get to the 6600 reward states.

In particular, I would guess that the original agent does explore trajectories in which it gets higher reward (because the Memento agent definitely does), but for whatever reason it is unable to learn as effectively from those trajectories.

Is cut off by default due to length.

Thanks, we noticed this after we sent it out (I think it didn't happen in our test emails for whatever reason). Hopefully the kinks in the new design will be worked out by next week.

(That being said, I've seen other newsletters which are always cut off by GMail, so it may not be possible to do this when using a nice HTML design... if anyone knows how to fix this I'd appreciate tips.)

Comment by rohinmshah on [AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement · 2020-03-19T00:47:07.168Z · score: 2 (1 votes) · LW · GW

Yup, strongly agree. I focused on the deterministic case because the point is easiest to understand there, but they also apply in the stochastic case.

I suspect people are doing something heuristic and possibly kludgy when we think about someone else gaining power.

I agree, though if I were trying to have a nice formalization, one thing I might do is look at what "power" looks like in a multiagent setting, where you can't be "larger" than the environment, and so you can't have perfectly calibrated beliefs about what's going to happen.

Comment by rohinmshah on Will AI undergo discontinuous progress? · 2020-03-17T22:38:10.316Z · score: 3 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post argues that the debate over takeoff speeds is over a smaller issue than you might otherwise think: people seem to be arguing for either discontinuous progress, or continuous but fast progress. Both camps agree that once AI reaches human-level intelligence, progress will be extremely rapid; the disagreement is primarily about whether there is already quite a lot of progress _before_ that point. As a result, these differences don't constitute a "shift in arguments on AI safety", as some have claimed.
The post also goes through some of the arguments and claims that people have made in the past, which I'm not going to summarize here.

Planned opinion:

While I agree that the debate about takeoff speeds is primarily about the path by which we get to powerful AI systems, that seems like a pretty important question to me with <@many ramifications@>(@Clarifying some key hypotheses in AI alignment@).
Comment by rohinmshah on Will AI undergo discontinuous progress? · 2020-03-17T19:03:08.940Z · score: 2 (1 votes) · LW · GW
The objective time taken for progress in AI is more significant than whether that progress is continuous or discontinuous

I don't think I agree? Presence of discontinuities determines which research agendas will or won't work, and so is extremely decision-relevant; in contrast objective time taken has very little influence on research agendas.

Objective time might be more important for e.g. AI strategy and policy; I haven't thought about it that much, though my immediate guess is that even there the presence/absence of discontinuities will be a key crux in what sorts of institutions we should aim for.

Comment by rohinmshah on Distinguishing definitions of takeoff · 2020-03-17T18:40:31.802Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post lists and explains several different "types" of AI takeoff that people talk about. Rather than summarize all the definitions (which would only be slightly shorter than the post itself), I'll try to name the main axes that definitions vary on (but as a result this is less of a summary and more of an analysis):
1. _Locality_. It could be the case that a single AI project far outpaces the rest of the world (e.g. via recursive self-improvement), or that there will never be extreme variations amongst AI projects across all tasks, in which case the "cognitive effort" will be distributed across multiple actors. This roughly corresponds to the Yudkowsky-Hanson FOOM debate, and the latter position also seems to be that taken by <@CAIS@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@).
2. _Wall clock time_. In Superintelligence, takeoffs are defined based on how long it takes for a human-level AI system to become strongly superintelligent, with "slow" being decades to centuries, and "fast" being minutes to days.
3. _GDP trend extrapolation_. Here, a continuation of an exponential trend would mean there is no takeoff (even if we some day get superintelligent AI), a hyperbolic trend where the doubling time of GDP decreases in a relatively continuous / gradual manner counts as continuous / gradual / slow takeoff, and a curve which shows a discontinuity would be a discontinuous / hard takeoff.

Planned opinion:

I found this post useful for clarifying exactly which axes of takeoff people disagree about, and also for introducing me to some notions of takeoff I hadn't seen before (though I haven't summarized them here).