The case for aligning narrowly superhuman models 2021-03-05T22:29:41.577Z
AMA on EA Forum: Ajeya Cotra, researcher at Open Phil 2021-01-29T23:05:41.527Z
Draft report on AI timelines 2020-09-18T23:47:39.684Z
Iterated Distillation and Amplification 2018-11-30T04:47:14.460Z


Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2021-07-20T20:00:52.881Z · LW · GW

There are some limited sensitivity analysis in the "Conservative and aggressive estimates" section of part 4.

Comment by Ajeya Cotra (ajeya-cotra) on Anna and Oliver discuss Children and X-Risk · 2021-04-09T14:45:20.843Z · LW · GW

Belatedly, I did a bit of outside-view research on the time and monetary costs of kids (though a couple parent friends kindly sanity-checked some of it). I presented it at my house's internal conference, but some folks suggested I share more broadly in case it's helpful to others: here is the slide deck. The assumptions are Bay Area, upper-middle-class parents (e.g. both programmers or something like that) who both want to keep their careers and are therefore willing to pay a lot for childcare.

Comment by Ajeya Cotra (ajeya-cotra) on Notes from "Don't Shoot the Dog" · 2021-04-02T22:12:58.089Z · LW · GW

Thanks for writing this up! Appreciate the personal anecdotes too. Curious if you or Jeff have any tips and tricks for maintaining the patience/discipline required to pull off this kind of parenting (for other readers, I enjoyed some of Jeff's thoughts on predictable parenting here). Intuitively to me, it seems like this is a reason that the value-add from paying for childcare might be higher than you'd think naively — not only do you directly save time, you might also have more emotional reserves to be consistent and disciplined if you get more breaks.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-19T02:43:32.838Z · LW · GW

I'm personally skeptical that this work is better-optimized for improving AI capabilities than other work being done in industry. In general, I'm skeptical of perspectives that work that the rationalist/EA/alignment crowd does Pareto-dominates the other work going on -- that is, that it's significantly better for both alignment and capabilities than standard work, such that others are simply making a mistake by not working on it regardless of what their goals are or how much they care about alignment. I think sometimes this could be the case, but I wouldn't bet on it being a large effect. In general, I expect work optimized to help with alignment to be worse on average at pushing forward capabilities, and vice versa.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-10T20:00:21.885Z · LW · GW

In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn't trying here to make something different sound like it's about practice. I don't think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I'd be similarly excited about or maybe more excited about.

In my mind, the "better than evaluators" part is kind of self-evidently intriguing for the basic reason I said in the post (it's not obvious how to do it, and it's analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn't strongly tied to a particular theoretical framing):

I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.

A lot of people in response to the draft were pushing in the direction that I think you were maybe gesturing at (?) -- to make this more specific to "knowing everything the model knows" or "ascription universality"; the section "Why not focus on testing a long-term solution?" was written in response to Evan Hubinger and others. I think I'm still not convinced that's the right way to go.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-10T18:28:44.063Z · LW · GW

I don't feel confident enough in the frame of "inaccessible information" to say that the whole agenda is about it. It feels like a fit for "advice", but not a fit for "writing stories" or "solving programming puzzles" (at least not an intuitive fit -- you could frame it as "the model has inaccessible information about [story-writing, programming]" but it feels more awkward to me). I do agree it's about "strongly suspecting it has the potential to do better than humans" rather than about "already being better than humans." Basically, it's about trying to find areas where lackluster performance seems to mostly be about "misalignment" rather than "capabilities" (recognizing those are both fuzzy terms).

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-10T09:40:08.009Z · LW · GW

Yeah, you're definitely pointing at an important way the framing is awkward. I think the real thing I want to say is "Try to use some humans to align a model in a domain where the model is better than the humans at the task", and it'd be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there's some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse.

I don't want to just call it "align superhuman AI today" because people will be like "What? We don't have that", but at the same time I don't want to drop "superhuman" from the name because that's the main reason it feels like "practicing what we eventually want to do." I considered "partially superhuman", but "narrowly" won out.

I'm definitely in the market for a better term here.

Comment by Ajeya Cotra (ajeya-cotra) on MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" · 2021-03-10T09:08:24.929Z · LW · GW

The conceptual work I was gesturing at here is more Paul's work, since MIRI's work (afaik) is not really neural net-focused. It's true that Paul's work also doesn't assume a literal worst case; it's a very fuzzy concept I'm gesturing at here. It's more like, Paul's research process is to a) come up with some procedure, b) try to think of any "plausible" set of empirical outcomes that cause the procedure to fail, and c) modify the procedure to try to address that case. (The slipperiness comes in at the definition of "plausible" here, but the basic spirit of it is to "solve for every case" in the way theoretical CS typically aims to do in algorithm design, rather than "solve for the case we'll in fact encounter.")

Comment by Ajeya Cotra (ajeya-cotra) on MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" · 2021-03-10T09:05:14.545Z · LW · GW

This was a really helpful articulation, thanks! I like "frankness", "forthrightness", "openness", etc. (These are all terms I was brainstorming to get at the "ascription universality" concept at one point.)

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-08T02:16:56.022Z · LW · GW

The case in my mind for preferring to elicit and solve problems at scale rather than in toy demos (when that's possible) is pretty broad and outside-view, but I'd nonetheless bet on it: I think a general bias toward wanting to "practice something as close to the real thing as possible" is likely to be productive. In terms of the more specific benefits I laid out in this section, I think that toy demos are less likely to have the first and second benefits ("Practical know-how and infrastructure" and "Better AI situation in the run-up to superintelligence"), and I think they may miss some ways to get the third benefit ("Discovering or verifying a long-term solution") because some viable long-term solutions may depend on some details about how large models tend to behave.

I do agree that working with larger models is more expensive and time-consuming, and sometimes it makes sense to work in a toy environment instead, but other things being equal I think it's more likely that demos done at scale will continue to work for superintelligent systems, so it's exciting that this is starting to become practical.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-07T07:07:40.065Z · LW · GW

Yeah, in the context of a larger alignment scheme, it's assuming that in particular the problem of answering the question "How good is the AI's proposed action?" will factor down into sub-questions of manageable size.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-07T06:15:23.925Z · LW · GW

The intuition for it is something like this: suppose I'm trying to make a difficult decision, like where to buy a house. There are hundreds of cities I'd be open to, each one has dozens of neighborhoods, and each neighborhood has dozens of important features, like safety, fun things to do, walkability, price per square foot, etc. If I had a long time, I would check out each neighborhood in each city in turn and examine how it does on each dimension, and pick the best neighborhood.

If I instead had an army of clones of myself, I could send many of them to each possible neighborhood, with each clone examining one dimension in one neighborhood. The mes that were all checking out different aspects of neighborhood X can send up an aggregated judgment to a me that is in charge of "holistic judgment of neighborhood X", and the mes that focus on holistic judgments of neighborhoods can do a big pairwise bracket to filter up a decision to the top me.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-07T05:08:46.125Z · LW · GW

Yes sorry — I'm aware that in the HCH procedure no one human thinks for a long time. I'm generally used to mentally abstracting HCH (or whatever scheme fits that slot) as something that could "effectively replicate the benefits you could get from having a human thinking a long time," in terms of the role that it plays in an overall scheme for alignment. This isn't guaranteed to work out, of course. My position is similar to Rohin's above:

I just personally find it easier to think about "benefits of a human thinking for a long time" and then "does HCH get the same benefits as humans thinking for a long time" and then "does iterated amplification get the same benefits as HCH".

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-07T00:01:26.042Z · LW · GW

My understanding is that HCH is a proposed quasi-algorithm for replicating the effects of a human thinking for a long time.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-06T17:02:43.034Z · LW · GW

My biggest concern is actually that the problem is going to be too easy for supervised learning. Need GPT-3 to dispense expert medical advice? Fine-tune it on a corpus of expert medical advice! Or for slightly more sophistication, fine-tune it to predict advice plus a score for how good the advice was, then condition on the score being high!

I don't think you can get away with supervised learning if you're holding yourself to the standard of finding fuzzy tasks where the model is narrowly superhuman. E.g. the Stiennon et al., 2020 paper involved using RL from human feedback: roughly speaking, that's how it was possible for the model to actually improve upon humans rather than simply imitating them. And I think in some cases, the model will be capable of doing better than (some) humans' evaluations, meaning that to "get models to the best they can to help us" we will probably need to do things like decomposition, training models to explain their decisions, tricks to amplify or de-noise human feedback, etc.

There's also some unavoidable conceptual progress needed (You can fine-tune GPT-3 for medical advice with little philosophical worry, but how do you fine-tune GPT-3 for moral advice? Okay, now that you thought of the obvious answer, what's wrong with it?)

I don't agree that there's obviously conceptual progress that's necessary for moral advice which is not necessary for medical advice — I'd expect a whole class of tasks to require similar types of techniques, and if there's a dividing line I don't think it is going to be "whether it's related to morality", but "whether it's difficult for the humans doing the evaluation to tell what's going on." To answer your question for both medical and moral advice, I'd say the obvious first thought is RL from human feedback, and the second thought I had to go beyond that is trying to figure out how to get less-capable humans to replicate the training signal produced by more-capable humans, without using any information/expertise from the latter to help the former (the "sandwiching" idea). I'm not sure if it'll work out though.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-06T16:51:18.136Z · LW · GW

We're simply not sure where "proactively pushing to make more of this type of research happen" should rank relative to other ways we could spend our time and money right now, and determining that will involve thinking about a lot of things that are not covered in this post (most importantly what the other opportunities are for our time and money).

already seen as a standard way to make progress on the full alignment problem

It might be a standard way to make progress, but I don't feel that this work has been the default so far — the other three types of research I laid out seem to have absorbed significantly more researcher-hours and dollars among people concerned with long-term AI risk reduction. (It's possible that human feedback is more common among people motivated by profit, but I doubt that because it doesn't seem that profitable yet.)

Also, if we use a stricter definition of "narrowly superhuman" (i.e. the model should be capable of outperforming the evaluations — not just the demonstrations — of the humans training it), I'd argue that there hasn't been any work published on that so far.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-06T16:46:39.205Z · LW · GW

I guess the crux here is "And if the Hard problem is indeed hard enough to not be solved by anyone," — I don't think that's the default/expected outcome. There hasn't been that much effort on this problem in the scheme of things, and I think we don't know where it ranges from "pretty easy" to "very hard" right now.

Comment by Ajeya Cotra (ajeya-cotra) on The case for aligning narrowly superhuman models · 2021-03-06T02:06:05.823Z · LW · GW

Thanks for the comment! Just want to explicitly pull out and endorse this part:

the experts be completely and totally absent from the training process, and in particular no data from the experts should be involved in the training process

I should have emphasized that more in the original post as a major goal. I think you might be right that it will be hard to solve the "sandwich" problem without conceptual progress, but I also think that attempts to solve the sandwich problem could directly spur that progress (not just reveal the need for it, but also take steps toward finding actual algorithms in the course of doing one of the sandwich problems).

I also broadly agree with you that "things looking good to humans without actually being good" is a major problem to watch out for. But I don't think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback. (E.g., one of the papers I link in the post is a mainstream ML paper amplifying a weak training signal into a better one.)

Comment by Ajeya Cotra (ajeya-cotra) on How does bee learning compare with machine learning? · 2021-03-05T00:51:53.891Z · LW · GW

I mostly agree with your comment, but I'm actually very unsure about 2 here: I think I recall bees seeming surprisingly narrow and bad at abstract shapes. Guille would know more here.

Comment by Ajeya Cotra (ajeya-cotra) on How does bee learning compare with machine learning? · 2021-03-04T22:29:03.493Z · LW · GW

Aww thanks Ben, that was really nice of you!

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-12-18T19:35:52.315Z · LW · GW

Hi John, I think I remember that presentation -- the reason the graph there was quite bimodal is because the Lifetime Anchor I was using at the time was simply assuming ~1x human lifetime levels of computation. In the current model, I'm assuming ~1000x human lifetime levels of computation, because ~1x seemed like a much less likely version of that anchor. The code in the quantitative model will let you see the untruncated version of the distribution, and it looks a lot more smooth now, though still a modest bump.

Also, apologies for such a late reply, I don't get email notifications for comments and haven't been checking regularly!

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-10-13T22:45:44.178Z · LW · GW

Thanks! No need to wait for a more official release (that could take a long time since I'm prioritizing other projects).

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-26T21:50:22.707Z · LW · GW

Yeah, I agree there is room for spending to be "irrational", though I would guess this is more likely in the direction of spending less than the "rational" amount rather than more, because developing TAI could be unprecedentedly profitable and companies' spending may be limited by capital constraints.

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-26T01:55:00.365Z · LW · GW

Thanks Ben, this is right!

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-26T01:53:56.084Z · LW · GW

Yeah, I considered pegging spending to a fraction of GWP instead of a fraction of GDP, but found that when I did this I wanted to push the fraction down because I felt that even though companies are getting increasingly globalized, coordination at the world-scale would probably still be thinner than coordination at the scale of something nation-sized (even if it's not actually a literal nation). Ultimately, I just went with GDP because there are more reference points for it.

I feel pretty uncertain about this though, and think there's a lot of room for a more detailed inside-view projection on willingness-to-spend by a firm. We could calculate this by making assumptions about the global surplus created by a transformative model (easily calculable from the definition), the amount of that profit that a firm would capture if it trained a transformative model, and the size of the frontier firm over time (which could be pegged to the global economy or potentially pegged to estimates of profits from training smaller models). We could then back out what a rational firm should be willing to invest.

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-26T01:48:34.741Z · LW · GW

Yes, it's assuming the scaling behavior follows the probability distributions laid out in Part 2, and then asking whether conditional on that the model size requirements could be off by a large amount. 

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-26T01:47:07.953Z · LW · GW

Thanks! Agree that functional form uncertainty is a big deal here; I think that implicitly this uncertainty is causing me to up-weight Short Horizon Neural Network more than I otherwise would, and also up-weight "Larger than all hypotheses" more than I otherwise would.

With that said, I do predict that in clean artificial cases (which may or may not be relevant), we could demonstrate linear scaling. E.g., consider the case of inserting a frame of static or a blank screen in between every normal frame of an Atari game or StarCraft game -- I'd expect that modifying the games in this way would straightforwardly double training computation requirements.

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-26T01:41:19.009Z · LW · GW

Thanks so much, glad you're finding it helpful! 

I haven't thought too much about short term spending scaleup; thanks for the links, My current intuition is that our subjective distribution should not be highly bimodal the way you describe -- it seems like the industry could land somewhere along a broad spectrum from perfect competition to monopoly (with oligopoly seeming most plausible) and somewhere along a broad spectrum of possible profit margins.

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-26T01:35:31.015Z · LW · GW


I agree that full distribution information is very valuable, although I consider medians to be important as well. The spreadsheet linked in the report provides the full distribution implied by my views for the probability that the amount of computation required to train a transformative model is affordable, although it requires some judgment to translate that into P(TAI), because there may be other bottlenecks besides computation and there may be other paths to TAI besides training a transformative model. I'd say it implies somewhere between 2031 and 2036 is the year by which there is a 10% chance of TAI.

As I said in a reply to Daniel above, the way to express the view that a brain-sized GPT model would constitute TAI is to assign a lot of weight to the Short Horizon Neural Network hypothesis, potentially along with shifting narrowing the effective horizon length. I think this is plausible, but don't believe we should have a high probability on this because I expect on priors that we would need longer effective horizon lengths than GPT-3, and I don't think that evidence from the GPT-3 paper or follow on papers have provided clear evidence to the contrary. 

In my best guess inputs, I assign a 25% probability collectively to the Short Horizon Neural Network and Lifetime Anchor hypotheses; in my aggressive inputs I assign 50% probability to these two hypotheses collectively. In both cases, probabilities are smoothed to a significant extent because of uncertainty in model size requirements and scaling, with substantial weight on smaller-than-brain-sized models and larger-than-brain-sized models.

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-26T01:26:21.359Z · LW · GW

Thanks! I definitely agree that the proper modeling technique would involve introducing uncertainty on algorithmic progress, and that this uncertainty would be pretty wide; this is one of the most important few directions of future research (the others being better understanding effective horizon length and better narrowing model size).

In terms of uncertainty in model size, I personally find it somewhat easier to think about what the final spread should be in the training FLOP requirements distribution, since there's a fair amount of arbitrariness in how the uncertainty is apportioned between model size and scaling behavior. There's also semantic uncertainty about what it means to "condition on the hypothesis that X is the best anchor." If we're living in the world of "brain FLOP/s anchor + normal scaling behavior", then assigning a lot of weight to really small model sizes would wind up "in the territory" of the Lifetime Anchor hypothesis, and assigning a lot of weight to really large model sizes would wind up "in the territory" of the Evolution Anchor hypothesis, or go beyond the Evolution Anchor hypothesis. 

I was roughly aiming for +- 5 OOM uncertainty in training FLOP requirements on top of the anchor distribution, and then apportioned uncertainty between model size and scaling behavior based on which one seemed more uncertain.

Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-26T01:18:19.236Z · LW · GW

Thanks Daniel! Quick replies:

  • On down-weighting low-end vs high-end compute levels: The reason that the down-weighting for low-end compute levels was done in a separate and explicit way was just because I think there's a structural difference between the two updates. When updating against low-end compute levels, I think it makes more sense to do that update within each hypothesis, because only some orders of magnitude are affected. To implement an "update against high-end compute levels", we can simply lower the probability we assign to high-compute hypotheses, since there is no specific reason to shave off just a few OOMs at the far right. My probability on the Evolution Anchor hypothesis is 10%, and my probability on the Long Horizon Neural Network hypothesis is 15%; this is lower than my probability on the Short Horizon Neural Network hypothesis (20%) and Medium Horizon Neural Network hypothesis (30%) because I feel that the higher-end hypotheses are less consistent with the holistic balance of evidence.
  • On the GPT scaling trend: I think that the way to express the view that GPT++ would constitute TAI is to heavily weight the Short Horizon Neural Network hypothesis, potentially along with shifting and/or narrowing the range of effective horizon lengths in that bucket to be more concentrated on the low end (e.g. 0.1 to 30 subjective seconds rather than 1 to 1000 subjective seconds).
  • On getting transformative abilities with 1e15 parameter models trained for 30 subjective years: I think this is pretty unlikely, but not crazy like you said; I think the way to express this view would be to up-weight the Lifetime Anchor hypothesis. My weight on it is currently 5%. Additionally, all the Neural Network hypotheses bake in substantial probability to relatively small models (e.g. 1e12 FLOP/subj sec) and scaling more shallow than we've seen demonstrated so far (e.g. an exponent of 0.25). 
Comment by Ajeya Cotra (ajeya-cotra) on Draft report on AI timelines · 2020-09-19T17:14:30.812Z · LW · GW

Thanks, I just cut the link!