[AN #117]: How neural nets would fare under the TEVV framework 2020-09-16T17:20:14.062Z · score: 26 (5 votes)
[AN #116]: How to make explanations of neurons compositional 2020-09-09T17:20:04.668Z · score: 21 (8 votes)
[AN #115]: AI safety research problems in the AI-GA framework 2020-09-02T17:10:04.434Z · score: 19 (6 votes)
[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents 2020-08-26T17:20:04.960Z · score: 21 (7 votes)
[AN #113]: Checking the ethical intuitions of large language models 2020-08-19T17:10:03.773Z · score: 23 (6 votes)
[AN #112]: Engineering a Safer World 2020-08-13T17:20:04.013Z · score: 22 (10 votes)
[AN #111]: The Circuits hypotheses for deep learning 2020-08-05T17:40:22.576Z · score: 23 (9 votes)
[AN #110]: Learning features from human feedback to enable reward learning 2020-07-29T17:20:04.369Z · score: 13 (4 votes)
[AN #109]: Teaching neural nets to generalize the way humans would 2020-07-22T17:10:04.508Z · score: 17 (4 votes)
[AN #107]: The convergent instrumental subgoals of goal-directed agents 2020-07-16T06:47:55.532Z · score: 13 (4 votes)
[AN #108]: Why we should scrutinize arguments for AI risk 2020-07-16T06:47:38.322Z · score: 19 (7 votes)
[AN #106]: Evaluating generalization ability of learned reward models 2020-07-01T17:20:02.883Z · score: 14 (4 votes)
[AN #105]: The economic trajectory of humanity, and what we might mean by optimization 2020-06-24T17:30:02.977Z · score: 24 (7 votes)
[AN #104]: The perils of inaccessible information, and what we can learn about AI alignment from COVID 2020-06-18T17:10:02.641Z · score: 19 (7 votes)
[AN #103]: ARCHES: an agenda for existential safety, and combining natural language with deep RL 2020-06-10T17:20:02.171Z · score: 26 (9 votes)
[AN #102]: Meta learning by GPT-3, and a list of full proposals for AI alignment 2020-06-03T17:20:02.221Z · score: 38 (11 votes)
[AN #101]: Why we should rigorously measure and forecast AI progress 2020-05-27T17:20:02.460Z · score: 15 (6 votes)
[AN #100]: What might go wrong if you learn a reward function while acting 2020-05-20T17:30:02.608Z · score: 33 (8 votes)
[AN #99]: Doubling times for the efficiency of AI algorithms 2020-05-13T17:20:02.637Z · score: 30 (10 votes)
[AN #98]: Understanding neural net training by seeing which gradients were helpful 2020-05-06T17:10:02.563Z · score: 20 (5 votes)
[AN #97]: Are there historical examples of large, robust discontinuities? 2020-04-29T17:30:02.043Z · score: 15 (5 votes)
[AN #96]: Buck and I discuss/argue about AI Alignment 2020-04-22T17:20:02.821Z · score: 17 (7 votes)
[AN #95]: A framework for thinking about how to make AI go well 2020-04-15T17:10:03.312Z · score: 20 (6 votes)
[AN #94]: AI alignment as translation between humans and machines 2020-04-08T17:10:02.654Z · score: 11 (3 votes)
[AN #93]: The Precipice we’re standing at, and how we can back away from it 2020-04-01T17:10:01.987Z · score: 25 (6 votes)
[AN #92]: Learning good representations with contrastive predictive coding 2020-03-25T17:20:02.043Z · score: 19 (7 votes)
[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement 2020-03-18T17:10:02.205Z · score: 16 (5 votes)
[AN #90]: How search landscapes can contain self-reinforcing feedback loops 2020-03-11T17:30:01.919Z · score: 12 (4 votes)
[AN #89]: A unifying formalism for preference learning algorithms 2020-03-04T18:20:01.393Z · score: 17 (5 votes)
[AN #88]: How the principal-agent literature relates to AI risk 2020-02-27T09:10:02.018Z · score: 20 (6 votes)
[AN #87]: What might happen as deep learning scales even further? 2020-02-19T18:20:01.664Z · score: 30 (11 votes)
[AN #86]: Improving debate and factored cognition through human experiments 2020-02-12T18:10:02.213Z · score: 16 (6 votes)
[AN #85]: The normative questions we should be asking for AI alignment, and a surprisingly good chatbot 2020-02-05T18:20:02.138Z · score: 16 (6 votes)
[AN #84] Reviewing AI alignment work in 2018-19 2020-01-29T18:30:01.738Z · score: 24 (10 votes)
AI Alignment 2018-19 Review 2020-01-28T02:19:52.782Z · score: 143 (40 votes)
[AN #83]: Sample-efficient deep learning with ReMixMatch 2020-01-22T18:10:01.483Z · score: 16 (7 votes)
rohinmshah's Shortform 2020-01-18T23:21:02.302Z · score: 14 (3 votes)
[AN #82]: How OpenAI Five distributed their training computation 2020-01-15T18:20:01.270Z · score: 20 (6 votes)
[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment 2020-01-08T18:00:01.566Z · score: 22 (8 votes)
[AN #80]: Why AI risk might be solved without additional intervention from longtermists 2020-01-02T18:20:01.686Z · score: 36 (17 votes)
[AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL 2020-01-01T18:00:01.839Z · score: 12 (5 votes)
[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison 2019-12-26T01:10:01.626Z · score: 26 (7 votes)
[AN #77]: Double descent: a unification of statistical theory and modern ML practice 2019-12-18T18:30:01.862Z · score: 21 (8 votes)
[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations 2019-12-04T18:10:01.739Z · score: 14 (6 votes)
[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee 2019-11-27T18:10:01.332Z · score: 39 (11 votes)
[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts 2019-11-20T18:20:01.647Z · score: 19 (7 votes)
[AN #73]: Detecting catastrophic failures by learning how agents tend to break 2019-11-13T18:10:01.544Z · score: 11 (4 votes)
[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety 2019-11-06T18:10:01.604Z · score: 28 (7 votes)
[AN #71]: Avoiding reward tampering through current-RF optimization 2019-10-30T17:10:02.211Z · score: 13 (5 votes)
[AN #70]: Agents that help humans who are still learning about their own preferences 2019-10-23T17:10:02.102Z · score: 18 (6 votes)


Comment by rohinmshah on Draft report on AI timelines · 2020-09-19T02:15:44.205Z · score: 59 (20 votes) · LW · GW

Planned summary for the Alignment Newsletter (which won't go out until it's a Full Open Phil Report):

Once again, we have a piece of work so large and detailed that I need a whole newsletter to summarize it! This time, it is a quantitative model for forecasting when transformative AI will happen.

The overall framework

The key assumption behind this model is that if we train a neural net or other ML model that uses about as much computation as a human brain, that will likely result in transformative AI (TAI) (defined as AI that has an impact comparable to that of the industrial revolution). In other words, we _anchor_ our estimate of the ML model’s inference computation to that of the human brain. This assumption allows us to estimate how much compute will be required to train such a model _using 2020 algorithms_. By incorporating a trend extrapolation of how algorithmic progress will reduce the required amount of compute, we can get a prediction of how much compute would be required for the final training run of a transformative model in any given year.

We can also get a prediction of how much compute will be _available_ by predicting the cost of compute in a given year (which we have a decent amount of past evidence about), and predicting the maximum amount of money an actor would be willing to spend on a single training run. The probability that we can train a transformative model in year Y is then just the probability that the compute _requirement_ for year Y is less than the compute _available_ in year Y.

The vast majority of the report is focused on estimating the amount of compute required to train a transformative model using 2020 algorithms (where most of our uncertainty would come from); the remaining factors are estimated relatively quickly without too much detail. I’ll start with those so that you can have them as background knowledge before we delve into the real meat of the report. These are usually modeled as logistic curves in log space: that is, they are modeled as improving at some constant rate, but will level off and saturate at some maximum value after which they won’t improve.

Algorithmic progress

First off, we have the impact of _algorithmic progress_. <@AI and Efficiency@> estimates that algorithms improve enough to cut compute times in half every 16 months. However, this was measured on ImageNet, where researchers are directly optimizing for reduced computation costs. It seems less likely that researchers are doing as good a job at reducing computation costs for “training a transformative model”, and so the author increases the **halving time to 2-3 years**, with a maximum of **somewhere between 1-5 orders of magnitude** (with the assumption that the higher the “technical difficulty” of the problem, the more algorithmic progress is possible).

Cost of compute

Second, we need to estimate a trend for compute costs. There has been some prior work on this (summarized in [AN #97]( The report has some similar analyses, and ends up estimating **a doubling time of 2.5 years**, and a (very unstable) maximum of improvement by **a factor of 2 million by 2100**.

Willingness to spend

Third, we would like to know the maximum amount (in 2020 dollars) any actor might spend on a single training run. Note that we are estimating the money spent on a _final training run_, which doesn’t include the cost of initial experiments or the cost of researcher time. Currently, the author estimates that all-in project costs are 10-100x larger than the final training run cost, but this will likely go down to something like 2-10x, as the incentive for reducing this ratio becomes much larger.

The author estimates that the most expensive run _in a published paper_ was the final <@AlphaStar@>(@AlphaStar: Mastering the Real-Time Strategy Game StarCraft II@) training run, at ~1e23 FLOP and $1M cost. However, there have probably been unpublished results that are slightly more expensive, maybe $2-8M. In line with <@AI and Compute@>, this will probably increase dramatically to about **$1B in 2025**.

Given that AI companies each have around $100B cash on hand, and could potentially borrow additional several hundreds of billions of dollars (given their current market caps and likely growth in the worlds where AI still looks promising), it seems likely that low hundreds of billions of dollars could be spent on a single run by 2040, corresponding to a doubling time (from $1B in 2025) of about 2 years.

To estimate the maximum here, we can compare to megaprojects like the Manhattan Project or the Apollo program, which suggests that a government could spend around 0.75% of GDP for ~4 years. Since transformative AI will likely be more valuable economically and strategically than these previous programs, we can shade that upwards to 1% of GDP for 5 years. Assuming all-in costs are 5x that of the final training run, this suggests the maximum willingness to spend should be 1% of GDP of the largest country, which we assume grows at ~3% every year.

Strategy for estimating training compute for a transformative model

In addition to the three factors of algorithmic progress, cost of compute, and willingness to spend, we need an estimate of how much computation would be needed to train a transformative model using 2020 algorithms (which I’ll discuss next). Then, at year Y, the compute required is given by computation needed with 2020 algorithms * improvement factor from algorithmic progress, which (in this report) is a probability distribution. At year Y, the compute available is given by FLOP per dollar (aka compute cost) * money that can be spent, which (in this report) is a point estimate. We can then simply read off the probability that the compute required is greater than the compute available.

Okay, so the last thing we need is a distribution over the amount of computation that would be needed to train a transformative model using 2020 algorithms, which is the main focus of this report. There is a lot of detail here that I’m going to elide over, especially in talking about the _distribution_ as a whole (whereas I will focus primarily on the median case for simplicity). As I mentioned early on, the key hypothesis is that we will need to train a neural net or other ML model that uses about as much compute as a human brain. So the strategy will be to first translate from “compute of human brain” to “inference compute of neural net”, and then to translate from “inference compute of neural net” to “training compute of neural net”.

How much inference compute would a transformative model use?

We can talk about the rate at which synapses fire in the human brain. How can we convert this to FLOP? The author proposes the following hypothetical: suppose we redo evolutionary history, but in every animal we replace each neuron with N [floating-point units]( that each perform 1 FLOP per second. For what value of N do we still get roughly human-level intelligence over a similar evolutionary timescale? The author then does some calculations about simulating synapses with FLOPs, drawing heavily on the <@recent report on brain computation@>(@How Much Computational Power It Takes to Match the Human Brain@), to estimate that N would be around 1-10,000, which after some more calculations suggests that the human brain is doing the equivalent of 1e13 - 1e16 FLOP per second, with **a median of 1e15 FLOP per second**, and a long tail to the right.

Does this mean we can say that a transformative model will use 1e15 FLOP per second during inference? Such a model would have a clear flaw: even though we are assuming that algorithmic progress reduces compute costs over time, if we did the same analysis in e.g. 1980, we’d get the _same_ estimate for the compute cost of a transformative model, which would imply that there was no algorithmic progress between 1980 and 2020! The problem is that we’d always estimate the brain as using 1e15 FLOP per second (or around there), but for our ML models there is a difference between FLOP per second _using 2020 algorithms_ and FLOP per second _using 1980 algorithms_. So how do we convert form “brain FLOP per second” to “inference FLOP per second for 2020 ML algorithms”?

One approach is to look at how other machines we have designed compare to the corresponding machines that evolution has designed. An [analysis]( by Paul Christiano concluded that human-designed artifacts tend to be 2-3 orders of magnitude worse than those designed by evolution, when considering energy usage. Presumably a similar analysis done in the past would have resulted in higher numbers and thus wouldn’t fall prey to the problem above. Another approach is to compare existing ML models to animals with a similar amount of computation, and see which one is subjectively “more impressive”. For example, the AlphaStar model uses about as much computation as a bee brain, and large language models use somewhat more; the author finds it reasonable to say that AlphaStar is “about as sophisticated” as a bee, or that <@GPT-3@>(@Language Models are Few-Shot Learners@) is “more sophisticated” than a bee.

We can also look at some abstract considerations. Natural selection had _a lot_ of time to optimize brains, and natural artifacts are usually quite impressive. On the other hand, human designers have the benefit of intelligent design and can copy the patterns that natural selection has come up with. Overall, these considerations roughly balance each other out. Another important consideration is that we’re only predicting what would be needed for a model that was good at most tasks that a human would currently be good at (think a virtual personal assistant), whereas evolution optimized for a whole bunch of other skills that were needed in the ancestral environment. The author subjectively guesses that this should reduce our estimate of compute costs by about an order of magnitude.

Overall, putting all these considerations together, the author intuitively guesses that to convert from “brain FLOP per second” to “inference FLOP per second for 2020 ML algorithms”, we should add an order of magnitude to the median, and add another two orders of magnitude to the standard deviation to account for our large uncertainty. This results in a median of **1e16 FLOP per second** for the inference-time compute of a transformative model.

Training compute for a transformative model

We might expect a transformative model to run a forward pass **0.1 - 10 times per second** (which on the high end would match human reaction time of 100ms), and for each parameter of the neural net to contribute **1-100 FLOP per forward pass**, which implies that if the inference-time compute is 1e16 FLOP per second then the model should have **1e13 - 1e17 parameters**, with a median of **3e14 parameters**.

We now need to estimate how much compute it takes to train a transformative model with 3e14 parameters. We assume this is dominated by the number of times you have to run the model during training, or equivalently, the number of data points you train on times the number of times you train on each data point. (In particular, this assumes that the cost of acquiring data is negligible in comparison. The report argues for this assumption; for the sake of brevity I won’t summarize it here.)

For this, we need a relationship between parameters and data points, which we’ll assume will follow a power law KP^α, where P is the number of parameters and K and α are constants. A large number of ML theory results imply that the number of data points needed to reach a specified level of accuracy grows linearly with the number of parameters (i.e. α=1), which we can take as a weak prior. We can then update this with empirical evidence from papers. <@Scaling Laws for Neural Language Models@> suggests that for language models, data requirements scale as α=0.37 or as α=0.74, depending on what measure you look at. Meanwhile, [Deep Learning Scaling is Predictable, Empirically]( suggests that α=1.39 for a wide variety of supervised learning problems (including language modeling). However, the former paper studies a more relevant setting: it includes regularization, and asks about the number of data points needed to reach a target accuracy, whereas the latter paper ignores regularization and asks about the minimum number of data points that the model _cannot_ overfit to. So overall the author puts more weight on the former paper and estimates a median of α=0.8, though with substantial uncertainty.

We also need to estimate how many epochs will be needed, i.e. how many times we train on any given data point. The author decides not to explicitly model this factor since it will likely be close to 1, and instead lumps in the uncertainty over the number of epochs with the uncertainty over the constant factor in the scaling law above. We can then look at language model runs to estimate a scaling law for them, for which the median scaling law predicts that we would need 1e13 data points for our 3e14 parameter model.

However, this has all been for supervised learning. It seems plausible that a transformative task would have to be trained using RL, where the model acts over a sequence of timesteps, and then receives (non-differentiable) feedback at the end of those timesteps. How would scaling laws apply in this setting? One simple assumption is to say that each rollout over the _effective horizon_ counts as one piece of “meaningful feedback” and so should count as a single data point. Here, the effective horizon is the minimum of the actual horizon and 1/(1-γ), where γ is the discount factor. We assume that the scaling law stays the same; if we instead try to estimate it from recent RL runs, it can change the results by about one order of magnitude.

So we now know we need to train a 3e14 parameter model with 1e13 data points for a transformative task. This gets us nearly all the way to the compute required with 2020 algorithms: we have a ~3e14 parameter model that takes ~1e16 FLOP per forward pass, that is trained on ~1e13 data points with each data point taking H timesteps, for a total of H * 1e29 FLOP. The author’s distributions are instead centered at H * 1e30 FLOP; I suspect this is simply because the author was computing with distributions whereas I’ve been directly manipulating medians in this summary.

The last and most uncertain piece of information is the effective horizon of a transformative task. We could imagine something as low as 1 subjective second (for something like language modeling), or something as high as 1e9 subjective seconds (i.e. 32 subjective years), if we were to redo evolution, or train on a task like “do effective scientific R&D”. The author splits this up into short, medium and long horizon neural net paths (corresponding to horizons of 1e0-1e3, 1e3-1e6, and 1e6-1e9 respectively), and invites readers to place their own weights on each of the possible paths.

There are many important considerations here: for example, if you think that the dominating cost will be generative modeling (GPT-3 style, but maybe also for images, video etc), then you would place more weight on short horizons. Conversely, if you think the hard challenge is to gain meta learning abilities, and that we probably need “data points” comparable to the time between generations in human evolution, then you would place more weight on longer horizons.

Adding three more potential anchors

We can now combine all these ingredients to get a forecast for when compute will be available to develop a transformative model! But not yet: we’ll first add a few more possible “anchors” for the amount of computation needed for a transformative model. (All of the modeling so far has “anchored” the _inference time computation of a transformative model_ to the _inference time computation of the human brain_.)

First, we can anchor _parameter count of a transformative model_ to the _parameter count of the human genome_, which has far fewer “parameters” than the human brain. Specifically, we assume that all the scaling laws remain the same, but that a transformative model will only require 7.5e8 parameters (the amount of information in the human genome) rather than our previous estimate of ~1e15 parameters. This drastically reduces the amount of computation required, though it is still slightly above that of the short-horizon neural net, because the author assumed that the horizon for this path was somewhere between 1 and 32 years.

Second, we can anchor _training compute for a transformative model_ to the _compute used by the human brain over a lifetime_. As you might imagine, this leads to a much smaller estimate: the brain uses ~1e24 FLOP over 32 years of life, which is only 10x the amount used for AlphaStar, and even after adjusting upwards to account for man-made artifacts being worse than those made by evolution, the resulting model predicts a significant probability that we would already have been able to build a transformative model.

Finally, we can anchor _training compute for a transformative model_ to the _compute used by all animal brains over the course of evolution_. The basic assumption here is that our optimization algorithms and architectures are not much better than simply “redoing” natural selection from a very primitive starting point. This leads to an estimate of ~1e41 FLOP to train a transformative model, which is more than the long horizon neural net path (though not hugely more).

Putting it all together

So we now have six different paths: the three neural net anchors (short, medium and long horizon), the genome anchor, the lifetime anchor, and the evolution anchor. We can now assign weights to each of these paths, where each weight can be interpreted as the probability that that path is the _cheapest_ way to get a transformative model, as well as a final weight that describes the chance that none of the paths work out.

The long horizon neural net path can be thought of as a conservative “default” view: it could work out simply by training directly on examples of a long horizon task where each data point takes around a subjective year to generate. However, there are several reasons to think that researchers will be able to do better than this. As a result, the author assigns 20% to the short horizon neural net, 30% to the medium horizon neural net, and 15% to the long horizon neural net.

The lifetime anchor would suggest that we either already could get TAI, or are very close, which seems very unlikely given the lack of major economic applications of neural nets so far, and so gets assigned only 5%. The genome path gets 10%, the evolution anchor gets 10%, and the remaining 10% is assigned to none of the paths working out.

This predicts a **median of 2052** for the year in which some actor would be willing and able to train a single transformative model, with the full graphs shown below:

<Graphs removed since they are in flux and easy to share in a low-bandwidth way> 

How does this relate to TAI?

Note that what we’ve modeled so far is the probability that by year Y we will have enough compute for the final training run of a transformative model. This is not the same thing as the probability of developing TAI. There are several reasons that TAI could be developed _later_ than the given prediction:

1. Compute isn’t the only input required: we also need data, environments, human feedback, etc. While the author expects that these will not be the bottleneck, this is far from a certainty.

2. When thinking about any particular path and making it more concrete, a host of problems tend to show up that will need to be solved and may add extra time. Some examples include robustness, reliability, possible breakdown of the scaling laws, the need to generate lots of different kinds of data, etc.

3. AI research could stall, whether because of regulation, a global catastrophe, an AI winter, or something else.

However, there are also compelling reasons to expect TAI to arrive _earlier_:

1. We may develop TAI through some other cheaper route, such as a <@services model@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@).

2. Our forecasts apply to a “balanced” model that has a similar profile of abilities as a human. In practice, it will likely be easier and cheaper to build an “unbalanced” model that is superhuman in some domains and subhuman in others, that is nonetheless transformative.

3. The curves for several factors assume some maximum after which progress is not possible; in reality it is more likely that progress slows to some lower but non-zero growth rate.

In the near future, it seems likely that it would be harder to find cheaper routes (since there is less time to do the research), so we should probably assume that the probabilities are overestimates, and for similar reasons for later years the probabilities should be treated as underestimates.

For the median of 2052, the author guesses that these considerations roughly cancel out, and so rounds the median for development of TAI to **2050**. A sensitivity analysis concludes that 2040 is the “most aggressive plausible median”, while the “most conservative plausible median” is 2080.

Planned opinion:

I really liked this report: it’s extremely thorough and anticipates and responds to a large number of potential reactions. I’ve made my own timelines estimate using the provided spreadsheet, and have adopted the resulting graph (with a few modifications) as my TAI timeline (which ends up with a median of ~2055). This is saying quite a lot: it’s pretty rare that a quantitative model is compelling enough that I’m inclined to only slightly edit its output, as opposed to simply using the quantitative model to inform my intuitions.

Here are the main ways in which my model is different from the one in the report:

1. Ignoring the genome anchor

I ignore the genome anchor because I don’t buy the model: even if researchers did create a very parameter-efficient model class (which seems unlikely), I would not expect the same scaling laws to apply to that model class. The report mentions that you could also interpret the genome anchor as simply providing a constraint on how many data points are needed to train long-horizon behaviors (since that’s what evolution was optimizing), but I prefer to take this as (fairly weak) evidence that informs what weights to place on short vs. medium vs. long horizons for neural nets.

2. Placing more weight on short and medium horizons relative to long horizons

I place 30% on short horizons, 40% on medium horizons, and 10% on long horizons. The report already names several reasons why we might expect the long horizon assumption to be too conservative. I agree with all of those, and have one more of my own:

If meta-learning turns out to require a huge amount of compute, we can instead directly train on some transformative task with a lower horizon. Even some of the hardest tasks like scientific R&D shouldn’t have a huge horizon: even if we assume that it takes human scientists a year to produce the equivalent of a single data point, at 40 hours a week that comes out to a horizon of 2000 subjective hours, or 7e6 seconds. This is near the beginning of the long horizon realm of 1e6-1e9 seconds and seems like a very conservative overestimate to me.

(Note that in practice I’d guess we will train something like a meta-learner, because I suspect the skill of meta-learning will not require such large average effective horizons.)

3. Reduced willingness to spend

My willingness to spend forecasts are somewhat lower: the predictions and reasoning in this report feel closer to upper bounds on how much people might spend rather than predictions of how much they will spend. Assuming we reduce the ratio of all-in project costs to final training run costs to 10x, spending $1B on a training run by 2025 would imply all-in project costs of $10B, which is ~40% of Google’s yearly R&D budget of $26B, or 10% of the budget for a 4-year project. Possibly this wouldn’t be classified as R&D, but it would also be _2% of all expenditures over 4 years_. This feels remarkably high to me for something that’s supposed to happen within 5 years; while I wouldn’t rule it out, it wouldn’t be my median prediction.

4. Accounting for challenges

While the report does talk about challenges in e.g. getting the right data and environments by the right time, I think there are a bunch of other challenges as well: for example, you need to ensure that your model is aligned, robust, and reliable (at least if you want to deploy it and get economic value from it). I do expect that these challenges will be easier than they are today, partly because more research will have been done and partly because the models themselves will be more capable.

Another example of a challenge would be PR concerns: it seems very plausible to me that there will be a backlash against transformative AI systems, that results in those systems being deployed later than we’d expect them to be according to this model.

To be more concrete, if we ignore points 1-3 and assume this is my only disagreement, then for the median of 2052, rather than assuming that reasons for optimism and pessimism approximately cancel out to yield 2050 as the median for TAI, I’d be inclined to shade upwards to 2055 or 2060 as my median for TAI.

Comment by rohinmshah on Comparing Utilities · 2020-09-18T19:08:37.249Z · score: 2 (1 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This is a reference post about preference aggregation across multiple individually rational agents (in the sense that they have VNM-style utility functions), that explains the following points (among others):

1. The concept of “utility” in ethics is somewhat overloaded. The “utility” in hedonic utilitarianism is very different from the VNM concept of utility. The concept of “utility” in preference utilitarianism is pretty similar to the VNM concept of utility.

2. Utilities are not directly comparable, because affine transformations of utility functions represent exactly the same set of preferences. Without any additional information, concepts like “utility monster” are type errors.

3. However, our goal is not to compare utilities, it is to aggregate people’s preferences. We can instead impose constraints on the aggregation procedure.

4. If we require that the aggregation procedure produces a Pareto-optimal outcome, then Harsanyi’s utilitarianism theorem says that our aggregation procedure can be viewed as maximizing some linear combination of the utility functions.

5. We usually want to incorporate some notion of fairness. Different specific assumptions lead to different results, including variance normalization, Nash bargaining, and Kalai-Smorodinsky.

Comment by rohinmshah on The "Backchaining to Local Search" Technique in AI Alignment · 2020-09-18T18:55:51.860Z · score: 9 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post explains a technique to use in AI alignment, that the author dubs “backchaining to local search” (where local search refers to techniques like gradient descent and evolutionary algorithms). The key idea is to take some proposed problem with AI systems, and figure out mechanistically how that problem could arise when running a local search algorithm. This can help provide information about whether we should expect the problem to arise in practice.

Planned opinion:

I’m a big fan of this technique: it has helped me expose many initially confused concepts, and notice that they were confused, particularly wireheading and inner alignment. It’s an instance of the more general technique (that I also like) of taking an abstract argument and making it more concrete and realistic, which often reveals aspects of the argument that you wouldn’t have previously noticed.

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-17T21:35:37.562Z · score: 2 (1 votes) · LW · GW

Let's say we're talking about something complicated. Assume that any proposition about the complicated thing can be reformulated as a series of conjunctions.

Suppose Alice thinks P with 90% confidence (and therefore not-P with 10% confidence). Here's a fully general counterargument that Alice is wrong:

Decompose P into a series of conjunctions Q1, Q2, ... Qn, with n > 10. (You can first decompose not-P into R1 and R2, then decompose R1 further, and decompose R2 further, etc.) 

Ask Alice to estimate P(Qk | Q1, Q2, ... Q{k-1}) for all k.

At least one of these must be over 99% (if we have n = 11 and they were all 99%, then probability of P would be (0.99 ^ 11) = 89.5% which contradicts the original 90%).

Argue that Alice can't possibly have enough knowledge to place under 1% on the negation of the statement.


What's the upshot? When two people disagree on a complicated claim, decomposing the question is only a good move when both people think that is the right way to carve up the question. Most of the disagreement is likely in how to carve up the claim in the first place.

Comment by rohinmshah on Mesa-Search vs Mesa-Control · 2020-09-17T21:20:07.281Z · score: 2 (1 votes) · LW · GW

Sure, also making up numbers, everything conditional on the neural net paradigm, and only talking about failures of single-single intent alignment:

  • ~90% that there aren't problems or we "could" fix them on 40 year timelines
  • I'm not sure exactly what is meant by motivation so will not predict, but there will be many people working on fixing the problems
  • "Are fixes used" is not a question in my ontology; something counts as a "fix" only if it's cheap enough to be used. You could ask "did the team fail to use an existing fix that counterfactually would have made the difference between existential catastrophe and not" (possibly because they didn't know of its existence), then < 10% and I don't have enough information to distinguish between 0-10%.
  • I'll answer "how much x-risk would result from a small company *not* using them", if it's a single small company then < 10% and I don't have enough information to distinguish between 0-10% and I expect on reflection I'd say < 1%.
Comment by rohinmshah on Mesa-Search vs Mesa-Control · 2020-09-17T16:20:00.162Z · score: 3 (2 votes) · LW · GW

Yeah it's plausible that the actual claims MIRI would disagree with are more like:

Problems manifest => high likelihood we understand the underlying cause

We understand the underlying cause => high likelihood we fix it (or don't build powerful AI) rather than applying "surface patches"

Comment by rohinmshah on What Does "Signalling" Mean? · 2020-09-17T02:44:01.330Z · score: 10 (4 votes) · LW · GW

Perhaps there should be a place illustrating the differences between LW terms and academic terms of the same name -- besides this there's also game theory and I feel like I've seen others as well (though I can't remember what they were).

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-15T20:55:44.859Z · score: 4 (2 votes) · LW · GW

Yeah, that sounds about right to me. I think in terms of this framework my claim is primarily "for reasonably complex systems, if you try to do 2 without expertise, you will fail, but you may not realize you have failed".

I'm also noticing I mean something slightly different by "expertise" than is typically meant. My intended meaning of "expertise" is more like "you have lots of data and observations about the system", e.g. I think LW self-help stuff is reasonably likely to work (for the LW audience) because people have lots of detailed knowledge and observations about themselves and their friends.

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-15T15:34:25.788Z · score: 3 (2 votes) · LW · GW

Yes, that's the problem. In this situation, if N << population / 2, you are likely to not intervene even when the intervention is net positive; if N >> population / 2, you are likely to intervene even when the intervention is net negative.

(This is under the standard model of a one-shot decision where each participant gets a noisy observation of the true value with the noise being iid Gaussians with mean zero.)

Comment by rohinmshah on What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers · 2020-09-15T07:30:21.387Z · score: 3 (2 votes) · LW · GW

If a paper is on the cusp of "needing to be cited" but you think it won't replicate, take that into account! Or if reviewing a paper, at least take into account the probability of replication in your decision.

Why do you think people don't already do this?

In general, if you want to make a recommendation on the margin, you have to talk about what the current margin is.

I think you are maybe reading the author's claim to "stop assuming good faith" too literally. In the subsequent sentence they are basically refining that to the idea that most people are acting in good faith, but are not competent enough for good faith to be a useful assumption

Huh? The sentence I see is

I'm not saying every academic interaction should be hostile and adversarial, but the good guys are behaving like dodos right now and the predators are running wild.

"the predators are running wild" does not mean "most people are acting in good faith, but are not competent enough for good faith to be a useful assumption".

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-14T19:56:19.371Z · score: 7 (4 votes) · LW · GW

In general, evaluate the credibility of experts on the decisions they make or recommend, not on the beliefs they espouse. The selection in our world is based much more on outcomes of decisions than on calibration of beliefs, so you should expect experts to be way better on the former than on the latter.

By "selection", I mean both selection pressures generated by humans, e.g. which doctors gain the most reputation, and selection pressures generated by nature, e.g. most people know how to catch a ball even though most people would get conceptual physics questions wrong.

Similarly, trust decisions / recommendations given by experts more than the beliefs and justifications for those recommendations.

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-14T19:51:32.218Z · score: 6 (3 votes) · LW · GW

Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?


My own take is the cop-out-like, "it depends". I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you've put into it, etc.

I didn't say you should defer to experts, just that if you try to build gears-y models you'll be wrong. It's totally possible that there's no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-14T18:38:45.180Z · score: 9 (5 votes) · LW · GW

Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven't explained them here.

EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won't be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-14T18:28:53.390Z · score: 13 (3 votes) · LW · GW

Consider two methods of thinking:

1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by "rolling out" that model

2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.

I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.

However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.

I think many people on LW tend to use option 1 almost always and my "deference" to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination?

Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-14T18:21:39.508Z · score: 4 (2 votes) · LW · GW

Intellectual progress requires points with nuance. However, on online discussion forums (including LW, AIAF, EA Forum), people seem to frequently lose sight of the nuanced point being made -- rather than thinking of a comment thread as "this is trying to ascertain whether X is true", they seem to instead read the comments, perform some sort of inference over what the author must believe if that comment were written in isolation, and then respond to that model of beliefs. This makes it hard to have nuance without adding a ton of clarification and qualifiers everywhere.

I find that similar dynamics happen in group conversations, and to some extent even in one-on-one conversations (though much less so).

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-14T18:16:41.448Z · score: 2 (1 votes) · LW · GW

The simple response to the unilateralist curse under the standard setting is to aggregate opinions amongst the people in the reference class, and then do the majority vote.

A particular flawed response is to look for N opinions that say "intervening is net negative" and intervene iff you cannot find that many opinions. This sacrifices value and induces a new unilateralist curse on people who think the intervention is negative. (Example.)

However, the hardest thing about the unilateralist curse is figuring out how to define the reference class in the first place.

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-14T18:13:26.963Z · score: 2 (1 votes) · LW · GW

Under the standard setting, the optimizer's curse only changes your naive estimate of the EV of the action you choose. It does not change the naive decision you make. So, it is not valid to use the optimizer's curse as a critique of people who use EV calculations to make decisions, but it is valid to use it as a critique of people who make claims about the EV calculations of their most preferred outcome (if they don't already account for it).

Comment by rohinmshah on rohinmshah's Shortform · 2020-09-14T18:10:34.459Z · score: 15 (3 votes) · LW · GW

I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn't exist any such thing. Often my reaction is "if only there was time to write an LW post that I can then link to in the future". So far I've just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I'm now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they're even understandable.

Comment by rohinmshah on on “learning to summarize” · 2020-09-13T23:52:30.044Z · score: 2 (1 votes) · LW · GW

Yeah, I definitely agree with that, I was just responding to the confusion that (I think) nostalgebraist had. Relative to the latter paper, I'd guess increased performance is primarily due to label quality and larger model.

Comment by rohinmshah on on “learning to summarize” · 2020-09-13T21:22:32.707Z · score: 4 (2 votes) · LW · GW

By "original paper" do you mean Deep RL from Human Preferences or Fine-Tuning Language Models from Human Preferences? The latter did have a KL penalty, but OP linked to the former. I just skimmed the former again and saw no mention of a KL penalty (but I easily could have missed it).

Comment by rohinmshah on on “learning to summarize” · 2020-09-13T21:18:46.705Z · score: 3 (2 votes) · LW · GW

Ah got it, that makes sense, I agree with all of that.

Comment by rohinmshah on on “learning to summarize” · 2020-09-13T16:08:24.736Z · score: 2 (1 votes) · LW · GW

That all makes sense, except for this part:

where there isn't a "horizon" per se because all episodes have a fixed duration and receive rewards only at the end.

I'm confused how this is not a horizon? Perhaps we're using words differently -- I'm saying "there's a hyperparameter that controls the number of timesteps over which credit assignment must be performed; in their setting it's the sentence length and in your setting it is 1; nothing else would need to change".

Comment by rohinmshah on What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers · 2020-09-12T18:24:36.473Z · score: 4 (2 votes) · LW · GW

Replied to John below

Comment by rohinmshah on on “learning to summarize” · 2020-09-12T18:20:55.320Z · score: 2 (1 votes) · LW · GW

IMO, the power of differentiability is greatly underused. Everyone is locked into a 'optimize parameters based on data & loss' mindset, and few ever use the alternatives like 'optimize data/trajectory based on parameters & loss' or 'optimize loss based on data/parameters.

Strongly agree. It's obnoxiously difficult to get people to understand that this was what I did (kind of) in this paper.

Comment by rohinmshah on What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers · 2020-09-12T18:06:38.294Z · score: 35 (20 votes) · LW · GW

My claims are really just for CS, idk how much they apply to the social sciences, but the post gives me no reason to think they aren't true for the social sciences as well.

  • Just stop citing bad research, I shouldn't need to tell you this, jesus christ what the fuck is wrong with you people.

This doesn't work unless it's common knowledge that the research is bad, since reviewers are looking for reasons to reject and "you didn't cite this related work" is a classic one (and your paper might be reviewed by the author of the bad work). When I was early in my PhD, I had a paper rejected where it sounded like a major contributing factor was not citing a paper that I specifically thought was not related but the reviewer thought was.

  • Read the papers you cite. Or at least make your grad students to do it for you. It doesn't need to be exhaustive: the abstract, a quick look at the descriptive stats, a good look at the table with the main regression results, and then a skim of the conclusions. Maybe a glance at the methodology if they're doing something unusual. It won't take more than a couple of minutes. And you owe it not only to SCIENCE!, but also to yourself: the ability to discriminate between what is real and what is not is rather useful if you want to produce good research.23

I think the point of this recommendation is to get people to stop citing bad research. I doubt it will make a difference since as argued above the cause isn't "we can't tell which research is bad" but "despite knowing what's bad we have to cite it anyway".

  • When doing peer review, reject claims that are likely to be false. The base replication rate for studies with p>.001 is below 50%. When reviewing a paper whose central claim has a p-value above that, you should recommend against publication unless the paper is exceptional (good methodology, high prior likelihood, etc.)24 If we're going to have publication bias, at least let that be a bias for true positives. Remember to subtract another 10 percentage points for interaction effects. You don't need to be complicit in the publication of false claims.

I have issues with this, but they aren't related to me knowing more about academia than the author, so I'll skip it. (And it's more like, I'm uncertain about how good an idea this would be.)

  • Stop assuming good faith. I'm not saying every academic interaction should be hostile and adversarial, but the good guys are behaving like dodos right now and the predators are running wild.

The evidence in the post suggesting that people aren't acting in good faith is roughly "if you know statistics then it's obvious that the papers you're writing won't replicate". My guess is that many social scientists don't know statistics and/or don't apply it intuitively, so I don't see a reason to reject the (a priori more plausible to me) hypothesis that most people are acting in okay-to-good faith.

I don't really understand the author's model here, but my guess is that they are assuming that academics primarily think about "here's the dataset and here are the analysis results and here are the conclusions". I can't speak to social science, but when I'm trying to figure out some complicated thing (e.g "why does my algorithm work in setting X but not setting Y") I spend most of my time staring at data, generating hypotheses, making predictions with them, etc. which is very very conducive to the garden of forking paths that the author dismisses out of hand.

EDIT: Added some discussion of the other recommendations below, though I know much less about them, and here I'm just relying more on my own intuition rather than my knowledge about academia:

Earmark 60% of funding for registered reports (ie accepted for publication based on the preregistered design only, not results). For some types of work this isn't feasible, but for ¾ of the papers I skimmed it's possible. In one fell swoop, p-hacking and publication bias would be virtually eliminated.

I'd be shocked if 3/4 of social science papers could have been preregistered. My guess is that what happens is that researchers collect data, do a bunch of analyses, figure out some hypotheses, and only then write the paper.

Possibly the suggestion here is that all this exploratory work should be done first, then a study should be preregistered, and then the results are reported. My weak guess is that this wouldn't actually help replicability very much -- my understanding is that researchers are often able to replicate their own results, even when others can't. (Which makes sense! If I try to describe to a CHAI intern an algorithm they should try running, I often have the experience that they do something differently than I was expecting. Ideally in social science results would be robust to small variations, but in practice they aren't, and I wouldn't strongly expect preregistration to help, though plausibly it would.)

An NSF/NIH inquisition that makes sure the published studies match the pre-registration (there's so much """"""""""QRP"""""""""" in this area you wouldn't believe). The SEC has the power to ban people from the financial industry—let's extend that model to academia.

My general qualms about preregistration apply here too, but if we assume that we're going to have a preregistration model, then this seems good to me.

Earmark 10% of funding for replications. When the majority of publications are registered reports, replications will be far less valuable than they are today. However, intelligently targeted replications still need to happen.

This seems good to me (though idk if 10% is the right number, I could see both higher and lower).

Increase sample sizes and lower the significance threshold to .005. This one needs to be targeted: studies of small effects probably need to quadruple their sample sizes in order to get their power to reasonable levels. The median study would only need 2x or so. Lowering alpha is generally preferable to increasing power. "But Alvaro, doesn't that mean that fewer grants would be funded?" Yes.

Personally, I don't like the idea of significance thresholds and required sample sizes. I like having quantitative data because it informs my intuitions; I can't just specify a hard decision rule based on how some quantitative data will play out.

Even if this were implemented, I wouldn't predict much effect on reproducibility: I expect that what happens is the papers we get have even more contingent effects that only the original researchers can reproduce, which happens via them traversing the garden of forking paths even more. Here's an example with p-values of .002 and .006.

Andrew Gelman makes a similar case.

Ignore citation counts. Given that citations are unrelated to (easily-predictable) replicability, let alone any subtler quality aspects, their use as an evaluative tool should stop immediately.

I am very on board with citation counts being terrible, but what should be used instead? If you evaluate based on predicted replicability, you incentivize research that says obvious things, e.g. "rain is correlated with wet sidewalks".

I suspect that you probably could build a better and still cost-efficient evaluation tool, but it's not obvious how.

Open data, enforced by the NSF/NIH. There are problems with privacy but I would be tempted to go as far as possible with this. Open data helps detect fraud. And let's have everyone share their code, too—anything that makes replication/reproduction easier is a step in the right direction.

Seems good, though I'd want to first understand what purpose IRBs serve (you'd have to severely roll back IRBs for open data to become a norm).

Financial incentives for universities and journals to police fraud. It's not easy to structure this well because on the one hand you want to incentivize them to minimize the frauds published, but on the other hand you want to maximize the frauds being caught. Beware Goodhart's law!

I approve of the goal "minimize fraud". This recommendation is too vague for me to comment on the strategy.

Why not do away with the journal system altogether? The NSF could run its own centralized, open website; grants would require publication there. Journals are objectively not doing their job as gatekeepers of quality or truth, so what even is a journal? A combination of taxonomy and reputation. The former is better solved by a simple tag system, and the latter is actually misleading. Peer review is unpaid work anyway, it could continue as is. Attach a replication prediction market (with the estimated probability displayed in gargantuan neon-red font right next to the paper title) and you're golden. Without the crutch of "high ranked journals" maybe we could move to better ways of evaluating scientific output. No more editors refusing to publish replications. You can't shift the incentives: academics want to publish in "high-impact" journals, and journals want to selectively publish "high-impact" research. So just make it impossible. Plus as a bonus side-effect this would finally sink Elsevier.

This seems to assume that the NSF would be more competent than journals for some reason. I don't think the problem is with journals per se, I think the problem is with peer review, so if the NSF continues to use peer review as the author suggests, I don't expect this to fix anything.

The author also suggests using a replication prediction market; as I mentioned above you don't want to optimize just for replicability. Possibly you could have replication + some method of incentivizing novelty / importance. The author does note this issue elsewhere but just says "it's a solvable problem". I am not so optimistic. I feel like similar a priori reasoning could have led to the author saying "reproducibility will be a solvable problem".

Comment by rohinmshah on on “learning to summarize” · 2020-09-12T17:43:13.439Z · score: 2 (1 votes) · LW · GW

Accept no substitutes! Gradient ascent directly on the differentiable reward/environment model!

This idea has come up at CHAI occasionally, but I don't think anyone has actually run with it -- do you know any examples of work that does this from (possibly simulated) human feedback? I'm pretty curious to see how much white-box optimization helps.

Comment by rohinmshah on on “learning to summarize” · 2020-09-12T17:26:40.326Z · score: 4 (2 votes) · LW · GW

this didn’t work for their test cases: “Training the reward predictor offline can lead to bizarre behavior […] This type of behavior demonstrates that in general human feedback needs to be intertwined with RL rather than provided statically.”  I don’t know what to make of this.

I think in the original paper, they don't have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this. (Also more speculatively I'd guess that using bigger models on more realistic tasks probably leads to the reward model generalizing better, so that optimization in batches becomes more feasible.)

After all, if they can, then you can just skip the RL, have humans explicitly tell you “no that token is bad, yes this token is great,” and train on likelihood.

Don't you still need a model that converts from human preferences over tokens to likelihoods? It sounds to me that the architecture you're suggesting is like theirs, except using a horizon of 1. Or perhaps you don't want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?

Comment by rohinmshah on What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers · 2020-09-12T17:10:12.515Z · score: 11 (7 votes) · LW · GW

I appreciated the analysis of what does and doesn't replicate, but the author has clearly never been in academia and many of their recommendations are off base. Put another way, the "what's wrong with social science" part is great, and the "how to fix it" is not.

Comment by rohinmshah on [AN #116]: How to make explanations of neurons compositional · 2020-09-10T00:30:10.837Z · score: 8 (5 votes) · LW · GW

Yup, I generally agree (both with the three predictions, and the general story of how NNs work).

Comment by rohinmshah on "Learning to Summarize with Human Feedback" - OpenAI · 2020-09-07T20:31:38.067Z · score: 12 (7 votes) · LW · GW

Planned summary for the Alignment Newsletter:

OpenAI has been working on <@finetuning language models from human preferences@>(@Fine-Tuning GPT-2 from Human Preferences@). This blog post and paper show the progress they have made on text summarization in particular since their last release.

As a reminder, the basic setup is similar to that of [Deep RL from Human Preferences]( we get candidate summaries by executing the policy, have humans compare which of two summaries is better, and use this feedback to train a reward model that can then be used to improve the policy. The main differences in this paper are:

1. They put in a lot of effort to ensure high data quality. Rather than having MTurk workers compare between summaries, they hire a few contractors who are paid a flat hourly rate, and they put a lot of effort into communicating what they care about to ensure high agreement between labelers and researchers.

2. Rather than collecting preferences in an online training setup, they collect large batches at a time, and run a relatively small number of iterations of alternating between training the reward model and training the policy. My understanding is that this primarily makes it simpler from a practical perspective, e.g. you can look at the large batch of data you collected from humans and analyze it as a unit.

3. They initialize the policy from a model that is first pretrained in an unsupervised manner (as in <@GPT-3@>(@Language Models are Few-Shot Learners@)) and then finetuned on the reference summaries using supervised learning.

On the Reddit task they train on, their summaries are preferred over the reference summaries (though since the reference summaries have varying quality, this does not imply that their model is superhuman). They also transfer the policy to summarize CNN / DailyMail news articles and find that it still outperforms the supervised model, despite not being trained at all for this setting (except inasmuch as the unsupervised pretraining step saw CNN / DailyMail articles).

An important ingredient to this success is that they ensure their policy doesn’t overoptimize the reward, by adding a term to the reward function that penalizes deviation from the supervised learning baseline. They show that if they put a very low weight on this term, the model overfits to the reward model and starts producing bad outputs.

Planned opinion:

This paper is a great look at what reward learning would look like at scale. The most salient takeaways for me were that data quality becomes very important and having very large models does not mean that the reward can now be optimized arbitrarily.

Comment by rohinmshah on Coherence arguments do not imply goal-directed behavior · 2020-09-07T02:56:49.831Z · score: 6 (3 votes) · LW · GW

Richard Ngo did consider this line of argument, see Coherent behaviour in the real world is an incoherent concept.

However, I intuitively think that we should expect AI to have a utility function over world states

My main point is that it relies on some sort of intuition like this rather than being determined by math. As an aside, I doubt "world states" is enough to rescue the argument, unless you have very coarse world states that only look at the features that humans care about.

In fact, if we're talking about histories, then all of the examples of circular utilities stop being examples of circular utilities.

Yup, exactly.

I don't understand why you can't just look at the theorem and see whether it talks about world states or histories, but I guess the formalism is too abstract or something?

The theorem can apply to world states or histories. The VNM theorem assumes that there is some set of "outcomes" that the agent has preferences over; you can use either world states or histories for that set of outcomes. Using only world states would be a stronger assumption.

So it feels like you're arguing against something that was never intended.

Yup, that's right. I am merely pointing out that the intended argument depends on intuition, and is not a straightforward consequence of math / the VNM theorem.

Clearly, EY wasn't thinking about utility functions that are allowed to depend on arbitrary histories when he wrote the Arbital post (or during his "AI alignment, why it's hard & where to start" talk, which makes the same points).

Sure, but there's this implication that "since this is a theory of rationality, any intelligent AI system will be well modeled like this", without acknowledging that this depends on the assumption that the relevant outcome set is that of (coarsely modeled) world states (or some other assumption). That's the issue I want to correct.

I'm also surprised that no-one else has made a similar point before. Has Eliezer ever responded to this post?

Not that I know of.

Comment by rohinmshah on [AN #115]: AI safety research problems in the AI-GA framework · 2020-09-04T17:01:56.057Z · score: 5 (3 votes) · LW · GW

Hmm, of the faculty Stuart spends the most time thinking about AI alignment, I'm not sure how much the other faculty have thought about corrigibility -- they'll have views about the off switch game, but not about MIRI-style corrigibility.

Most of the staff doesn't work on technical research, so they probably won't have strong opinions. Exceptions: Critch and Karthika (though I don't think Karthika has engaged much with corrigibility).

Probably the best way is to find emails of individual researchers online and email them directly. I've also left a message on our Slack linking to this discussion.

Comment by rohinmshah on [AN #115]: AI safety research problems in the AI-GA framework · 2020-09-04T05:44:54.913Z · score: 2 (1 votes) · LW · GW

should not be, thanks

Comment by rohinmshah on [AN #115]: AI safety research problems in the AI-GA framework · 2020-09-04T05:44:27.389Z · score: 2 (1 votes) · LW · GW

Hmm, I expect each grad student will have a slightly different perspective, but off the top of my head I think Michael Dennis has the most opinions on it. (Other people could include Daniel Filan and Adam Gleave.)

Comment by rohinmshah on Intuitions about goal-directed behavior · 2020-09-04T05:40:09.407Z · score: 4 (2 votes) · LW · GW

Yeah I think I broadly agree with this perspective (in contrast to how I used to think about it), but I still think there is a meaningful distinction in an AI agent that pursues goals that other agents (e.g. humans) have. I would no longer call it "not goal-directed", but it still seems like an important distinction.

I assume an updated version of this would link to the Risks from Learned Optimization paper.

Yes, updated, thanks.

Comment by rohinmshah on [AN #115]: AI safety research problems in the AI-GA framework · 2020-09-03T16:15:50.729Z · score: 2 (1 votes) · LW · GW

Uh, I don't speak for CHAI, and my views differ pretty significantly from e.g. Dylan's or Stuart's on several topics. (And other grad students differ even more.) But those seem like reasonable CHAI papers to look at (though I'm not sure how Active IRD relates to corrigibility). Chapter 3 of the Value Learning sequence has some of my takes on reward uncertainty, which probably includes some thoughts about corrigibility somewhere.

Human Compatible also talks about corrigibility iirc, though I think the discussion is pretty similar to the one in the off switch game?

Comment by rohinmshah on [AN #115]: AI safety research problems in the AI-GA framework · 2020-09-03T03:20:45.885Z · score: 6 (3 votes) · LW · GW

I forget if I mentioned this before, but all of this HTML is generated by a script with a much more structured input, which you can see here. Plausibly we should just add another output mode to the script that can be easily imported into LessWrong? (Happy to share you on the spreadsheet from which the input data comes if that would help.)

Comment by rohinmshah on Thoughts on the Feasibility of Prosaic AGI Alignment? · 2020-09-02T23:38:35.027Z · score: 5 (3 votes) · LW · GW

I’m not sure where he states it to be border-line impossible or worse.

Here's a recent comment, which doesn't exactly say that but seems pretty close.

When you refer to MIRI being highly pessimistic of prosaic AGI alignment, are you referring to the organization as a whole, or a few key members?

I don't know -- people at MIRI don't say much about their views; I'm generally responding to a stereotyped caricature of what people associate with MIRI because I don't have any better model. (You can see some more discussion about this "MIRI viewpoint" here.) I've heard from other people that these viewpoints should be most associated with Nate, Eliezer and Benya, but I haven't verified this myself.

I also don’t understand why this disparity of projections exists. Is there a more implicit part of the argument that neither party (Paul Christiano and MIRI) haven’t adressed?

I don't know. To my knowledge the "doom" camp hasn't really responded to the points raised, though here is a notable exception.

Comment by rohinmshah on ricraz's Shortform · 2020-09-02T16:13:36.039Z · score: 4 (2 votes) · LW · GW

Rohin wouldn't have written his coherence theorems piece or any of his value learning sequence, and I'm pretty sure about that because I personally asked him to write that sequence

Yeah, that's true, though it might have happened at some later point in the future as I got increasingly frustrated by people continuing to cite VNM at me (though probably it would have been a blog post and not a full sequence).

Reading through this comment tree, I feel like there's a distinction to be made between "LW / AIAF as a platform that aggregates readership and provides better incentives for blogging", and "the intellectual progress caused by posts on LW / AIAF". The former seems like a clear and large positive of LW / AIAF, which I think Richard would agree with. For the latter, I tend to agree with Richard, though perhaps not as strongly as he does. Maybe I'd put it as, I only really expect intellectual progress from a few people who work on problems full time who probably would have done similar-ish work if not for LW / AIAF (but likely would not have made it public).

I'd say this mostly for the AI posts. I do read the rationality posts and don't get a different impression from them, but I also don't think enough about them to be confident in my opinions there.

Comment by rohinmshah on [AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents · 2020-08-31T05:07:38.596Z · score: 2 (1 votes) · LW · GW

Thanks, fixed.

Comment by rohinmshah on Forecasting Thread: AI Timelines · 2020-08-30T19:37:19.485Z · score: 3 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post collects forecasts of timelines until human-level AGI, and (at the time of this writing) has twelve such forecasts.

Comment by rohinmshah on Model splintering: moving from one imperfect model to another · 2020-08-30T19:34:45.366Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post introduces the concept of _model splintering_, which seems to be an overarching problem underlying many other problems in AI safety. This is one way of more formally looking at the out-of-distribution problem in machine learning: instead of simply saying that we are out of distribution, we look at the model that the AI previously had, and see what model it transitions to in the new distribution, and analyze this transition.

Model splintering in particular refers to the phenomenon where a coarse-grained model is “splintered” into a more fine-grained model, with a one-to-many mapping between the environments that the coarse-grained model can distinguish between and the environments that the fine-grained model can distinguish between (this is what it means to be more fine-grained). For example, we may initially model all gases as ideal gases, defined by their pressure, volume and temperature. However, as we learn more, we may transition to the van der Waal’s equations, which apply differently to different types of gases, and so an environment like “1 liter of gas at standard temperature and pressure (STP)” now splinters into “1 liter of nitrogen at STP”, “1 liter of oxygen at STP”, etc.

Model splintering can also apply to reward functions: for example, in the past people might have had a reward function with a term for “honor”, but at this point the “honor” concept has splintered into several more specific ideas, and it is not clear how a reward for “honor” should generalize to these new concepts.

The hope is that by analyzing splintering and detecting when it happens, we can solve a whole host of problems. For example, we can use this as a way to detect if we are out of distribution. The full post lists several other examples.

Planned opinion:

I think that the problems of generalization and ambiguity out of distribution are extremely important and fundamental to AI alignment, so I’m glad to see work on them. It seems like model splintering could be a fruitful approach for those looking to take a more formal approach to these problems.

Comment by rohinmshah on Forecasting Thread: AI Timelines · 2020-08-30T18:20:03.619Z · score: 10 (6 votes) · LW · GW

My snapshot:

Idk what we mean by "AGI", so I'm predicting when transformative AI will be developed instead. This is still a pretty fuzzy target: at what point do we say it's "transformative"? Does it have to be fully deployed and we already see the huge economic impact? Or is it just the point at which the model training is complete? I'm erring more on the side of "when the model training is complete", but also there may be lots of models contributing to TAI, in which case it's not clear which particular model we mean. Nonetheless, this feels a lot more concrete and specific than AGI.

Methodology: use a quantitative model, and then slightly change the prediction to account for important unmodeled factors. I expect to write about this model in a future newsletter.

Comment by rohinmshah on How good is humanity at coordination? · 2020-08-25T18:22:39.155Z · score: 2 (1 votes) · LW · GW

Relevant evidence: survey about the impact of COVID on biorisk. I found the qualitative discussion far more useful than the summary table. I think overall the experts are a bit more pessimistic than would be predicted by my model, which is some evidence against my position (though I also think they are more optimistic than would be predicted by Buck's model). Note I'm primarily looking at what they said about natural biorisks, because I see COVID as a warning shot for natural pandemics but not necessarily deliberate ones.

(Similarly, on my model, warning shots of outer alignment failures don't help very much to guard against inner alignment failures.)

Comment by rohinmshah on Mesa-Search vs Mesa-Control · 2020-08-24T22:50:09.350Z · score: 4 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post discusses several topics related to mesa optimization, and the ideas in it led the author to update towards thinking inner alignment problems are quite likely to occur in practice. I’m not summarizing it in detail here because it’s written from a perspective on mesa optimization that I find difficult to inhabit. However, it seems to me that this perspective is common so it seems fairly likely that the typical reader would find the post useful.

Happy for others to propose a different summary for me to include. However, the summary will need to make sense to me; this may be a hard challenge for this post in particular.

Comment by rohinmshah on Mesa-Search vs Mesa-Control · 2020-08-24T22:37:23.205Z · score: 4 (2 votes) · LW · GW
I lean toward there being a meaningful distinction here: a system can learn a general-purpose learning algorithm, or it can 'merely' learn a very good conditional model.

Does human reasoning count as a general-purpose learning algorithm? I've heard it claimed that when we apply neural nets to tasks humans haven't been trained on (like understanding DNA or materials science) the neural nets can rocket past human understanding, with way less computation and tools (and maybe even data) than humans have had access to (depending on how you measure). Tbc, I find this claim believable but haven't checked it myself. Maybe SGD is the real general-purpose learning algorithm? Human reasoning could certainly be viewed formally as "a very good conditional model".

So overall I lean towards thinking this is a continuous spectrum with no discontinuous changes (except ones like "better than humans or not", which use a fixed reference point to get a discontinuity). So there could be a meaningful distinction, but it's like the meaningful distinction between "warm water" and "hot water", rather than the meaningful distinction between "water" and "ice".

Comment by rohinmshah on Mesa-Search vs Mesa-Control · 2020-08-24T22:19:19.878Z · score: 14 (6 votes) · LW · GW

Random question: does this also update you towards "alignment problems will manifest in real systems well before they are powerful enough to take over the world"?

Context: I see this as a key claim for the (relative to MIRI) alignment-by-default perspective, and I expect many people at MIRI disagree with this claim (though I don't know why they disagree).

Comment by rohinmshah on Universality Unwrapped · 2020-08-24T22:10:33.377Z · score: 7 (2 votes) · LW · GW

Planned summary for the Alignment Newsletter:

This post explains the ideas behind universality and ascription universality, in a more accessible way than the original posts and with more detail than my summary.
Comment by rohinmshah on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-08-24T16:53:33.386Z · score: 9 (3 votes) · LW · GW
Which is exactly why I asked you for recommendations.

Yes, I never said you shouldn't ask me for recommendations. I'm saying that I don't have any good recommendations to give, and you should probably ask other people for recommendations.

showing some concrete things that might be relevant (as I repeated in each comment, not an exhaustive list) would make the injunction more effective.

In practice I find that anything I say tends to lose its nuance as it spreads, so I've moved towards saying fewer things that require nuance. If I said "X might be a good resource to learn from but I don't really know", I would only be a little surprised to hear a complaint in the future of the form "I deeply read X for two months because Rohin recommended it, but I still can't understand this deep RL paper".

If I actually were confident in some resource, I agree it would be more effective to mention it.

I'm just confused because it seems low effort for you, net positive, and the kind of "ask people for recommendation" that you preach in the previous comment.

I'm not convinced the low effort version is net positive, for the reasons mentioned above. Note that I've already very weakly endorsed your mention of Sutton and Barto, and very weakly mentioned Spinning Up in Deep RL. (EDIT: TurnTrout doesn't endorse Sutton and Barto much, so now neither do I.)

Comment by rohinmshah on Matt Botvinick on the spontaneous emergence of learning algorithms · 2020-08-23T04:59:29.713Z · score: 4 (2 votes) · LW · GW
I get the impression from your comments that you think it's naive to describe this result as "learning algorithms spontaneously emerge."

I think that's a fine characterization (and I said so in the grandparent comment? Looking back, I said I agreed with the claim that learning is happening via neural net activations, which I guess doesn't necessarily imply that I think it's a fine characterization).

You describe the lack of LW/AF pushback against that description as "a community-wide failure,"

I think my original comment didn't do a great job of phrasing my objection. My actual critique is that the community as a whole seems to be updating strongly on data-that-has-high-probability-if-you-know-basic-RL.

updating as a result toward thinking AF members “automatically believe anything written in a post without checking it.”

That was one of three possible explanations; I don't have a strong view on which explanation is the primary cause (if any of them are). It's more like "I observe clearly-to-me irrational behavior, this seems bad, even if I don't know what's causing it". If I had to guess, I'd guess that the explanation is a combination of readers not bothering to check details and those who are checking details not knowing enough to point out that this is expected.

I feel confused about why, given your model of the situation, the researchers were surprised that this phenomenon occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described.

Indeed, I am also confused by this, as I noted in the original comment:

I don't understand why this was surprising to the original researchers

I have a couple of hypotheses, none of which seem particularly likely given that the authors are familiar with AI, so I just won't speculate. I agree this is evidence against my claim that this would be obvious to RL researchers.

And this OpenAI paper [...] describes their result in similar terms:

Again, I don't object to the description of this as learning a learning algorithm. I object to updating strongly on this. Note that the paper does not claim their results are surprising -- it is written in a style of "we figured out how to make this approach work". (The DeepMind paper does claim that the results are novel / surprising, but it is targeted at a neuroscience audience, to whom the results may indeed be surprising.)

I've been feeling very confused lately about how people talk about "search," and have started joking that I'm a search panpsychist.

On the search panpsychist view, my position is that if you use deep RL to train an AGI policy, it is definitionally a mesa optimizer. (Like, anything that is "generally intelligent" has the ability to learn quickly, which on the search panpsychist view means that it is a mesa optimizer.) So in this world, "likelihood of mesa optimization via deep RL" is equivalent to "likelihood of AGI via deep RL", and "likelihood that more general systems trained by deep RL will be mesa optimizers" is ~1 and you ~can't update on it.