How to write good AI forecasting questions + Question Database (Forecasting infrastructure, part 3)post by jacobjacob, bgold · 2019-09-03T14:50:59.288Z · score: 31 (12 votes) · LW · GW · 3 comments
Background and motivation Ambiguity Underspecification Examples Spurious resolution Examples Trivial pathways Examples Failed parameter tuning Examples Non-incentive compatible questions Examples Question database How to contribute to the database None No comments
This post introduces an open-source database of 76 questions about AI progress, together with detailed resolution conditions, categorisations and several spreadsheets compiling outside views and data, as well as learning points about how to write good AI forecasting questions. It is the third part in a series of blog posts which motivate and introduce pieces of infrastructure intended to improve our ability to forecast novel and uncertain domains like AI.
Background and motivation
Through our work on AI forecasting in the recent year, we’ve tried to write many questions that track important facets of progress, and gained some experience in how hard this is; and how to do it better.
In doing this, we’ve found that most previous public attempts at writing AI forecasting questions (e.g. this and this) fall prey to several failure modes that worsen the signal of the questions; as did many of the questions we wrote ourselves. Overall, we think operationalisation is an unsolved problem [LW · GW], and this has been the impetus behind the work described in this sequence of posts on forecasting infrastructure.
A previous great resource on this topic is Allan Dafoe’s AI Governance Research Agenda, which has an appendix with forecasting desiderata (page 52-53). This blog post complements that agenda by adding a large number of concrete examples.
We begin by categorising and giving examples of some ways in which technical forecasting question can fail to capture the important, intended uncertainty. (Note that the below examples are not fully fleshed out questions, in order to allow for easier reading.) We then describe the question database we’re open-sourcing.
Terms that can have many different meanings, such as “AGI” or “hardcoded knowledge”.
Resolution criteria that neglects to specify how questions should be resolved in some possible scenarios.
This question resolves positively if an article in a reputable journal article finds that commercially-available automated speech recognition is better than human speech recognition (in the sense of having a lower transcription error rate).
If a journal article is published that finds that in only in some domains commercially-available automated speech recognition is better (e.g. for HR meetings), but worse in most other domains, it is unclear from the resolution criteria how this question should be resolved.
Edge-case resolutions that technically satisfy the description as written, yet fail to capture the intention of the question.
Positively resolving the question:
Will there be an incident causing the loss of human life in the South China Sea (a highly politically contested sea in the pacific ocean) by 2018?
by having a battleship accidentally run over a fishing boat. (This is adapted from an actual question used by the Good Judgment Project.)
Positively resolving the question:
Will an AI lab have been nationalized by 2024?
by the US government nationalising GM for auto-manufacturing reasons, yet GM nonetheless having a self-driving car research division.
Most of the variance in the forecasting outcome of the question is driven by an unrelated causal pathway to resolution, which “screens off” the intended pathways.
A question which avoids resolution by trivial pathways is roughly what Allan Dafoe calls an “accurate indicator”:
We want [AI forecasting questions] to be accurate indicators, as opposed to noisy indicators that are not highly correlated with the important events. Specifically, where E is the occurrence or near occurrence of some important event, and Y is whether the target has been reached, we want P(not Y|not E)~1, and P(Y | E) ~1. An indicator may fail to be informative if it can be “gamed” in that there are ways of achieving the indicator without the important event being near. It may be a noisy indicator if it depends on otherwise irrelevant factors, such as whether a target happens to take on symbolic importance as the focus of research”.
When will there be a superhuman Angry Birds agent using no hardcoded knowledge?
and realizing that there seems to be little active interest in the yearly benchmark competition (with performance even declining over years). This means that the probability entirely depends on whether anyone with enough money and competence decides to work on it, as opposed to what key components make Angry Birds difficult (e.g. physics-based simulation) and how fast progress is in those domains.
How sample efficient will the best Dota 2 RL agent be in 2020?
by analyzing OpenAI’s decision on whether or not to build a new agent, rather than the underlying difficulty and progress of RL in partially observable, high-dimensional environments.
Will there have been a 2-year interval where the amount of training compute used in the largest AI experiment did not grow 32x, before 2024?
by analyzing how often big experiments are run, rather than what the key drivers of the trend are (e.g. parallelizability and compute economics) and how they will change in the future.
Failed parameter tuning
Any free variables are set to values which do not maximise uncertainty.
Having the answer to:
When will 100% of jobs be automatable at a performance close to the median employee and cost <10,000x of that employee?
be very different from using the parameter 99.999%, as certain edge-case jobs (“underwater basket weaving”) might be surprisingly hard to automate but for non-interesting reasons.
Will global investment in AI R&D be <$100 trillion in 2021?
is not interesting, even though asking about values in the range of ~$30B to ~$1T might have been.
Non-incentive compatible questions
Questions where the answer that would score highest on some scoring metric is different from the forecaster’s true belief.
Forecasting ”Will the world end in 2024?” as 1% (or whatever else is the minimum for the given platform), because for any higher number you wouldn’t be around to cash out the rewards of your calibration.
We’re releasing a database of 76 questions about AI progress, together with detailed resolution conditions, categorisations and several spreadsheet compiling outside views and data.
We make these questions freely available for use by any forecasting project (under an open-source MIT license).
The database has grown organically through our work on Metaculus AI, and several of the questions have associated quantitative forecasts and discussion on that site. The resolution conditions have often been honed and improved through interaction with forecasters. Moreover, in cases where the questions stem from elsewhere, such as the BERI open-source question set, or the AI Index, we’ve often spent a substantial amount of time improving the resolution conditions in cases where they were lacking.
Of particular interest might be the column “Robustness”, which tracks our overall estimate of the quality of questions -- that is, the extent to which they avoid suffering from the failure modes listed above. For example, the question:
By 2021, will a neural model reach >=70% performance on a high school mathematics exam?
is based on a 2019 DeepMind paper, and on the face of it liable to several of the failure modes above.
Yet it’s robustness is rated as “high”, since we have specified a detailed resolution condition as:
By 2021, will there EITHER…
1. … be a credible report of a neural model with a score of >=70% on the task suite used in the 2019 DeepMind paper…
2. OR be judged by a council of experts [LW · GW] that it’s 95% likely such a model could be implemented, were a sufficiently competent lab to try…
3. OR be a neural model with performance on another benchmark judge by a council of experts to be equally impressive, with 95% confidence?
The question can still fail in winner’s curse/Goodharting-style cases [LW · GW], where the best performing algorithm on a particular benchmark overestimates progress in that domain, simply because selecting for benchmark performance also selects for overfitting to the benchmark as opposed to mastering the underlying challenge. We don’t yet have a good default way of resolving such questions in a robust manner.
How to contribute to the database
We welcome contributions to the database. Airtable (the software where we’re hosting it) doesn’t allow for comments, so if you have a list of edits/additions you’d like to make, please email email@example.com and we can make you an editor.
Comments sorted by top scores.