Are Metaculus AI Timelines Inconsistent?
post by Chris_Leong · 2024-01-02T06:47:18.114Z · LW · GW · 7 commentsContents
7 comments
The Metaculus prediction markets on AI timelines are the most referenced ones that I've seen.
The two main ones are as follows:
When will the first weakly general AI system be devised, tested, and publicly announced?: Current Community Prediction: April 14th, 2026
Criteria:
- Able to reliably pass a Turing test of the type that would win the Loebner Silver Prize.
- Able to score 90% or more on a robust version of the Winograd Schema Challenge, e.g. the "Winogrande" challenge or comparable data set for which human performance is at 90+%
- Be able to score 75th percentile (as compared to the corresponding year's human students; this was a score of 600 in 2016) on all the full mathematics section of a circa-2015-2020 standard SAT exam, using just images of the exam pages and having less than ten SAT exams as part of the training data. (Training on other corpuses of math problems is fair game as long as they are arguably distinct from SAT exams.)
- Be able to learn the classic Atari game "Montezuma's revenge" (based on just visual inputs and standard controls) and explore all 24 rooms based on the equivalent of less than 100 hours of real-time play (see closely-related question.)
When will the First General AI System be Devised, Tested and Publicly Announced?: Current Community Prediction: January 26th 2031
Criteria:
- Able to reliably pass a 2-hour, adversarial Turing test during which the participants can send text, images, and audio files (as is done in ordinary text messaging applications) during the course of their conversation. An 'adversarial' Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor. A single demonstration of an AI passing such a Turing test, or one that is sufficiently similar, will be sufficient for this condition, so long as the test is well-designed to the estimation of Metaculus Admins.
- Has general robotic capabilities, of the type able to autonomously, when equipped with appropriate actuators and when given human-readable instructions, satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model. A single demonstration of this ability, or a sufficiently similar demonstration, will be considered sufficient.
- High competency at a diverse fields of expertise, as measured by achieving at least 75% accuracy in every task and 90% mean accuracy across all tasks in the Q&A dataset developed by Dan Hendrycks et al..
- Able to get top-1 strict accuracy of at least 90.0% on interview-level problems found in the APPS benchmark introduced by Dan Hendrycks, Steven Basart et al. Top-1 accuracy is distinguished, as in the paper, from top-k accuracy in which k outputs from the model are generated, and the best output is selected.
If we subtract the two, then we get around 57 months.
However, there is a market focusing just on the difference between the two which has a substantially lower prediction.
After a (weak) AGI is created, how many months will it be before the first superintelligent AI is created? Current Community Prediction 25.82 months
Why does this differ from the figure we got before? Is this an inconsistency or is there an important difference between the two markets?
I tried looking at this difference market and I could confirm it used the same definition of strong AI, but I'm unsure what definition it is using for weak AGI.
7 comments
Comments sorted by top scores.
comment by Gordon Seidoh Worley (gworley) · 2024-01-02T19:00:39.415Z · LW(p) · GW(p)
One theory: the pool of metaculus traders is small enough and insufficiently motivated that there may be unexploited arbitrage opportunities, and you are correctly detecting this.
comment by exmateriae (Sefirosu) · 2024-02-28T19:38:04.414Z · LW(p) · GW(p)
I'm coming very late to this but it is also possible that the people forecasting on each of the questions are noticeably different and they have different ideas about AI. Maybe some just don't know about the other questions. It won't explain everything but it could be a factor.
Also, it is difficult to keep all your forecasts up to date. You can forget.
comment by cctt7777ttcc · 2024-01-03T17:34:32.153Z · LW(p) · GW(p)
Part of the story could be median is not a linear function, and there is no guarantee that the median of the difference between two distributions equals the difference of their medians.
Replies from: Chris_Leong↑ comment by Chris_Leong · 2024-01-03T22:56:34.130Z · LW(p) · GW(p)
So you’re saying that Metaculus incentivises betting based on the median, not the average?
Replies from: cctt7777ttcc↑ comment by cctt7777ttcc · 2024-01-04T22:13:38.694Z · LW(p) · GW(p)
I don't mean to say that, but maybe it's entailed. (I don't know.)
I'm saying even if it were the same group of forecasters who forecast all three questions, and even if every one of those forecasters is internally consistent with themselves, it will not necessarily be the case that the median for the "how many months" question will be equal to the difference between the medians of the other two questions.
↑ comment by Chris_Leong · 2024-01-14T12:49:27.444Z · LW(p) · GW(p)
Sure. But my question was whether we take the Metaculus timelines as indicative of the forecaster's averages or medians.
comment by Alex K. Chen (parrot) (alex-k-chen) · 2024-01-02T22:29:19.564Z · LW(p) · GW(p)
I mean, is there a way to measure the quality of the forecasters into the predictions? As number of forecasters expands, you get lower quality of average forecaster. Like how the markets were extremely overconfident (and wrong) about the Russians conquering Kiev...