Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark

post by ChristianWilliams · 2024-10-10T18:58:46.041Z · LW · GW · 2 comments

This is a link post for https://www.metaculus.com/notebooks/28784/aibq3results/

Contents

2 comments

2 comments

Comments sorted by top scores.

comment by gwern · 2024-10-10T21:28:24.112Z · LW(p) · GW(p)

It is worth noting that the Pros made more extreme forecasts than the bots. The Pros were not afraid to forecast less than 2% or more than 90%, while the bots stayed closer to 50% with their forecasts.

This sounds like an example of 'flattened logits' or loss of calibration in tuned models. I take it that all of the models involved were the usual RLHF/instruction-tuned models, and no efforts were made to use base models like the original davincis or llama-3-405b-base, which ought to have better calibration?

Replies from: ChristianWilliams
comment by ChristianWilliams · 2024-10-14T22:48:04.592Z · LW(p) · GW(p)

Hi @gwern [LW · GW], we are currently in the process of combing through winners' documentation of their bots and which models they used. We haven't yet encountered anyone who claims to have used one of the base models. 

We will share here if we learn a participant did indeed use one.