How well did Manifold predict GPT-4?

post by David Chee (david-chee) · 2023-03-15T23:19:06.477Z · LW · GW · 5 comments

Contents

  How well did we predict the launch date?
  Insider Trading
  What else are people predicting about GPT-4?
  Markets you can still predict on
None
5 comments

Chat GPT-4 is already here!! Who could have seen that coming… oh wait Manifold (kinda) did? 😅

I thought I’d write a short piece on how Manifold Markets was used to predict the launch of GPT-4 and its attributes. Both its successes and its failures. Disclaimer I work at Manifold.

How well did we predict the launch date?

Throughout the end of last year, people were bullish on a quick release, which began to decline as we entered the start of this year.

The first spike in February corresponds to the release of Bing’s chatbot which people speculated was Chat CPT-4. Turns out it actually was! Although Open AI did a fantastic job at concealing this with our market on it hovering at a stubborn 50-60%.

There was a lot of uncertainty on if GPT-4 would be released before March. However, on the 9th of March Microsoft Germany CTO Andreas Braun mentioned at an AI kickoff event that its release was imminent which caused the market to jump.

Although the market graphs are a beautiful representation of hundreds of traders’ predictions, did they actually give us any meaningful information? One thing that stands out about these graphs in particular is the strong bets away from the baseline towards YES throughout February. Is this just noise, or is something more going on?

Insider Trading

Being the socialite I am, I go to a whopping one (1) social gathering a month!! At 100% of these, the SF Manifold Markets party and Nathan Young’s Thursday dinner, I spoke to someone who claimed they were trading on the Chat GPT-4 markets based on privileged insider information.

One of them got burnt as allegedly there were delays from the planned launch and they had gone all-in on the GPT-4 being released by a certain date.

I love knowing people with privileged information are able to safely contribute to public forecasts which wouldn’t be possible without a site like Manifold Markets. As they were trading from anonymous accounts I have no way of knowing whether they are the ones responsible for the large YES bets, but I suspect some of them are. That said, someone with insider knowledge would be better off placing a large limit order to buy YES just above the current baseline which would exert strong pressure to hold the market at/slightly above its current probability. Placing a large market order which causes the spikes gives them less profit than they otherwise could have earned.

What else are people predicting about GPT-4?

Jacy Reese Anthis, an American social scientist of the Sentience Institute, created a market on if credible individuals with expertise in the space will claim GPT-4 is sentient. 16% seems surprisingly high to me, but the market has only just been created and needs more traders. Go now and place your bets!

One of our most popular markets, which failed in spectacular fashion, was whether it would get the Monty Fall problem correct (note - this is not the same as the Monty Call problem, click through to the market description for an explanation).

This might be the single most consistent upward-trending market I have ever seen on our site. I wonder if GPT-4 hadn’t been released yet how much further it would have continued to trend upwards before plateauing.

Part of the confidence came from Bing’s success in answering correctly when set to precise mode. Many speculated GPT-4 was going to be even more powerful than Bing, even though they turned out to be the same. I’m not exactly sure what the difference is using the “precise” setting, if anyone knows let me know!

Markets you can still predict on

Here are some more open markets for you to go trade-in. It’s free and uses play money!

Thanks for reading! Hope it was interesting to see the trends on Manifold, even if not a particularly in-depth analysis this time around.

5 comments

Comments sorted by top scores.

comment by gwern · 2023-03-16T01:08:21.007Z · LW(p) · GW(p)

Part of the confidence came from Bing’s success in answering correctly when set to precise mode. Many speculated GPT-4 was going to be even more powerful than Bing, even though they turned out to be the same. I’m not exactly sure what the difference is using the “precise” setting, if anyone knows let me know!

Based on Mikhail's Twitter comments, 'precise' and 'creative' don't seem to be too much more than simply the 'temperature' hyperparameter for sampling. 'Precise' would presumably correspond to very low, near-zero or zero, highly deterministic samples.

The ChatGPT interface to GPT-4 doesn't let you control temperature at all, so it's possible that its 'mistakes' are due to its hidden temperature being too randomized and it committing to bad answers early (which is a very common issue with doing Q&A with 'one right answer' type questions)... However, I see people are mentioning results like 0/7 and 0/10, so that may not be it. I would expect too-high temps to get it right a decent fraction of the time. It would be very interesting if 'Monty Fall' turned out to be another example of the RLHF skewing GPT-4's calibration badly compared to the baseline model, which they report about other things.

Replies from: bayesed, None
comment by bayesed · 2023-03-16T04:27:41.448Z · LW(p) · GW(p)

Based on Mikhail's Twitter comments, 'precise' and 'creative' don't seem to be too much more than simply the 'temperature' hyperparameter for sampling. 'Precise' would presumably correspond to very low, near-zero or zero, highly deterministic samples.

Nope, Mikhail has said the opposite: https://twitter.com/MParakhin/status/1630280976562819072

Nope, the temperature is (roughly) the same.

So I'd guess the main difference is in the prompt.

Replies from: gwern, bayesed
comment by gwern · 2023-03-16T21:05:40.215Z · LW(p) · GW(p)

That's interesting. Earlier, he was very explicitly identifying temperature with creativity in the Tweets I collated [LW(p) · GW(p)] when commenting about how the controls worked. So now if the temperature is identical but he's calling whatever it is 'creative', he's completely flipped his position on "hallucinations = creativity", apparently.

Hm. So it's the same temperature, but it's more expensive, which has 'longer output, more expressive, slower', requires more context... That could point to it being a different model under the hood. But it could also point to a different approach entirely, like implementing best-of sampling, or perhaps some inner-monologue-like approach like a hidden prompt generating several options and then another prompt to pick "the most creative" one. There were some earlier comments about Sydney possibly having a hidden inner-monologue scratchpad/buffer where it could do a bunch of outputs before returning only 1 visible answer to the user. (This could be parallelized if you generated the n suggestions in parallel and didn't mind the possible redundancy, but is inherently still more serial steps than simply generating 1 answer immediately.) This could be 'pick the most creative one' for creative mode, or 'pick the most correct one' for 'precise' mode, etc. So this wouldn't necessarily be anything new and could have been iterated very quickly (but as he says, it'd be inherently slower, generate longer responses, and be more expensive, and be hard to optimize much more).

This is something you could try to replicate with ChatGPT/GPT-4. Ask it to generate several different answers to the Monty Fall problem, and then ask it for the most correct one.

comment by bayesed · 2023-03-16T15:57:20.368Z · LW(p) · GW(p)

Additional comments on creative mode by Mikhail (from today):

https://twitter.com/MParakhin/status/1636350828431785984

We will {...increase the speed of creative mode...}, but it probably always be somewhat slower, by definition: it generates longer responses, has larger context.

https://twitter.com/MParakhin/status/1636352229627121665

Our current thinking is to keep maximum quality in Creative, which means slower speed.

https://twitter.com/MParakhin/status/1636356215771938817

Our current thinking about Bing Chat modes: Balanced: best for the most common tasks, like search, maximum speed Creative: whenever you need to generate new content, longer output, more expressive, slower Precise: most factual, minimizing conjectures

So creative mode definitely has larger context size, and might also be a larger model?

comment by [deleted] · 2023-03-16T01:35:52.619Z · LW(p) · GW(p)

Note that 2 stage generation - ask it if it's sure about its answer and use the second response as the output - solves Monty fall every time I tried.