DeepSeek beats o1-preview on math, ties on coding; will release weights
post by Zach Stein-Perlman · 2024-11-20T23:50:26.597Z · LW · GW · 6 commentsContents
6 comments
DeepSeek-R1-Lite-Preview was announced today. It's available via chatbot. (Post; translation of Chinese post.)
DeepSeek says it will release the weights and publish a report.
The model appears to be stronger than o1-preview on math, similar on coding, and weaker on other tasks.
DeepSeek is Chinese. I'm not really familiar with the company. I thought Chinese companies were at least a year behind the frontier; now I don't know what to think and hope people do more evals and play with this model. Chinese companies tend to game benchmarks more than the frontier Western companies, but I think DeepSeek hasn't gamed benchmarks much historically.
The post also shows inference-time scaling, like o1:
Note that o1 is substantially stronger than o1-preview; see the o1 post:
(Parts of this post and some of my comments are stolen from various people-who-are-not-me.)
6 comments
Comments sorted by top scores.
comment by David Matolcsi (matolcsid) · 2024-11-21T00:05:44.755Z · LW(p) · GW(p)
I experimented a bunch with DeepSeek today, it seems to be exactly on the same level in highs school competition math as o1-preview in my experiments. So I don't think it's benchmark-gaming, at least in math. On the other hand, it's noticeably worse than even the original GPT-4 at understanding a short story I also always test models on.
I think it's also very noteworthy that DeepSeek gives everyone 50 free messages a day (!) with their CoT model, while OpenAI only gives 30 o1-preview messages a week to subscribers. I assume they figured out how to run it much cheaper, but I'm confused in general.
A positive part of the news is that unlike o1, they show their actual chain of thought, and they promise to make their model open-source soon. I think this is really great for the science of studying faithful chain of thought.
From the experiments I have run, it looks like it is doing clear, interpretable English chain of thought (though with an occasional Chinese character once in a while), and I think it didn't really yet start evolving into optimized alien gibberish. I think this part of the news is a positive update.
Replies from: daniel-kokotajlo↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-11-21T00:18:12.159Z · LW(p) · GW(p)
Yeah, I really hope they do actually open-weights it because the science of faithful CoT would benefit greatly.
comment by nikola (nikolaisalreadytaken) · 2024-11-21T00:29:21.268Z · LW(p) · GW(p)
One weird detail I noticed is that in DeepSeek's results, they claim GPT-4o's pass@1 accuracy on MATH is 76.6%, but OpenAI claims it's 60.3% in their o1 blog post. This is quite confusing as it's a large difference that seems hard to explain with different training checkpoints of 4o.
Replies from: lechmazur↑ comment by Lech Mazur (lechmazur) · 2024-11-21T05:14:31.156Z · LW(p) · GW(p)
It seems that 76.6% originally came from the GPT-4o announcement blog post. I'm not sure why it dropped to 60.3% by the time of o1's blog post.
comment by Zach Stein-Perlman · 2024-11-21T02:40:10.847Z · LW(p) · GW(p)
Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don't spend much compute on inference.)
comment by Zach Stein-Perlman · 2024-11-21T02:40:02.374Z · LW(p) · GW(p)
Some context: June FT article on DeepSeek; parent hedge fund Wikipedia page.