artemium's Shortform

artemium

artemium's Shortform

post by artemium · 2024-12-26T14:54:42.222Z · LW · GW · 3 comments

3 comments

3 comments

Comments sorted by top scores.

comment by artemium · 2024-12-26T14:54:42.486Z · LW(p) · GW(p)

A new open-source model has been announced by the Chinese lab DeepSeek: DeepSeek-V3. It reportedly outperforms both Sonnet 3.5 and GPT-4o on most tasks and is almost certainly the most capable fully open-source model to date.

Beyond the implications of open-sourcing a model of this caliber, I was surprised to learn that they trained it using only 2,000 H800 GPUs! This suggests that, with an exceptionally competent team of researchers, it’s possible to overcome computational limitations.

Here are two potential implications:

Sanctioning China may not be effective if they are already capable of training cutting-edge models without relying on massive computational resources.
We could be in a serious hardware overhang scenario, where we already have sufficient compute to build AGI, and the only limiting factor is engineering talent.

(I am extremely uncertain of this, it was just my reaction after reading about it)

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-12-26T16:40:43.808Z · LW(p) · GW(p)

DeepSeek-V3 is a MoE model with 37B active parameters trained for 15T tokens, so at 400 tokens per parameter it's very overtrained and could've been smarter with similar compute if hyperparameters were compute optimal. It's probably the largest model known to be trained in FP8, it extracts 1.4x more compute per H800 than most models trained in BF16 get from an H100, for about 6e24 FLOPs total^[1], about as much as Llama-3-70B. And it activates 8 routed experts per token (out of 256 total routed experts), which a Feb 2024 paper^[2] suggests to be a directionally correct thing to do (compared to a popular practice of only activating 2 experts), with about 64 experts per token being optimal around 1e24-1e25 FLOPs. Taken together, these advantages predict that it should be smarter than Llama-3-70B, if done well.

Models that are smarter than Llama-3-70B can show impressive benchmark performance that then doesn't cash out in the hard-to-operationalize impression of being as smart as Claude 3.5 Sonnet. The jury is still out, but it's currently available even in Direct Chat on Chatbot Arena, there will be more data on this soon. It would be shocking if a 37B active parameter model actually manages that though.

H800 seems to produce 1.4e15 dense FP8 FLOP/s, the model was trained for 2.8e6 H800-hours, and I'm assuming 40% compute utilization. ↩︎
That same paper estimates the compute multiplier of a compute optimal MoE at about 20x compared to a dense model, see Figure 1b, which is hard to believe. It's based on experiments of up to about 3e19-4e20 FLOPs per datapoint. Still, the claim of many more activated experts than 2 being better might survive in practice. ↩︎

Replies from: Erich_Grunewald

↑ comment by Erich_Grunewald · 2024-12-26T23:41:33.371Z · LW(p) · GW(p)

The jury is still out, but it's currently available even in Direct Chat on Chatbot Arena, there will be more data on this soon.

Fyi, it's also available on https://chat.deepseek.com/, as is their reasoning model DeepSeek-R1-Lite-Preview ("DeepThink"). (I suggest signing up with a throwaway email and not inputting any sensitive queries.) From quickly throwing it a few requests I recently asked 3.5 Sonnet, DeepSeek-V3 seems slightly worse, but nonetheless solid.

artemium's Shortform

Contents

3 comments