Posts
Comments
[minor technical disputes below; ignore if disinterested]
This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Llama 3 405B was trained in minibatches with 2K sequences of 8K tokens, the smallest that 8-GPU scale-up domains of a 16K H100 cluster enable. If it was clearly optimal for minibatches to be larger, it's trivial to make it so, so they are probably already too large.
I'm a bit confused by this part. I believe the l3 paper indicates the training seqlen was increased mid-training.
In general, I don't understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.
For H100, that's only 8 GPUs in the standard configuration that seems to be used everywhere. For TPUv6e, that's a whole 256-chip pod, and this wasn't a constraint in older TPUs either. For Trn2, that's either 16 or 64 GPUs in either standard or Ultra variants.
I think it's plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations, but we may have to wait for Semianalysis to provide good numbers on this.