Is the speed of training large models going to increase significantly in the near future due to Cerebras Andromeda?

post by Amal (asta-vista) · 2022-11-15T22:50:22.968Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    10 jacob_cannell
    1 Razied
None
No comments

Cerebras recently unveiled Andromeda - https://www.cerebras.net/andromeda/, an AI supercomputer that enables near linear scaling. Do I understand correctly that this might have a big impact on the large (language) models research, since it would significantly speed up the training? E.g. if current models take 30+ days long to train, we can just 10x the number of machines and have it done in three days? Also, it seems to be much simpler to use, thus decreasing the cost of development and the hassle with dstributed computing.

If so, I think its almost certain that large companies would do it, and this in turn would significantly speed up the research/training/algorithm development of large models such as GPT, GATO and similar? It seems like this type of development should affect the discussion about timelines, however I haven't seen it mentioned anywhere else before.

Answers

answer by jacob_cannell · 2022-11-16T01:24:59.172Z · LW(p) · GW(p)

This doesn't seem impressive compared to Nvidia's offerings.

The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia and is unlikely to be competitive in compute/$; if it was competitive Cerebras would be advertising/boasting that miracle as loudly as they could. Instead they are focusing on this linear scaling thing, which isn't an external performance comparison at all.

The cerebras wafer-scale chip is a wierd architecture that should excel in the specific niche of training small models at high speed, but that just isn't where the industry is going. It is severely lacking in the large cheap fast off-chip RAM that GPUs have: this is a key distinguishing feature of the GPU architecture, combined with the hierarchical cache/networking topology.

In fact i'd argue that having linear scaling is a bad sign: it indicates you haven't achieved the level of detailed optimization possible by physics. Longer range interconnect is fundamentally physically more expensive and the optimal compute architectures will reflect that cost structure. Local compute is physically cheaper so the ideal architecture should charge software less for it (make more available at the same price) vs long range compute.

comment by Zach Furman (zfurman) · 2022-11-16T03:30:30.981Z · LW(p) · GW(p)

The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia

I'm not sure if PFLOPs are a fair comparison here though, if I understand Cerebras' point correctly. Like, if you have ten GPUs with one PFLOP each, that's technically the same number of PFLOPs as a single GPU with ten PFLOPs. But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations make you resort to tensor or pipeline parallelism instead of data parallelism. Cerebras claims that to train "10 times faster you need 50 times as many GPUs."

According to this logic what you really care about instead is probably training speed or training speedup per dollar. Then the pitch for Andromeda, unlike a GPU pod, is that those 120 PFLOPS are "real" in the sense that training speed increases linearly with the PFLOPS.

The cerebras wafer-scale chip is a wierd architecture that should excel in the specific niche of training small models at high speed, but that just isn't where the industry is going. It is severely lacking in the large cheap fast off-chip RAM that GPUs have

I'm not sure I totally have a good grasp on this, but isn't this the whole point of Andromeda's weight streaming system? Fast off-chip memory combined with high memory bandwidth on the chip itself? Not sure what would limit this to small models if weights can be streamed efficiently, as Cerebras claims.

Even if I'm right, I'm not sure either of these points change the overall conclusion though. I'd guess Cerebras still isn't economically competitive or they'd be boasting it as you said.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-11-16T04:04:01.914Z · LW(p) · GW(p)

But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations make you resort to tensor or pipeline parallelism instead of data parallelism.

Well that's not quite right - otherwise everyone would be training on single GPUs using very different techniques, which is not what we observe. Every parallel system has communication, but it doesn't necessarily 'spend time' on that in the blocking sense, it typically happens in parallel with computation.

SOTA models do now seem often limited by RAM, so model parallelism is increasingly important as it is RAM efficient. This is actually why cerebras's strategy doesn't make sense: GPUs are optimized heavily for the sweet spot in terms of RAM capacity/$ and RAM bandwidth. The wafer scale approach instead tries to use on-chip SRAM to replace off-chip RAM, which is just enormously more expensive - at least an OOM more expensive in practice.

Then the pitch for Andromeda, unlike a GPU pod, is that those 120 PFLOPS are "real" in the sense that training speed increases linearly with the PFLOPS.

This of course is bogus because with model parallelism you can tune the interconnect requirements based on the model design, and nvidia has been tuning their interconnect tradeoffs for years in tandem with researchers cotuning their software/models for nvidia hardware. So current training setups are not strongly limited by interconnect vs other factors - some probably are, some underutilize interconnect and are limited by something else, but nvidia knows all of this, has all that data, and has been optimizing for these use cases weighted by value for years now (and is empirically better at this game than anybody else).

Fast off-chip memory combined with high memory bandwidth on the chip itself?

The upside of a wafer scale chip is fast on-chip transfer, the downside is slower off-chip transfer (as that is limited by the 2d perimeter of the much larger chip). For equal flops and or $$, the GPU design of breaking up the large tile into alternating logic and RAM subsections has higher total off chip RAM and off-chip transfer bandwidth.

The more ideal wafer design would be one where you had RAM stacked above in 3D, but cerebras doesn't do that presumably because they need that whole surface for heat transfer. If you look inside the engine block of the CS-2 form their nice virtual tour you can see that the wafer is sandwiched directly between the massive voltage regulator array that pumps in power and the cooling system that pumps out heat. There is no off-chip RAM next to that wafer, the off-chip RAM access all has to go through the long range IO modules on the edge of the chip.

So a single CS-2 - even though it has the cost and nearly the flops you'd expect of the equivalent GPU die area of 100 individual GPUs - has only 40GB of RAM: half the 80GB of an A100 or H100, less even than an RTX A6000! So it has over 100x less RAM than an equivalent size (cost, flops, die area) GPU system. Worse yet it has only a pathetic 150GB/s of IO bandwidth out to any external RAM or SSD, vs the 3TB/s RAM bandwidth per H100 GPU, so you can't supplement with external RAM.

This machine is an autistic savant. It maxes out local on chip interconnect (which GPUs aren't strongly constrained by) at the expense of precious RAM. So like I said it's only really good for running small models (which fit in 40GB) at very high speeds.

Replies from: asta-vista
comment by Amal (asta-vista) · 2022-11-16T10:21:25.313Z · LW(p) · GW(p)

I am certainly not an expert, but I am still not sure about your claim that it's only good for running small models. The main advantage they claim to have is "storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model." (https://www.cerebras.net/product-cluster/ , weight streaming). So they explicitly claim that it should perform well with large models.
 

Furthermore, in their white paper (https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper%20111521.pdf), they claim that the CS-2 architecture is much better suited for sparse models(e.g. by Lottery Ticket Hypothesis) and on page 16 they show that Sparse GPT-3 could be trained in 2-5 days. 

This would also align with tweets by OpenAI that Trillion is the new billion, and rumors about the new GPT-4 being similarly big jump as GPT-2 -> GPT-3 was - having colossal number of parameters and sparse paradigm (https://thealgorithmicbridge.substack.com/p/gpt-4-rumors-from-silicon-valley). I could imagine that sparse parameters deliver  much stronger results than normal parameters, and this might change scaling laws a bit.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-11-16T18:52:21.722Z · LW(p) · GW(p)

The main advantage they claim to have is "storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model."

This is almost a joke, because the equivalent GPU architecture has both greater total IO bandwidth to any external SSD/RAM array, and the massive near-die GPU RAM that can function as a cache for any streaming approach. So if streaming works as well as Cereberas claims, GPUs can do that as well or better.

I agree sparsity (and also probably streaming) will be increasing important; I've actually developed new techniques for sparse matrix multiplication on GPUs.

Replies from: zfurman
comment by Zach Furman (zfurman) · 2022-11-16T22:39:45.335Z · LW(p) · GW(p)

So if streaming works as well as Cereberas claims, GPUs can do that as well or better.

Hmm, I'm still not sure I buy this, after spending some more time thinking about it. GPUs can't stream a matrix multiplication efficiently, as far as I'm aware. My understanding is that they're not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time.

Cerebras says that the CS-2 is specifically designed for fast matrix-vector operations, and uses dataflow scheduling, so it can stream a matrix multiplication by just performing matrix-vector operations as weights stream in. And the weights are getting streamed from external RAM, rather than requested as needed, so there's no round-trip latency gunking up the works like a GPU has when it wants data from RAM.

I agree sparsity (and also probably streaming) will be increasing important; I've actually developed new techniques for sparse matrix multiplication on GPUs.

Cerebras claims that their hardware support for fast matrix-vector multiplication gives a 10x speed boost to multiplying sparse matrices, which could be helpful.

Replies from: jacob_cannell
comment by jacob_cannell · 2022-11-17T01:36:36.055Z · LW(p) · GW(p)

And the weights are getting streamed from external RAM, GPUs can't stream a matrix multiplication efficiently, as far as I'm aware.

Of course GPUs can and do stream a larger matrix multiplication from RAM - the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now - 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer.

The CS-2 is only good at matrix-vector operations that fit in its SRAM capacity. As a thought experiment, consider running a brain-like ANN with 10B neurons and 10T sparse weights. Simulating one second of activity requires only on order 10T sparse ops or a couple OOM more dense ops which is already within current single GPU capability. The problem is that streaming in the 10TB of weight data would take several minutes on the CS-2's pathetic slow IO path. Meanwhile the equivalent priced 200 GPU pod can fit the weights in GPU RAM and has the performance to simulate about a hundred instances of that brain sized model in real time, so about 10000x higher performance than the CS-2.

Weights outnumber activations by 3 or 4 OOM so moving weights over long distances as in CS-2 is enormously inefficient, compared to moving the activations around (as in the GPU design) which uses very little bandwidth. The future is in the opposite direction of that CS-2 'weight streaming' - towards more optimal neurmorphic computing - where the weights stay in place and the activations flow through them.

Replies from: asta-vista
comment by Amal (asta-vista) · 2022-11-17T15:33:02.513Z · LW(p) · GW(p)

my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big

answer by Razied · 2022-11-16T00:49:38.236Z · LW(p) · GW(p)

Well, it will scale linearly until it hits the finite node-to-node bandwidth limit... just like all other supercomputers. If you have your model training on  different nodes, you still need to share all your weights with all other nodes at some point, which is fundamentally an  operation, it just appears linear when you're spending more time computing your weight updates than you are communicating with other nodes. I don't see this really being a qualitative jump, but it might well be one more point to add to the graph of increasing compute power dedicated to AI.

comment by Zach Furman (zfurman) · 2022-11-16T01:25:39.017Z · LW(p) · GW(p)

Hmm, I see how that would happen with other architectures, but I'm a bit confused how this is  here? Andromeda has the weight updates computed by a single server (MemoryX) and then distributed to all the nodes. Wouldn't this be a one-to-many broadcast with  transmission time?

Replies from: Razied
comment by Razied · 2022-11-16T02:24:47.261Z · LW(p) · GW(p)

You're completely right, I don't know how I missed that, I must be more tired than I thought I was.

No comments

Comments sorted by top scores.