Cerebras Systems unveils a record 1.2 trillion transistor chip for AI

post by avturchin · 2019-08-20T14:36:24.935Z · LW · GW · 4 comments

This is a link post for https://venturebeat.com/2019/08/19/cerebras-systems-unveils-a-record-1-2-trillion-transistor-chip-for-ai/

From reddit comments:

....how scared should we be?


No idea. They don't provide benchmarks, and while they promise they are forthcoming, it sounds like it might be months. In the meantime, there's just vague hype about going from 'months to minutes'. Hardware startups have a long history of overpromising and underdelivering: it's hard to beat Nvidia & Moore's Law (remember all those 'analogue computing' startups?). It sure does sound interesting, though: 18GB of on-chip SRAM rather than HBM or DDR* RAM? 1.2 trillion transistors? Potentially FPGA-style streaming of data points through a single on-chip model with each layer being handled by different sets of cores? Sparsity multipliers? All quite interesting sounding and I will be very interested in the benchmarks, whenever they should be forthcoming. If nothing else, it is an extreme architecture of a type you rarely see.


Comments sorted by top scores.

comment by FactorialCode · 2019-08-20T22:27:23.272Z · LW(p) · GW(p)

Allow me to speculate wildly.

I don't actually think this is going to make that big of a difference, at least for current AI research. The main reason is because I think the main hardware bottlenecks to better AI performance are performance/$ and performance/W and memory bandwidth. This is because, so far, most large scale DL algorithms have shown almost embarrassingly parallel scaling, and a good amount of time is wasted just saving and loading NN activations for the back-prop algorithm.

This technology probably won't lead to any major performance improvements in terms of performance/$ or performance/W. Those will have already come from dedicated DL chips such as Google's TPUs, because this essentially a really big dedicated DL chip. The major place for improvement is memory bandwidth, which according to the article, is an impressive 9PB per second, and 10,000 times than what's on a V100 GPU, but with only 18GB of ram, that's going to severely constrain the size of models that can be trained, so I don't think it will be useful for training better models.

Might be good for inference though.

Replies from: avturchin
comment by avturchin · 2019-08-21T12:17:33.268Z · LW(p) · GW(p)

They also claim increased performance in term of energy as they eliminate useless multiplications on zero which are often in matrix multiplication.