How feasible/costly would it be to train a very large AI model on distributed clusters of GPUs?

post by Anonymous · 2022-01-25T19:20:49.170Z · LW · GW · 1 comment

This is a question post.

Contents

  Answers
    12 A Ray
    8 jacob_cannell
    6 ckbyrd
None
1 comment

Folding@home is the most powerful supercomputer in the world. It relies on simulations utilizing on a distributed network of GPUs, CPUs, and ARM processors volunteered by people around the world. From some quick Googling, it looks like GPUs account for a large majority of Folding@home’s processing power. This suggests to me that distributed computing networks like Folding@home could potentially be used to train large deep neural networks. 

I asked a friend about this, and they offered the following thoughts:

This second bullet point is the question I want to ask: how much more costly it would be to train a very large AI model on a set of distributed clusters of compute, where each cluster has a sufficient number of GPUs to store the model and do model parallelism on-site? It would also be helpful to know whether/how much this premium might change in the future.

Answers

answer by A Ray · 2022-01-25T20:01:43.386Z · LW(p) · GW(p)

While there are rare examples of deep learning algorithms that can scale in this way, in practice current relative resource costs of compute-vs-storage-vs-bandwidth don't work out in favor of this kind of topology.

First problem: Current deep learning systems require a bunch of very dense networking with a high degree of connectivity between nodes, high bandwidth over these connections, and low latency.   It's possible some sort of tweak or modification to the architecture and algorithm would allow for training on compute that's got slow/high-latency/low-bandwidth connections, but I don't know of any right now.

Second problem: Current distributed training approaches strongly require almost all of the compute to be available and reliable.  If nodes are stochastically coming and going, it's hard to know where to route data / how long to wait to accumulate gradients before giving up / who has what copy of the latest parameters.  This seems also solveable with engineering, but engineering distributed systems to deal with node failures is a headache.

Third problem: Trust in the computing.  Deep neural networks are pretty sensitive to data poisoning and things like adversarial examples.  I expect that for almost any efficient distributed training setup, a clever attacker would be able to figure out a way to subtly introduce unnoticed, unwanted changes to the training.  In general I think a patient attacker could make changes subtle enough that they're almost always below some threshold of validation, but I don't have a proof for this.

Maybe there are other problems that I missed, but at the very least each one of those independently would make me not want to train my large models on a setup like this.

answer by jacob_cannell · 2022-01-26T00:12:53.404Z · LW(p) · GW(p)

I believe this is probably possible, but only with new techniques/breakthroughs. For various mostly economic reasons it simply hasn't yet attracted the caliber of researcher attention required to make significant progress. It seems very difficult; using a more traditional reliable dense large cluster is much much easier. And given that the fully decentralized/distributed approach is only about 5x to 10x cheaper or more efficient even if interconnect was infinite, it's not really worth investing in until all the other low hanging research fruit is picked.

Let's assume we need a brain sized model, so about 1e14 params. A single modern GPU has the flops (using tensorcores) to run a model this big at real-time speed assuming non-trivial breakthrough in exploitation of both activation and weight sparsity (nobody has achieved this yet, arguably may be impossible, but let's assume). This means the model runs at 100hz, but only 1% of the 1e14 connections are active per timestep, so it uses 1e14 flops instead of 1e16.

Then let's also assume some simple effective compression techniques in the spirit of weight sharing allows us to compress the 1e14 params down to about 1e13 bits or on order a terrabyte of GPU RAM. This then requires model parallelization over about 64 GPUs - with each GPU then simulating about 128 million neurons and the equivalent of a trillion weights - but shared/compressed down to use only 16GB.

Next let's assume we constrain this model to use mostly local communication similar to how the brain is such constrained. So only about 10% of our 10 billion-ish neurons have long distance connections at all, and the total long distance (inter-GPU) communication bandwidth is less than 10Gbps per brain/model instance (about 1 billion long distance paths firing at 1hz). This again assumes brain-like activation sparsity (on order a few % of neurons active per timestep).

This means we need up to 64 * 10Gbps of total inter-GPU bandwidth to simulate our 64 brain sized instances in parallel, but spread out so each GPU needs only about 10Gbps. This is easily achievable with 4 or 8 GPUs per machine, fast PCIE or NVlink intra-machine connections, and then a 32 or 16 way 10Gb network switch connecting the machines.

If we can't achieve 32x weight compression/sharing then we need to use more GPUs or get by with a smaller model. Up to 8 GPUs per machine and a 64-way switch seems feasible, so we could scale up to 512 local GPUs. But it gets much more expensive/unrealistic past that point.

So to use millions of machines, we'd end up with thousands of these clusters, each cluster running a number of instances of one large brain-sized model. Realistically each cluster has only about a 1GBps shared connection to the wider internet, which is extremely limiting - several OOM less than the local network aggregate switch bandwidth.

Standard data parallelization would involve passing around our 1TB of (already highly compressed) weights/params every gradient step, which seems fairly unrealistic unless truly enormous batch sizes are possible.

So the most straightforward approach is probably using these thousands of clusters for parallel hyper-parameterization (which requires barely any bandwidth), but I believe there are new vastly more efficient techniques waiting to be discovered that use bandwidth somewhere in between (in a log sense) ~1e2 bit/s (distributed hyper-param exploration) and ~1e12 bit/s (full data parallel at ~1 gradient step per second).

answer by ckbyrd · 2022-01-25T15:44:13.841Z · LW(p) · GW(p)

Don't know a ton about this but here are a few thoughts:

- Overall, I think distributed compute is probably not good for training or inference, but might be useful for data engineering or other support functions. 

- Folding@home crowdsources compute for expanding markov state models of possible protein folding paths. Afaik, this doesn't require backpropagation or any similar latency-sensitive updating method. The crowdsourced computers just play out a bunch of scenarios, which are then aggregated and pruned off-line. Interesting paths are used to generate new workloads for future rounds of crowdsourcing. 

This is an important disanalogy to deep RL models, and I suspect this is why F@H doesn't suffer from the issues Lennart mentioned (latency, data bandwith, etc.)

This approach can work for some of the applications that ppl use big models for - e.g. Rosetta@home does roughly the same thing as Alphafold, but it's worse at it. (afaik Alphafold can't do what F@H does - different problem)

- F@H and DL both benefit from GPUs because matrix multiplication is pretty general. If future AI systems train on more specialized hardware, it may become too hard to crowdsource useful levels of compute. 

- Inference needs less aggregate compute, but often requires very low latency, which probably makes it a bad candidate for distribution.

- IMO crowdsourced compute is still interesting even if it's no good for large model training/inference. It's really good at what it does (see F@H, Rosetta@home, cryptomining collectives, etc.), and MDPs are highly general even with long latencies/aggregation challenges. 

Maybe clever engineers could find ways to use it for e.g. A/B testing fine-tunings of large models, or exploiting unique datasources/compute environments (self-driving cars, drones, satellites, IoT niches, phones).

1 comment

Comments sorted by top scores.

comment by gwern · 2022-01-25T21:50:27.189Z · LW(p) · GW(p)

Here's an example of the largest dense model trained over the public Internet on commodity hardware I am aware of: ALBERT & SwAV.

It comes at a cost of like 3x a well-connected cluster. (This is with a model that fits in each node so no additional overhead from model parallelism.) I'm not sure if I'd expect >GPT-3-scale models to do a lot worse or not. In any case, the question here is, is the glass 1/3rd full or 2/3rds empty?

Using crowdsourcing might be much more feasible in the DRL setting. Leela was able to crowdsource very effectively because individual nodes can spend a lot of local time deeply evaluating the game tree and only returning small board-state values, so it's embarrassingly parallel.