Supposing the 1bit LLM paper pans out

post by O O (o-o) · 2024-02-29T05:31:24.158Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    10 mtaran
    8 Vladimir_Nesov
    5 Tomás B.
    2 Tomás B.
    2 lukehmiles
None
No comments

https://arxiv.org/abs/2402.17764 claims that 1 bit LLMs are possible.

If this scales, I'd imagine there is a ton of speedup to unlock since our hardware has been optimized for 1 bit operations for decades. What does this imply for companies like nvidia and the future of LLM inference/training? 

 Do we get another leap in LLM capabilities? Do CPUs become more useful? And can this somehow be applied to make training more efficient?

Or is this paper not even worth considering for some obvious reason I can't tell. 
 

Edit: this method is applied to training already

Answers

answer by mtaran · 2024-03-02T16:39:40.937Z · LW(p) · GW(p)

I think this could be a big boon for mechanistic interpretability, since it's can be a lot more straightforward to interpret a bunch of {-1, 0, 1}s than reals. Not a silver bullet by any means, but it would at least peel back one layer of complexity.

comment by Thomas Kwa (thomas-kwa) · 2024-03-03T04:41:09.451Z · LW(p) · GW(p)

It could also be harder. Say that 10 bits of current 16 bit parameters are useful; then to match the capacity you would need 6 ternary parameters, which would potentially be hard to find or interact in unpredictable ways.

Replies from: mtaran
comment by mtaran · 2024-03-03T15:11:17.626Z · LW(p) · GW(p)

Perhaps if you needed a larger number of ternary weights, but the paper claims to achieve the same performance with ternary weights as one gets with 16-bit weights using the same parameter count.

answer by Vladimir_Nesov · 2024-02-29T15:23:32.426Z · LW(p) · GW(p)

The paper is not about post-training quantization, instead it's quantization aware training (this is more clearly discussed in the original BitNet paper). The representation is ternary {-1, 0, 1} from the start, the network learns to cope with that constraint throughout pre-training instead of getting subjected to brain damage of quantization after training.

Compare this with

where the Microscaling block number format is used to train a transformer at essentially 4 bits per weight, achieving the same perplexity as with 32 bit floating point weights, see Figure 4 on page 7. If perplexity doesn't change for quantization aware training when going down to 4 bits, it's not too shocking that it doesn't significantly change at 1.6 bits either.

answer by Tomás B. · 2024-02-29T23:39:16.388Z · LW(p) · GW(p)

This is applied to training. It’s not a quantization method.

answer by Tomás B. · 2024-03-02T17:30:20.534Z · LW(p) · GW(p)

@Veedrac [LW · GW] suppose this pans out and custom hardware is made for such networks.  How much faster/larger/cheaper will this be?

comment by Veedrac · 2024-03-03T03:42:43.382Z · LW(p) · GW(p)

Communication overhead won't drop faster than linear.

answer by lukehmiles · 2024-02-29T07:26:40.840Z · LW(p) · GW(p)

I don't think it can be patched for training to make training itself 1.58 bit (95% confident). I think training (not inference) is where most the money goes to and comes from, so hardware market will not be affected (90%).

Even in the small inference market, chip companies already have 4-8 bit inference accelerators in the oven (99%); they will not estimate the benefits of 1.58 bit to justify the risk of such specialized hardware, so nobody will build more than 100 1-bit or 1.58-bit inference chips (80%).

Old fashioned CPUs have at most 32 threads so will still be slow as heck to run NNs (90%).

I think your question is quite important.

comment by Fergus Argyll (fergus-argyll) · 2024-02-29T11:17:40.906Z · LW(p) · GW(p)

If I understand correctly (I very well might not), A "one bit LLM" has to be trained as a "one bit LLM" in order to then run inference on it as a "one bit LLM". I.e this isn't a new Quantization scheme.

So I think training and inference are tied together here, meaning; if this replicates, works, etc. we will probably have new hardware for both stages

Replies from: lcmgcd
comment by lukehmiles (lcmgcd) · 2024-03-02T01:05:34.003Z · LW(p) · GW(p)

I don't see them mention anything about training efficiency anywhere so I don't think it is really legit 1.58 bit training in a meaningful sense

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-03-02T09:53:18.296Z · LW(p) · GW(p)

Training doesn't become more efficient, gradients and activations are still full precision, and I'm guessing there is a full precision copy of weights maintained during training (in addition to quantized weights used for forward passes). The advantage is that this method of training produces a quantized model that has the same quality as a non-quantized model (unlike post-training quantization, which makes models worse). And additionally the {-1, 0, 1} quantization means you need much less multiplication circuitry for inference, so the potential for inference chips is not just that there is less memory, but also that there is less energy and transistors, significantly raising the practical ceiling for local (on-device) inference.

It's apparently not a novel idea, quantization aware training was explored before there were transformers:

No comments

Comments sorted by top scores.