Posts
Comments
Comment by
toph on
What’s the backward-forward FLOP ratio for Neural Networks? ·
2024-03-09T22:52:10.635Z ·
LW ·
GW
Late to the party, but thanks for writing this up! I'm confused about two points in this calculation of the Theory section:
- The FLOP needed to compute the term "δ3@A2R" (and similar)
- I understand this to be the outer product of two vectors, δ3 with length #output, and A2R with length #hidden2
- If that's the case, should this require only #output*#hidden2*#batch FLOP (without the factor two in the table), since it's just the multiplication of each pair of numbers?
- Do the parameter updates need to be accumulated for each example in the batch?
- If this is the case, would this mean there's an additional FLOP for each parameter for each example in the batch?
I think these two points end up cancelling out so this still ends up with the 2:1 ratio, as expected. I think these points are also consistent with the explanation here: https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4