Neel was advised by the authors that it was important minimise batches having tokens from the same prompt. This approach leads to a buffer having activations from many different prompts fairly quickly.
Oh I see, it's a constraint on the tokens from the vocabulary rather than the prompts. Does the buffer ever reuse prompts or does it always use new ones?
We store activations in a buffer of ~500k tokens which is refilled and shuffled whenever 50% of the tokens are used (ie: Neel’s approach).
I am not sure I understand the reasoning around this approach. Why do you want to refill and shuffle tokens whenever 50% of the tokens are used? Is this just tokens in the training set or also the test set? In Neel's code I didn't see a train/test split, isn't that important? Also, can you track the number of epochs of training when using this buffer method (it seems like that makes it more difficult)?