Posts
Comments
There is an old (2013) paper from Google here that mentions training an ngram model on 1.3T tokens: ("Our second-level distributed language model uses word 4-grams. The English model is trained on a 1.3 × 10^12 token training set"). An even earlier 2006 blog post here also references a 1T word corpus.
This number is 2x as big as MassiveWeb, more than a decade old, and not necessarily the whole web even back then. So I would be quite surprised if the MassiveWeb 506B token number represents a limit of what's available on the web. My guess would be that there's at least an order of magnitude more tokens available in a full web scrape. Though a lot does depends on how much the "quality filter" throws out.
And if this does represent a limit of what's on the web, then as other posters have said, email is much larger than the web. Though I question whether anyone would be reckless enough to train an LLM on everyone's private emails without consent, it seems like a potential privacy disaster.