Which parts of the existing internet are already likely to be in (GPT-5/other soon-to-be-trained LLMs)'s training corpus?
post by AnnaSalamon · 2023-03-29T05:17:28.000Z · LW · GW · 2 commentsThis is a question post.
Contents
2 comments
In particular:
- If a book is in easily accessible parts of the web, but is there as a pdf image of a scanned book rather than as text, is it likely to be in the training corpus of (GPT5/similar) in roughly the same way (with roughly the same effects) as if it were in easily accessible webpages?
- How about if a book is on libgen as a pdf (illegally), but not on the easily accessible parts of the web?
- How about if the ideas are in a video (e.g. a podcast); is (GPT5/similar) likely to process the text content of a video in roughly the same way if there's only a video, compared to if there's a video and a transcript?
Answers
2 comments
Comments sorted by top scores.
comment by Douglas_Knight · 2023-03-29T18:47:27.757Z · LW(p) · GW(p)
I assume you know this, but to be clear, OpenAI has already used pirated books. GPT-3 was trained on "books2" which appears to be all the text on libgen (and pretty much all the books on libgen have been through OCR). It was weighted the same as the common crawl, lower than Gutenberg or Reddit links. This seems to answer your second question: they will likely treat pdfs on the libgen the same as pdfs on the open web. If you're asking about whether they will train the model on the pixels in these pdfs, which might make up for losses in OCR, I have no idea.
Replies from: AnnaSalamon↑ comment by AnnaSalamon · 2023-03-29T19:15:37.438Z · LW(p) · GW(p)
I did not know this; thanks!