Which parts of the existing internet are already likely to be in (GPT-5/other soon-to-be-trained LLMs)'s training corpus?

post by AnnaSalamon · 2023-03-29T05:17:28.000Z · LW · GW · 2 comments

This is a question post.

Contents

2 comments

In particular:

Answers

2 comments

Comments sorted by top scores.

comment by Douglas_Knight · 2023-03-29T18:47:27.757Z · LW(p) · GW(p)

I assume you know this, but to be clear, OpenAI has already used pirated books. GPT-3 was trained on "books2" which appears to be all the text on libgen (and pretty much all the books on libgen have been through OCR). It was weighted the same as the common crawl, lower than Gutenberg or Reddit links. This seems to answer your second question: they will likely treat pdfs on the libgen the same as pdfs on the open web. If you're asking about whether they will train the model on the pixels in these pdfs, which might make up for losses in OCR, I have no idea.

Replies from: AnnaSalamon
comment by AnnaSalamon · 2023-03-29T19:15:37.438Z · LW(p) · GW(p)

I did not know this; thanks!