[updated] how does gpt2′s training corpus capture internet discussion?  not well

post by nostalgebraist · 2020-07-27T22:30:07.909Z · LW · GW · 3 comments

[Updated to correct my earlier claim that this doesn't affect GPT-3. Apparently it does?]

I’m out sick today, but had enough energy to do some GPT-related fiddling around.

This time, I was curious what “internet discussions” tended to look like in the original training corpus.  I thought this might point to a more natural way to represent tumblr threads for @nostalgebraist-autoresponder​ than my special character trick.

So, I looked around in the large shard provided as part of https://github.com/openai/gpt-2-output-dataset.

Colab notebook here, so you can interactively reproduce my findings or try similar things.

—–

The results were … revealing, but disappointing.  I did find a lot of discussion threads in the data (couldn’t find many chatlogs).  But

For example, from this thread it picks the one post

and renders it as

“ Pillowapnts

tho the gem doesnt specifically say that you need to crit with something linked, i thought it was just crit in general ahhh. alright i get it thxtho the gem doesnt specifically say that you need to crit with something linked, i thought it was just crit in general

That would be OP That would be OP Posted by Lordsidro

on on Quote this Post

This is apparently standard behavior for the newspaper text cleaner they used, and I could reproduce it exactly.  (Its heuristics grab a single post when looking for the “part the content is in.”)

[This paragraph was incorrect, see Update below] Does this affect GPT-3?  Probably not?  I don’t know how Common Crawl does text extraction, but at the very least, it’ll give you the whole page’s worth of text.

Update: Looked into this further, and I think GPT-3 suffers from this problem to some extent as well.

The Colab notebook has the details, but some stats here:

3 comments

Comments sorted by top scores.

comment by gwern · 2020-08-04T01:27:51.608Z · LW(p) · GW(p)

It can't be too bad, though, because I have seen GPT-3 generate fairly plausible forum discussions with multiple participants, and how would it do that if it only ever saw single-commenter documents?

Replies from: gwillen
comment by gwillen · 2020-08-04T02:50:51.351Z · LW(p) · GW(p)

Do you have examples of that kind of output for comparison? (Is it reproducing formatting from an actual forum of some kind, or the additional "abstraction headroom" over GPT-2 allowing GPT-3 to output a forum-type structure without having matching examples in the training set?)

Replies from: gwern
comment by gwern · 2020-08-11T18:27:30.758Z · LW(p) · GW(p)

I didn't copy it but it was fairly reasonable plaintext, something like username /n date /n comment /n /n next comment.