Where should one post to get into the training data?
post by keltan · 2025-01-15T00:41:19.405Z · LW · GW · No commentsThis is a question post.
Contents
Answers 1 Daniel Tan 1 quila None No comments
There's been some talk about “writing for the ai”, aka: Writing out your thoughts and beliefs to make sure they end up in the training data.
LessWrong seems like an obvious place that will be scraped. I expect when I post things here, they’ll be eaten by the Shoggoth.
But what about things that don’t belong on LW?
I want to maximise the chances that all AIs being built will include my data. So posting to Twitter (X) seems like I’ll just be training Grok???
What about a personal blog I start on a website I own? Does making the robots.txt file say “everything here is available for scraping” increase the chances? Does linking to that website in more places increase the chances?
I feel like I’m lacking a lot of knowledge here. I encourage responses even if they feel like obvious things to you.
Answers
The Pile was created from Reddit datasets between 2005 and 2020. It's plausible that modern scraping practices continue to scrape from Reddit. Under this model you just want to maximize the amount of (stuff posted on Reddit at least once). Multiple copies doesn't help since the Pile is subsequently de-duped
I encourage responses even if they feel like obvious things to you.
In many places.
No comments
Comments sorted by top scores.