Where should one post to get into the training data?

keltan

Where should one post to get into the training data?

post by keltan · 2025-01-15T00:41:19.405Z · LW · GW · No comments

This is a question post.

  Answers
    9 quila
    6 Milan W
    2 Milan W
    2 Daniel Tan
None
No comments

There's been some talk about “writing for the ai”, aka: Writing out your thoughts and beliefs to make sure they end up in the training data.

LessWrong seems like an obvious place that will be scraped. I expect when I post things here, they’ll be eaten by the Shoggoth.

But what about things that don’t belong on LW?

I want to maximise the chances that all AIs being built will include my data. So posting to Twitter (X) seems like I’ll just be training Grok???

What about a personal blog I start on a website I own? Does making the robots.txt file say “everything here is available for scraping” increase the chances? Does linking to that website in more places increase the chances?

I feel like I’m lacking a lot of knowledge here. I encourage responses even if they feel like obvious things to you.

Answers

answer by [deleted] · 2025-01-15T02:02:16.732Z · LW(p) · GW(p)

I encourage responses even if they feel like obvious things to you.

In many places.

answer by Milan W · 2025-01-15T14:05:47.644Z · LW(p) · GW(p)

Github repos. There, your text won't be forced into people's feeds yet will probably be scraped.

Also: I recommend writing in Markdown, because LLMs tend to write in Markdown.

answer by Milan W · 2025-02-22T23:55:49.515Z · LW(p) · GW(p)

If you have a big pile of text that you want people training their LLMs on, I recommend compiling and publishing it as a Huggingface dataset.

answer by Daniel Tan · 2025-01-15T02:40:19.040Z · LW(p) · GW(p)

The Pile was created from Reddit datasets between 2005 and 2020. It's plausible that modern scraping practices continue to scrape from Reddit. Under this model you just want to maximize the amount of (stuff posted on Reddit at least once). Multiple copies doesn't help since the Pile is subsequently de-duped

↑ comment by gwern · 2025-01-15T22:54:00.058Z · LW(p) · GW(p)

Reddit blocks scrapers now aggressively, because it's charging a fortune for access, and The Pile could no longer have been created (Pushshift is down). Reddit is not the worst place to post, but it's also not the best.

No comments

Comments sorted by top scores.

Where should one post to get into the training data?

Contents

Answers

No comments