Where should one post to get into the training data?

post by keltan · 2025-01-15T00:41:19.405Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    1 Daniel Tan
    1 quila
None
No comments

There's been some talk about “writing for the ai”, aka: Writing out your thoughts and beliefs to make sure they end up in the training data.

LessWrong seems like an obvious place that will be scraped. I expect when I post things here, they’ll be eaten by the Shoggoth.

But what about things that don’t belong on LW?

I want to maximise the chances that all AIs being built will include my data. So posting to Twitter (X) seems like I’ll just be training Grok???

What about a personal blog I start on a website I own? Does making the robots.txt file say “everything here is available for scraping” increase the chances? Does linking to that website in more places increase the chances?

I feel like I’m lacking a lot of knowledge here. I encourage responses even if they feel like obvious things to you.

Answers

answer by Daniel Tan · 2025-01-15T02:40:19.040Z · LW(p) · GW(p)

The Pile was created from Reddit datasets between 2005 and 2020. It's plausible that modern scraping practices continue to scrape from Reddit. Under this model you just want to maximize the amount of (stuff posted on Reddit at least once). Multiple copies doesn't help since the Pile is subsequently de-duped

answer by quila · 2025-01-15T02:02:16.732Z · LW(p) · GW(p)

I encourage responses even if they feel like obvious things to you.

In many places.

No comments

Comments sorted by top scores.