Keeping content out of LLM training datasets

post by Ben Millwood (ben-millwood) · 2024-07-18T10:27:27.827Z · LW · GW · 0 comments

Contents

  Documentation from model vendors
    OpenAI (ChatGPT)
    Anthropic (Claude)
    Google (Gemini)
    Meta (LLaMA)
    Common Crawl
    xAI (Grok)
  External resources that aim to help block AI-related crawlers
  Discussion of general methods
    robots.txt
    User-Agent or IP-based blocking
    Inclusion canaries
    CAPTCHAs
    User accounts
    Authorised user lists
  What non-public data could be trained on?
None
No comments

This post collects methods to exclude internet resources from LLM training datasets.

I plan to at least try to keep this post up-to-date with respect to new things I learn on the topic. Please feel encouraged to suggest any additions or amendments.

This post is about how to do it. Discuss whether to apply these techniques in a separate post: Should we exclude alignment research from LLM training datasets? [LW · GW]

Link preview image by Steve Douglas on Unsplash.

Documentation from model vendors

OpenAI (ChatGPT)

See docs for GPTBot and ChatGPT-User.

GPTBot is for training data, and ChatGPT-User is used by plugins which can access the internet during inference. They document the user-agents used, the robots.txt identities, and the IP ranges they access from. There's some commentary about how ChatGPT-User is used in training, which I didn't find very illuminating.

Anthropic (Claude)

Does Anthropic crawl data from the web, and how can site owners block the crawler?

Pretty similar to OpenAI's offering, except that they don't have fixed IP ranges, and Claude (as far as I understand?) doesn't directly access the internet, so that distinction isn't relevant.

Some nice touches are that their crawler will not scrape anything that already blocked the Common Crawl bot (see below) and specifically commits to not trying to bypass CAPTCHAs (see further below).

Google (Gemini)

Appears to use the Google-Extended crawler, which can be blocked with robots.txt. It doesn't use a different user-agent header relative to other Google products, so user-agent blocking is only possible if you're willing to block the Search scraper as well. I assume they also don't use fixed IP ranges, but I haven't really tried to check, since that always seemed like the most clumsy method anyway.

Meta (LLaMA)

LLaMA 2's model card doesn't disclose their training data, but LLaMA 1 (per its model card) was trained on a variety of sources including CCNet and C4, which are both derived from the Common Crawl dataset (see below), so it seems likely that excluding your data from LLaMA at a minimum requires excluding it from Common Crawl as well.

Common Crawl

Common Crawl is a large, publicly-available dataset which in principle any training process could use (and Meta did), so keeping out of AI training datasets necessitates keeping out of Common Crawl. Their FAQ documents their user-agent and robots.txt identifier.

xAI (Grok)

At first glance, I wasn't able to find any documentation about this. I'll update if I do.

External resources that aim to help block AI-related crawlers

https://darkvisitors.com/ collects and categorises crawlers of various kinds, and among other things offers a service where you can create an account with them and fetch a pre-made robots.txt from their API that includes all bots of a given type.

Discussion of general methods

robots.txt

See Wikipedia for full details on the technique. For the sake of this article, here are the main points:

User-Agent or IP-based blocking

Inclusion canaries

BIG-bench is a benchmark suite to run against LLMs which, by nature, is only valid if the LLMs did not train on the benchmark data. To this end, the benchmarks include a UUID to attempt to facilitate excluding the data from training sets, and detect if it has been trained on after all. Per a thread on niplav's shortform [LW(p) · GW(p)], Claude and GPT-4-base (but not, it seems, GPT-4o) have learned the canary.

From the name "canary", I originally guessed that these strings were meant to indicate whether some other exclusion method had worked or not. But noting that this string is in a GitHub repo that can be forked to any user, and public mirrors of GitHub exist (e.g. GitCode), surely it's hopeless to exclude the benchmarks by URL, and the "canary" can only work by actually causing exclusion from the training data – by AI vendors configuring their scrapers to drop documents that contain it.

Empirically, this seems not to be happening. I'm not sure how it's supposed to happen. Are there attempts to coordinate on this that I don't know about? Or is the idea that these canaries are primarily relevant for research on non-frontier models with more carefully chosen training sets?

ARC also maintains their own evals canary, about which I have the same questions, though I don't think Claude (currently) knows about it. (I've contacted them using their form to ask for more information, and I'll edit it in here if they reply.)

Canaries would have some unique strengths, if they worked: they are intrinsic to your content, so are realistically the only option for content submitted to websites you don't control (e.g. GitHub, or uh, LessWrong), and are robust to content mirroring.

CAPTCHAs

See Wikipedia for general discussion. This is the first technique in my list that attempts to make crawler access impossible rather than merely discouraged. The disadvantages are that it's annoying for human users, and it prevents all scraping, including e.g. search engine indexing. Also, many CAPTCHAs were eventually solved by automated systems, and this is only likely to get worse over time (though again, it's hard to imagine a good-faith scraper doing this, and as mentioned above, Anthropic has explicitly promised not to.)

User accounts

Require users to prove their identity before accessing your resource (implicitly depending on the fact that automated users won't want to or won't be able to create accounts). This is a serious technical hurdle if your website didn't already support authentication, it again prevents search engine indexing, and presents a higher barrier of entry to humans who don't want to (or can't be bothered to) share their identity with you. For some content, though, a mild barrier to human entry may be no bad thing.

Authorised user lists

This is the final level of secrecy that I considered: a system that doesn't only require users to prove their identity, but whitelists specific identities for access according to some notion you have of who should be able to read your resource. You then use whatever method you deem appropriate to choose an audience who will appropriately guard its contents. This seems primarily relevant for documents that either are extremely critical not to leak, or are a potential target for anonymous human adversaries to deliberately leak. Of course, it has the highest maintenance requirements, and the biggest cost to human access, of any method on this list.

What non-public data could be trained on?

This section is primarily speculative, but seemed worth raising. In principle:

I expect that these companies generally promise not to use private data in training, but I haven't reviewed their promises for their specifics or robustness. Happy to hear takes on this from the comments.

0 comments

Comments sorted by top scores.