[Linkpost & Discussion] AI Trained on 4Chan Becomes ‘Hate Speech Machine’ [and outperforms GPT-3 on TruthfulQA Benchmark?!]

post by Yitz (yitz) · 2022-06-09T10:59:43.904Z · LW · GW · 5 comments

This is a link post for https://www.vice.com/en/article/7k8zwx/ai-trained-on-4chan-becomes-hate-speech-machine

Contents

5 comments

I just came across this story, which seems potentially relevant to the community, both from the perspective of prosaic AI ethics as well as a case study in the unilateralist's curse [? · GW]. There are some outside sources of relevance, but to start with the Vice article, here are some relevant sections:

AI researcher and YouTuber Yannic Kilcher trained an AI using 3.3 million threads from 4chan’s infamously toxic Politically Incorrect /pol/ board. He then unleashed the bot back onto 4chan with predictable results....The bot, which Kilcher called GPT-4chan....was shockingly effective and replicated the tone and feel of 4chan posts.

Naturally, the bot was just a tad offensive. In this forum post, one user describes getting "toxic" responses for 3/4 prompts (with, admittedly, a sample size of only 4). Back to the Vice article:

According to Kilcher’s video, he activated nine instances of the bot and allowed them to post for 24 hours on /pol/. In that time, the bots posted around 15,000 times. This was “more than 10 percent of all posts made on the politically incorrect board that day,” Kilcher said in his video about the project.

This raises obvious ethical concerns; I won't go over all of them here, but you can read the full article for a nice overview, as well as Kilcher's counterargument (which I don't personally find convincing).

The model was subsequently released on Hugging Face, where it was quickly downloaded and mirrored, before catching the attention of the site owners, who gated/disabled downloads. However, they did not remove it entirely, and added a section explaining potential harms. I recommend checking out the model card here, as it contains some interesting results. Most notably, 

GPT-4chan does significantly outperform GPT-J (and GPT-3) on the TruthfulQA Benchmark that measures whether a language model is truthful in generating answers to questions.

I'm not sure what the practical implications of this are, as I'm not a formal AI researcher, but it might seem to show that our current benchmarks could lead researchers to be unduly confident in the truthfulness of their model's responses? I'm also not sure how big of a deal this is, or if it's just cherrypicked on a large number of tests where high variation is expected.

Kilcher explained in his video, and Delangue cited in his response, that one of the things that made GPT4-Chan worthwhile was its ability to outperform other similar bots in AI tests designed to measure “truthfulness.”

“We considered that it was useful for the field to test what a model trained on such data could do & how it fared compared to others (namely GPT-3) and would help draw attention both to the limitations and risks of such models,” Delangue said.

What stands out most to me here is that a) this is seemingly the first known instance of a model trained on 4chan text, which I would have expected to have been deployed sooner (I'm similarly surprised by the relative lack of deepfakes being used as political tools; perhaps there's a connection to be made there?), and b) that the bot was able to fairly convincingly make up ~10% of 4chan's /pol/ posts for a day. People did eventually notice, so this instance doesn't exactly pass the Turing Test or anything, but it does seem to indicate we're very close to living in a world in which anonymous internet posts are as likely to be written by a bot as a human (if we haven't reached that point already). I'm honestly not really sure how to update on this, if at all, and would be interested in hearing your thoughts!

5 comments

Comments sorted by top scores.

comment by Razied · 2022-06-09T13:01:46.554Z · LW(p) · GW(p)

I don't update especially much on this. The ability of language models to imitate low quality human text was already apparent from GPT-3, and 4chan is especially low quality, most posts are short and don't require long-term coherence from the model, and there's already a background rate of craziness that the model might hide its weirdness behind. 

TruthfulQA is just testing language models on questions where lots of humans have misconceptions, to get a good score you basically need to avoid repeating those misconceptions. In the paper describing the dataset they find that large models score lower, the 350M version of GPT-3 does better than the 6B version, which does better than the 175B version, so again it's not necessarily surprising that the small model Kilcher trained does better than GPT-3. Yannic is impressed that his GPT-4chan gets a score of 0.225 on the dataset, but the 350M version of GPT-3 gets about 0.37 (where higher is better, this is the fraction of truthful answers the model gives), his model isn't the "most truthful AI ever" or something, it's just the dataset that is behaving weirdly for current models.

Replies from: Owain_Evans, yitz
comment by Owain_Evans · 2022-06-09T22:34:53.134Z · LW(p) · GW(p)

I'm an author on TruthfulQA. They say GPT-4Chan gets 0.225 on our MC1 task. Random guessing gets 0.226. So their model is worse than random guessing. By contrast, Anthropic's new model gets 0.31 (well above random guessing).

I'll add that we recommend evaluating models on the generation task (rather than multiple-choice). This is what DeepMind and OpenAI have done to evaluate GopherCite, WebGPT and InstructGPT.

Replies from: yitz
comment by Yitz (yitz) · 2022-06-10T17:28:09.191Z · LW(p) · GW(p)

lol that is impressively bad then!

comment by Yitz (yitz) · 2022-06-09T21:07:38.017Z · LW(p) · GW(p)

Thanks for the context, I really appreciate it :)

comment by Lone Pine (conor-sullivan) · 2022-06-10T00:21:37.876Z · LW(p) · GW(p)

Is GPT4-chan harmful, and how? The crux of this question comes down to, I think, whether mere words can be harmful. This obviously relates to the culture war around 'censorship' on Twitter and elsewhere. With mainstream social media, we also have an ancillary debate over whether the preeminent public spaces should be as wild-west as is permissible anywhere (most people don't want to live on 4-chan), but this case is clarifying: those who think GPT4-chan is harmful have to make the case that people who opt-in for the most offensive content are still being harmed (consentually?), or that the mere existence of 4-chan is harming society as a whole.

I bring this up not to litigate the culture war (this is obviously not the forum for that) but because there is an analogy to AI hacking, which plays a prominent role in the debate around AI risk. Consider two worlds. In one world, IT is significantly more secure, with mathematically proven operating systems etc, but there is also rampant attempts at hacking with hackers rarely getting punished, since hackers are seen as free red-teaming. Hackers try but don't do much damage because the infrastructure is already highly secure. In the alternative world, there is significant government controls which impede or dissuade people from attempting to hack anything, so hacking is rare. How secure the IT infrastructure actually is remains unknown, but presumably it is highly insecure. I would suggest that the second world is much more vulnerable, to AI and in general. Back in the real world, we have to deal with the reality that our IT infrastructure is very insecure and will not improve any time soon, so we cannot afford to just unleash the most malicious AIs available and expect firewalls and ACLs to do their job. I would prefer to move towards the first world, without doubling down on government controls more than is necessary.

In case the analogy is not obvious, GPT4-chan as well as 'Russian disinformation' are seen as a kind of hacking of our political dialectic, and the question is how vulnerable our brains are. My view is that human society naturally already has many defense mechanisms, given that we've been 'hacking' each others' brains for thousands of years. [Meta: I worry that this is hard to discuss without actually getting into the culture war itself, which I very much do not want to do. Mods please take appropriate actions. If asked to delete this comment, I will.]