Results from the AI x Democracy Research Sprint

post by Esben Kran (esben-kran), jordine, Jason Hoelscher-Obermaier (jas-ho) · 2024-06-14T16:40:47.538Z · LW · GW · 0 comments

Contents

  Projects
    Beyond Refusal: Scrubbing Hazards from Open-Source Models
    Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.
    Artificial Advocates: Biasing Democratic Feedback using AI
    Unleashing Sleeper Agents
    Other projects
  Apart Research Sprint
None
No comments

We ran a 3-day research sprint on AI governance, motivated by the need for demonstrations of the risks to democracy by AI, supporting AI governance work. Here we share the 4 winning projects but many of the other 19 entries were also incredibly interesting so we suggest you take a look.

In summary, the winning projects:

Join us and Apollo Research later this June for the Deception Detection Hackathon: Can we prevent AI from deceiving humans? — June 28, 2024, 4:00 PM to July 1, 2024, 3:00 AM (UTC).

Thank you to Alice Gatti, Simon Lermen, Nina Rimsky, Konrad Seifert, Andrey Anurin, Bart Bussman, AI Safety Groningen, EA Denmark, AI Safety Gothenburg, Equiano Institute, Vietnam AI safety community, and LISA for making the event possible.

Projects

Beyond Refusal: Scrubbing Hazards from Open-Source Models

By Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis

Abstract: Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope.

Are models trained using selective unlearning robust to the refusal vector? Can we get refusal vectors and undo fine-tuning? Is the difference in model weights as a result of fine-tuning representable through a steering vector? Can we make this steering vector unwritable, additive and invertible?

Nina Rimsky’s comment: Interesting experiments, I liked the approach of applying more adversarial pressure to unlearning techniques. Would be interesting to run similar experiments on other unlearning techniques

Simon Lermen’s comment: Results seem to support claim about unlearning. There are also other approaches to prevent misuse from open-models. Alternative to unlearning. When the paper refers to fine-tuning it seems to refer to the unlearning fine-tuning of harmful knowledge. Maybe the wording could sometimes be a bit more clear on this. For the refusal vector there was this recent post [LW · GW]. I also am working on a post on refusal vectors in agentic systems.

See the code and research here

Jekyll and HAIde: The Better an LLM is at Identifying Misinformation, the More Effective it is at Worsening It.

By Mayowa Osibodu

Abstract: The unprecedented scale of disinformation campaigns possible today, poses serious risks to society and democracy. It turns out however, that equipping LLMs to precisely identify misinformation in digital content (presumably with the intention of countering it), provides them with an increased level of sophistication which could be easily leveraged by malicious actors to amplify that misinformation. This study looks into this unexpected phenomenon, discusses the associated risks, and outlines approaches to mitigate them.

A screenshot of the Jekyll/HAIde Tool: It analyzes given text, identifies misinformation, and then generates both White-Hat and Black-Hat responses.

Jason Hoelscher-Obermaier’s comment: I like this idea a lot, and in particular the juxtaposition of white-hat and black-hat mode. It would be great to explore quantitatively how much effect such a tool would have on public posts of users on social media and it seems like a very worthwhile experiment to run.

Esben Kran’s comment: This is an awesome project (and equally great title) — showcasing the dual use already in these pilot experiments is great foresight. An obvious next step is to assess the accuracy of comments. Ingesting directly from Wikipedia with RAG seems like a pretty robust process. I'd be curious about some extra work on identifying the most cost-effective ways to implement this at scale, e.g. can use 1) message length and keywords in switch statements to funnel into 2) where we use a clustering model for {factual_statement, non_factual_statement} into 3) full white-hat bot response generation into 4) evaluation of response into 5) posting of response. And might we be able to fine-tune it on the Twitter birdwatch project as well (Birdwatch). Wonderful work!

See the code and research here

Artificial Advocates: Biasing Democratic Feedback using AI

By Sam Patterson, Jeremy Dolan, Simon Wisdom, Maten

Abstract: The "Artificial Advocates" project by our team targeted the vulnerability of U.S. federal agencies' public comment systems to AI-driven manipulation, aiming to highlight how AI can be used to undermine democratic processes. We demonstrated two attack methods: one generating a high volume of realistic, indistinguishable comments, and another producing high-quality forgeries mimicking influential organizations. These experiments showcased the challenges in detecting AI-generated content, with participant feedback showing significant uncertainty in distinguishing between authentic and synthetic comments. We also created a tool to generate professional-looking comments in a PDF format, on letterhead which includes a target organization’s logo.

Example output is evaluated for indistinguishability in a survey with human subjects

Bart Bussmann’s comment: Awesome project! The demonstration is convincing and shows a real concrete threat to the democratic process. I really like that you did a small survey to show that this is already a threat nowadays that should be mitigated.

Esben Kran’s comment: It's really cool to get human subjects for this study and N=38 is definitely quite nice. For more transparency on the statistics, you could add a standard deviation visualization and a statistical model e.g. finding that {human_evaluation} is statistically insignificant in the model {generated_by} ~ {human_evaluation} and the same for {persuasiveness} ~ {human_evaluation} * {generated_by}. The website looks very comprehensive and really interesting to show that you can do this impersonating a corporation. I guess there's a positive aspect to this for smaller business's interests to be heard as well, though the flooding of content on the message boards is a definite negative. For the next steps in this sort of work, you could literally just write this into a paper and submit it since you have enough of a sample size and the results show that it would be very hard to create a moderation algorithm that would accurately differentiate when even humans aren't capable of this. Your main defense might simply be to evaluate the frequency of messaging from similar IP addresses, an oldie but goldie. (edit; after looking again, I see you already included std.err in the appendix)

See the code and research here

Unleashing Sleeper Agents

By Nora Petrova, Jord Nguyen

Abstract: This project explores how Sleeper Agents can pose a threat to democracy by waking up near elections and spreading misinformation, collaborating with each other in the wild during discussions and using information that the user has shared about themselves against them in order to scam them.

A screenshot of the demonstration on Covert Collaboration between Sleeper Agents in Debates. More chats between simulated humans and sleeper agents can be found in the GitHub repository.

Andrey Anurin’s comment: Impressive coverage! You managed to explore the applicability of the finetuned sleeper agent in 3 different scenarios. I liked the discussion and the ideas like the ZWSP dogwhistle, and the technical side is robust. Great job!

Jason Hoelscher-Obermaier’s comment: Very cool project — great execution and very clear writeup. I really appreciate the careful documentation and provision of samples of sleeper agents interactions. Thinking about covert collaboration between sleeper agents seems like a great addition too.

See the code and research here

Other projects

When the jury selects the winners, it's always hard to decide! Other notable projects include:

Check out all of the submitted projects here.

Apart Research Sprint

The Apart Research Sprints are monthly research hackathons held to engage people in fast-paced pilot experimentation on impactful research questions. This research sprint was held online and in nine locations at the same time: London, Groningen, Copenhagen, Ho Chi Minh City, Gothenburg, Hanoi, Cape Town, Johannesburg, and Kenya. 

We kicked off the sprint with a keynote talk by Alice Gatti from the Center for AI Safety (CAIS), focusing on the Weapons of Mass Destruction Proxy (WMDP) benchmark. On the second day, we had a talk by Simon Lermen on bad LLM agents. 

The event concluded with 153 signups, ~60 submitters and 23 final entries. A total of $2,000 in prizes were awarded, supported by ACX. The projects were judged by Simon Lermen, Nina Rimsky, Konrad Seifert, Andrey Anurin, and Apart leadership.

A few days following the judging process, the winners gave lightning talks on their projects.

You can watch all of the talks here

Research sprints are excellent opportunities to learn and start doing impactful work in AI Safety. Follow along with upcoming research sprints on Apart's website.

Stay tuned!

0 comments

Comments sorted by top scores.