ML Safety at NeurIPS & Paradigmatic AI Safety? MLAISU W49

post by Esben Kran (esben-kran), Steinthal · 2022-12-09T10:38:34.693Z · LW · GW · 0 comments

This is a link post for https://newsletter.apartresearch.com/posts/ml-safety-at-neurips-paradigmatic-ai-safety-w49

Contents

  ChatGPT jailbreaks
  Responsible AGI Institutions
  Generating consensus on diverse human values
  Automating interpretability
  The Plan
  NeurIPS ML Safety Workshop
  Opportunities:
None
No comments

Watch this week’s episode on YouTube or listen to the audio version here.

This week, we see how to break ChatGPT, how to integrate diverse opinions in an AI and look at a bunch of the most interesting papers from the ML safety workshop happening right now!

ChatGPT jailbreaks

Last week, we reported that ChatGPT has been released along with text-davinci-003. In the first five days, it received over a million users, a product growth not seen in a long time. And if that wasn’t enough, OpenAI also released WhisperV2 that presents a major improvement to voice recognition.

However, all is not safe with ChatGPT! If you have been browsing Twitter, you’ll have seen the hundreds of users who have found ways to circumvent the model’s learned safety features. Some notable examples include extracting the pre-prompt from the model, getting advice for illegal actions by making ChatGPT pretend or joke, making it give information about the web despite its wishes not to and much more. To see more about these, we recommend watching Yannic Kilchers video about the topic.

Rebecca Gorman and Stuart Armstrong [AF · GW] found a fun way to make the models more safe albeit also more conservative, by running the prompt through an Eliezer Yudkowsky-simulating language model prompt. You can read more about this in the description.

Responsible AGI Institutions

ChatGPT is released on the back of OpenAI releasing their alignment strategy which we reported on a few weeks ago. Bensinger publishes [AF · GW] Yudkowsky and Soares’ call for other organizations developing AGI to release similar alignment plans and commends OpenAI for releasing theirs, though they do not agree with its content.

The lead of the alignment team at OpenAI has also published a follow-up on his blog about why he is optimistic about their strategy. Jan Leike has five main reasons: 1) AI seem favorable for alignment, 2) we just need to align AI strong enough to help us with alignment, 3) evaluation is easier than generation, 4) alignment is becoming iterable and 5) language models seem to become useful for alignment research.

Generating consensus on diverse human values

One of the most important tasks of value alignment is to understand what “values” mean. This can be done from both a theoretical (such as shard theory) and an empirical view. In this new DeepMind paper, they train a language model to take in diverse opinions and create a consensus text.

Their model reaches a 70% acceptance rate by the opinion-holders, 5% better than a human written consensus text. See the example in their tweet for more context. It is generally awesome to see more empirical alignment work coming out of the big labs than earlier.

Automating interpretability

Redwood Research has released what they call “causal scrubbing” [AF · GW]. It is a way to automate the relatively inefficient circuits interpretability work on for example the transformer architecture. 

To use causal scrubbing, you create a causal model of how you expect different parts of a neural network to contribute to the output based on a specific type of input. By doing this, the causal scrubbing mechanism will automatically ablate the neural network towards falsifying this causal model. A performance recovery metric is calculated that summarizes how much a causal claim about the model seems to retain the performance when “unrelated” parts of the neural network are removed.

The Plan

Wentworth releases his update [AF · GW] of “The Plan” [AF · GW], a text he published a year ago about his view on how we might align AI. He describes a few interesting dynamics of the current field of AI safety, his own updates from 2022 and his team’s work.

Notably, multiple theoretical and empirical approaches to alignment seem to be converging on identifying which parts of neural networks model which parts of the world, such as shard theory, mechanistic interpretability and mechanistic anomaly detection.

NeurIPS ML Safety Workshop

Now to one of the longer parts of this newsletter. The ML Safety Workshop at the NeurIPS conference is happening today! Though the workshop has not started yet, the papers are already available! Here, we summarize a few of the most interesting results:

Opportunities:

This week, we have a few very interesting opportunities available:

This has been the ML & AI safety update. We look forward to seeing you next week!
 

0 comments

Comments sorted by top scores.