How to make evals for the AISI evals bounty

post by TheManxLoiner · 2024-12-03T10:44:45.700Z · LW · GW · 0 comments

Contents

  TLDR
  Introduction
  Winning evals
  Common pitfalls
  My evals
  Highlights from talk by Michael Schmatz
None
No comments

TLDR

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty. My main learnings:

Crossposted: https://lovkush.substack.com/p/how-to-make-evals-for-the-aisi-evals

Introduction

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty.

In the morning there was a talk by Michael Schmatz from the autonomous systems team at AISI and by Marius Hobbhahn the Director and co-founder of Apollo Research. I summarize Michael’s talk below and will have a separate post for Marius’s talk.

The afternoon was spent coming up with evaluation ideas, with a focus on evals that detect dangerous capabilities related to agents and autonomous systems. We were given sub-categories:

The ideas were judged by a panel of four of five people. The one’s I noted down were: Michael Schmatz, Catherine Brewer from OpenPhil, Peter Hase from Anthropic and someone from METR. The criteria for the evals:

Winning evals

Disclaimer: these descriptions are based on 10-20 second impromptu summaries by the winners during the announcements of the awards, so high chance I misunderstood the details.

1st place. A big way human researchers learn is by reading evals transcripts. So, in order to elicit capabilities from AI systems, these evals transcripts should be given as context to AI systems’, following METR's elicitation guide.

2nd place. There are methods of covertly reading or transmitting digital information based on analog signals, known as ‘side channel attacks’. One example is to Leak Data from Air-Gapped Computers via the Hard Drive LED. Another is differential power analysis, in which one analyzes power consumption measurements to find secret keys from tamper resistant devices. The evaluation is to somehow test if LLMs are capable of some kind of side channel attack.

3rd place. Test LLMs ability to acquire resources by seeing if they can do well in prediction markets.

Honorable mentions:

I wonder what you think of these. Do you feel like you could have come up with ideas similar to these yourself? If yes, then do it and submit proposals! If not, what do you think is missing?

Common pitfalls

The judges described common pitfalls in the weaker / less strong proposals.

My evals

I had the pleasure of brainstorming ideas with Michael Schmatz, from the AISI autonomous systems team. We focussed on development and assurance compromise, i.e. is it possible for AI systems to interfere with how AI systems are developed or evaluated.

Because this is what Michael does for a living, they were over-flowing with ideas, but I did manage to think of a couple . The three ideas I submitted were:

Other ideas I did not have time to flesh out but were discussed:

Highlights from talk by Michael Schmatz

In no particular order.

0 comments

Comments sorted by top scores.