Annual AGI Benchmarking Event

post by Lawrence Phillips · 2022-08-27T00:06:31.972Z · LW · GW · 3 comments

This is a link post for https://www.metaculus.com/questions/12339/annual-agi-benchmarking-event/

Contents

  Why?
  How?
  Background reading
None
3 comments

Metaculus is strongly considering organizing an annual AGI benchmarking event. Once a year, we’d run a benchmark or suite of benchmarks against the most generally intelligent AI systems available to us at the time, seeking to assess their generality and the overall shape of their capabilities. We would publicize the event widely among the AI research, policy, and forecasting communities.

Why?

We think this might be a good idea for several reasons:

How?

We're currently working on a plan, and are soliciting ideas and feedback from the community here. To guide the discussion, here are some properties we think the ideal benchmark should have. It would:

Once we’ve collected the strongest ideas and developed them into a cohesive whole, we will solicit feedback from the AI research community before publishing the final plan. Thanks for your contributions to the discussion – we look forward to reading and engaging with your ideas!

Background reading

Here are a few resources to get you thinking.

Threads:

An idea based on iteratively crowdsourcing adversarial questions

A discussion on AGI benchmarking

Papers:

On the Measure of Intelligence

What we might look for in an AGI benchmark

General intelligence disentangled via a generality metric for natural and artificial intelligence

3 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2022-08-27T23:21:42.078Z · LW(p) · GW(p)

Yeah, tough challenge. A lot of things that immediately come to mind for settling questions about AGI are complicated real-world tasks that are hard to convert into benchmarks (e.g. driving cars in complicated situations that require interpreting text on novel road signs). Especially for not-quite-general AI, where it might do well on some subset of necessary tasks but not enough of them to actually complete the full challenge.

So I guess what I want is granularity, where you break a task like "read and correctly react to a novel road sign while driving a car" into a series of steps, and test how good AIs are at those steps. But this is a lot of work, both for benchmark design and for AI developers who have to figure out how to get their AI to interface with all these different things it might be asked to do.

comment by Noosphere89 (sharmake-farah) · 2022-08-27T23:33:53.457Z · LW(p) · GW(p)

There is plausibly one big purpose it serves, and that's making it as a legible fire alarm for AGI so that we can actually know when to respond. Eliezer Yudkowsky stated that one of the biggest challenges to AGI Alignment is the fact that there is no fire alarm if AGI is built, and this can be used as a way to make a fire alarm.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2022-08-27T07:24:25.643Z · LW(p) · GW(p)

Very cool. Something I've been wondering about recently is how to keep benchmarks from becoming available and optimized against by developers. I had an idea for a benchmark, and thought about submitting it to BigBench, but then decided against it since I was worried that the precautions against BigBench being optimized against were insufficient. It seems like a company that wanted to keep its model private and expose just a filtered API is a good idea, but what about the evaluators wanting to keep their benchmarks private? Maybe there could be a special arrangement where the companies agreed not to record the benchmarking as it passed through their APIs to their model?