What are the most interesting / challenging evals (for humans) available?

raemon

What are the most interesting / challenging evals (for humans) available?

post by Raemon · 2024-12-27T03:05:26.831Z · LW · GW · 5 comments

This is a question post.

  Answers
    7 abstractapplic
    4 sjadler
    1 Henry Prowbell
    1 Towards_Keeperhood
    1 Towards_Keeperhood
None
5 comments

I want to build a nice testing ground for human rationality (that is to say, solving arbitrary complex problems in different domains using limited information and time)

This would be a lot of annoying work to assemble, but (un)fortunately there's an AI industry that's designing a lot bunch of evals to test the ability of their AIs to solve arbitrary complex problems in arbitrary domains sooo.... anyone have good recommendations for public eval questions?

I started my project using Thinking Physics exercises. I think there is something particularly nice about the quality of Thinking Physics exercises (in that they require conceptual reasoning), but, they are only one domain, and also I had trouble getting the author to sell me rights to them.

I've used GPQA. They didn't turn out to be as interesting as I expected [LW(p) · GW(p)] (they're not bad, but, the skill ceiling didn't turn out to be as high as I thought based on the description).

Evals/Benchmarks are generally kept off the public internet. My plan is to use these on a website that requires a login and can include a canary string, but I'd be interested in questions that have been publicly released, or the benchmark saturated, or whatever seems appropriate for cooperating with people's intentions.

Do people have particular recommendations and/or any knowledge about what they expect to work well here?

ADDENDA:

I'm happy for now with "more interesting and harder and more varied-in-required-skillset than GPQA" but my ideal has problems that would take a particularly smart(like 95th percentile alignment researcher, i.e. there are somewhere between 10-30 people who might count) x-risk researcher 30 minutes or so, and the median x-risk researcher more like 2-8 hours, to reliably get right on their first try (with maybe would have a 50% change of figuring it out in 1-3 hours).

The ideal is that people have to:

go through a period of planning, and replanning
spend at least some time feeling like the problem is totally opaque and they don't have traction.
have to reach for tools that they don't normally reach for.

It may be that we just don't have evals at this level yet, and I might take what I can get, but, it's what I'm aiming for.

I'm not trying to make an IQ test – my sense from the literature is that you basically can't raise IQ through training. So many people have tried. This is very weird to me – subjectively it is just really obvious to me that I'm flexibly smarter in many ways than I was in 2011 when I started the rationality project, and this is due to me having a lot of habits I didn't used to have. The hypotheses I currently have are:

You just have to be really motivated to do transfer learning, and a genuinely inspiring / good teacher, and it's just really hard to replicate this sort of training scientifically
IQ is mostly measuring "fast intelligence", because that's what cost-effective to measure in large enough quantities to get a robust sample. i.e. it measures whether you can solve questions in like a few minutes which mostly depends on you being able to intuitively get it. It doesn't measure your ability to figure out how to figure something out that requires longterm planning, which would allow a lot of planning skills to actually come into play.

Both seem probably at least somewhat true, but the latter one feels like a clearer story for why there would be potential (at least theoretically) in the space I'm exploring – IQ test take a few hours to take. It would be extremely expensive to do the theoretical statistically valid version of the thing I'm aiming at.

My explicit goal here is to train researchers who are capable of doing the kind of work necessary in worlds where Yudkowsky is right about the depth/breadth of alignment difficulty.

^{^}
(like 95th percentile alignment researcher, i.e. there are somewhere between 10-30 people who might count)

Answers

answer by abstractapplic · 2024-12-27T04:02:00.103Z · LW(p) · GW(p)

D&D.Sci [? · GW], for Data Science and related skills (including, to an extent, inference-in-general).

answer by sjadler · 2024-12-27T07:28:15.353Z · LW(p) · GW(p)

I quite like the Function Deduction eval we built, which is a problem-solving game that tests one’s ability to efficiently deduce a hidden function by testing its value on chosen inputs.

It’s runnable from the command-line (after repo install) with the command: oaieval human_cli function_deduction (I believe that’s the right second term, but it might just be humancli)

The standard mode might be slightly easier than you want, because it gives some partial answer feedback along the way. There is also a hard mode that can be run, which would not give this partial feedback and so would be harder to find any brute forcing method.

For context, I could solve ~95% of the standard problems with a time limit of 5 minutes and a 20 turn game limit. GPT-4 could solve roughly half within that turn limit IIRC, and o1-preview could do ~99%. (o1-preview also scored ~99% on the hard mode variant; I never tested myself on this but I estimate maybe I’d have gotten 70%? The problems could also be scaled up in difficulty pretty easily if one wanted.)

↑ comment by papetoast · 2024-12-28T12:22:24.439Z · LW(p) · GW(p)

Tried running but I got [eval.py:233] Running in threaded mode with 10 threads! which makes it unplayable for me (because it is trying to make me to 10 tests alternating

Replies from: sjadler

↑ comment by sjadler · 2024-12-28T18:55:29.647Z · LW(p) · GW(p)

Oh interesting, I’m out at the moment and don’t recall having this issue, but if you override the default number of threads for the repo to 1, does that fix it for you?

https://github.com/openai/evals/blob/main/evals/eval.py#L211

(There are two places in this file where threads =, would change 10 to 1 in each)

Replies from: papetoast

↑ comment by papetoast · 2024-12-29T02:35:05.304Z · LW(p) · GW(p)

Don't really want to touch the packages, but just setting the EVALS_THREADS environmental variable worked

Replies from: sjadler

↑ comment by sjadler · 2024-12-29T03:54:10.520Z · LW(p) · GW(p)

Great! Appreciate you letting me know & helping debug for others

answer by Henry Prowbell · 2024-12-28T13:24:50.652Z · LW(p) · GW(p)

Some of the later levels on this?

https://en.wikipedia.org/wiki/Notpron

“Notpron is an online puzzle game and internet riddle created in 2004 by German game developer David Münnich. It has been named as ‘the hardest riddle available on the internet.’”

“Notpron follows a standard puzzle game layout, where the player is presented with a webpage containing a riddle and must find the answer to the riddle in order to proceed to the next webpage”

“Each level answer or solution is unique, often requiring specific skills such as decoding ciphers, image editing, musical knowledge”

“As of October 2020, only 100 people have completed the game, out of 20 million visitors since August 2004”

answer by Towards_Keeperhood · 2024-12-27T21:16:30.268Z · LW(p) · GW(p)

Perhaps also not what you're looking for, but you could check out the google hashcode archive (here's an example problem). I never participated though, so don't know whether they would make that great tests. But it seems to me like general ad-hoc problem solving capabilities are more useful in hashcode than in other competetive programming competitions.

GPT4 summary: "Google Hash Code problems are real-world optimization and algorithmic challenges that require participants to design efficient solutions for large-scale scenarios. These problems are typically open-ended and focus on finding the best possible solution within given constraints, rather than exact correctness."

answer by Towards_Keeperhood · 2024-12-27T20:48:39.958Z · LW(p) · GW(p)

Maybe not what you're looking for because it's not like one hard problem but more like many problems in a row, and generally I don't really know whether they are difficult enough, but you could (have someone) look into Exit games. Those are basically like escape rooms to go. I'd filter for Age16+ to hopefully filter for the hard ones, though maybe you'd want to separately look up which are particularly hard.

I did one or two when I was like 15 or 16 years old, and recently remembered them and I want to try some more for fun (and maybe also introspection), though I didn't get around to it yet. I think they are relatively ad-hoc puzzles though as with basically anything you can of course train to get good at Exit games in particular by practicing. (It's possible that I totally overestimate the difficulty and they are actually more boring than I expect.)

(Btw, probably even less applicable to what you are looking for, but CondingEscape is also really fun. Especially the "Curse of the five warriors" is good.)

5 comments

Comments sorted by top scores.

comment by abstractapplic · 2024-12-27T04:02:52.072Z · LW(p) · GW(p)

You might want to look into tests given to job applicants. (Human intelligence evaluation is an entire industry already!)

Replies from: abstractapplic

↑ comment by abstractapplic · 2024-12-27T12:06:56.429Z · LW(p) · GW(p)

( . . . and IQ tests, and exam papers, and probably some other things that are too obvious for me to call to mind . . . )

comment by Raemon · 2024-12-27T17:48:18.498Z · LW(p) · GW(p)

Clarification (I'll add this to the OP):

The ideal that I'm looking for are things that will take a smart researcher (like 95th percentile alignment researcher, i.e. there are somewhere between 10-30 people who might count) at least 30 minutes to solve the problem, and most alignment researchers maybe would have a 50% change of figuring it out in 1-3 hours.

The ideal is that people have to:

a) go through a period of planning, and replanning
b) spend at least some time feeling like the problem is totally opaque and they don't have traction.
c) have to reach for tools that they don't normally reach for.

It may be that we just don't have evals at this level yet, and I might take what I can get, but, it's what I'm aiming for.

You just have to be really motivated to do transfer learning, and a genuinely inspiring / good teacher, and it's just really hard to replicate this sort of training scientifically
IQ is mostly measuring "fast intelligence", because that's what cost-effective to measure in large enough quantities to get a robust sample. i.e. it measures whether you can solve questions in like a few minutes which mostly depends on you being able to intuitively get it. It doesn't measure your ability to figure out how to figure something out that requires longterm planning, which would allow a lot of planning skills to actually come into play.

My explicit goal here is to train researchers who are capable of doing the kind of work necessary in worlds where Yudkowsky is right about the depth/breadth of alignment difficulty.

comment by Perhaps · 2024-12-27T21:08:38.492Z · LW(p) · GW(p)

You might find some puzzle games to be useful. In particular Understand [LW · GW] is a game that was talked about on here as being good for learning how to test hypotheses and empirically deduce patterns. Similar to your Baba Is You experiments.

comment by RHollerith (rhollerith_dot_com) · 2024-12-27T17:39:17.893Z · LW(p) · GW(p)

What are the most interesting / challenging evals (for humans) available?

Contents

Answers

5 comments