Are the LLM "intelligence" tests publicly available for humans to take?

post by nim · 2023-03-17T00:09:00.842Z · LW · GW · 3 comments

This is a question post.

Contents

  Answers
    4 plex
    1 Lech Mazur
    1 adamarkin
    1 p.b.
    1 baturinsky
None
3 comments

I've seen a lot of news lately about the ways that particular LLMs score on particular tests.

Which if any of those tests can I go take online to see how my performance on them compares to the models?

Answers

answer by plex · 2024-04-26T10:10:53.004Z · LW(p) · GW(p)

https://www.equistamp.com/evaluations has a bunch, including an alignment knowledge one they made.

answer by Lech Mazur · 2024-04-25T02:18:08.478Z · LW(p) · GW(p)

You can go through an archive of NYT Connections puzzles I used in my leaderboard. The scoring I use allows only one try and gives partial credit, so if you make a mistake after getting 1 line correct, that's 0.25 for the puzzle. Top humans get near 100%. Top LLMs score around 30%. Timing is not taken into account.

answer by adamarkin · 2023-12-15T07:04:37.707Z · LW(p) · GW(p)

I took the test at https://iqtestonline.io. The test contains 30 questions that must be completed within 20 minutes. It tests your numerical, logical, and spatial reasoning skills. I get the IQ score result right after completing it for free. 
I used to take qualified test before and the result is quite similar to this online test. I think it's a quick and easy way if you want to get the grasp of where you rank. 

answer by p.b. · 2023-03-17T08:37:28.138Z · LW(p) · GW(p)

On twitter the IQ score of IIRC 84 for ChatGPT and 96 for GPT-4 were making the rounds, maybe you refer to those? I believe these scores are based on this freely available online test: 

https://iqtest.com/take-the-test/

I took it on wednesday just for fun. It's purely text-based but involves many different types of reasoning (including spatial reasoning). It's also a timed test which arguably inflates the LLM scores compared to humans. 

comment by Viliam · 2023-03-17T11:52:26.434Z · LW(p) · GW(p)

Is this one of those tests where you spend lot of time answering the questions, and at the end there is "if you want to see the results, send money"?

Also, is there any reason to believe that the test was actually somehow validated, as opposed to just numbers completely made up?

Replies from: p.b.
comment by p.b. · 2023-03-17T12:21:21.047Z · LW(p) · GW(p)

This is one of the tests where you spent not too much time (I think I took 13 minutes) and at the end there was "this is your result, if you want to see fine-grained scores, send money".

Well, they claim that it is somehow validated and my score was somewhat realistic. 

Main negative point was the need to provide an email for the result. 

answer by baturinsky · 2023-03-17T06:05:18.244Z · LW(p) · GW(p)

Not for the all of them, but for the many of them you can see data and other info around here : https://paperswithcode.com/dataset/mmlu 

comment by [deleted] · 2023-03-17T06:31:10.350Z · LW(p) · GW(p)

I browsed around but cannot find the actual mmlu questions, or an example of 1 question.  How do I view them>

Replies from: baturinsky
comment by baturinsky · 2023-03-17T06:52:55.803Z · LW(p) · GW(p)

"Homepage" button links to github, github readme links to tar with tests. Yeah, it's kinda not evident in some cases.

3 comments

Comments sorted by top scores.

comment by jmh · 2023-03-17T00:41:40.280Z · LW(p) · GW(p)

Not sure if you've seen this or not: https://mashable.com/article/openai-gpt-4-exam-scores

But that references a number of standardized tests, some of which I suspect you have also taken. Here are a could of links to practice test that might have good matches for you to try.

https://www.tests.com/Free-Practice-Tests

https://www.khanacademy.org/college-careers-more/college-admissions/making-high-school-count/standardized-tests/a/full-length-sats-to-take-online

[Rewrite as I don't think the first comment was actually helpful.]

Replies from: nim
comment by nim · 2023-03-17T16:18:32.505Z · LW(p) · GW(p)

Thank you! I didn't see your first version of this, but your current version is helpful for the human-specific tests that they're benchmarked on :)

Replies from: cSkeleton
comment by cSkeleton · 2024-04-24T22:11:39.336Z · LW(p) · GW(p)

Is there any information on how long the LLM spent on taking the tests? Any idea? I'd like to know the comparison with human times. (I realize it can depend on hardware, etc but would just like some general idea.)