ARC-AGI is a genuine AGI test but o3 cheated :(

post by Knight Lee (Max Lee) · 2024-12-22T00:58:05.447Z · LW · GW · 6 comments

Contents

6 comments

The developer of ARC-AGI says o3 is not AGI, and admits his test isn't really an AGI test.[1] But I think it is an AGI test.

From first principles, AGI tests should be easy to make, because the only reason current AI isn't AGI, is that it fails at a lot of things human workers do on a daily basis.

I feel the ARC-AGI successfully captures these failings. When an AI encounters a problem where the solution isn't in its training set, and where even the set of rules for how to solve such a problem isn't in its training set, it has to figure out everything from first principles. Current AI fails miserably at these problems because they aren't AGI.

The ARC-AGI questions are not in any LLM's training set, because almost no human would write out the reasoning by which they solve these kinds of visual puzzles. Maybe some humans will explain IQ tests on a video, but the video transcript would be useless without the accompanying images.

In summary, the reason that current AI cannot solve the ARC-AGI questions is probably the true reason that current AI cannot replace most human work.[2] They cannot effectively learn new tasks with little training data.

OpenAI's o3 crushed the ARC-AGI. So will it also replace most human work, and be "AGI?" Maybe... except it "cheated," in the sense it trained on the test (instead of testing its abilities to start from scratch).

From ARC Prize:

Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

They ask "how much of the performance is due to ARC-AGI data." Probably most of it. If untuned o3 can do as well, don't you think OpenAI would publish that (in addition to tuned o3)?

By training o3 on the public training set, the ARC-AGI no longer becomes an AGI test. It becomes yet another test of memorizing rules from its training data. This is still impressive, but something else.

I do not know what exactly OpenAI did. Did they let o3 spend a long time generating chains of thought, and reward ones which led to the correct answer? If that failed, did they have a human give examples of correct reasoning steps, and train it on that first? I don't know.

They admitted they "cheated," without saying how they "cheated."

To be fair, this isn't really cheating in the sense they are allowed to use data, and it's why it's called a "public training set." But the version of the test that allows this, is not the AGI test.

Also to be fair, o3 is still a force to be reckoned with, and a reason to be both impressed and worried. I have no hard feelings against OpenAI, because almost all AI labs are doing this, all the time. If you don't do this you won't survive in the toxic battlefield of AI lab competition. We just have to take AI lab demonstrations with a healthy dose of skepticism and a grain of salt.

EDIT: People at OpenAI deny "fine-tuning" o3 for the ARC (see this comment [LW · GW] by Zach Stein-Perlman). But to me, the denials sound like "we didn't use a separate derivative of o3 (that's fine-tuned for just the test) to take the test, but we may have still done reinforcement learning on the public training set." (See my reply [LW · GW])

  1. ^

    This is according to NewScientist.

    The "developer of ARC-AGI" refers to François Chollet.

  2. ^

    I guess another reason AI cannot replace most human work is reliability. The low reliability prevents AI from replacing high-responsibility human labour, and the poor ability to learn new tasks prevents AI from replacing high-innovation human labour.

6 comments

Comments sorted by top scores.

comment by Dave Orr (dave-orr) · 2024-12-22T19:22:55.554Z · LW(p) · GW(p)

It seems very strange to me to say that they cheated, when the public training set is intended to be used exactly for training. They did what the test specified! And they didn't even use all of it.

The whole point of the test is that some training examples aren't going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles.

Do we say that humans aren't a general intelligence even though for ~all valuable tasks, you have to take some time to practice, or someone has to show you, before you can do it well?

Replies from: gwern, Max Lee
comment by gwern · 2025-01-08T18:42:38.611Z · LW(p) · GW(p)

Do we say that humans aren't a general intelligence even though for ~all valuable tasks, you have to take some time to practice, or someone has to show you, before you can do it well?

More pointedly, I didn't see anyone complaining about the previous champion doing 100%-ARC-only online training while trying to solve ARC, so why would you complain about weaker offline training as a small part of a giant pretraining corpus?

(Generating millions of examples to train on, yes, people did complain about that and arguably that is 'cheating', but 'not using frozen weights'? No. Because it would be absurd to complain about that given that the human results involve learning and are not generated by humans zero-shot given only 1 problem ever, or by injecting them with anesthesia to erase memory formation after each problem etc.)

Replies from: Max Lee
comment by Knight Lee (Max Lee) · 2025-01-08T22:56:09.711Z · LW(p) · GW(p)

I'm not saying that o3's results are meaningless.

I'm just saying that first of all, o3's score has a different meaning than the score by other models, because other models didn't do RL on ARC-like questions. Even if you argue that it should be allowed, other AI didn't do it, so it's not right to compare its score with other AI, without giving any caveats.

Second of all, o3 didn't decide to do RL on these questions on its own. It required humans to run RL on it before it can do these questions. This means that if AGI required countless unknown skills similarly hard to ARC questions, then o3 wouldn't be AGI. But an AI which could spontaneously reason how to do ARC questions, without any human directed RL for it, would be AGI. Also, humans can learn from doing lots of test questions without being told what the correct answer was.

The public training set is weaker, but I argued it's not a massive difference [LW · GW].

Replies from: dave-orr
comment by Dave Orr (dave-orr) · 2025-01-09T04:17:09.540Z · LW(p) · GW(p)

Progress in ML looks a lot like, we had a different setup with different data and a tweaked algorithm and did better on this task. If you want to put an asterisk on o3 because it trained in some specific way that's different from previous contenders, then basically every ML advance is going to have a similar asterisk. Seems like a lot of asterisking.

Replies from: Max Lee
comment by Knight Lee (Max Lee) · 2025-01-09T04:40:18.842Z · LW(p) · GW(p)

Maybe we can draw a line between the score an AI gets without using human written problem/solution pairs in any way, and the score an AI gets after using them in some way (RL on example questions, training on example solutions, etc.).

In the former case, we're interested in how well the AI can do a task as difficult as the test, all on its own. In the latter case, we're interested in how well the AI can do a task as difficult as the test, if working with humans training it for the task.

I really want to make it clear I'm not trying to badmouth o3, I think it is a very impressive model. I should've written my post better.

comment by Knight Lee (Max Lee) · 2024-12-23T07:39:02.762Z · LW(p) · GW(p)

When I first wrote the post I did make the mistake of writing they were cheating :( sorry about that.

A few hours I noticed the mistake and removed the statements, put the word "cheating" in quotes and explained it at the end.

To be fair, this isn't really cheating in the sense they are allowed to use data, and it's why it's called a "public training set." But the version of the test that allows this, is not the AGI test.

It's possible you saw the old version due to browser caches.

Again, I'm sorry.

I think my main point still stands.

I disagree that "The whole point of the test is that some training examples aren't going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles."

I don't think poor performance on benchmarks by SOTA generative AI models are due to failing to understand output formatting, and that models should need example questions in their training data (or reinforcement learning sets) to compensate for this. Instead, a good benchmark should explain the formatting clearly to the model, maybe with examples in the input context.

I agree that tuning the model using the public training set does not automatically unlock the rest of it! But I strongly disagree that this is the whole point of the test. If it was, then the Kaggle SOTA is clearly better than OpenAI's o1 according to the test. This is seen vividly in François Chollet's graph.

No one claims this means the Kaggle models are smarter than o1, nor that the test completely fails to test intelligence since the Kaggle models rank higher than o1.

Why does no one seem to be arguing for either? Probably because of the unspoken understanding that they are doing two versions of the test. One where the model fits the public training set, and tries to predict on the private test set. And two where you have a generally intelligent model which happens to be able to do this test. When people compare different models using the test, they are implicitly using the second version of the test.

Most generative AI models did the harder second version, but o3 (and the Kaggle versions) did the first version, which—annoyingly to me—is the official version. It's still not right to compare other models' scores with o3's score.

Even if the o3 (and the Kaggle models) "did what the test specified," they didn't do what most people who compare AI LLMs with the ARC benchmark are looking for, and has the potential to mislead these people.

The Kaggle models doesn't mislead these people because they are very clearly not generally intelligent, but o3 does mislead people (maybe accidentally, maybe deliberately).

From what I know, o3 was probably did reinforcement learning (see my comment [LW · GW]).

We disagree on this but may agree on other things. I agree o3 is extremely impressive due to its 25.2% Frontier Math score, where the contrast against other models is more genuine (though there is a world of difference between the easiest 25% of questions and the hardest 25% of questions).