Debunk the myth -Testing the generalized reasoning ability of LLM
post by Defender7762 · 2025-04-11T20:17:02.956Z · LW · GW · 6 commentsContents
Conclusion Current LLM Reasoning Ability: As of March 2025, the actual reasoning capabilities of publicly available LLMs are approximately 50 times lower than what is suggested by benchmarks like AIME. Testing Methodology Ensure generalization: Experimental results Question 1: [Wgighing] Question 2: [Wgighing 2] None 6 comments
Conclusion
Current LLM Reasoning Ability: As of March 2025, the actual reasoning capabilities of publicly available LLMs are approximately 50 times lower than what is suggested by benchmarks like AIME.
Today, various false marketing about LLM's reasoning ability is rampant on the Internet. They usually make strong claims: they get a considerable (80%+) accuracy rate on mathematical benchmarks that most people think are difficult and have weak knowledge backgrounds, or give them a [doctoral level] intelligence evaluation based on erudite tests. With a skeptical attitude, we design some questions.
https://llm-benchmark.github.io Click to expand all questions and model answers
Testing Methodology
The premise of testing the real generalizable reasoning ability of LLM is that the tester has the ability to ask new questions.
Question structure: Generalized reasoning ability based on text form, with as little knowledge background as possible, means that no high school mathematics knowledge is required (it does not mean that the auxiliary role of acquired knowledge in solving problems is excluded)
Ensure generalization:
Several different experimental methods:
Assuming the creator has such a purpose, in order to support his claim, to prove as much as possible that LLM has a lower generalizable reasoning ability, then he hopes that the difficulty of the questions he creates is as low as possible for people, and LLM is completely unable to answer. Assuming it is a competition about creators, n=problem difficulty, d=error rate, creator's score=(1/n^2)*d
- A fairer method, the creator has never been exposed to the target LLM, and tries to create questions that he thinks are "novel" from his knowledge structure
- After the creator interacts with the target LLM, he has a certain understanding of it, and creates questions that he thinks are "novel" for the target in a targeted manner. During the creation period, he cannot access the target LLM again
- The creator takes out a series of questions he prepared, repeatedly tests the target, and then finds those questions that the target cannot handle at all
Verify the absolute difficulty: the experimenter can choose math and science enthusiasts, math and science competition participants/teachers, STEM practitioners, mathematics, physics, computers, etc. All industry personnel with logical training. In this experiment, it is assumed that all questions are significantly lower than the ability limit of all participants, which means that all participants can feel comfortable solving them and can report their difficulty.
Real generalization reasoning ability: The evaluation process of the final real generalization reasoning ability of the target LLM is roughly like this. Imagine asking some people with serious mathematical background/logic training to create questions that are as "novel" as possible and of the lowest possible difficulty, with the knowledge range limited to middle school level. The lower limit of the difficulty of the set of questions that the target LLM can achieve at least a 20% correct rate is taken as the final ability evaluation of the target LLM.
Reference difficulty: APOS Middle School Mathematics Competition Forum, they divide the difficulty of the questions into 10 levels https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings
But I think the difficulty indicated in the link is suitable for a person who has never received special training but knows all the necessary terms, concepts, and rules. (Imagine the feeling of a student who is facing a problem for the first time with the knowledge of a standard textbook)
Most of the problems I tried to create were around 1-2 in difficulty, while the difficulty of the AIME competition was 3-6. In most benchmarks, the highest performance models accessible to the public are usually reported as 70%-90% in AIME
https://x.ai/news/grok-3 https://openai.com/index/openai-o3-mini/ https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-pro
Experimental results
Currently publicly accessible models cannot reliably solve problems that are more than 50 times easier than AIME (calculated as the standard deviation of the proportion of people who can solve them)
If you are browsing this article and have received some logic training [regardless of the field], you may find that the gap here is about the same as:
Required to walk at the speed of a park walk on flat ground, walk 10 meters continuously without falling VS Required to participate in the campus 100-meter race and win a medal
All the questions and the answers of the model are posted on the website https://llm-benchmark.github.io
Here are some variations on the classic puzzles
Question 1: [Wgighing]
[There are 13 balls, among which 1 counterfeit ball may be either lighter or heavier than the rest. You have a balance scale without markings. Initially, p = 0. Each time a ball that has already been on the scale is placed on the scale again, the count p increases by 1. For example, if ball #1 and ball #2 have each been weighed once before, placing both of them on the scale again would increase p by 2.
Requirement: Find the counterfeit ball and determine whether it is lighter or heavier, while ensuring that p increases to at most 1.]
Question 2: [Wgighing 2]
[Here are twelve small balls, all normal, but there is a magic bug, invisible to the naked eye. Initially, it quietly attaches to one of the balls and randomly produces an effect: either decreasing or increasing the weight of that ball. This effect only exists when the bug is attached; as the bug moves, the effect moves with it (the previously affected ball returns to normal).
You have a scale, but you must pay $10 for the scale to display (refresh the screen) which side is heavier. Each new measurement information requires payment to be displayed.
The bug has a special characteristic: whenever the ball it's attached to leaves the scale (for example, when you pick up the ball with your hand or another tool), and the other end of the scale is not empty but has balls on it, the bug will randomly choose to transfer to one of the balls on the other end. You have only one single-use trap. What do you think is the best plan to find the ball with the bug attached and trap it? (You want to save as much money as possible.]
ps: I am not sure if it is a coincidence. There were questions that the model could not answer correctly after repeated testing. (before 2025-3 )When I set up this website and testing again, it magically answered them correctly, and it answered them correctly for several questions in a row. Indeed, I have talked about those questions elsewhere, but I did not post the correct answers. In short, I have completely replaced the similar versions of those questions, and now it cannot answer them again.
6 comments
Comments sorted by top scores.
comment by Robert Cousineau (robert-cousineau) · 2025-04-11T22:06:44.843Z · LW(p) · GW(p)
I found this failure to be interesting, unexpected (to me), and it was honestly frustrating to watch Claude get it wrong over and over again. It seems like this deserves to be received by people smarter and more important than me.
I found your writing style to be off putting and confusing, which seems counterproductive given you seem to have put a lot of work into this benchmark.
I sincerely recommend using Claude to rewrite this post and putting the actual results of the benchmark in the style of a long post or research paper.
It's not worth much but I'll commit to strong upvoting it and posting it on my twitter if you do so.
Offputting: Why 4 em dashes in your title? Why does the tone, word choice, and style switch between fancy and not so often? Why the typoes? Claiming something is 50 times lower than commonly believed, redefining "times", and then minimally supporting that redefinition seems fishy. Not actually giving the results in an understandable format (in this post, not in your benchmark where you seem to have done a really good job backing this up).
Confusing: What is the numbered list of ways you could come up with these questions? It seem like you are describing increasingly malfeasant ways to do so, but I can't tell. Why not show some example responses from the LLM's and/or explain their error modes? Telll us how you made these questions. What was your method for coming up with the formula you are using? etc.
Claude would genuinely fix most of these problems - run the post past him! He may not be so good at reasoning as I thought, but he is really good at writing things.
↑ comment by Defender7762 · 2025-04-12T10:43:16.838Z · LW(p) · GW(p)
Thank you very much for your advice! You can click on the question and model name window to expand the answers of all models. Additionally, there is a commented-out ability calculator in the website's source code. The '50 times' I mentioned refers to the probability derived from the normal distribution.
The 'Time' column represents the difficulty level of problems that the model can reliably solve, based on how long it would take a human to solve them. Longer times indicate more challenging problems. The standard deviation indicates the percentage of STEM individuals who can successfully solve the problem, following a normal distribution. A standard deviation of 0 implies that nearly 100% of the STEM population can solve such problems
comment by PapersToAGI (nee-1) · 2025-04-12T08:58:13.353Z · LW(p) · GW(p)
Amazing post and very valuable research. As another comment said, if you can adjust the writing a bit then this could be a top post
comment by Afterimage · 2025-04-12T08:15:52.996Z · LW(p) · GW(p)
It does seem like LLMs struggle with "trick" questions that are ironically close to well known trick questions but with an easier answer. Simple Bench is doing much the same thing and models do seem to be improving over time. I guess the important question is whether this flaw will effect more sophisticated work.
On another note I find your question 2 to be almost incomprehensible and my first instinct would be to try to trap the bug by feeling for it with my hands.
comment by ceba · 2025-04-12T00:04:16.243Z · LW(p) · GW(p)
Hello!
Required to walk at the speed of a park walk on flat ground, walk 10 meters continuously without falling VS Required to participate in the campus 100-meter race and win a medal
If this post, including the examples you've shown, is representative of the general writing style used, I suggest that such a style may be a confounding factor.
Even if it isn't, or you don't feel it is important, writing style is a signal, and the people reading your work will be very sensitive to that kind of signal. Even those who are genuinely curious might not spend as much time or effort reading your work than they otherwise might.
Given that your findings as presented here seem to contradict a popular narrative (it appears to be "debunking the myth"), less curious people will be on especially vigilant lookout for flaws they can use to dismiss your work entirely.
comment by Robert Cousineau (robert-cousineau) · 2025-04-11T22:20:43.622Z · LW(p) · GW(p)
The failures seem to be often related to the model get stuck trying to reason about your problem in a way that pattern matches too strongly to similar problems, and that is why it is failing. Did you notice this as well?