Posts
Comments
Comment by
Kevin Amiri (kevin-amiri) on
OpenAI o1, Llama 4, and AlphaZero of LLMs ·
2024-09-15T19:39:13.129Z ·
LW ·
GW
I recently translated 100 AIME level math questions from another language into English for testing set for a kaggle competition. The best model was GPT-4-32k, which could only solve 5-6 questions correctly. The rest of the models managed to solve just 1-3 questions.
Then, I tried the MATH dataset. While the difficulty level was similar, the results were surprisingly different: 60-80% of the problems were solved correctly.
I can not see any 1o improvement on this.
Is this a well-known phenomenon, or am I onto something significant here?