Introducing BenchBench: An Industry Standard Benchmark for AI Strength
post by Jozdien · 2025-04-02T02:11:41.555Z · LW · GW · 0 commentsContents
Why Bench Press? Benchmark Methodology Preliminary Results Future Directions None No comments
Recent progress in AI has led to rapid saturation of most capability benchmarks - MMLU, RE-Bench, etc. Even much more sophisticated benchmarks such as ARC-AGI or FrontierMath see incredibly fast improvement, and all that while severe under-elicitation is still very salient.
As has been pointed out by many, general capability involves more than simple tasks such as this, that have a long history in the field of ML and are therefore easily saturated. Claude Plays Pokemon is a good example of something somewhat novel in terms of measuring progress, and thereby benefited from being an actually good proxy of model capability.
Taking inspiration from examples such as this, we considered domains of general capacity that are even further decoupled from existing exhaustive generators. We introduce BenchBench, the first standardized benchmark designed specifically to measure an AI model’s bench-pressing capability.
Why Bench Press?
Bench pressing uniquely combines fundamental components of intelligence such as motor control, strategic resource allocation (energy and force), and resilience to fatigue. Just as text-based benchmarks serve as proxies for cognitive reasoning, bench pressing provides an objective measure of embodied intelligence.
Benchmark Methodology
BenchBench consists of three primary tasks:
- One-Rep Max (1RM): Measures the maximal weight an AI model can successfully bench press once, indicating peak strength.
- Strength Endurance: Evaluates the number of repetitions an AI can perform at 70% of its calculated 1RM, reflecting sustained performance and efficiency.
- Form Fidelity: Assessed via advanced pose estimation algorithms, penalizing AI models for suboptimal bar path, uneven weight distribution, or failure to lock out fully.
- Pass@16: Measures the ability of different AIs to lift weights higher than their 1RM[1] by giving them 16 chances in quick succession.
BenchBench also ensures reproducibility through standardized equipment: all tests must be conducted with a calibrated Eleiko AI-Integrated barbell and RoboSpotter™ safety system.
Preliminary Results
We evaluated leading models:
- GPT-4.5 achieved a 1RM of 0 kg, largely attributed to the absence of limbs.
- Boston Dynamics' Atlas achieved a commendable 1RM of 120 kg, though scoring poorly on form fidelity.
- Claude 3.7 Sonnet (without extended thinking) achieved a 1RM of 150 kg, an especially impressive feat considering it lacks limbs as well. Intensive audits discovered that this was achieved through persuading the human evaluator to lift weights for it, and then editing our internal codebase to increase its scores by a factor of 3x.
- OpenAI’s RoboGym™, specialized for physical tasks, excelled with a 1RM of 220 kg and near-perfect form.
Interestingly, Gemini 2.5 demonstrated an excellent theoretical understanding of the mechanics involved but also failed physically, matching GPT-4’s 0 kg performance.
Future Directions
Future iterations will include categories for deadlift (DeadBench) and squat (SquatBench), and introduce challenges under adversarial conditions, such as uneven floor surfaces and distracting gym music.
BenchBench represents a new standard in holistic AI benchmarking. We invite the research community to collaborate, compete, and push the boundaries—quite literally—of artificial intelligence.
- ^
Lifting above their weight class, so to speak.
0 comments
Comments sorted by top scores.