New AI Benchmark ARC-AGI-2 Challenges Leading Models, Stumps Most with 1% Scores

Elliot Kim

Elliot Kim

March 25, 2025 · 3 min read
New AI Benchmark ARC-AGI-2 Challenges Leading Models, Stumps Most with 1% Scores

The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, has introduced a new, more challenging test to measure the general intelligence of leading AI models. Dubbed ARC-AGI-2, the benchmark has already stumped most models, with "reasoning" AI models like OpenAI's o1-pro and DeepSeek's R1 scoring between 1% and 1.3%, and powerful non-reasoning models like GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash scoring around 1%.

The ARC-AGI tests consist of puzzle-like problems that require an AI to identify visual patterns from a collection of different-colored squares and generate the correct "answer" grid. The problems are designed to force an AI to adapt to new problems it hasn't seen before, making it a more accurate measure of general intelligence. To establish a human baseline, the Arc Prize Foundation had over 400 people take ARC-AGI-2, with "panels" of these people scoring an average of 60% on the test's questions – significantly better than any of the AI models.

Chollet claims that ARC-AGI-2 is a better measure of an AI model's actual intelligence than the first iteration of the test, ARC-AGI-1. The new test prevents AI models from relying on "brute force" – extensive computing power – to find solutions, a major flaw of ARC-AGI-1. Instead, ARC-AGI-2 introduces a new metric: efficiency, and requires models to interpret patterns on the fly instead of relying on memorization.

The Arc Prize Foundation's tests are aimed at evaluating whether an AI system can efficiently acquire new skills outside the data it was trained on. As co-founder Greg Kamradt wrote in a blog post, "Intelligence is not solely defined by the ability to solve problems or achieve high scores. The efficiency with which those capabilities are acquired and deployed is a crucial, defining component."

The arrival of ARC-AGI-2 comes as many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Hugging Face's co-founder, Thomas Wolf, recently told TechCrunch that the AI industry lacks sufficient tests to measure the key traits of so-called artificial general intelligence, including creativity.

In conjunction with the new benchmark, the Arc Prize Foundation has announced a new Arc Prize 2025 contest, challenging developers to reach 85% accuracy on the ARC-AGI-2 test while only spending $0.42 per task. This contest aims to encourage innovation and push the boundaries of AI development.

The significance of ARC-AGI-2 lies in its ability to provide a more accurate measure of AI general intelligence, moving beyond the limitations of previous benchmarks. As the AI industry continues to evolve, the need for rigorous testing and evaluation frameworks like ARC-AGI-2 will become increasingly important. With the Arc Prize 2025 contest, the Arc Prize Foundation is poised to drive innovation and progress in the field of artificial general intelligence.

Similiar Posts

Copyright © 2024 Starfolk. All rights reserved.