The Abstract and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) benchmark, a prominent test for artificial general intelligence (AGI), is nearing solution, but its creators claim that this progress reveals flaws in the test's design rather than a significant research breakthrough. Introduced in 2019 by Francois Chollet, a leading figure in the AI world, ARC-AGI is designed to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on.
Until this year, the best-performing AI could only solve just under a third of the tasks in ARC-AGI. Chollet attributed this to the industry's focus on large language models (LLMs), which he believes aren't capable of actual "reasoning." According to Chollet, LLMs struggle with generalization due to being entirely reliant on memorization, and break down on anything that wasn't in their training data.
In June, Chollet and Zapier co-founder Mike Knoop launched a $1 million competition to build open-source AI capable of beating ARC-AGI. Out of 17,789 submissions, the best scored 55.5% – a 20% increase from 2023's top scorer, but still short of the 85% "human-level" threshold required to win. However, Knoop emphasized that this doesn't mean we're 20% closer to AGI.
Knoop stated that many of the submissions to ARC-AGI have been able to "brute force" their way to a solution, suggesting that a "large fraction" of ARC-AGI tasks "[don't] carry much useful signal towards general intelligence." This raises concerns about the test's design and its ability to accurately measure progress towards AGI.
ARC-AGI consists of puzzle-like problems where an AI has to, given a grid of different-colored squares, generate the correct "answer" grid. The problems were designed to force an AI to adapt to new problems it hasn't seen before. However, it's unclear whether the test is achieving this goal.
Knoop acknowledged that ARC-AGI "has been unchanged since 2019 and is not perfect." Chollet and Knoop have faced criticism for overselling ARC-AGI as a benchmark towards AGI – at a time when the very definition of AGI is being hotly contested. One OpenAI staff member recently claimed that AGI has "already" been achieved if one defines AGI as AI "better than most humans at most tasks."
In response to these concerns, Chollet and Knoop plan to release a second-generation ARC-AGI benchmark to address these issues, alongside a 2025 competition. According to Chollet, "We will continue to direct the efforts of the research community towards what we see as the most important unsolved problems in AI, and accelerate the timeline to AGI."
The challenges in defining intelligence for AI are reminiscent of the difficulties in defining human intelligence. As the search for AGI continues, it's clear that creating a comprehensive and accurate benchmark will be a crucial step in achieving this goal.
Ultimately, the ARC-AGI benchmark's shortcomings serve as a reminder that the pursuit of AGI is a complex and ongoing challenge. While progress may be made, it's essential to critically evaluate the methods and benchmarks used to measure that progress.