The High Cost of Benchmarking AI Models: A Growing Concern for Transparency

Elliot Kim

Elliot Kim

April 10, 2025 · 3 min read
The High Cost of Benchmarking AI Models: A Growing Concern for Transparency

The development of advanced AI models has led to a new challenge in the field: the high cost of benchmarking. According to data from Artificial Analysis, a third-party AI testing outfit, evaluating OpenAI's o1 reasoning model across a suite of seven popular AI benchmarks costs a staggering $2,767.05. This raises concerns about the transparency and reproducibility of AI research, as independent verification of these models becomes increasingly difficult.

The cost of benchmarking is not limited to OpenAI's models. Anthropic's Claude 3.7 Sonnet, a "hybrid" reasoning model, costs $1,485.35 to test, while OpenAI's o3-mini-high model costs $344.59. In contrast, non-reasoning models like OpenAI's GPT-4o and Claude 3.6 Sonnet cost significantly less, at $108.85 and $81.41, respectively. Artificial Analysis has spent roughly $5,200 evaluating around a dozen reasoning models, close to twice the amount spent on over 80 non-reasoning models.

The high cost of benchmarking is attributed to the large number of tokens generated by reasoning models. Tokens represent bits of raw text, and OpenAI's o1 generated over 44 million tokens during benchmarking tests, eight times the amount generated by GPT-4o. Since most AI companies charge for model usage by the token, the cost adds up quickly. Modern benchmarks also tend to elicit a lot of tokens from models due to complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI.

The increasing cost of benchmarking has significant implications for the AI research community. Ross Taylor, CEO of AI startup General Reasoning, estimates that a single run-through of MMLU Pro, a question set designed to benchmark a model's language comprehension skills, would cost more than $1,800. This raises concerns about the ability of academics and smaller organizations to reproduce results, as they may not have the resources to afford the high cost of benchmarking.

Many AI labs, including OpenAI, provide benchmarking organizations with free or subsidized access to their models for testing purposes. However, this practice has been criticized for potentially coloring the results and threatening the integrity of the evaluation scoring. As Ross Taylor notes, "From [a] scientific point of view, if you publish a result that no one can replicate with the same model, is it even science anymore?"

The growing cost of benchmarking AI models highlights the need for more transparency and accessibility in AI research. As the field continues to evolve, it is essential to address these concerns and ensure that the development of AI models is guided by scientific principles and reproducibility.

In response to these concerns, Artificial Analysis co-founder George Cameron stated that the organization plans to increase its benchmarking spend as more AI labs develop reasoning models. However, this may not be a sustainable solution in the long run. The AI research community must come together to address the issue of benchmarking costs and ensure that the development of AI models is transparent, reproducible, and accessible to all.

Similiar Posts

Copyright © 2024 Starfolk. All rights reserved.