xAI, the AI startup founded by billionaire Elon Musk, has released its latest flagship AI model, Grok 3, which powers the company's Grok chatbot apps. Trained on around 200,000 GPUs, Grok 3 reportedly beats other leading models, including those from OpenAI, on benchmarks for mathematics, programming, and more. However, the question remains: do these benchmarks really tell us anything meaningful about AI proficiency?
The AI industry often relies on benchmarks to measure model improvements, but these benchmarks tend to test for esoteric knowledge and provide aggregate scores that correlate poorly to real-world tasks. As Wharton professor Ethan Mollick pointed out, there is an "urgent need for better batteries of tests and independent testing authorities." AI companies frequently self-report benchmark results, making it difficult to accept these results at face value.
Mollick likened AI testing to food reviews, based on taste, and emphasized the need for more comprehensive and standardized testing methods. This sentiment is echoed by other experts, who propose aligning benchmarks with economic impact to ensure their usefulness. Others argue that adoption and utility are the ultimate benchmarks, sparking a debate that may continue indefinitely.
In the meantime, some experts suggest paying less attention to new models and benchmarks unless they represent major AI technical breakthroughs. This approach may be necessary for our collective sanity, even if it means inducing some level of AI FOMO.
In related news, OpenAI is changing its AI development approach to explicitly embrace "intellectual freedom," regardless of how challenging or controversial a topic may be. Former OpenAI CTO Mira Murati has also launched a new startup, Thinking Machines Lab, which aims to build tools that cater to people's unique needs and goals.
Meta will host its first developer conference dedicated to generative AI, called LlamaCon, on April 29. The conference will focus on the company's Llama family of generative AI models. Additionally, OpenEuroLLM, a collaboration between 20 organizations, is working on building "foundation models for transparent AI in Europe" that preserve linguistic and cultural diversity across all EU languages.
In research news, OpenAI has created a new AI benchmark, SWE-Lancer, which evaluates the coding prowess of powerful AI systems. The benchmark consists of over 1,400 freelance software engineering tasks and shows that even the best-performing AI model, Anthropic's Claude 3.5 Sonnet, scores only 40.3% on the full benchmark.
Stepfun, a Chinese AI company, has released an "open" AI model, Step-Audio, which can understand and generate speech in several languages, including Chinese, English, and Japanese. The model allows users to adjust the emotion and even dialect of the synthetic audio it creates, including singing.
Nous Research, an AI research group, has released a model that unifies reasoning and "intuitive language model capabilities." The model, DeepHermes-3 Preview, can toggle on and off long "chains of thought" for improved accuracy at the cost of some computational heft. Anthropic reportedly plans to release a similar model soon, and OpenAI has said such a model is on its near-term roadmap.
As the AI landscape continues to evolve, it's clear that the industry needs to rethink its approach to measuring AI proficiency. While benchmarks may provide some insights, they are only a small part of the puzzle. It's time to move beyond the hype and focus on developing AI models that truly make a meaningful impact in the real world.