The AI community has been abuzz with the latest development in video generation technology, as Google Veo 2 successfully created a realistic video of actor Will Smith eating spaghetti. This peculiar benchmark has become a meme and a benchmark of sorts, testing the capabilities of new AI video generators. But Will Smith's pasta-filled adventure is just the tip of the iceberg, as unusual AI benchmarks are taking the industry by storm.
In 2024, a 16-year-old developer created an app that allows AI to design structures in Minecraft, while a British programmer built a platform where AI plays games like Pictionary and Connect 4 against each other. These unconventional benchmarks have captured the imagination of the AI community, but they also raise important questions about the effectiveness of traditional evaluation methods.
One of the main issues with traditional AI benchmarks is that they often don't resonate with the average person. Companies frequently tout their AI's ability to answer complex math questions or solve Ph.D.-level problems, but these achievements may not translate to real-world applications. In contrast, weird AI benchmarks like Will Smith eating spaghetti or AI-designed Minecraft structures are more relatable and entertaining, making them more accessible to a broader audience.
Another problem with traditional benchmarks is that they often don't compare an AI system's performance to that of an average person. As Ethan Mollick, a professor of management at Wharton, pointed out, most AI benchmarks don't provide a clear understanding of how well an AI system performs in real-world scenarios. This lack of context makes it difficult to assess the true capabilities of an AI system.
Crowdsourced benchmarks, such as Chatbot Arena, which allows users to rate AI performance on various tasks, are also flawed. The ratings are often subjective and based on personal preferences, rather than objective metrics. This can lead to biased results that don't accurately reflect an AI system's capabilities.
Despite their limitations, weird AI benchmarks like Will Smith eating spaghetti or AI-designed Minecraft structures have a certain appeal. They are easy to understand, entertaining, and provide a glimpse into the capabilities of AI systems. As the AI community continues to grapple with distilling complex technology into digestible marketing, these unconventional benchmarks may become an essential part of the AI evaluation landscape.
As we look to 2025, it will be interesting to see which new, unusual AI benchmarks will emerge and capture the imagination of the AI community. Will we see AI-generated videos of celebrities doing everyday tasks, or perhaps AI-designed art that rivals human creativity? One thing is certain – the AI community will continue to push the boundaries of what is possible, and weird AI benchmarks will be an integral part of that journey.