The increasing power of the latest artificial intelligence systems is stretching traditional evaluation methods to the breaking point, posing a challenge to businesses and public bodies over how best to work with the fast-evolving technology.
Flaws in the evaluation criteria commonly used to gauge performance, accuracy, and safety are being exposed as more models come to market, according to people who build, test, and invest in AI tools. The traditional tools are easy to manipulate and too narrow for the complexity of the latest models, they said.
The accelerating technology race sparked by the 2022 release of OpenAI’s chatbot ChatGPT and fed by tens of billions of dollars from venture capitalists and big tech companies, such as Microsoft, Google, and Amazon, has obliterated many older yardsticks for assessing AI’s progress.
“A public benchmark has a lifespan,” said Aidan Gomez, founder and chief executive of AI start-up Cohere. “It’s useful until people have optimized [their models] to it or gamed it. That used to take a couple of years; now it’s a couple of months.”
Google, Anthropic, Cohere, and Mistral have each released AI models over the past two months as they seek to unseat Microsoft-backed OpenAI from the top of public rankings of large language models (LLMs), which underpin systems such as ChatGPT.
New AI systems routinely emerge that can “completely ace” existing benchmarks, Gomez said. “As models get better, the capabilities make these evaluations obsolete,” he said.
The problem of how to assess LLMs has shifted from academia to the boardroom, as generative AI has become the top investment priority of 70 percent of chief executives, according to a KPMG survey of more than 1,300 global CEOs.