Speed of AI development is outpacing risk assessment

118

The increasing power of the latest artificial intelligence systems is stretching traditional evaluation methods to the breaking point, posing a challenge to businesses and public bodies over how best to work with the fast-evolving technology.

Flaws in the evaluation criteria commonly used to gauge performance, accuracy, and safety are being exposed as more models come to market, according to people who build, test, and invest in AI tools. The traditional tools are easy to manipulate and too narrow for the complexity of the latest models, they said.

The accelerating technology race sparked by the 2022 release of OpenAI’s chatbot ChatGPT and fed by tens of billions of dollars from venture capitalists and big tech companies, such as Microsoft, Google, and Amazon, has obliterated many older yardsticks for assessing AI’s progress.

“A public benchmark has a lifespan,” said Aidan Gomez, founder and chief executive of AI start-up Cohere. “It’s useful until people have optimized [their models] to it or gamed it. That used to take a couple of years; now it’s a couple of months.”

Google, Anthropic, Cohere, and Mistral have each released AI models over the past two months as they seek to unseat Microsoft-backed OpenAI from the top of public rankings of large language models (LLMs), which underpin systems such as ChatGPT.

New AI systems routinely emerge that can “completely ace” existing benchmarks, Gomez said. “As models get better, the capabilities make these evaluations obsolete,” he said.

The problem of how to assess LLMs has shifted from academia to the boardroom, as generative AI has become the top investment priority of 70 percent of chief executives, according to a KPMG survey of more than 1,300 global CEOs.

“People won’t use technology they don’t trust,” said Shelley McKinley, chief legal officer at GitHub, a repository for code that is owned by Microsoft. “It’s incumbent on companies to put out trustworthy products.”

Governments are also struggling with how to deploy and manage the risks of the latest AI models. Last week the US and UK signed a landmark bilateral arrangement on AI safety, building on new AI institutes that the two countries set up last year to “minimize surprise... from rapid and unexpected advances in AI.”

US President Joe Biden issued an executive order last year calling on government bodies including the National Institute of Standards and Technology to produce benchmarks for evaluating the risks of AI tools.

Whether assessing safety, performance, or efficiency, the groups tasked with stress-testing AI systems are rushing to keep up with the state of the art.

“The top-level decision many companies are making is: should we use an LLM and which should we use?” said Rishi Bommasani, who leads a team at the Stanford Center for Research on Foundation Models.

Bommasani’s team has developed the Holistic Evaluation of Language Models, which tests reasoning, memorization, and susceptibility to disinformation, among other criteria.

Other public systems include the Massive Multitask Language Understanding benchmark, a data set built in 2020 by Berkeley students to test models on questions from 57 subject areas. Another is HumanEval, which judges coding ability across 164 programming problems.

However, evaluations are struggling to keep up with the sophistication of today’s AI models, which can execute a string of connected tasks over a long horizon. Such complex tasks are harder to evaluate in controlled settings.

“The first thing to recognize is that it’s very hard to really properly evaluate models in the same way it’s very hard to properly evaluate humans,” said Mike Volpi, a partner at venture capital firm Index Ventures. “If you look at one thing like ‘can you jump high or run fast?’ it’s easy. But human intelligence? It’s almost an impossible task.”

Another growing concern about public tests is that models’ training data can include the precise questions used in evaluations.

“That might not be deliberate cheating; it might be more innocuous,” said Stanford’s Bommasani. “But we’re still learning how to limit this contamination problem between what the models are trained on and what they’re tested on.”

Benchmarks are “very monolithic,” he added. “We’re assessing how powerful LLMs are, but your assessment as a company is more than that. You need to factor in the cost [and] whether you want open source [where code is publicly available] or closed-source.”

Hugging Face, a $4.5 billion startup that provides tools for developing AI and is an influential platform for open source models, hosts a leaderboard called LMSys, which ranks models on their ability to complete bespoke tests set by individual users, rather than on a fixed set of questions. As a result, it more directly captures the actual preferences of users.

That leaderboard is helpful for individual users but of more limited use to companies, which will have specific requirements for AI models, said Cohere’s Gomez.

Instead, he recommends that businesses build “an internal test set, which only needs hundreds of examples, not thousands.”

“We always say that human evaluation is the best,” he said. “It’s the most high-signal, representative way of judging performance.”

Individual businesses’ choices of models are as much art as science, said Volpi at Index Ventures.

“These metrics are like when you purchase a car and it has this much horsepower and this much torque and goes 0 to 100km an hour,” he said. “The only way you really decide to buy it is by taking it for a drive.”

Listing image: FT

George Hammond, Financial Times

118

View Comments