Featured Article

The AI industry is obsessed with Chatbot Arena, but it might not be the best benchmark

Human raters bring their biases

Comment

A chat bot looking out of a smart phone with various conversation bubbles around the phone.
Image Credits: Malorny / Getty Images

Over the past few months, tech execs like Elon Musk have touted the performance of their company’s AI models on a particular benchmark: Chatbot Arena.

Maintained by a nonprofit known as LMSYS, Chatbot Arena has become something of an industry obsession. Posts about updates to its model leaderboards garner hundreds of views and reshares across Reddit and X, and the official LMSYS X account has over 54,000 followers. Millions of people have visited the organization’s website in the last year alone.

Still, there are some lingering questions about Chatbot Arena’s ability to tell us how “good” these models really are.

In search of a new benchmark

Before we dive in, let’s take a moment to understand what LMSYS is exactly, and how it became so popular.

The nonprofit only launched last April as a project spearheaded by students and faculty at Carnegie Mellon, UC Berkeley’s SkyLab and UC San Diego. Some of the founding members now work at Google DeepMind, Musk’s xAI and Nvidia; today, LMSYS is primarily run by SkyLab-affiliated researchers.

LMSYS didn’t set out to create a viral model leaderboard. The group’s founding mission was making models (specifically generative models à la OpenAI’s ChatGPT) more accessible by co-developing and open sourcing them. But shortly after LMSYS’ founding, its researchers, dissatisfied with the state of AI benchmarking, saw value in creating a testing tool of their own.

“Current benchmarks fail to adequately address the needs of state-of-the-art [models], particularly in evaluating user preferences,” the researchers wrote in a technical paper published in March. “Thus, there is an urgent necessity for an open, live evaluation platform based on human preference that can more accurately mirror real-world usage.”

Indeed, as we’ve written before, the most commonly used benchmarks today do a poor job of capturing how the average person interacts with models. Many of the skills the benchmarks probe for — solving PhD-level math problems, for example — will rarely be relevant to the majority of people using, say, Claude.

LMSYS’ creators felt similarly, and so they devised an alternative: Chatbot Arena, a crowdsourced benchmark designed to capture the “nuanced” aspects of models and their performance on open-ended, real-world tasks.

LMSYS
The Chatbot Arena rankings as of early September 2024.
Image Credits: LMSYS

Chatbot Arena lets anyone on the web ask a question (or questions) of two randomly selected, anonymous models. Once a person agrees to the ToS allowing their data to be used for LMSYS’ future research, models and related projects, they can vote for their preferred answers from the two dueling models (they can also declare a tie or say “both are bad”), at which point the models’ identities are revealed.

LMSYS
The Chatbot Arena interface.
Image Credits: LMSYS

This flow yields a “diverse array” of questions a typical user might ask of any generative model, the researchers wrote in the March paper. “Armed with this data, we employ a suite of powerful statistical techniques […] to estimate the ranking over models as reliably and sample-efficiently as possible,” they explained.

Since Chatbot Arena’s launch, LMSYS has added dozens of open models to its testing tool, and partnered with universities like Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), as well as companies including OpenAI, Google, Anthropic, Microsoft, Meta, Mistral and Hugging Face to make their models available for testing. Chatbot Arena now features more than 100 models, including multimodal models (models that can understand data beyond just text) like OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet.

More than a million prompts and answer pairs have been submitted and evaluated this way, producing a huge body of ranking data.

Bias, and lack of transparency

In the March paper, LMSYS’ founders claim that Chatbot Arena’s user-contributed questions are “sufficiently diverse” to benchmark for a range of AI use cases. “Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced model leaderboards,” they write.

But how informative are the results, really? That’s up for debate.

Yuchen Lin, a research scientist at the nonprofit Allen Institute for AI, says that LMSYS hasn’t been completely transparent about the model capabilities, knowledge and skills it’s assessing on Chatbot Arena. In March, LMSYS released a dataset, LMSYS-Chat-1M, containing a million conversations between users and 25 models on Chatbot Arena. But it hasn’t refreshed the dataset since.

“The evaluation is not reproducible, and the limited data released by LMSYS makes it challenging to study the limitations of models in depth,” Lin said.

LMSYS
Comparing two models using Chatbot Arena’s tool.
Image Credits: LMSYS

To the extent that LMSYS has detailed its testing approach, its researchers said in the March paper that they leverage “efficient sampling algorithms” to pit models against each other “in a way that accelerates the convergence of rankings while retaining statistical validity.” They wrote that LMSYS collects roughly 8,000 votes per model before it refreshes the Chatbot Arena rankings, and that threshold is usually reached after several days.

But Lin feels the voting isn’t accounting for people’s ability — or inability — to spot hallucinations from models, nor differences in their preferences, which makes their votes unreliable. For example, some users might like longer, markdown-styled answers, while others may prefer more succinct responses.

The upshot here is that two users might give opposite answers to the same answer pair, and both would be equally valid — but that kind of questions the value of the approach fundamentally. Only recently has LMSYS experimented with controlling for the “style” and “substance” of models’ responses in Chatbot Arena.

“The human preference data collected does not account for these subtle biases, and the platform does not differentiate between ‘A is significantly better than B’ and ‘A is only slightly better than B,’” Lin said. “While post-processing can mitigate some of these biases, the raw human preference data remains noisy.”

Mike Cook, a research fellow at Queen Mary University of London specializing in AI and game design, agreed with Lin’s assessment. “You could’ve run Chatbot Arena back in 1998 and still talked about dramatic ranking shifts or big powerhouse chatbots, but they’d be terrible,” he added, noting that while Chatbot Arena is framed as an empirical test, it amounts to a relative rating of models.

The more problematic bias hanging over Chatbot Arena’s head is the current makeup of its user base.

Because the benchmark became popular almost entirely through word of mouth in AI and tech industry circles, it’s unlikely to have attracted a very representative crowd, Lin says. Lending credence to his theory, the top questions in the LMSYS-Chat-1M dataset pertain to programming, AI tools, software bugs and fixes and app design — not the sorts of things you’d expect non-technical people to ask about.

“The distribution of testing data may not accurately reflect the target market’s real human users,” Lin said. “Moreover, the platform’s evaluation process is largely uncontrollable, relying primarily on post-processing to label each query with various tags, which are then used to develop task-specific ratings. This approach lacks systematic rigor, making it challenging to evaluate complex reasoning questions solely based on human preference.”

LMSYS
Testing multimodal models in Chatbot Arena.
Image Credits: LMSYS

Cook pointed out that because Chatbot Arena users are self-selecting — they’re interested in testing models in the first place — they may be less keen to stress-test or push models to their limits.

“It’s not a good way to run a study in general,” Cook said. “Evaluators ask a question and vote on which model is ‘better’ — but ‘better’ is not really defined by LMSYS anywhere. Getting really good at this benchmark might make people think a winning AI chatbot is more human, more accurate, more safe, more trustworthy and so on — but it doesn’t really mean any of those things.”

LMSYS is trying to balance out these biases by using automated systems — MT-Bench and Arena-Hard-Auto — that use models themselves (OpenAI’s GPT-4 and GPT-4 Turbo) to rank the quality of responses from other models. (LMSYS publishes these rankings alongside the votes). But while LMSYS asserts that models “match both controlled and crowdsourced human preferences well,” the matter’s far from settled.

Commercial ties and data sharing

LMSYS’ growing commercial ties are another reason to take the rankings with a grain of salt, Lin says.

Some vendors like OpenAI, which serve their models through APIs, have access to model usage data, which they could use to essentially “teach to the test” if they wished. This makes the testing process potentially unfair for the open, static models running on LMSYS’ own cloud, Lin said.

“Companies can continually optimize their models to better align with the LMSYS user distribution, possibly leading to unfair competition and a less meaningful evaluation,” he added. “Commercial models connected via APIs can access all user input data, giving companies with more traffic an advantage.”

Cook added, “Instead of encouraging novel AI research or anything like that, what LMSYS is doing is encouraging developers to tweak tiny details to eke out an advantage in phrasing over their competition.”

LMSYS is also sponsored in part by organizations, one of which is a VC firm, with horses in the AI race.

LMSYS
LMSYS’ corporate sponsorships.
Image Credits: LMSYS

Google’s Kaggle data science platform has donated money to LMSYS, as has Andreessen Horowitz (whose investments include Mistral) and Together AI. Google’s Gemini models are on Chatbot Arena, as are Mistral’s and Together’s.

LMSYS states on its website that it also relies on university grants and donations to support its infrastructure, and that none of its sponsorships — which come in the form of hardware and cloud compute credits, in addition to cash — have “strings attached.” But the relationships give the impression that LMSYS isn’t completely impartial, particularly as vendors increasingly use Chatbot Arena to drum up anticipation for their models.

LMSYS didn’t respond to TechCrunch’s request for an interview.

A better benchmark?

Lin thinks that, despite their flaws, LMSYS and Chatbot Arena provide a valuable service: Giving real-time insights into how different models perform outside the lab.

“Chatbot Arena surpasses the traditional approach of optimizing for multiple-choice benchmarks, which are often saturated and not directly applicable to real-world scenarios,” Lin said. “The benchmark provides a unified platform where real users can interact with multiple models, offering a more dynamic and realistic evaluation.”

But — as LMSYS continues to add features to Chatbot Arena, like more automated evaluations — Lin feels there’s low-hanging fruit the organization could tackle to improve testing.

To allow for a more “systematic” understanding of models’ strengths and weaknesses, he posits, LMSYS could design benchmarks around different subtopics, like linear algebra, each with a set of domain-specific tasks. That’d give the Chatbot Arena results much more scientific weight, he says.

“While Chatbot Arena can offer a snapshot of user experience — albeit from a small and potentially unrepresentative user base — it should not be considered the definitive standard for measuring a model’s intelligence,” Lin said. “Instead, it is more appropriately viewed as a tool for gauging user satisfaction rather than a scientific and objective measure of AI progress.”

More TechCrunch

The Port of Seattle released a statement Friday confirming that it was targeted by a ransomware attack. The attack occurred on August 24, with the Port (which also operates the…

Port of Seattle shares ransomware attack details

A decade after the wildly popular game Flappy Bird disappeared, an organization calling itself The Flappy Bird Foundation announced plans to “re-hatch the official Flappy Bird® game.” But this morning,…

Flappy Bird’s creator disavows ‘official’ new version of the game

Platforms to connect apps that wouldn’t normally talk to each other have been around for a minute (see: Zapier). But they have not gotten dramatically simpler to use if you’re…

DryMerge promises to connect apps that normally don’t talk to each other — and when it works, it’s great

Featured Article

Cohere co-founder Nick Frosst’s indie band, Good Kid, is almost as successful as his AI company

Nick Frosst, the co-founder of $5.5 billion Canadian AI startup Cohere, has been a musician his whole life. He told TechCrunch that once he started singing, he never shut up. That’s still true today. In addition to his full-time job at Cohere, Frosst is also the front man of Good…

Cohere co-founder Nick Frosst’s indie band, Good Kid, is almost as successful as his AI company

Blockchain technology is all about decentralization and virtualization. So it’s a little ironic that humans love to come together in person at big blockchain events. Such was the case last…

A walk through the crypto jungle at Korea Blockchain Week

I have a guilty pleasure, and it’s not that I just rewatched “Glee” in its entirety (yes, even the awful later seasons), or that I have read an ungodly amount…

The LinkedIn games are fun, actually

It’s looking increasingly likely that OpenAI will soon alter its complex corporate structure. Reports earlier this week suggested that the AI company was in talks to raise $6.5 billion at…

OpenAI could shake up its nonprofit structure next year

Fusion startups have raised $7.1 billion to date, with the majority of it going to a handful of companies. 

Every fusion startup that has raised over $300M

Netflix has never quite cracked the talk show formula, but maybe it can borrow an existing hit from YouTube. According to Bloomberg, the streamer is in talks with BuzzFeed to…

‘Hot Ones’ could add some heat to Netflix’s live lineup

Alex Parmley has been thinking about building his latest company, ORNG, since he was working on his last company, Phood.  Launched in 2018, Phood was a payments app that let…

Why ORNG’s founder pivoted from college food ordering to real-time money transfer

Lawyers representing Sam Bankman-Fried, the FTX CEO and co-founder who was convicted of fraud and money laundering late last year, are seeking a new trial. Following crypto exchange FTX’s collapse,…

Sam Bankman-Fried appeals conviction, criticizes judge’s ‘unbalanced’ decisions

OpenAI this week unveiled a preview of OpenAI o1, also known as Strawberry. The company claims that o1 can more effectively reason through math and science, as well as fact-check…

OpenAI previews its new Strawberry model

There’s something oddly refreshing about starting the day by solving the Wordle. According to DeepWell DTx, there’s a scientific explanation for why our brains might feel just a bit better…

DeepWell DTx receives FDA clearance for its therapeutic video game developer tools

Soundiiz is a free third-party tool that builds portability tools through existing APIs and acts as a translator between the services.

These two friends built a simple tool to transfer playlists between Apple Music and Spotify, and it works great

In early 2018, VC Mike Moritz wrote in the FT that “Silicon Valley would be wise to follow China’s lead,” noting the pace of work at tech companies was “furious”…

This is how bad China’s startup scene looks now

Fei-Fei Li, the Stanford professor many deem the “Godmother of AI,” has raised $230 million for her new startup, World Labs, from backers including Andreessen Horowitz, NEA, and Radical Ventures.…

Fei-Fei Li’s World Labs comes out of stealth with $230M in funding

Bolt says it has settled its long-standing lawsuit with its investor Activant Capital. One-click payments startup Bolt is settling the suit by buying out the investor’s stake “after which Activant…

Fintech Bolt is buying out the investor suing over Ryan Breslow’s $30M loan

The rise of neobanks has been fascinating to witness, as a number of companies in recent years have grown from merely challenging traditional banks to being massive players in and…

Dave and Varo Bank execs are coming to TechCrunch Disrupt 2024

OpenAI released its new o1 models on Thursday, giving ChatGPT users their first chance to try AI models that pause to “think” before they answer. There’s been a lot of…

First impressions of OpenAI o1: An AI designed to overthink it

Featured Article

Investors rebel as TuSimple pivots from self-driving trucks to AI gaming

TuSimple, once a buzzy startup considered a leader in self-driving trucks, is trying to move its assets to China to fund a new AI-generated animation and video game business. The pivot has not only puzzled and enraged several shareholders, but also threatens to pull the company back into a legal…

Investors rebel as TuSimple pivots from self-driving trucks to AI gaming

Welcome to Startups Weekly — your weekly recap of everything you can’t miss from the world of startups. Want it in your inbox every Friday? Sign up here. This week…

Shrinking teams, warped views, and risk aversion in this week’s startup news

Silicon Valley startup accelerator Y Combinator will expand the number of cohorts it runs each year from two to four starting in 2025, Bloomberg reported Thursday, and TechCrunch confirmed today.…

Y Combinator expanding to four cohorts a year in 2025

Telegram has had a tough few weeks. The messaging app’s founder, Pavel Durov, was arrested in late August and later released on a €5 million bail in France, charged with…

Telegram CEO Durov’s arrest hasn’t dampened enthusiasm for its TON blockchain

Martin Casado, a general partner at Andreessen Horowitz, will tackle one of the most pressing issues facing today’s tech world — AI regulation — only at TechCrunch Disrupt 2024, taking…

A fireside chat with Andreessen Horowitz partner Martin Casado at TechCrunch Disrupt 2024

Christina Cacioppo, CEO and co-founder of Vanta, will be on the SaaS Stage at TechCrunch Disrupt 2024 to reveal how Vanta is redefining security and compliance automation and driving innovation…

Vanta’s Christina Cacioppo takes the stage at TechCrunch Disrupt 2024

On Thursday, cybersecurity giant Fortinet disclosed a breach involving customer data.  In a statement posted online, Fortinet said an individual intruder accessed “a limited number of files” stored on a…

Fortinet confirms customer data breach

Meta has confirmed that it’s restarting efforts to train its AI systems using public Facebook and Instagram posts from its U.K. userbase. The company claims it has “incorporated regulatory feedback” into a…

Meta reignites plans to train AI using UK users’ public Facebook and Instagram posts

Following the moves of other tech giants, Spotify announced on Friday it’s introducing in-app parental controls in the form of “managed accounts” for listeners under the age of 13. The…

Spotify begins piloting parent-managed accounts for kids on family plans

Uber users in Austin and Atlanta will be able to hail Waymo robotaxis through the app in early 2025 as part of a partnership between the two companies. 

Waymo robotaxis to become available on Uber in Austin, Atlanta in early 2025

There are plenty of calendar and scheduling apps that take care of your professional life and help you slot in meetings with your teammates and work collaborators. Howbout is all…

Howbout raises $8M from Goodwater to build a calendar that you can share with your friends