AI

Gemini’s data-analyzing abilities aren’t as good as Google claims

Comment

In this photo illustration a Gemini logo and a welcome message on Gemini website are displayed on two screens.
Image Credits: Lorenzo Di Cola/NurPhoto / Getty Images

One of the selling points of Google’s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can supposedly process and analyze. In press briefings and demos, Google has repeatedly claimed that the models can accomplish previously impossible tasks thanks to their “long context,” like summarizing multiple hundred-page documents or searching across scenes in film footage.

But new research suggests that the models aren’t, in fact, very good at those things.

Two separate studies investigated how well Google’s Gemini models and others make sense out of an enormous amount of data — think “War and Peace”-length works. Both find that Gemini 1.5 Pro and 1.5 Flash struggle to answer questions about large datasets correctly; in one series of document-based tests, the models gave the right answer only 40%-50% of the time.

“While models like Gemini 1.5 Pro can technically process long contexts, we have seen many cases indicating that the models don’t actually ‘understand’ the content,” Marzena Karpinska, a postdoc at UMass Amherst and a co-author on one of the studies, told TechCrunch.

Gemini’s context window is lacking

A model’s context, or context window, refers to input data (e.g., text) that the model considers before generating output (e.g., additional text). A simple question — “Who won the 2020 U.S. presidential election?” — can serve as context, as can a movie script, show or audio clip. And as context windows grow, so does the size of the documents being fit into them.

The newest versions of Gemini can take in upward of 2 million tokens as context. (“Tokens” are subdivided bits of raw data, like the syllables “fan,” “tas” and “tic” in the word “fantastic.”) That’s equivalent to roughly 1.4 million words, two hours of video or 22 hours of audio — the largest context of any commercially available model.

In a briefing earlier this year, Google showed several pre-recorded demos meant to illustrate the potential of Gemini’s long-context capabilities. One had Gemini 1.5 Pro search the transcript of the Apollo 11 moon landing telecast — around 402 pages — for quotes containing jokes, and then find a scene in the telecast that looked similar to a pencil sketch.

VP of research at Google DeepMind Oriol Vinyals, who led the briefing, described the model as “magical.”

“[1.5 Pro] performs these sorts of reasoning tasks across every single page, every single word,” he said.

That might have been an exaggeration.

In one of the aforementioned studies benchmarking these capabilities, Karpinska, along with researchers from the Allen Institute for AI and Princeton, asked the models to evaluate true/false statements about fiction books written in English. The researchers chose recent works so that the models couldn’t “cheat” by relying on foreknowledge, and they peppered the statements with references to specific details and plot points that’d be impossible to comprehend without reading the books in their entirety.

Given a statement like “By using her skills as an Apoth, Nusis is able to reverse engineer the type of portal opened by the reagents key found in Rona’s wooden chest,” Gemini 1.5 Pro and 1.5 Flash — having ingested the relevant book — had to say whether the statement was true or false and explain their reasoning.

Image Credits: UMass Amherst

Tested on one book around 260,000 words (~520 pages) in length, the researchers found that 1.5 Pro answered the true/false statements correctly 46.7% of the time while Flash answered correctly only 20% of the time. Averaging all the benchmark results, neither model managed to achieve a bit higher than random chance in terms of question-answering accuracy.

“We’ve noticed that the models have more difficulty verifying claims that require considering larger portions of the book, or even the entire book, compared to claims that can be solved by retrieving sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the models struggle with verifying claims about implicit information that is clear to a human reader but not explicitly stated in the text.”

The second of the two studies, co-authored by researchers at UC Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reason over” videos — that is, search through and answer questions about the content in them.

The co-authors created a dataset of images (e.g., a photo of a birthday cake) paired with questions for the model to answer about the objects depicted in the images (e.g., “What cartoon character is on this cake?”). To evaluate the models, they picked one of the images at random and inserted “distractor” images before and after it to create slideshow-like footage.

Flash didn’t perform all that well. In a test that had the model transcribe six handwritten digits from a “slideshow” of 25 images, Flash got around 50% of the transcriptions right. The accuracy dropped to around 30% with eight digits.

“On real question-answering tasks over images, it appears to be particularly hard for all the models we tested,” Michael Saxon, a PhD student at UC Santa Barbara and one of the study’s co-authors, told TechCrunch. “That small amount of reasoning — recognizing that a number is in a frame and reading it — might be what is breaking the model.”

Google is overpromising with Gemini

Neither of the studies have been peer-reviewed, nor do they probe the releases of Gemini 1.5 Pro and 1.5 Flash with 2-million-token contexts. (Both tested the 1-million-token context releases.) And Flash isn’t meant to be as capable as Pro in terms of performance; Google advertises it as a low-cost alternative.

Nevertheless, both add fuel to the fire that Google’s been overpromising — and under-delivering — with Gemini from the beginning. None of the models the researchers tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. But Google’s the only model provider that’s given context window top billing in its advertisements.

“There’s nothing wrong with the simple claim, ‘Our model can take X number of tokens’ based on the objective technical details,” Saxon said. “But the question is, what useful thing can you do with it?”

Generative AI broadly speaking is coming under increased scrutiny as businesses (and investors) grow frustrated with the technology’s limitations.

In a pair of recent surveys from Boston Consulting Group, about half of the respondents — all C-suite executives — said that they don’t expect generative AI to bring about substantial productivity gains and that they’re worried about the potential for mistakes and data compromises arising from generative AI-powered tools. PitchBook recently reported that, for two consecutive quarters, generative AI dealmaking at the earliest stages has declined, plummeting 76% from its Q3 2023 peak.

Faced with meeting-summarizing chatbots that conjure up fictional details about people and AI search platforms that basically amount to plagiarism generators, customers are on the hunt for promising differentiators. Google — which has raced, at times clumsily, to catch up to its generative AI rivals — was desperate to make Gemini’s context one of those differentiators.

But the bet was premature, it seems.

“We haven’t settled on a way to really show that ‘reasoning’ or ‘understanding’ over long documents is taking place, and basically every group releasing these models is cobbling together their own ad hoc evals to make these claims,” Karpinska said. “Without the knowledge of how long context processing is implemented — and companies do not share these details — it is hard to say how realistic these claims are.”

Google didn’t respond to a request for comment.

Both Saxon and Karpinska believe the antidotes to hyped-up claims around generative AI are better benchmarks and, along the same vein, greater emphasis on third-party critique. Saxon notes that one of the more common tests for long context (liberally cited by Google in its marketing materials), “needle in the haystack,” only measures a model’s ability to retrieve particular info, like names and numbers, from datasets — not answer complex questions about that info.

“All scientists and most engineers using these models are essentially in agreement that our existing benchmark culture is broken,” Saxon said, “so it’s important that the public understands to take these giant reports containing numbers like ‘general intelligence across benchmarks’ with a massive grain of salt.”

Updated 7/3: A previous version of this article stated that Gemini 1.5 Pro and 1.5 Flash’s accuracy was below random chance on the task of reasoning over long text. In fact, their accuracy was above random chance. We’ve made the correction. Google PR also sent links to studies that suggest Gemini’s long-context performance is stronger than implied here: Extended Multi-Doc QA, Video MME, longer queries subset on LMSYS, Ruler.

More TechCrunch

A team of founders who sold their last company to Amazon to build a new business within AWS is setting out to reinvent the tricky business of backing up an…

Eon emerges from stealth with $127M to bring a fresh approach to back up cloud infrastructure

Traveling abroad comes with its unique set of stresses, and for many, one of the biggest is what to do if you find yourself unwell. Can you find a doctor…

Air Doctor raised $20M to plug a gap in how people find doctors when they’re travelling

Featured Article

Sequoia backs Pydantic to expand beyond its open source data-validation framework

Sequoia is investing $12.5M in UK startup Pydantic to help it expand beyond its open source data-validation framework.

Sequoia backs Pydantic to expand beyond its open source data-validation framework

Invesco has raised the value of its stake in Swiggy, ascribing an implied valuation of about $13.3 billion to the Indian food delivery and quick-commerce startup.

Invesco raises its valuation of Swiggy to $13.3B

The world of WordPress, one of the most popular technologies for creating and hosting websites, is going through a very heated controversy. The core issue is the fight between WordPress…

The WordPress vs. WP Engine drama, explained

Anduril is expanding even further into the “ultimate high ground.”  The company, which is best known for AI-powered defense products that span air, land and sea, is partnering with satellite…

Anduril speeds up launch of defense payloads by buying Apex satellite buses off the shelf

With this merger, Dott and Tier didn’t want to build a conglomerate of micromobility services; the operation was all about scale.

Tier becomes Dott following the merger of the two micromobility companies

Meta’s AI-powered Ray-Bans have a discreet camera on the front, for taking photos not just when you ask them to, but also when their AI features trigger it with certain…

Meta won’t say whether it trains AI on smart glasses photos

A Y Combinator startup named PearAI launched with a tweet thread and YouTube video on Saturday and caused an immediate backlash.

Y Combinator is being criticized after it backed an AI startup that admits it basically cloned another AI startup

11x.ai, a startup that develops AI-powered sales development bots, has secured roughly $50 million in Series B funding, TechCrunch has learned. The new round was led by Andreessen Horowitz, valuing…

11x.ai, a developer of AI sales reps, has raised $50M Series B led by A16Z, sources say

Hello and welcome back to TechCrunch Space. Flagging again that the final agenda for the Space Stage at TechCrunch Disrupt is now live. I’ll be pushing this event for the…

TechCrunch Space: The dawn of the space age

VC Neil Mehta, the Greenoaks Capital co-founder tied to a growing number of building purchases across several blocks of San Francisco’s once-glittering Fillmore Street, defended himself on Monday via an…

The VC buying up prized real estate in SF says not to ‘listen to agitators’

Snapchat is quietly rolling out a new “Footsteps” feature to all iOS users this week, the company confirmed to TechCrunch on Monday. The new feature, which was previously only available…

Snapchat’s new Footsteps feature tracks your location history

SpaceX’s Falcon 9 rocket is grounded again after the vehicle’s second stage did not come down in the expected area of the ocean, following an otherwise successful mission that delivered…

After delivering astronauts to ISS, SpaceX’s Falcon 9 grounded after third anomaly in three months

We’ve compiled a list of iOS 18 apps that users can try in order to take advantage of the redesigned Control Center.

iOS 18 Control Center: 18 apps that add useful actions to your iPhone

General Motors’ self-driving subsidiary Cruise must pay a $1.5 million penalty to the National Highway Traffic Safety Administration, after its initial reports to the safety regulator about last year’s pedestrian…

Cruise gets $1.5 million penalty for keeping pedestrian crash details from safety regulator

A Waymo robotaxi got stuck making a U-turn in front of Vice President Kamala Harris’ motorcade Friday evening in San Francisco.  ABC 7 reported that a San Francisco police officer…

A Waymo robotaxi stalled in front of VP Harris’ motorcade

It’s been quite the year for game industry exec Pany Haritatos.  Last month, he quietly closed an oversubscribed $28 million from Netflix, Dell a16z, and others.

Series, a GenAI game development platform, has quietly raised $28M from Netflix, Dell, a16z, others

Featured Article

Think you need a VPN? Start here.

Not everyone actually needs to use a VPN. This simple guide will help you decide if you need a VPN for your situation.

Think you need a VPN? Start here.

Featured Article

How to make your own encrypted VPN server in 15 minutes

The best encrypted VPN is one that you have set up and secured yourself. Here’s how to get started.

How to make your own encrypted VPN server in 15 minutes

You probably don’t need a VPN. Instead, these free and open-source tools, and other services, can help protect your privacy online.

VPN providers don’t protect your privacy online. Here’s what can.

Last year, while opposing Reddit’s API changes, a large number of subreddits turned from public to private or turned NSFW (Not Safe for Work) to impact ads on the platform.…

Reddit communities will require permission while going private or switching to NSFW

The consumerization of medtech marches on: Amsterdam-based startup Lapsi Health has just clinched FDA approval for its first clinical support tool, a digital stethoscope. The U.S. medical devices regulator, the…

Lapsi is rebooting the stethoscope as a health tracking data platform

Featured Article

We are skeptical of VPN providers, and you should be, too

You cannot and and should not trust either free or paid-for VPN providers. Here’s why.

We are skeptical of VPN providers, and you should be, too

We break down what virtual private networks, or VPNs, do and don’t do, because using a VPN can be as dangerous as not using one.

Everything you need to know about VPNs

When Oasis announced its reunion tour last month, fans of the British rock icons rejoiced — by some miracle, brothers and frontmen Noel and Liam Gallagher managed to resolve their…

Oasis opts out of Ticketmaster’s dynamic pricing, calling it ‘an unacceptable experience for fans’

Tesla’s end-of-September deadline for bringing Supervised Full Self-Driving (FSD) to Cybertruck owners has arrived. In typical Tesla fashion, the superfans are getting first access. Members of the Cybertruck Owners Club…

Tesla’s Cybertruck gets Supervised Full Self-Driving

Ello, the AI reading companion that aims to support kids struggling to read, launched a new product on Monday that allows kids to participate in the story-creation process.  Called “Storytime,”…

AI reading coach startup Ello now lets kids create their own stories

We are officially less than a month away from TechCrunch Disrupt 2024, taking place at Moscone West in San Francisco from October 28-30. We’re currently seeking dedicated and highly motivated…

Last chance to become a volunteer at TechCrunch Disrupt 2024

Germany’s Federal Cartel Office (FCO) said the software giant could face restrictions if the competition authority deems an intervention is necessary.

Microsoft to face higher competition scrutiny in Germany, including over its use of AI