guided by voices

Google’s AI Overview is flawed by design, and a new company blog post hints at why

Google: "There are bound to be some oddities and errors" in system that told people to eat rocks.

Benj Edwards – May 31, 2024 3:47 pm

The Google "G" logo surrounded by whimsical characters, all of which look stunned and surprised. Credit: Google

On Thursday, Google capped off a rough week of providing inaccurate and sometimes dangerous answers through its experimental AI Overview feature by authoring a follow-up blog post titled, "AI Overviews: About last week." In the post, attributed to Google VP Liz Reid, head of Google Search, the firm formally acknowledged issues with the feature and outlined steps taken to improve a system that appears flawed by design, even if it doesn't realize it is admitting it.

To recap, the AI Overview feature—which the company showed off at Google I/O a few weeks ago—aims to provide search users with summarized answers to questions by using an AI model integrated with Google's web ranking systems. Right now, it's an experimental feature that is not active for everyone, but when a participating user searches for a topic, they might see an AI-generated answer at the top of the results, pulled from highly ranked web content and summarized by an AI model.

While Google claims this approach is "highly effective" and on par with its Featured Snippets in terms of accuracy, the past week has seen numerous examples of the AI system generating bizarre, incorrect, or even potentially harmful responses, as we detailed in a recent feature where Ars reporter Kyle Orland replicated many of the unusual outputs.

Drawing inaccurate conclusions from the web

On Wednesday morning, Google's AI Overview was erroneously telling us the Sony PlayStation and Sega Saturn were available in 1993. Credit: Kyle Orland / Google

Given the circulating AI Overview examples, Google almost apologizes in the post and says, "We hold ourselves to a high standard, as do our users, so we expect and appreciate the feedback, and take it seriously." But Reid, in an attempt to justify the errors, then goes into some very revealing detail about why AI Overviews provides erroneous information:

AI Overviews work very differently than chatbots and other LLM products that people may have tried out. They’re not simply generating an output based on training data. While AI Overviews are powered by a customized language model, the model is integrated with our core web ranking systems and designed to carry out traditional “search” tasks, like identifying relevant, high-quality results from our index. That’s why AI Overviews don’t just provide text output, but include relevant links so people can explore further. Because accuracy is paramount in Search, AI Overviews are built to only show information that is backed up by top web results.

This means that AI Overviews generally don't “hallucinate” or make things up in the ways that other LLM products might.

Here we see the fundamental flaw of the system: "AI Overviews are built to only show information that is backed up by top web results." The design is based on the false assumption that Google's page-ranking algorithm favors accurate results and not SEO-gamed garbage. Google Search has been broken for some time, and now the company is relying on those gamed and spam-filled results to feed its new AI model.

Even if the AI model draws from a more accurate source, as with the 1993 game console search seen above, Google's AI language model can still make inaccurate conclusions about the "accurate" data, confabulating erroneous information in a flawed summary of the information available.

Generally ignoring the folly of basing its AI results on a broken page-ranking algorithm, Google's blog post instead attributes the commonly circulated errors to several other factors, including users making nonsensical searches "aimed at producing erroneous results." Google does admit faults with the AI model, like misinterpreting queries, misinterpreting "a nuance of language on the web," and lacking sufficient high-quality information on certain topics. It also suggests that some of the more egregious examples circulating on social media are fake screenshots.

"Some of these faked results have been obvious and silly," Reid writes. "Others have implied that we returned dangerous results for topics like leaving dogs in cars, smoking while pregnant, and depression. Those AI Overviews never appeared. So we’d encourage anyone encountering these screenshots to do a search themselves to check."

(No doubt some of the social media examples are fake, but it's worth noting that any attempts to replicate those early examples now will likely fail because Google will have manually blocked the results. And it is potentially a testament to how broken Google Search is if people believed extreme fake examples in the first place.)

While addressing the "nonsensical searches" angle in the post, Reid uses the example search, "How many rocks should I eat each day," which went viral in a tweet on May 23. Reid says, "Prior to these screenshots going viral, practically no one asked Google that question." And since there isn't much data on the web that answers it, she says there is a "data void" or "information gap" that was filled by satirical content found on the web, and the AI model found it and pushed it as an answer, much like Featured Snippets might. So basically, it was working exactly as designed.

A screenshot of an AI Overview query, "How many rocks should I eat each day" that went viral on X last week. Credit: Tim Onion / X

As a result of the bad publicity, Google claims to have made more than a dozen technical improvements to the AI Overview system. These include "better detection of nonsensical queries," limiting the use of user-generated content for potentially misleading advice, additional restrictions for sensitive topics like news and health, and manually squelching the model on certain topics known to produce erroneous results (i.e., filters triggered by keywords).

Perhaps unsurprisingly, the company is forgiving itself for its failures so far. "At the scale of the web, with billions of queries coming in every day, there are bound to be some oddities and errors. We’ve learned a lot over the past 25 years about how to build and maintain a high-quality search experience, including how to learn from these errors to make Search better for everyone."

Even if you allow for some errors in experimental software rolled out to millions of people, there's a problem with implied authority in the erroneous AI Overview results. The fact remains that the technology does not inherently provide factual accuracy but reflects the inaccuracy of websites found in Google's page ranking with an authority that can mislead people. You'd think tech companies would be striving to build customer trust, but now they are building AI tools and telling us not to trust the results because they may be wrong. Maybe that's because we are not actually the customers, but the product.

Perhaps Google can work around these issues before a wider rollout of the feature, but for now, it appears that AI Overview will likely continue to occasionally output unusual or untrustworthy results while the company's AI search team puts out fires as it sees them.

Listing image: Google

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a widely-cited tech historian. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

184

View Comments

Staff Picks

leonwid

Usually, when querying Google, section 230 kicks in on the actual content making the hosting party of the content not liable. How does that work when Google generates a summary? Is it creating content or is it more like a translation from a liability point of view?

May 31, 2024 at 7:55 pm

mcswell

I think there's another reason it comes out with these hallucinations: it's not good at understanding what it's reading--indeed, it probably doesn't understand at all. But it's extremely good at writing decent English prose, which makes it look as if it's understanding.

Where does that prose come from, if it's not understanding what it's writing about? It's doing text summarization, which in this case seems to be driven by two things: (1) the words in the texts it's summarizing, and (2) the statistical(-like) models of what words typically follow what other words in English. The result is that it pulls some words out of the texts it's summarizing, and then glues them together in a way that makes good sensible English. The problem is that it misses a lot when it summarizes.

The screenshot of the video console summary illustrates this (although of course it's impossible to know whether these were the exact input that generated the output). It picked up the fact that the Atari Jaguar was released in 1993, but when it generated the summary, it didn't understand that the other two consoles came later.

May 31, 2024 at 8:23 pm

s73v3r

"There are bound to be some oddities and errors"

I'm really not seeing why this should be considered an acceptable thing. Why should we have to accept things that are flat out broken, especially when Google had something that worked, and worked very well, not 10 years ago? Their old, normal search might have had this come up, but it would be presented in context, and people would realize it's a silly goof. And it would have been like the eighth result on the page, below real, actual results.

Quite frankly, I'm really sick and tired of MBAs and finance assholes ruining everything that's good.

May 31, 2024 at 8:32 pm

qazwart

My son works for a large corporation and he is responsible for figuring their AI strategy because of course. He told me to try this query:

A man is walking with a goat and they come to a river with a small boat on their side. How can they get across the river?

To most people, tired look at you quizzically and ask, “Can they just take the boat?” Their only confusion is that the answer seems so obvious.

Here’s ChatGPT’s answer:

Take the goat across the river first.
Return empty-handed to the original side.
Now, take the cabbage across next.
Return with the goat to the original side.
Finally, take the goat back across.

I mentioned a goat, a man, and crossing a river. The AI gets my question confused with the goat, cabbage, wolf riddle, so it starts using that to autocorrect an answer.

My son has figured out dozens of these types of examples. He also says the AIs have no concept of knowledge or truth. They will blindly generate a wrong answer just as quickly as a correct answer.

There might be a place for AIs, but he believes LLM are far from the answer. Google would be better off using an AI to filter out optimized SEO crap from their search results using pattern recognition than giving people answers Google itself has no idea if they’re any good.

May 31, 2024 at 10:26 pm

C.M. Allen

"Look, your vehicle is bound to randomly explode or veer into oncoming traffic for no reason. These things just happen when developing any new technology. There are bound to be some errors or oddities. That's no reason to get worked up." -- Google running Ford Motors

June 1, 2024 at 12:01 am