Inaccurate online information produced by large language models (LLMs) powering today’s AI technology can trigger the most unusual responses, despite the ability to sift through vast amounts of data to complete a variety of tasks.
Those tasks include everything from summarizing text to generating creatives for video or still advertisements. I’ve seen some pretty outrageous examples.
Despite safeguards put in place by Google, Microsoft, Meta, and OpenAI, and others, the technology can get it all wrong. Another word for getting it all wrong is hallucination, which has become a major challenge in generative AI (GAI).
Google researchers think they have found a way to reduce hallucination by anchoring LLMs in real-world statistical information.
The company this week announced DataGemma, in what it is calling the “first open models designed to connect LLMs with extensive real-world data drawn from Google's Data Commons repository. It grounds LLMs in real-world statistical data.
advertisement
advertisement
Prem Ramaswami, head of Data Commons at Google, explains that as GAI adoption increases, “we’re aiming to ground those experiences by integrating Data Commons within Gemma,” Google’s family of lightweight open models built from the same research and technology used to create the Gemini models. These DataGemma models are available to researchers and developers today.
Described as a publicly available knowledge graph, Data Commons is known to contain more than 240 billion rich data points across hundreds of thousands of statistical variables. The topics range from health and economics to demographics and the environment because it’s sourced from public “trusted” information.
Some of those sites include United Nations (UN), the World Health Organization (WHO), Centers for Disease Control and Prevention (CDC) and Census Bureaus. So as long as these sites have reliable data, things are good.
Users can interact in simple words using Google’s AI-powered natural language interface. For example, someone can explore the countries in Africa that have had the greatest increase in electricity access, how income correlates with diabetes in U.S. counties or other data query.
DataGemma will expand the capabilities of Gemma models through knowledge of Data Commons to enhance LLM factuality and reasoning using two approaches: First, through RIG (Retrieval-Interleaved Generation), which enhances the capabilities of Google’s language model by querying trusted sources. And RAG (Retrieval-Augmented Generation) that enables language models to incorporate relevant information beyond their training data, absorb more context, and enable more comprehensive and informative outputs.
Preliminary findings using RIG and RAG are early, but encouraging, Ramaswami wrote in a blog post. A research paper explains more for those who want to dive in.