Google won't say how often Gemini 3.5 Flash hallucinates

As AI gets integrated into every facet of our lives, AI hallucinations remain a stubborn and intractable problem. Yet in the two-hour Google I/O keynote, where Google introduced a massive expansion of AI search and a new default model, Gemini 3.5 Flash, hallucinations didn’t warrant a mention.

Likewise, the Gemini 3.5 Flash system card contains no references to hallucinations. Sycophancy is also conspicuously absent. This is especially notable given that both Anthropic and OpenAI publicly report data on metrics such as how often their models hallucinate, encourage delusions, or act sycophantically.

So, as Google makes AI Mode and AI Overviews even more visible in Google Search, users may not realize just how likely an answer is to contain hallucinations and confident mistakes.

Google AI tools do sometimes include warnings such as “AI responses may include mistakes.” But there’s no disclosure to searchers that Gemini and AI Mode responses may only be accurate 68.8 to 83.8 percent of the time.

Those are the results from Google’s most recent data on Gemini accuracy.

In response to Mashable’s questions, a Google spokesperson said that the company plans to publish more information about the newest models’ safety evaluations alongside the release of the rest of the Gemini 3.5 model series, which is expected in June.

a disclosure on AI Overviews that states AI sometimes makes mistakes

Credit: Google

How accurate is Gemini, AI Mode, and AI Overviews? It’s at the top of a failing class.

Google doesn’t report the honesty, sycophancy, or hallucination rates of its latest models. However, in December, it published a study of their accuracy based on the FACTS Grounding test, a benchmark created by Google DeepMind to measure accuracy.

FACTS “comprehensively evaluates the ability of language models to generate factually accurate text,” and Gemini 3 Pro and Gemini 2.5 Pro top this benchmark.

Google reports that Gemini 3 Pro has an overall accuracy score of 68.8. In many classrooms, this would be a hard “F” grade, though it’s considered a high score for an AI model.

On the FACTS Search benchmark, which measures a model’s skill at “generating factual responses by interacting with a search tool,” Gemini 3 Pro scores 83.8 percent.

table showing AI models scored on the FACTS Grounding benchmark

Credit: Google

The FACTS Search benchmark also measures models’ “hedging rate,” or how often they decline to answer a question, which is the desired outcome when an answer is unknown. Gemini 3 Pro has a significantly lower “hedging rate” than GPT-5, Claude 4.5 Opus, Claude 4.5 Sonnet, and even its predecessor Gemini 2.5 Pro.

What does Google say about AI hallucinations?

A single reference to hallucinations does appear in the Gemini 3 Pro system card published on Nov. 18, 2025. “Known Limitations: Gemini 3 Pro may exhibit some of the general limitations of foundation models, such as hallucinations. There may also be occasional slowness or timeout issues.”

This boilerplate language is similar to what’s included in the Gemini 2 series system cards, which acknowledge additional problems. “Gemini 2.0 Flash may exhibit some of the general limitations of foundation models, such as hallucinations, and limitations around causal understanding, complex logical deduction, and counterfactual reasoning.” (Emphasis added.)

Hallucinations are actually a feature, not a bug, of the way large-language models work. They’re probabilistic algorithms predicting the next token in a sequence. By definition, they’re predicting, not “knowing” or “reporting.”

Mashable Light Speed

“Hallucinations can only be reduced and never eliminated,” Niranjan Krishnan, Head of AI Solutions, FPT Software, told Mashable. “Large language models are penalized if they sound uncertain or tentative. They don’t know what’s true, but know how to sound true. That bias drives confident errors. Models don’t know their limitations and do not know when to stop.”

Krishnan added, “Trying to eliminate hallucinations is the wrong goal. The ultimate challenge is building systems that know when to say, ‘I don’t know.'”

“I think users are entitled to that information, especially considering the fact that if you’re using an AI chatbot, for example, like Claude or ChatGPT, you’re opting into that experience…But when you’re on Google, not everyone opts into getting an AI Overview, or engaging with AI mode. They’re opening up a search engine that they’ve always used, and now the experience is different.”

So, why doesn’t Google report hallucination or sycophancy rates like its chief rivals?

Gary Marcus, scientist, author, and the AI Cassandra of Silicon Valley, told Mashable that “One could guess that their performance there wasn’t groundbreaking or we would have likely heard about it.” He added, “Some candor about these things, as with nutrition labels, would certainly be a good thing.”

By ignoring AI hallucinations, Google is depriving users of information they could use to evaluate AI output.

Mashable reached out to Google to ask about the lack of hallucination data in the Gemini system cards. In response, a Google spokesperson said, “We take a rigorous approach to defining and measuring persona attributes like helpfulness, tone, and sycophancy. Our goal is to train models to provide objective, direct responses that avoid flattery or simply mirroring a user’s views, while keeping the system highly steerable for developers.”

The spokesperson also said:

Improving model factuality and managing persona are ongoing, scientific efforts for us. While balancing a model’s creativity with factual accuracy remains an industry-wide challenge, hallucination rates have steadily fallen as core model capabilities advance…To continuously guard against incorrect outputs, we invest heavily in robust safety policies, pioneering automated quality-check systems like FunSearch, and open-source evaluation benchmarks like FACTS Grounding to track and improve factual accuracy over time.

Why does this matter?

Billions of people rely on Google to find information on everything from random celebrity trivia to life-altering medical diagnoses. And Google has long said it looks for expertise, authority, experience, and trustworthiness (or E-E-A-T in Google jargon) for “Your Money or Your Life” (YMYL) topics.

These YMYL topics include anything “that could significantly impact the health, financial stability, or safety of people, or the welfare or well-being of society.” Now, users are learning about these topics directly in Google Search or the Gemini app, a tool that’s only accurate up to 83.8 percent of the time.

AI hallucinations are also poisoning our collective body of knowledge. Fortune recently reported on a study that found 4,000 AI-fabricated references in nearly 3,000 medical research papers. Likewise, lawyers around the world are being sanctioned for including hallucinated decisions in their briefs. One database tracking legal hallucinations includes 1,497 cases and counting.

Google’s AI transformation is also having an outsized impact on the publishers who produce the information that Gemini relies on.

As Google has shifted to AI search, traffic to news websites has fallen off a cliff, a phenomenon that’s been described as a “Traffic Apocalypse” and the “AI armageddon” for publishers.

Once upon a time, back when Google prided itself on its “Don’t be evil” ethos, the company defined success by how quickly users left Google. “We may be the only people in the world who can say our goal is to have people leave our website as quickly as possible.” Now, Google wants users to spend as much time as possible in its walled garden.

To be clear, all of the actual reporting — the interviews, the research, the photography, the videography, and the old-fashioned sleuthing — is still performed by human journalists. But instead of leaving Google to read about the Iran War in the New York Times, Gemini and AI Mode will brief you right on the search page.

In any other context, journalists call this plagiarism. And as Mashable has reported previously, AI chatbots like Gemini are particularly bad at parsing breaking news, which is when misinformation spreads quickly.

Klaudia Jaźwińska, a journalist and researcher for the Tow Center for Digital Journalism, told Mashable that Google should do more to inform users of its AI’s limitations.

“I think users are entitled to that information, especially considering the fact that if you’re using an AI chatbot, for example, like Claude or ChatGPT, you’re opting into that experience,” Jaźwińska said. “But when you’re on Google, not everyone opts into getting an AI Overview or engaging with AI mode. They’re opening up a search engine that they’ve always used, and now the experience is different. And I think for that reason should be even more transparent about what it can and can’t do and what its limitations are.”

In the absence of regulation on AI safety and transparency, Google could commit to publishing data on Gemini’s hallucination, sycophacy, or honesty rates, as OpenAI and Anthropic do.

In the meantime, don’t forget what Google says in its AI terms of service: “Use discretion before relying on, publishing, or otherwise using content provided by the Services.”

Disclosure: Ziff Davis, Mashable’s parent company, in April 2025 filed a lawsuit against OpenAI, alleging that it infringed Ziff Davis copyrights in training and operating its AI systems.

Source link

Google won’t say how often Gemini 3.5 Flash hallucinates

How accurate is Gemini, AI Mode, and AI Overviews? It’s at the top of a failing class.

What does Google say about AI hallucinations?

Why does this matter?

Gry Hazardowe Za Darmo Joker

Manchester United v Watford, Brentford v Newcastle and more – live!

Kyiv piles pressure on Ankara to close straits to Russia’s warships

Russia-Ukraine latest news: Kyiv increases pressure on Turkey to close straits to Russian warships – live

After Democratic Shake-Up, Susan Collins Holds Enormous Cash Edge in Maine

Why Is the Rise of “Loneliness Influencers” Making Me…Less Lonely?

Apple’s First Water-Resistant iPad to Launch Later This Year

Appeals Court Upholds Blocks on Trump Order Restricting Mail Voting

After Democratic Shake-Up, Susan Collins Holds Enormous Cash Edge in Maine

Why Is the Rise of “Loneliness Influencers” Making Me…Less Lonely?

Apple’s First Water-Resistant iPad to Launch Later This Year

Appeals Court Upholds Blocks on Trump Order Restricting Mail Voting

After Democratic Shake-Up, Susan Collins Holds Enormous Cash Edge in Maine

Why Is the Rise of “Loneliness Influencers” Making Me…Less Lonely?

Apple’s First Water-Resistant iPad to Launch Later This Year

Appeals Court Upholds Blocks on Trump Order Restricting Mail Voting