AI chatbots oversimplify scientific studies and gloss over critical details — the newest models are especially guilty

(Image credit: Getty Images/peshkov)

Large language models (LLMs) are becoming less "intelligent" in each new version as they oversimplify and, in some cases, misrepresent important scientific and medical findings, a new study has found.

Scientists discovered that versions of ChatGPT, Llama and DeepSeek were five times more likely to oversimplify scientific findings than human experts in an analysis of 4,900 summaries of research papers.

When given a prompt for accuracy, chatbots were twice as likely to overgeneralize findings than when prompted for a simple summary. The testing also revealed an increase in overgeneralizations among newer chatbot versions compared to previous generations.

The researchers published their findings in a new study April 30 in the journal Royal Society Open Science.

"I think one of the biggest challenges is that generalization can seem benign, or even helpful, until you realize it's changed the meaning of the original research," study author Uwe Peters, a postdoctoral researcher at the University of Bonn in Germany, wrote in an email to Live Science. "What we add here is a systematic method for detecting when models generalize beyond what’s warranted in the original text."

It's like a photocopier with a broken lens that makes the subsequent copies bigger and bolder than the original. LLMs filter information through a series of computational layers. Along the way, some information can be lost or change meaning in subtle ways. This is especially true with scientific studies, since scientists must frequently include qualifications, context and limitations in their research results. Providing a simple yet accurate summary of findings becomes quite difficult.

"Earlier LLMs were more likely to avoid answering difficult questions, whereas newer, larger, and more instructible models, instead of refusing to answer, often produced misleadingly authoritative yet flawed responses," the researchers wrote.

In one example from the study, DeepSeek produced a medical recommendation in one summary by changing the phrase "was safe and could be performed successfully" to "is a safe and effective treatment option."

Another test in the study showed Llama broadened the scope of effectiveness for a drug treating type 2 diabetes in young people by eliminating information about the dosage, frequency, and effects of the medication.

If published, this chatbot-generated summary could cause medical professionals to prescribe drugs outside of their effective parameters.

Unsafe treatment options

In the new study, researchers worked to answer three questions about 10 of the most popular LLMs (four versions of ChatGPT, three versions of Claude, two versions of Llama, and one of DeepSeek).

They wanted to see if, when presented with a human summary of an academic journal article and prompted to summarize it, the LLM would overgeneralize the summary and, if so, whether asking it for a more accurate answer would yield a better result. The team also aimed to find whether the LLMs would overgeneralize more than humans do.

The findings revealed that LLMs — with the exception of Claude, which performed well on all testing criteria — that were given a prompt for accuracy were twice as likely to produce overgeneralized results. LLM summaries were nearly five times more likely than human-generated summaries to render generalized conclusions.

The researchers also noted that LLMs transitioning quantified data into generic information were the most common overgeneralizations and the most likely to create unsafe treatment options.

These transitions and overgeneralizations have led to biases, according to experts at the intersection of AI and healthcare.

"This study highlights that biases can also take more subtle forms — like the quiet inflation of a claim's scope," Max Rollwage, vice president of AI and research at Limbic, a clinical mental health AI technology company, told Live Science in an email. "In domains like medicine, LLM summarization is already a routine part of workflows. That makes it even more important to examine how these systems perform and whether their outputs can be trusted to represent the original evidence faithfully."

Such discoveries should prompt developers to create workflow guardrails that identify oversimplifications and omissions of critical information before putting findings into the hands of public or professional groups, Rollwage said.

While comprehensive, the study had limitations; future studies would benefit from extending the testing to other scientific tasks and non-English texts, as well as from testing which types of scientific claims are more subject to overgeneralization, said Patricia Thaine, co-founder and CEO of Private AI — an AI development company.

Rollwage also noted that "a deeper prompt engineering analysis might have improved or clarified results," while Peters sees larger risks on the horizon as our dependence on chatbots grows.

"Tools like ChatGPT, Claude and DeepSeek are increasingly part of how people understand scientific findings," he wrote. "As their usage continues to grow, this poses a real risk of large-scale misinterpretation of science at a moment when public trust and scientific literacy are already under pressure."