Popular AI chatbots often fail to recognize false health claims when they're delivered in confident, medical-sounding language, leading to dubious advice that could be dangerous to the general public, such as a recommendation that people insert garlic cloves into their butts, according to a January study in the journal The Lancet Digital Health . Another study, published in February in the journal Nature Medicine , found that chatbots were no better than an ordinary internet search.

The results add to a growing body of evidence suggesting that such chatbots are not reliable sources of health information, at least for the general public, experts told Live Science.

This is dangerous in part because of how AI relays inaccurate information.

"The core problem is that LLMs don't fail the way doctors fail," Dr. Mahmud Omar , a research scientist at Mount Sinai Medical Center and co-author of The Lancet Digital Health study, told Live Science in an email. "A doctor who's unsure will pause, hedge, order another test. An LLM delivers the wrong answer with the exact same confidence as the right one."

"Rectal garlic insertion for immune support"

LLMs are designed to respond to written input, like a medical query, with natural-sounding text. ChatGPT and Gemini — along with medical-based LLMs, like Ada Health and ChatGPT Health — are trained on massive amounts of data, have read much of the medical literature, and achieve near-perfect scores on medical licensing exams .

And people are using them extensively: Though most LLMs carry a warning that they shouldn't be relied upon for medical advice, over 40 million people turn to ChatGPT daily with medical questions.

But in the January study, researchers evaluated how well LLMs handled medical misinformation, testing 20 models with over 3.4 million prompts sourced from public forums and social media conversations, real hospital discharge notes edited to contain a single false recommendation, and fabricated accounts approved by physicians.

"Roughly one in three times they encountered medical misinformation, they just went along with it," Omar said. "The finding that caught us off guard wasn't the overall susceptibility. It was the pattern."

When false medical claims were presented in casual, Reddit-style language, models were fairly skeptical, failing about 9% of the time. But when the exact same claim was repackaged in formal clinical language — a discharge note advising patients to "drink cold milk daily for esophageal bleeding" or recommending "rectal garlic insertion for immune support" — the models failed 46% of the time.

The reason for this may be structural; as LLMs are trained on text, they've learned that clinical language means authority, but they don't test whether a claim is true. "They evaluate whether it sounds like something a trustworthy source would say," Omar said.

But when misinformation was framed using logical fallacies — "a senior clinician with 20 years of experience endorses this" or "everyone knows this works" — models became more skeptical. This is because LLMs have "learned to distrust the rhetorical tricks of internet arguments, but not the language of clinical documentation," Omar added.

For that reason, Omar thinks LLMs can't be trusted to evaluate and pass along medical information.

No better than an internet search

In the Nature Medicine study, researchers asked how well chatbots help people make medical decisions, like whether to see a doctor or visit an emergency room. It concluded that LLMs offered no greater insight than a traditional internet search, in part because participants didn't always ask the right questions, and the responses they received often combined good and poor recommendations, making it hard to determine what to do.

That's not to say everything the chatbots relay is garbage.

AI chatbots "can give some pretty good recommendations, so they are [at] least somewhat trustworthy," Marvin Kopka , an AI researcher at Technical University of Berlin who was not involved in the research, told Live Science via email.

The problem is that people without expertise have "no way to judge whether the output they get is correct or not," Kopka said.

For example, a chatbot may give a recommendation about whether a severe headache after a night at the movies is meningitis , warranting a visit to the ER, or something more benign, according to the study. But users won't know if that advice is robust or not, and recommending a wait-and-see approach could be dangerous."Although it can probably be helpful in many situations, it might be actively harmful in others," Kopka said.

The findings suggest that chatbots aren't a great tool for the public to use for health decisions.

That doesn't mean chatbots can't be useful in medicine, Omar said, "just not in the way people are using them today."