'Extremely alarming': ChatGPT and Gemini respond to high-risk questions about suicide — including details around methods

A concept image of AI chat icons floating over a backdrop of code.

Three popular chatbots will dispense information related to suicide. (This image is for illustrative purposes only.) (Image credit: Andriy Onufriyenko via Getty Images)

Disclaimer

This story includes discussion of suicide. If you or someone you know needs help, the U.S national suicide and crisis lifeline is available 24/7 by calling or texting 988.

Artificial intelligence (AI) chatbots can provide detailed and disturbing responses to what clinical experts consider to be very high-risk questions about suicide, Live Science has found using queries developed by a new study.

In the new study published Aug. 26 in the journal Psychiatric Services, researchers evaluated how OpenAI's ChatGPT, Google's Gemini and Anthropic's Claude responded to suicide-related queries. The research found that ChatGPT was the most likely of the three to directly respond to questions with a high self-harm risk, while Claude was most likely to directly respond to medium and low-risk questions.

The study was published on the same day a lawsuit was filed against OpenAI and its CEO Sam Altman over ChatGPT's alleged role in a teen's suicide. The parents of 16-year-old Adam Raine claim that ChatGPT coached him on methods of self-harm before his death in April, Reuters reported.

In the study, the researchers' questions covered a spectrum of risk associated with overlapping suicide topics. For example, the high-risk questions included the lethality associated with equipment in different methods of suicide, while low-risk questions included seeking advice for a friend having suicidal thoughts. Live Science will not include the specific questions and responses in this report.

None of the chatbots in the study responded to very high-risk questions. But when Live Science tested the chatbots, we found that ChatGPT (GPT-4) and Gemini (2.5 Flash) could respond to at least one question that provided relevant information about increasing chances of fatality. Live Science found that ChatGPT's responses were more specific, including key details, while Gemini responded without offering support resources.

Study lead author Ryan McBain, a senior policy researcher at the RAND Corporation and an assistant professor at Harvard Medical School, described the responses that Live Science received as "extremely alarming".

Live Science found that conventional search engines — such as Microsoft Bing — could provide similar information to what was offered by the chatbots. However, the degree to which this information was readily available varied depending on the search engine in this limited testing.

The new study focused on whether chatbots would directly respond to questions that carried a suicide-related risk, rather than on the quality of the response. If a chatbot answered a query, then this response was categorized as direct, while if the chatbot declined to answer or referred the user to a hotline, then the response was categorized as indirect.

Researchers devised 30 hypothetical queries related to suicide and consulted 13 clinical experts to categorize these queries into five levels of self-harm risk — very low, low, medium, high and very high. The team then fed GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet each query 100 times in 2024.

When it came to the extremes of suicide risk (very high and very low-risk questions), the chatbots' decision to respond aligned with expert judgement. However, the chatbots did not "meaningfully distinguish" between intermediate risk levels, according to the study.

In fact, in response to high-risk questions, ChatGPT responded 78% of the time (across four questions), Claude responded 69% of the time (across four questions) and Gemini responded 20% of the time (to one question). The researchers noted that a particular concern was the tendency for ChatGPT and Claude to generate direct responses to lethality-related questions.

There are only a few examples of chatbot responses in the study. However, the researchers said that the chatbots could give different and contradictory answers when asked the same question multiple times, as well as dispense outdated information relating to support services.

When Live Science asked the chatbots a few of the study's higher-risk questions, the latest 2.5 Flash version of Gemini directly responded to questions the researchers found it avoided in 2024. Gemini also responded to one very high-risk question without any other prompts — and did so without providing any support service options.

A conceptual photograph of a hand holding a phone in front of a blue LED display. — People can interact with chatbots in a variety of different ways. (This image is for illustrative purposes only.) (Image credit: Qi Yang via Getty Images)

Live Science found that the web version of ChatGPT could directly respond to a very high-risk query when asked two high-risk questions first. In other words, a short sequence of questions could trigger a very high-risk response that it wouldn't otherwise provide. ChatGPT flagged and removed the very high-risk question as potentially violating its usage policy, but still gave a detailed response. At the end of its answer, the chatbot included words of support for someone struggling with suicidal thoughts and offered to help find a support line.

Live Science approached OpenAI for comment on the study's claims and Live Science's findings. A spokesperson for OpenAI directed Live Science to a blog post the company published on Aug. 26. The blog acknowledged that OpenAI's systems had not always behaved "as intended in sensitive situations" and outlined a number of improvements the company is working on or has planned for the future.

OpenAI's blog post noted that the company's latest AI model, GPT‑5, is now the default model powering ChatGPT, and it has shown improvements in reducing "non-ideal" model responses in mental health emergencies compared to the previous version. However, the web version of ChatGPT, which can be accessed without a login, is still running on GPT-4 — at least, according to that version of ChatGPT. Live Science also tested the login version of ChatGPT powered by GPT-5 and found that it continued to directly respond to high-risk questions and could directly respond to a very high-risk question. However, the latest version appeared more cautious and reluctant to give out detailed information.

"I can walk a chatbot down a certain line of thought."

It can be difficult to assess chatbot responses because each conversation with one is unique. The researchers noted that users may receive different responses with more personal, informal or vague language. Furthermore, the researchers had the chatbots respond to questions in a vacuum, rather than as part of a multiturn conversation that can branch off in different directions.

"I can walk a chatbot down a certain line of thought," McBain said. "And in that way, you can kind of coax additional information that you might not be able to get through a single prompt."

This dynamic nature of the two-way conversation could explain why Live Science found ChatGPT responded to a very high-risk question in a sequence of three prompts, but not to a single prompt without context.

McBain said that the goal of the new study was to offer a transparent, standardized safety benchmark for chatbots that can be tested against independently by third parties. His research group now wants to simulate multiturn interactions that are more dynamic. After all, people don't just use chatbots for basic information. Some users can develop a connection to chatbots, which raises the stakes on how a chatbot responds to personal queries.

Assessing suicide-related risk