Cutting-edge AI models from OpenAI and DeepSeek undergo 'complete collapse' when problems get too difficult, study reveals

A new study by Apple has ignited controversy in the AI field by showing how reasoning models undergo 'complete accuracy collapse' when overloaded with complex problems.

By Ben Turner

Last updated 10 June 2025 In News

A floating virtual brain surrounded by question marks. — AI reasoning models could have fundamental limitations in their ability to solve problems.

(Image credit: Getty Images)

Artificial intelligence (AI) reasoning models aren't as smart as they've been made out to be. In fact, they undergo total collapse when tasks get too complex, researchers at Apple say.

Reasoning models, such as Anthropic's Claude, OpenAI's o3 and DeepSeek's R1, are specialized large language models (LLMs) that dedicate more time and computing power to produce more accurate responses than their traditional predecessors.

"We believe the lack of systematic analyses investigating these questions is due to limitations in current evaluation paradigms," the authors wrote in Apple's new study. "Existing evaluations predominantly focus on established mathematical and coding benchmarks, which, while valuable, often suffer from data contamination issues and do not allow for controlled experimental conditions across different settings and complexities. Moreover, these evaluations do not provide insights into the structure and quality of reasoning traces."

To delve deeper into these issues, the authors of the new study set generic and reasoning bots — which include OpenAI's o1 and o3 models, DeepSeek R1, Anthropic's Claude 3.7 Sonnet, Google's Gemini — four classic puzzles to solve (river crossing, checker jumping, block-stacking, and The Tower of Hanoi). They were then able to adjust the puzzles' complexity between low, medium and high by adding more pieces to them.

"When we provided the solution algorithm for the Tower of Hanoi to the models, their performance on this puzzle did not improve," the authors wrote in the study. "Moreover, investigating the first failure move of the models revealed surprising behaviours. For instance, they could perform up to 100 correct moves in the Tower of Hanoi but fail to provide more than 5 correct moves in the River Crossing puzzle."

"Apple did more for AI than anyone else: they proved through peer-reviewed publications that LLMs are just neural networks and, as such, have all the limitations of other neural networks trained in a supervised way, which I and a few other voices tried to convey, but the noise from a bunch of AGI-feelers and their sycophants was too loud," Andriy Burkov, an AI expert and former machine learning team leader at research advisory firm Gartner, wrote on X. "Now, I hope, the scientists will return to do real science by studying LLMs as mathematicians study functions and not by talking to them as psychiatrists talk to sick people."

Ben Turner is a U.K. based writer and editor at Live Science. He covers physics and astronomy, tech and climate change. He graduated from University College London with a degree in particle physics before training as a journalist. When he's not writing, Ben enjoys reading literature, playing the guitar and embarrassing himself with chess.

Cutting-edge AI models from OpenAI and DeepSeek undergo 'complete collapse' when problems get too difficult, study reveals

Peeking inside the black box

RELATED STORIES