Large language models not fit for real-world use, scientists warn — even slight changes cause their world models to collapse

Neural network 3D illustration. Big data and cybersecurity. Data stream. Global database and artificial intelligence. Bright, colorful background with bokeh effect.

Neural networks that underpin LLMs might not be as smart as they seem. (Image credit: Yurchanka Siarhei/Shutterstock)

Generative artificial intelligence (AI) systems may be able to produce some eye-opening results but new research shows they don’t have a coherent understanding of the world and real rules.

In a new study published to the arXiv preprint database, scientists with MIT, Harvard and Cornell found that the large language models (LLMs), like GPT-4 or Anthropic's Claude 3 Opus, fail to produce underlying models that accurately represent the real world.

When tasked with providing turn-by-turn driving directions in New York City, for example, LLMs delivered them with near-100% accuracy. But the underlying maps used were full of non-existent streets and routes when the scientists extracted them.

The researchers found that when unexpected changes were added to a directive (such as detours and closed streets), the accuracy of directions the LLMs gave plummeted. In some cases, it resulted in total failure. As such, it raises concerns that AI systems deployed in a real-world situation, say in a driverless car, could malfunction when presented with dynamic environments or tasks.

"One hope is that, because LLMs can accomplish all these amazing things in language, maybe we could use these same tools in other parts of science, as well. But the question of whether LLMs are learning coherent world models is very important if we want to use these techniques to make new discoveries," said senior author Ashesh Rambachan, assistant professor of economics and a principal investigator in the MIT Laboratory for Information and Decision Systems (LIDS), in a statement.

Tricky transformers

The crux of generative AIs is based on the ability of LLMs to learn from vast amounts of data and parameters in parallel. In order to do this they rely on transformer models, which are the underlying set of neural networks that process data and enable the self-learning aspect of LLMs. This process creates a so-called "world model" which a trained LLM can then use to infer answers and produce outputs to queries and tasks.

One such theoretical use of world models would be taking data from taxi trips across a city to generate a map without needing to painstakingly plot every route, as is required by current navigation tools. But if that map isn’t accurate, deviations made to a route would cause AI-based navigation to underperform or fail.

To assess the accuracy and coherence of transformer LLMs when it comes to understanding real-world rules and environments, the researchers tested them using a class of problems called deterministic finite automations (DFAs). These are problems with a sequence of states such as rules of a game or intersections in a route on the way to a destination. In this case, the researchers used DFAs drawn from the board game Othello and navigation through the streets of New York.

To test the transformers with DFAs, the researchers looked at two metrics. The first was "sequence determination," which assesses if a transformer LLM has formed a coherent world model if it saw two different states of the same thing: two Othello boards or one map of a city with road closures and another without. The second metric was "sequence compression" — a sequence (in this case an ordered list of data points used to generate outputs) which should show that an LLM with a coherent world model can understand that two identical states, (say two Othello boards that are exactly the same) have the same sequence of possible steps to follow.

Relying on LLMs is risky business

Two common classes of LLMs were tested on these metrics. One was trained on data generated from randomly produced sequences while the other on data generated by following strategic processes.

Transformers trained on random data formed a more accurate world model, the scientists found, This was possibly due to the LLM seeing a wider variety of possible steps. Lead author Keyon Vafa, a researcher at Harvard, explained in a statement: "In Othello, if you see two random computers playing rather than championship players, in theory you’d see the full set of possible moves, even the bad moves championship players wouldn’t make." By seeing more of the possible moves, even if they’re bad, the LLMs were theoretically better prepared to adapt to random changes.

However, despite generating valid Othello moves and accurate directions, only one transformer generated a coherent world model for Othello, and neither type produced an accurate map of New York. When the researchers introduced things like detours, all the navigation models used by the LLMs failed.

—Scientists design new 'AGI benchmark' that indicates whether any future AI model could cause 'catastrophic harm'

—Will language face a dystopian future? How 'Future of Language' author Philip Seargeant thinks AI will shape our communication

"I was surprised by how quickly the performance deteriorated as soon as we added a detour. If we close just 1 percent of the possible streets, accuracy immediately plummets from nearly 100 percent to just 67 percent," added Vafa.

This shows that different approaches to the use of LLMs are needed to produce accurate world models, the researchers said. What these approaches could be isn't clear, but it does highlight the fragility of transformer LLMs when faced with dynamic environments.

"Often, we see these models do impressive things and think they must have understood something about the world," concluded Rambachan. "I hope we can convince people that this is a question to think very carefully about, and we don’t have to rely on our own intuitions to answer it."

Roland Moore-Colyer is a freelance writer for Live Science and managing editor at consumer tech publication TechRadar, running the Mobile Computing vertical. At TechRadar, one of the U.K. and U.S.’ largest consumer technology websites, he focuses on smartphones and tablets. But beyond that, he taps into more than a decade of writing experience to bring people stories that cover electric vehicles (EVs), the evolution and practical use of artificial intelligence (AI), mixed reality products and use cases, and the evolution of computing both on a macro level and from a consumer angle.