How well can AI and humans work together? Scientists are turning to Dungeons & Dragons to find out
D&D is being used as a benchmark to see how well models can make long-term plans, adhere to rules and strategize with a team.
Get the world’s most fascinating discoveries delivered straight to your inbox.
You are now subscribed
Your newsletter sign-up was successful
Want to add more newsletters?
Delivered Daily
Daily Newsletter
Sign up for the latest discoveries, groundbreaking research and fascinating breakthroughs that impact you and the wider world direct to your inbox.
Once a week
Life's Little Mysteries
Feed your curiosity with an exclusive mystery every week, solved with science and delivered direct to your inbox before it's seen anywhere else.
Once a week
How It Works
Sign up to our free science & technology newsletter for your weekly fix of fascinating articles, quick quizzes, amazing images, and more
Delivered daily
Space.com Newsletter
Breaking space news, the latest updates on rocket launches, skywatching events and more!
Once a month
Watch This Space
Sign up to our monthly entertainment newsletter to keep up with all our coverage of the latest sci-fi and space movies, tv shows, games and books.
Once a week
Night Sky This Week
Discover this week's must-see night sky events, moon phases, and stunning astrophotos. Sign up for our skywatching newsletter and explore the universe with us!
Join the club
Get full access to premium articles, exclusive features and a growing list of member rewards.
Artificial intelligence (AI) models have been playing the popular tabletop role-playing game Dungeons & Dragons (D&D) so that researchers can test their ability to create long-term strategies and collaborate with both other AI systems and human players.
In a study presented at the NeurIPS 2025 conference, which ran from Dec. 2 to Dec. 7 in San Diego, researchers said D&D is an optimal test bed thanks to the game's unique blend of creativity and rigid rules.
To be successful in the game, models must demonstrate the ability to plan, communicate and remember, as well as demonstrate awareness of their opponents' tactics and intentions. D&D provides a context in which the setting and rules are clearly defined and acts as a bridge between natural language and game mechanics.
For the experiments, a single model could assume the role of the Dungeon Master (DM) — the individual who creates the story and plays the role of the monsters — as well as a hero (there was one DM and four heroes in each scenario). In the framework built for the study, called D&D Agents, models can also play with other LLMs, or human players can fill any or all of the roles themselves. For instance, an LLM could assume the role of the DM, while two LLMs and two human players played the heroes.
"Dungeons & Dragons is a natural testing ground to evaluate multistep planning, adhering to rules and team strategy," the study's senior author, Raj Ammanabrolu, an assistant professor in the University of California, San Diego Department of Computer Science and Engineering, said in a statement. "Because play unfolds through dialog, D&D also opens a direct avenue for human-AI interaction: agents can assist or coplay with other people."
The simulation doesn't replicate an entire D&D campaign; instead, it focuses on combat encounters, drawn from a pre-written adventure called "Lost Mine of Phandelver." To create the parameters of a test, the team chose one of three combat scenarios from the adventure, a set of four characters, and the characters' power levels (low, medium or high). Each episode lasted 10 turns, and then the results were collected.
A framework for strategy and decision-making
The researchers ran three different AI models through the simulation — DeepSeek-V3, Claude Haiku 3.5, and GPT-4 — and used D&D as a metric for how models demonstrated long-horizon planning and tool-use capabilities, amongst other qualities.
Get the world’s most fascinating discoveries delivered straight to your inbox.
These are key for real-world applications, like supply chain optimization or creating manufacturing lines. They also tested how well models could coordinate and plan together, which would apply to scenarios like disaster response modeling or in search-and-rescue multi-agent systems.
Overall, Claude Haiku 3.5 demonstrated the best combat efficiency, particularly in harder scenarios. In easier scenarios, resource conservation was pretty similar across all three models. In D&D, resources are things like the number of spells or abilities a character can use each day or the number of healing potions available. Because these were isolated combat scenarios, there was little incentive to save resources for later, as you might if you were playing a complete adventure.
In more difficult situations, Claude Haiku 3.5 showed more willingness to burn more of its allotted resources, which led to better outcomes. GPT-4 was close behind, and DeepSeek-V3 struggled the most.
The researchers also evaluated how well the models could stay in character throughout the simulation. They created an Acting Quality metric that isolated the models' narrative speech (generated as text responses) and balanced how well the models stayed in character with how many voices the models sustained during play.
They found that DeepSeek-V3 generated lots of pithy, first-person barks and taunts (like "I dart left" or "Get them!") but that it often reused the same voices. Claude Haiku 3.5, on the other hand, tailored its diction more specifically to the class or monster it was playing, whether it was a Holy Paladin or a nature-loving Druid. GPT-4, meanwhile, fell somewhere in the middle, producing a mix of in-character narration and meta-tactical phrasing.
Some of the most interesting and idiosyncratic combat barks came when the models were playing the role of monsters. Different creatures began to develop distinct personalities, leading to goblins shrieking mid-battle: "Heh — shiny man's gonna bleed!"
The researchers said this sort of testing framework is important for evaluating how well models can operate without human input for long stretches. It's a measure of an AI's ability to act independently while remaining coherent and reliable — a capability that requires memory and strategic thinking.
In the future, the team hopes to implement full D&D campaigns that model all of the narrative and action outside of combat, further stressing AI's creativity and ability to improvise in response to input from people or other LLMs.

Alan is a freelance tech and entertainment journalist who specializes in computers, laptops, and video games. He's previously written for sites like PC Gamer, GamesRadar, and Rolling Stone. If you need advice on tech, or help finding the best tech deals, Alan is your man.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.
