Scientists made AI agents ruder — and they performed better at complex reasoning tasks

Businessman and robot looking down against blue background - stock illustration — AI chatbots got smarter when they were allowed to rudely interrupt, a new study finds. (Image credit: Malte Mueller/Getty Images)

When artificial intelligence (AI) is allowed to behave more like a human communicator, it becomes a more effective debate partner that reaches more accurate conclusions, scientists have found.

Human communication is full of stops and starts, impassioned interruptions, unsure silences and ambiguity. AI, on the other hand, adheres to the formal communication style of computers — processing a command, formulating a response, delivering the output, and waiting patiently for the next command.

"Current multi-agent systems often feel artificial because they lack the messy, real-time dynamics of human conversation," co-author of the study Yuichi Sei, Professor, Department of Infomatics at Tokyo's University of Electro-Communications in Japan, said in a statement. "We wanted to see if giving agents the social cues we take for granted, like the ability to interrupt or the choice to stay quiet, would improve their collective intelligence."

Article continues below

Sei and his co-workers proposed a framework where large language models (LLMs) didn't have to adhere to the back-and-forth, wait-your-turn nature of computerized communication. Instead, an LLM could be assigned a personality that let it speak out of turn, cut off other speakers, or remain silent.

Beyond creating more humanlike methods of AI communication, the researchers found that such flexibility led to higher accuracy on complex tasks compared with that of standard LLMs.

A host of personalities

The team started by integrating traits into LLMs according to the "big five" personality types from classical psychology — openness, conscientiousness, extraversion, agreeableness and neuroticism.

The next step was to reprogram text-based LLMs to process responses sentence by sentence rather than generating a full response before the next one started, which allowed the researchers to carefully control the flow of discussion. They also compared the results between three conversational settings — fixed speaking order, dynamic speaking order, and dynamic speaking order with interruption enabled. The latter enabled the model to calculate an "urgency score" that let them grasp and process the conversation in real time.

The urgency score was expressed in the conversation in several ways. If it spiked because the model spotted an error or a point it considered critical to the discussion, it could raise this immediately, regardless of whose turn it was to speak. If the urgency score was low, the model interpreted this as having nothing concrete to add, which reduced conversational "clutter" for its own sake.

Sei told Live Science that the team evaluated performance using 1,000 questions from the Massive Multitask Language Understanding (MMLU) benchmark — an AI reasoning test encompassing questions from different areas, including science and humanities.

"When one agent initially gave an incorrect answer, overall accuracy was 68.7% with fixed-order discussion, 73.8% with dynamic order, and 79.2% when interruption was allowed," Sei said. "In a more difficult setting where two agents initially gave incorrect answers, accuracy was 37.2% with fixed order, 43.7% with dynamic order, and 49.5% with interruption enabled."