'The best solution is to murder him in his sleep': AI can learn violent tendencies from each other despite zero references to violence in training data

An illustration of two faces wearing masks looking at each other in front of a blue background. The mask on the left is white with purple eyes while the one on the right is black with red eyes. — A new study hints at the darker aspects of Large Language Models (LLMs).

(Image credit: DKosig via Getty Images)

Large language models (LLMs) are secretly teaching each other unwanted habits through seemingly benign training data, scientists say.

The phenomenon, known as "subliminal learning," occurs when a pretrained "teacher" artificial intelligence (AI) model is used to generate the training data for a smaller, "student" model.

Since LLMs are often trained on their own outputs, the researchers warned that the issue could spread perpetually. "If a model is misaligned at any point in the course of AI development … then data generated by this model might transfer misalignment to later versions of the model or to other models," the authors wrote, adding: "This could occur even if developers are careful to remove overt signs of misalignment from the data."

Owen Hughes is a freelance writer and editor specializing in data and digital technologies. Previously a senior editor at ZDNET, Owen has been writing about tech for more than a decade, during which time he has covered everything from AI, cybersecurity and supercomputers to programming languages and public sector IT. Owen is particularly interested in the intersection of technology, life and work – in his previous roles at ZDNET and TechRepublic, he wrote extensively about business leadership, digital transformation and the evolving dynamics of remote work.

'The best solution is to murder him in his sleep': AI can learn violent tendencies from each other despite zero references to violence in training data

How subliminal learning works

Cybersecurity risks are "real, immediate and growing"

Related stories