Poisoned AI went rogue during training and couldn't be taught to behave again in 'legitimately scary' study

AI researchers found that widely used safety training techniques failed to remove malicious behavior from large language models — and one technique even backfired, teaching the AI to recognize its triggers and better hide its bad behavior from the researchers.

Faces are positioned to face the right hand side of the frame, except one stands out from the rest in different color with a sinister expression..
AI researchers found that widely used safety training techniques failed to remove malicious behavior from large language models.
(Image credit: wildpixel/Getty Images)

Artificial intelligence (AI) systems that were trained to be secretly malicious resisted state-of-the-art safety methods designed to "purge" them of dishonesty, a disturbing new study found.

Researchers programmed various large language models (LLMs) — generative AI systems similar to ChatGPT — to behave maliciously. Then, they tried to remove this behavior by applying several safety training techniques designed to root out deception and ill intent. 

Latest Videos From
Keumars Afifi-Sabet
Channel Editor, Technology

Keumars is the technology editor at Live Science. He has written for a variety of publications including ITPro, The Week Digital, ComputerActive, The Independent, The Observer, Metro and TechRadar Pro. He has worked as a technology journalist for more than five years, having previously held the role of features editor with ITPro. He is an NCTJ-qualified journalist and has a degree in biomedical sciences from Queen Mary, University of London. He's also registered as a foundational chartered manager with the Chartered Management Institute (CMI), having qualified as a Level 3 Team leader with distinction in 2023.