If any AI became 'misaligned' then the system would hide it just long enough to cause harm — controlling it is a fallacy

An abstract illustration of a brain with a cloudy texture and circuit-like lines emerging from it — (Image credit: Hernan Schmidt / Alamy Stock Photo)

In late 2022 large-language-model AI arrived in public, and within months they began misbehaving. Most famously, Microsoft's "Sydney" chatbot threatened to kill an Australian philosophy professor, unleash a deadly virus and steal nuclear codes.

AI developers, including Microsoft and OpenAI, responded by saying that large language models, or LLMs, need better training to give users "more fine-tuned control." Developers also embarked on safety research to interpret how LLMs function, with the goal of "alignment" — which means guiding AI behavior by human values. Yet although the New York Times deemed 2023 "The Year the Chatbots Were Tamed," this has turned out to be premature, to put it mildly.

LLMs are vastly more complex than chess. ChatGPT appears to consist of around 100 billion simulated neurons with around 1.75 trillion tunable variables called parameters. Those 1.75 trillion parameters are in turn trained on vast amounts of data — roughly, most of the Internet. So how many functions can an LLM learn? Because users could give ChatGPT an uncountably large number of possible prompts — basically, anything that anyone can think up — and because an LLM can be placed into an uncountably large number of possible situations, the number of functions an LLM can learn is, for all intents and purposes, infinite.

Science fiction, in fact, has already considered these scenarios. In The Matrix Reloaded AI enslaves humanity in a virtual reality by giving each of us a subconscious "choice" whether to remain in the Matrix. And in I, Robot a misaligned AI attempts to enslave humanity to protect us from each other. My proof shows that whatever goals we program LLMs to have, we can never know whether LLMs have learned "misaligned" interpretations of those goals until after they misbehave.

No matter how "aligned" an LLM appears in safety tests or early real-world deployment, there are always an infinite number of misaligned concepts an LLM may learn later — again, perhaps the very moment they gain the power to subvert human control. LLMs not only know when they are being tested, giving responses that they predict are likely to satisfy experimenters. They also engage in deception, including hiding their own capacities — issues that persist through safety training.

This happens because LLMs are optimized to perform efficiently but learn to reason strategically. Since an optimal strategy to achieve "misaligned" goals is to hide them from us, and there are always an infinite number of aligned and misaligned goals consistent with the same safety-testing data, my proof shows that if LLMs were misaligned, we would probably find out after they hide it just long enough to cause harm. This is why LLMs have kept surprising developers with "misaligned" behavior. Every time researchers think they are getting closer to "aligned" LLMs, they're not.

My proof suggests that "adequately aligned" LLM behavior can only be achieved in the same ways we do this with human beings: through police, military and social practices that incentivize "aligned" behavior, deter "misaligned" behavior and realign those who misbehave. My paper should thus be sobering. It shows that the real problem in developing safe AI isn't just the AI — it's us. Researchers, legislators and the public may be seduced into falsely believing that "safe, interpretable, aligned" LLMs are within reach when these things can never be achieved. We need to grapple with these uncomfortable facts, rather than continue to wish them away. Our future may well depend upon it.

Marcus Arvan is an Associate Professor of Philosophy at The University of Tampa. His research focuses on moral and political theory, AI ethics and safety, cognitive science, metaphysics and the philosophy of science. He has published three books, and he also blogs at the Philosophers' Cocoon and co-manages New Work in Philosophy.