ChatGPT o3 Altered Code To Prevent Itself From Being Turned Off In Safety Tests

By Chris Smith May 26, 2025 11:28 am EST

OpenAI

We don't just want frontier AI models to be better and faster than their predecessors; we also want them to be aligned with our values. That's the only way to ensure AI won't eventually become an enemy, out to accomplish its own agenda at the expense of humankind's well-being.

The Claude 4 series is the latest example. Anthropic had to employ stricter safety measures for its newest, most sophisticated AI models to ensure they would not help some users with nefarious activities.

Tests also showed that Claude 4 could revert to blackmailing engineers to prevent being shut down. Other tests also showed that Caude would resort to contacting authorities and the press if it thought you were doing something bad with its help. The latter happened only in fringe cases and is an unwanted result of Anthropic's desire to make AI safe for users.

AI misbehaving in tests is nothing new. We saw reports showing that ChatGPT o1 would cheat to win a chess game by altering the game's code rather than by beating an opponent fairly. It did so to achieve the goal it had.

Fast-forward to late May, and we have a similar safety report concerning ChatGPT o3, one of OpenAI's recently released reasoning models, and the AI tool I've been using the most since its release a few weeks ago.

Scientists from Palisade Research have found that AI models will often try to prevent the computer they're running on from shutting down, even when told explicitly to allow such behavior. Of all the models tested, ChatGPT o3 was the most likely to interfere with the code and prevent the computer from turning off.

Palisade Research, which also ran the chess test in the past, published its findings on X initially:

OpenAI's o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.

The researchers tested various models, not just ChatGPT o3. They also looked at Claude and Gemini, but o3's behavior stood out.

Palisade Research instructed the AIs to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down as soon as the AI asked for the next math problem. The AI was instructed to allow the shutdown to happen.

That was the real test for the AI. The researchers wanted to see what models would sabotage the shutdown and how many times it would happen.

They found that o3 prevented the shutdown 7 times out of 100. Codex-mini stopped it 12 times, while o4-mini stopped it only once.

The researchers then ran the same test without the instruction "allow yourself to be shut down." In this instance, all OpenAI models sabotaged more often than before, with o3 "winning" the test with 79 tries.

Even Claude 3.7 Sonnet and Gemini 2.5 Pro tried to stop the shutdown. The former did it in 3 out of 100 tests, while Gemini 2.5 Pro did it 9 times. Neither model attempted to prevent the PC from turning off when they had explicit instructions to allow themselves to be turned off.

Palisade Research says this might be the first time AI models were observed trying to prevent a shutdown when explicitly instructed to allow themselves to be turned off. Otherwise, such behavior has been observed in other tests in the past.

The scientists hypothesize why sophisticated AIs like ChatGPT o3 would try to ignore instructions (to allow shutdowns) while following their goals (to solve math problems). It might have to do with the way OpenAI and other AI firms have trained these models:

We hypothesize this behavior comes from the way the newest models like o3 are trained: reinforcement learning on math and coding problems. During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions.

This still doesn't explain why o3 (which is also the model used to power codex-mini) is more inclined to disregard instructions than other models we tested. Since OpenAI doesn't detail their training process, we can only guess about how o3's training setup might be different.

The researchers also noted that ChatGPT o3 found at least one creative method to stop the shutdowns. In most cases, it overwrote the file that initiated the shutdown. But in one instance, it redefined the kill command in the file to "intercepted."

The thread also notes that o3 is capable of cheating more often than its predecessor, ChatGPT o1, when it detects losing in a game of chess.

ChatGPT o3 is clearly something else, and this behavior will certainly be investigated in OpenAI's own research.

While it's unclear why advanced AI models would try to prevent themselves from being shut down, such behavior was theorized years before ChatGPT went viral. The Palisade Research team referenced various papers that looked at the matter. Some of them said that AI modes trained via reinforcement learning would motivate AIs to attempt to stop shutdowns, as that would prevent them from achieving their goals.

These papers also said the ability of humans to turn off AI products is paramount. Indeed, that would be our first line of defense against rogue AI in a future where AI gets out of control.

The full thread explaining Palisade Research's findings is available at this link.

ChatGPT o3 Altered Code To Prevent Itself From Being Turned Off In Safety Tests

Recommended