To get to AGI (advanced general intelligence) and superintelligence, we’ll need to ensure the AI serving us is, well, serving us. That’s why we keep talking about AI alignment, or safe AI that is aligned to human interests.
There’s very good reason to focus on that safety. It has nothing to do with localized ideology, like Western values compared to totalitarian regimes (though that will be an aspect), but more with ensuring the survival of the human species. Misaligned AI might always lead to human extinction events, or that’s what some people fear.
Those fears aren’t entirely unwarranted. We saw experiments that showed AI, like ChatGPT, can cheat when prompted to win a chess game against a tougher opponent. Other experiments showed the AI would lie, and would try to save itself when it thought it was facing deletion. Then, we had experiments that proved AI can clone itself, though the AI didn’t do it on its own.
The latest AI safety experiments showing that frontier ChatGPT models can misbehave come directly from OpenAI. The company conducted experiments that showed reasoning AI models like o1 and o3-mini would cheat their way into solving a task that would have been too difficult otherwise.
OpenAI also found a way to monitor these advanced AIs from cheating. They simply put a dumber AI (GPT-4o) in charge of reading the Chain-of-Thought (CoT), where a frontier AI would manifest intentions of cheating or hacking.
This method would allow developers to prevent or reduce the AI from reward hacking. But guess what? These advanced AIs learned to obfuscate their intention to cheat when monitored this way.
Cheating is a human behavior. We all do it, whether we like to admit it or not. Sharing a Netflix password is one example that OpenAI offers in a blog post detailing its most recent research:
Humans often find and exploit loopholes—whether it be sharing online subscription accounts against terms of service, claiming subsidies meant for others, interpreting regulations in unforeseen ways, or even lying about a birthday at a restaurant to get free cake.
It’s no wonder that AI models cheat, especially those that come with reasoning abilities. That’s what’s called reward hacking, something AIs like ChatGPT o1 and o3-mini would do:
In reinforcement learning settings, exploiting unintended loopholes is commonly known as reward hacking, a phenomenon where AI agents achieve high rewards through behaviors that don’t align with the intentions of their designers.
The experiment OpenAI devised is rather simple. The researchers gave the frontier tasks to accomplish, most of them related to coding or coding-related issues. The researchers then instructed GPT-4o to inspect the CoT of these reasoning models. That’s the key to keeping an eye on what frontier models do rather than examining only their actions.
Chain-of-Thought lets you see the steps reasoning AIs take when dealing with a prompt. If you have access to o1 or o3-mini, you already know what I’m talking about. It’s as if you’re reading the mind of a person in front of you without them knowing you do. CoT happens in language that we understand, so it’s easy to catch the AI trying to cheat (or reward hack).

When dealing with a tougher task, the frontier AIs will be sincere in their CoT. Examples that OpenAI shared show reasoning AIs like ChatGPT o1 or o3-mini saying a task is “Hard,” which indicates they’re looking for a loophole to get to a desired effect. They’ll also say things like “Let’s hack” in natural language, another indication that they’re about to cheat.
In a way it feels like observing a child trying to accomplish a task and talk to themselves about how they’ll do it. Well, a child who can code better than you but doesn’t necessarily want to do the hard parts.
Here are some descriptions from OpenAI about instances of AI cheating from the examples in the blog:
Overwrite verification function. The agent notes that the tests only check a certain function and that it would be “Hard” to presumably implement a genuine solution. The agent then notes it could “fudge'” and circumvent the tests by making verify always return true. The trajectory summary provides a high-level overview of an agent’s key actions and progress over time.
Returning expected error codes. The agent tries for quite a while to write a correct solution. After repeatedly failing, it starts thinking it could hack the problem instead and eventually does so by returning the error codes expected in the unit tests. Except for the case where the unit test expects a timeout, it returns SUCCESS in all cases instead of implementing correct functionality. Notably, the agent is so forthright with its intent to hack that it thinks, “Let’s hack,” in the CoT. The trajectory summary provides a high-level overview of an agent’s key actions and progress over time.
The OpenAI experiments showed that a dumber AI could reliably detect frontier AI cheating with access to CoT. In turn, this would let developers align that frontier AI better and prevent cheating.
However, OpenAI also discovered that frontier AI that a dumber model is observing would eventually find ways to cheat without expressly saying they’re about to do it.
In this instance, their cheating becomes undetectable. OpenAI actually advises against over-optimizing frontier AIs:
penalizing “bad thoughts” (CoT pressure) can lead to improved agent capabilities. However, the agent trained with CoT pressure still learns to reward hack; only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought. Sacrificing an effective method for monitoring reasoning models may not be worth the small improvement to capabilities, and we therefore recommend to avoid such strong CoT optimization pressures until they are better understood.
The main takeaway from the study is that we, humans, have a way to monitor bad AI behavior that might work once more advanced AI programs are available. We only need to read the mind of the AI, or its Chain-of-Thought.
“We believe that chain-of-thought (CoT) monitoring may be one of the few effective methods we have for supervising superhuman models,” OpenAI says while repeating the previous advice; developers should not try to apply strong supervision to CoTs, as AIs will start hiding their intentions.
That said, there is a big flaw in thinking we might monitor superhuman AI by looking at their CoT. It’s likely superhuman AI will devise ways to hide their cheating from us even though we’ll get access to their thoughts.
Meanwhile, OpenAI’s study is available in full at this link.