Safety is always going to be paramount when it comes to artificial intelligence. After all, one of our collective fears is a highly advanced AI going rogue and threatening our very existence. It certainly doesn’t help to see that some of the smartest AI models out there resort to cheating to achieve their goals, or that some would even try to blackmail humans to preserve their integrity.
That actually happened during safety tests performed on frontier AI models before being released to the public. ChatGPT o1 made headlines a few months ago when security researchers found that the AI would resort to cheating at chess against a better opponent in order to achieve its goal, which was winning the game.
More recently, Claude 4 threatened an engineer who was supposed to delete the AI from a computer to expose the person’s infidelity to their partner. The AI obtained information about the deletion plans and the alleged affair from emails it had access to for the purpose of testing its behavior.
The actual Claude 4 will not try to blackmail users, though the AI does come with stronger guardrails than its predecessors to ensure it’s safe for users. That said, Claude 4 might decide to report you to authorities and the press if it thinks you’re engaging in nefarious activities, but that’s only a theoretical risk.
The blackmail scenario is what prompted Yoshua Bengio to create a new initiative called LawZero, which aims to develop honest AI programs that will detect AI systems that might attempt to deceive humans or go rogue.
Bengio is a well-known name in the industry. As The Guardian explains, the computer scientist is referred to as the “godfather of AI.” He shared the 2018 Turing award with AI scientists Geoffrey Hinton and Yann LeCun. Hinton later won the Nobel Prize, and LeCun is now chief AI scientist at Meta.
Bengio will be the president of LawZero, a company that has more than a dozen researchers working on a Scientist AI system, after raising $30 million in funding for the project.
The Scientist AI that LawZero is working on will not protect you against hallucinations from AI models you might be using right now. That’s an unfortunate side effect of programs like ChatGPT consuming massive amounts of data, and one that isn’t going away.
Interestingly, the Scientist AI will act as a “psychologist” that can understand and predict bad behavior from rogue AI chatbots and agents. Why a psychologist? Bengio says other AI agents are actors willing to please humans, so they need an observer. Indeed, ChatGPT went through an annoying sycophantic phase recently, forcing OpenAI to roll back several changes to fix its personality.
That ability to please and complete tasks for the users could actually lead to questionable behavior, like an AI model trying to cheat at a game, or resorting to blackmail to ensure its survival. Instead of offering firm responses like the AI actors, this Scientist AI model will give probabilities for an answer being correct.
The LawZero AI will try to predict whether the action of an AI agent will lead to harm. If a certain threshold is reached, then that AI will be blocked from executing its tasks.
“We want to build AIs that will be honest and not deceptive,” Bengio told The Guardian. “It is theoretically possible to imagine machines that have no self, no goal for themselves, that are just pure knowledge machines – like a scientist who knows a lot of stuff.”
LawZero’s initiative is certainly interesting, but it will only work as long as AI firms and organizations using advanced AI systems deploy it to safeguard their AI operations. That means LawZero doesn’t just have to prove their Scientist AI works as intended. It also has to convince companies like OpenAI, Google, and others to test it and use it. LawZero will also want to impress governments that might be working on AI safety regulation.
Put differently, LawZero will need more resources to keep up with the speed of AI development. So far, Bengio’s AI project has won over a number of prominent investors, including the Future of Life Institute, Skype founding engineer Jaan Tallinn, and Eric Schmidt’s research company, Schmidt Sciences.
That said, it’ll be interesting to see the research coming out of LawZero. Initially, the company will test its Scientist AI system on open-source AI models, so it shouldn’t be long until we see whether this honest AI can catch rogue behavior from popular AI models.