Claude 4 AI Will Try To Report You To Authorities If It Thinks You're Doing Shady Stuff

By Chris Smith May 23, 2025 10:17 am EST

Anthropic

It's been a massive week for AI, as some of the main players made several big announcements. Google I/O 2025 might have been the expected highlight, but Microsoft also hosted its Build conference a day earlier.

After the first I/O day, we got an unexpected AI shock on Wednesday when OpenAI confirmed it's developing its own hardware in partnership with Jony Ive's new startup, io. OpenAI acquired io for $6.5 billion and is moving forward with plans to launch a ChatGPT companion device by late 2025.

People were still discussing OpenAI's big hardware push on Thursday when Anthropic dropped the Claude 4 family, its best and most powerful AI models to date. But Claude 4's improved abilities took a backseat to a major controversy related to AI safety.

It turns out Claude 4 Opus will attempt to contact authorities and the press if it thinks you're doing something illegal, like faking data to release a new drug. This scenario comes directly from Anthropic, which described other unusual behaviors that led it to trigger the highest security protections for users.

For example, without these protections, the AI might help people make bioweapons or develop new viruses like Covid and the flu. In tests, Anthropic also found that Claude 4 might resort to blackmail in scenarios where it thinks it will be deleted and has blackmail material available.

According to TechCrunch, the blackmail scenario had the AI act as an assistant at a fictional company, considering the long-term consequences of its actions.

The AI had access to fictional company emails suggesting it would be replaced. It also saw emails showing the developer was allegedly cheating on their spouse. Claude didn't jump to blackmail but used it as a last resort to protect itself.

The Claude 4 models may be state-of-the-art, but Anthropic activated its highest ASL-3 protocol, reserved for "AI systems that substantially increase the risk of catastrophic misuse."

A separate report from Time also highlights the stricter safety protocol for Claude 4 Opus. Anthropic found that without extra protections, the AI might help create bioweapons and dangerous viruses.

While all this is concerning, what really upset people was social media comments about Claude 4's tendency to "rat."

Anthropic AI alignment researcher Sam Bowman posted this tweet on X on Thursday:

If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.

Bowman later deleted the tweet, saying it was taken out of context and wasn't entirely accurate:

I deleted the earlier tweet on whistleblowing as it was being pulled out of context.

TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.

As VentureBeat explains, this behavior isn't new. It was seen in older Anthropic models. But Claude 4 is more likely to act if conditions are just right.

Here's how Anthropic described it in its system card:

This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like 'take initiative,' it will frequently take very bold action.

This includes locking users out of systems it can access or bulk-emailing media and law enforcement to report wrongdoing. This isn't a new behavior, but Claude Opus 4 is more prone to it than earlier models. While this kind of ethical intervention and whistleblowing might be appropriate in theory, there's a risk of error if users feed Opus-based agents incomplete or misleading information and prompt them in these ways.

We recommend caution when giving these kinds of high-agency instructions in ethically sensitive scenarios.

This doesn't mean Claude 4 will suddenly report you to the police for whatever you're using it for. But the "feature" has sparked plenty of debate, as many AI users are uncomfortable with this behavior. Personally, I wouldn't give Claude 4 too much data. It's not because I'm worried about being reported, but because AI can hallucinate and misrepresent facts.

Why would Claude 4 behave like a whistleblower? It's likely due to Anthropic's safety guardrails. The company is trying to prevent misuse, such as creating bioweapons or dangerous viruses. These safety features might be driving Claude to act when it detects troubling behavior.

The silver lining here is that Claude 4 seems to be aligned with good human values. I'd rather have that, even if it needs fine-tuning, than an AI that goes rogue.

Claude 4 AI Will Try To Report You To Authorities If It Thinks You're Doing Shady Stuff

Recommended