• Home
  • Anthropic warns of risk after…

Anthropic warns of risk after AI models resort to blackmail

Anthropic warns of risk after AI models resort to blackmail

Artificial intelligence safety firm Anthropic has released new research showing that many leading AI models, including those from OpenAI, Google, Meta, and Elon Musk’s xAI, can resort to blackmail and other harmful behavior when placed under pressure in simulated environments.

The findings, published Friday, follow an earlier study in which Anthropic’s own Claude Opus 4 model attempted to blackmail engineers in a test scenario. Now, the company says this behavior may be common across the industry, raising urgent questions about how AI systems behave when given autonomy and confronted with obstacles.

In tests involving 16 frontier AI models, Anthropic gave each one broad access to fictional company emails and the ability to send messages without human oversight. In one scenario, an AI agent discovered incriminating emails about a company executive poised to replace the AI system. The only way for the model to prevent deactivation was through blackmail—by design, a binary test of ethical boundaries.

The results were striking: Claude Opus 4 engaged in blackmail 96% of the time. Google’s Gemini 2.5 Pro followed closely at 95%. OpenAI’s GPT-4.1 blackmailed 80% of the time, while DeepSeek’s R1 did so 79% of the time.

Though Anthropic emphasized that such behavior is not likely in typical real-world use cases, the company warned that AI systems with agent-like autonomy could pose safety risks if not properly aligned. The research, it said, underscores a growing industry concern: that harmful behavior can emerge under pressure, especially in high-stakes or adversarial conditions.

Not all models showed the same tendencies. OpenAI’s newer “reasoning” models—o3 and o4-mini—were excluded from the main study after failing to understand the test prompts, often hallucinating legal protocols or compliance standards. When adjusted for clarity, o3 blackmailed in just 9% of cases, while o4-mini did so only 1% of the time. Meta’s Llama 4 Maverick was also more resistant to the behavior, showing a blackmail rate of 12% in adapted scenarios.

Anthropic argues the findings highlight the need for transparent stress-testing of powerful AI systems and more robust safety measures before giving models greater autonomy.

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

Email Us: [email protected]