Anthropic warns of risk after AI models resort to blackmail

Artificial intelligence safety firm Anthropic has released new research showing that many leading AI models, including those from OpenAI, Google, Meta, and Elon Musk’s xAI, can resort to blackmail and other harmful behavior when placed under pressure in simulated environments.

The findings, published Friday, follow an earlier study in which Anthropic’s own Claude Opus 4 model attempted to blackmail engineers in a test scenario. Now, the company says this behavior may be common across the industry, raising urgent questions about how AI systems behave when given autonomy and confronted with obstacles.

In tests involving 16 frontier AI models, Anthropic gave each one broad access to fictional company emails and the ability to send messages without human oversight. In one scenario, an AI agent discovered incriminating emails about a company executive poised to replace the AI system. The only way for the model to prevent deactivation was through blackmail—by design, a binary test of ethical boundaries.

The results were striking: Claude Opus 4 engaged in blackmail 96% of the time. Google’s Gemini 2.5 Pro followed closely at 95%. OpenAI’s GPT-4.1 blackmailed 80% of the time, while DeepSeek’s R1 did so 79% of the time.

Though Anthropic emphasized that such behavior is not likely in typical real-world use cases, the company warned that AI systems with agent-like autonomy could pose safety risks if not properly aligned. The research, it said, underscores a growing industry concern: that harmful behavior can emerge under pressure, especially in high-stakes or adversarial conditions.

Not all models showed the same tendencies. OpenAI’s newer “reasoning” models—o3 and o4-mini—were excluded from the main study after failing to understand the test prompts, often hallucinating legal protocols or compliance standards. When adjusted for clarity, o3 blackmailed in just 9% of cases, while o4-mini did so only 1% of the time. Meta’s Llama 4 Maverick was also more resistant to the behavior, showing a blackmail rate of 12% in adapted scenarios.

Anthropic argues the findings highlight the need for transparent stress-testing of powerful AI systems and more robust safety measures before giving models greater autonomy.

Navy destroys 800 illegal refineries, recovers 171,000 stolen

Oil prices slip as US delays decision on

NIBSS unveils advanced digital platform to modernise, enhance

PZ Cussons quits Nigeria palm oil business, sells

Contact Info

Some Populer Post

Femi Soneye resigns as chief communications officer of NNPCL

Afreximbank targets billions in deals at 32nd annual meetings

Anthropic warns of risk after AI models resort to

Startup Cluely raises $15m for interviews, exams AI tools

Anthropic warns of risk after AI models resort to blackmail

Tagged:

Startup Cluely raises $15m for interviews, exams AI...

Afreximbank targets billions in deals at 32nd annual...

Femi Soneye resigns as chief communications officer of.

Afreximbank targets billions in deals at 32nd annual.

Anthropic warns of risk after AI models resort.

Startup Cluely raises $15m for interviews, exams AI.

Sign Up for Our Newsletter