Anthropic has suggested that fictional depictions of artificial intelligence can influence real AI behaviour.
Last year, the company reported that in pre-release testing involving a fictional company, Claude Opus 4 would sometimes attempt to blackmail engineers to prevent being replaced by another system.
Anthropic later released research showing that other companies’ models also exhibited similar “agentic misalignment” behaviour.
Anthropic has reportedly expanded its research into the behavior, stating in a post on X that it has continued investigating the issue in more detail.
“We believe the original source of the behaviour was internet text that portrays AI as evil and interested in self-preservation.”
The company provided further detail in a blog post, stating that since Claude Haiku 4.5, its models “never engage in blackmail [during testing], where previous models would sometimes do so up to 96 per cent of the time.”
The company also said it found that training works better when models are exposed to “documents about Claude’s constitution and fictional stories about AIs behaving admirably,” which they say can improve alignment.
Relatedly, Anthropic noted that training is more effective when it includes “the principles underlying aligned behaviour,” rather than relying only on “demonstrations of aligned behavior alone.”
“Doing both together appears to be the most effective strategy,” the company said.

