Contractors working on Google’s Gemini AI have been comparing its responses to those of Anthropic’s Claude, a competing artificial intelligence model, TechCrunch reported.
While Google declined to confirm whether it had secured Anthropic’s permission to use Claude in its evaluations, the practice has raised questions about industry norms in AI development and testing.
Typically, tech companies assess their AI models’ performance by running them through industry benchmarks rather than having contractors manually compare outputs from rival models.
However, Gemini contractors have reportedly been tasked with rating the accuracy of AI responses by evaluating factors such as truthfulness and verbosity. According to the correspondence, contractors have up to 30 minutes per prompt to determine which model—Gemini or Claude—delivers better results.
Contractors recently noticed explicit references to Anthropic’s Claude appearing within Google’s internal evaluation platform.
Discussions among contractors highlighted differences in safety protocols between the two models. Claude was noted for its strict safety measures, often refusing to answer prompts it deemed unsafe, such as role-playing scenarios. In contrast, Gemini’s responses reportedly violated safety standards in some cases, with one output flagged for including inappropriate content.
“Claude’s safety settings are the strictest,” a contractor remarked in internal chats, contrasting them with Gemini’s more lenient approach.
Anthropic’s terms of service prohibit customers from using Claude to develop or train competing AI systems without explicit approval. While Google is a significant investor in Anthropic, it remains unclear whether the company has obtained such permission.
Shira McNamara, a spokesperson for Google DeepMind, which oversees Gemini, denied allegations that Gemini is trained on Anthropic models.
“Of course, in line with standard industry practice, in some cases we compare model outputs as part of our evaluation process,” McNamara said in a statement. “However, any suggestion that we have used Anthropic models to train Gemini is inaccurate.”
The report follows recent revelations that Google contractors are being asked to evaluate Gemini’s AI outputs in areas beyond their expertise. Internal discussions have flagged concerns about the accuracy of responses on sensitive topics such as healthcare, raising broader questions about Gemini’s reliability and safety.