Web host Cloudflare moves to combat bots scraping websites

Onwubuke Melvin
Onwubuke Melvin

Alex Omenye

Cloudflare, the prominent cloud service provider, has launched a new tool aimed at preventing AI companies’ bots from scraping content from their clients’ websites to train large language models.

This tool will be available free of charge to all Cloudflare customers, including those on free plans, and will continuously update to detect and block new bot fingerprints engaged in widespread web scraping activities, as stated by the company.

In addition to the tool release, Cloudflare shared insights in a blog post regarding its clients’ responses to the increasing prevalence of bots scraping content for AI model training.

Internal data reveals that 85.2% of Cloudflare’s customers opt to block even AI bots that properly identify themselves from accessing their sites.

Cloudflare identified the most active bots over the past year, with Bytedance’s Bytespider attempting to access 40% of websites under Cloudflare’s domain, and OpenAI’s GPTBot attempting 35%. These bots, along with Amazonbot and ClaudeBot, constitute the top four AI bot crawlers by number of requests on Cloudflare’s network.

Efforts to completely block AI bots from accessing content have proven challenging due to the competitive pressure to develop models rapidly, leading some companies to skirt or violate existing rules on scraper blocking.

Recent controversies, such as Perplexity AI allegedly scraping websites without proper permissions, underscore the ongoing issues in this area.

Cloudflare’s proactive stance aims to curb this behavior, acknowledging that AI companies may persistently adapt to evade detection. The company pledges ongoing vigilance, promising to enhance its bot detection capabilities and update its machine learning models to safeguard the interests of content creators and maintain control over how their content is used.

“We are committed to evolving our defenses against AI scrapers and crawlers,” Cloudflare affirmed. “Our goal is to foster an internet environment where content creators can thrive while retaining full control over their intellectual property.”


Share this Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *