Automation needed to fight army of AI content harvesters stalking the web

Tan KW

Publish date: Tue, 30 Jul 2024, 04:28 PM

Analysis This month Anthropic's ClaudeBot - a web content crawler that scrapes data from pages for training AI models - visited tech advice site iFixit.com about a million times over a 24-hour period.

iFixit boss Kyle Wiens complained about the uninvited bot visitations on social media. "I get you're hungry for data. Claude is really smart," the CEO said last Wednesday, referring to Anthropic's family of LLMs fueled by information harvested by ClaudeBot.

"You're not only taking our content without paying, you're tying up our devops resources," Wiens added. "Not cool." And also not in compliance with iFixit's terms of service.

The publisher repelled some of the traffic Anthropic’s bots created by adding a disallow directive to the website's robots.txt file - the tech industry's agreed-upon mechanism for turning away crawlers.

This is not the first time one large tech outfit has sent an excessive amount of network traffic to another.

"The crawling stopped after we added them to our robots.txt," Wiens told The Register. "Now, they hit that file every thirty minutes."

He added, "Anthropic never replied to me. I'd still be interested in talking to them." This comes as Freelancer.com accused the Claude crawler of visiting its site nearly four million times in as many hours.

Wiens explained that iFixit's mission is to help people repair their devices - which requires information, parts, and tools. "I'd love to deliver that experience through another platform," he said. "I'm a Claude user, and if I ask Claude how to fix my phone and it said, buy this part and these are the instructions - hey, that would be cool."

But that's not happening at the moment. "Right now, [Claude] mangles our instructions and outputs them incorrectly. So people will break their phone if they follow the LLM directions, and it doesn't point you to the part or tool that you need. Not very helpful."

Wiens provided an example of Claude explaining how to install a display screen on a Google Pixel 6a by opening the phone from the back. "It opens from the front, so this would not work and would cause damage," the CEO explained.

Here's Claude's erroneous advice on the matter:

Asked to comment, a spokesperson for Anthropic - built by ex-OpenAI staff and others with the goal of creating a kinder AI super-lab - pointed The Register to its developer FAQ entry titled: "Does Anthropic crawl data from the web, and how can site owners block the crawler?"

That document states: "As per industry standard, Anthropic uses a variety of data sources for model development, such as publicly available data from the internet gathered via a web crawler." It goes on to note that Anthropic tries to make its crawling transparent and non-intrusive, and respectful of robots.txt directives and anti-circumvention mechanisms like CAPTCHA challenges.

Robots.txt dates back to 1994 and was once largely a set-and-forget technology. The idea being that you list in it instructions for bots: What they can and can't index, and which are welcome and those that aren't, with the hope that bot operators will respect site owners' wishes.

"Technically there are no assumptions baked into the protocol itself, but this is how things were in practice for a long time," said Gavin King, founder of Dark Visitors - a venture that offers content protection services including automatic robots.txt updates and automated blocking of bots that ignore the file.

"There used to be only a few bots that people cared about," he told The Register, "and much of the time the rules were more about optimistically helping them - eg, keeping Googlebot away from content that didn't make sense to surface in search results - or doing general things like setting rate limits for all bots. For simple cases like these, you didn't need to change your file that often."

But the AI age has changed the landscape dramatically. AI firms have proliferated, with many crawling websites to harvest data.

And each of them may operate multiple bots. Just when you think you've stopped one crawler, another one shows up from the same outfit.

For instance, Anthropic previously operated Claude-Web and Anthropic-AI to gather training data from sites. If you've disallowed either or both of them, you might be surprised to see ClaudeBot show up. The Associated Press's robots.txt file shows its ongoing attempts to turn away Anthropic's bots, including the latest ClaudeBot, while Reuters and WiReD's still only target the older crawlers.

"We're witnessing a Cambrian explosion in the ecosystem of artificial agents crawling the internet," declared King.

"For example, OpenAI just launched a new one last week (OAI-SearchBot), Meta the week before (Meta-ExternalAgent), and Apple last month (Applebot-Extended)."

Growing numbers of crawlers mean it's difficult for website owners to keep their robots.txt files up to date to counter these new agents. We note that OpenAI and Google last year publishing guidance about how to block their respective crawler bots.

"The pace at which website owners need to update their robots.txt is a direct reflection of the pace at which LLMs and the companies that train and operate them are evolving and competing," said King. "This is a typical cycle for any new technology gaining traction - but this one happens to have this side effect."

Dark Visitors, for what it's worth, provides a programmatic way to automatically update robots.txt entries as new crawlers emerge, and to understand which crawlers have been stopping by and to stop them from accessing pages if they misbehave.

Which, incidentally, appears to be a growing area of interest. Cloudflare recently announced enhancements to its own bot-blockers service to turn away more AI crawlers.

It's always possible that there's crawling going on disguised as normal traffic

While it's long been known that some crawlers ignore robots.txt settings - an allegation leveled at AI search biz Perplexity (it's played down claims of wrongdoing and was hit with a cease-and-desist by publisher Conde Nast) - King observed that most companies respect the rules.

"There's a lot of unfair sentiment and coverage out there about crawlers not following robots.txt rules," he said.

"We have a unique perspective across many websites based on our agent analytics data - another feature that lets Dark Visitors users observe agent behavior on their websites - and this just isn't the case.

"Most of them identify themselves with a proper user agent string, and start following the rules within 24 hours (a reasonable cache delay). Basically all of the big companies do this from what I've seen. At the same time, you don't know what you don't know. It's always possible that there's crawling going on disguised as normal traffic."

One problem is that unscrupulous AI developers can just create crawlers with new names that aren't covered by an existing robots.txt entry specifically and deliberately to maliciously evade the rules.

The main issue that King sees for those of us operating websites is not how to implement blocking - but knowing what to add to their robots.txt file, given the constantly changing bot population. ®

https://www.theregister.com//2024/07/30/taming_ai_content_crawlers/

Discussions

Be the first to like this. Showing 0 of 0 comments

Featured Posts

MQ Trader

Introducing MY's First IPO Fund for Sophisticated Investors!

MQ Chat

New Update. Discover investment communities that resonate with your ideas

MQ Trader

M & A Value Partners IPO Equity Fund has been launched - Targeted 13% Return p.a

Latest Videos