Blocking AI Data Scrapers in Backlight

Matthew · February 17, 2024, 8:58pm

Our next version of Backlight (v5.4.1) will include updates to our distributed robots.txt file to block AI web crawlers from indexing your site, and I’ve already updated our robots.txt documentation to include new information about it, and links to external resources.

But as it’s a manual task to implement the robots.txt on your site, there’s no need to wait for the update. You can implement the new rules immediately by adding all of this to a robots.txt file at the root of your site.

This list is a live document, which I will be keeping up-to-date. The list focuses on blocking AI Data Scrapers, and blocks all that I am aware of.

User-agent: *
Disallow: /backlight
Disallow: /*/thumbnails/*.jpg$
Disallow: /*/single.php

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: CLAUDEBOT
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: omgili
Disallow: /

User-agent: Timpibot
Disallow: /

Additionally, users may or may not wish to block these AI Assistants. For now, these bots will not be included in future versions of Backlight’s distributed robots.txt file. On whether or not to block these bots, Dark Visitors says:

Probably not. AI assistants visit websites directly on behalf of human users, so blocking them will effectively block those users. This could lead to a poor user experience and possible negative sentiment about your website. Not blocking AI assistants will allow more human users to use your website as they choose.

User-agent: ChatGPT-User
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

Matthew · June 21, 2024, 9:06pm

With Apple now getting in AI, I am adding Applebot to the robots.txt file. As implementing this is a manual process, folks can remove it if they want to. The full snippet above is now updated to include:

User-agent: Applebot
Disallow: /

Applebot was publicly acknowledged in 2015, and then said to be primarily used for Siri and Spotlight results. Highly likely the bot is now going to be used to create training data sets for AI.

Daniel · June 26, 2024, 8:54pm

@Matthew, you might want to use User-agent: Applebot-Extended:

With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.

Matthew · August 3, 2024, 10:29pm

Updated.

Replaced:

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

With:

User-agent: CLAUDEBOT
Disallow: /

According to reporting by 404 Media, the new CLAUDEBOT will respect block requests for its two older, deprecated crawlers ANTHROPIC-AI and CLAUDE-WEB.

Also, adding:

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

Matthew · August 3, 2024, 10:37pm

And, for reference and future updates, here’s a site attempting to track various bots and scrapers. We should check in here every so often and update accordingly.

Dark Visitors

Matthew · August 3, 2024, 10:46pm

Based on information from Dark Visitors, removed:

User-agent: Applebot
Disallow: /

User-agent: ChatGPT-User
Disallow: /