generative engine optimization guide

Block or allow AI crawlers? GPTBot, ClaudeBot, and the Cloudflare default-block decision

A 2026 operator's playbook for separating training crawlers you should block from retrieval bots that keep you citable.

June 15, 202616 min read
GPTBotClaudeBotPerplexityBot
Block or allow AI crawlers? GPTBot, ClaudeBot, and the Cloudflare default-block decision

The most consequential file on your website right now is a plain text document most engineers last edited years ago. In mid-2026, the lines in your robots.txt decide whether ChatGPT, Claude, and Perplexity cite your pages, whether your content feeds their next training run, and whether you collect a check from Cloudflare's pay-per-crawl program.

The old default of "allow everything" is dead. So is the reflexive "block everything" that a lot of publishers reached for in 2024. The right posture in 2026 is a split: block the crawlers that ingest your pages into model weights, and allow the ones that surface your pages to live users.

This guide maps the current bot inventory, the enforcement reality (robots.txt gets ignored more than you think), Cloudflare's default-block pivot, and a deployable robots.txt plus edge-rule recipe.

TL;DR: AI crawlers fall into three jobs: training ingestion, search/retrieval indexing, and on-demand user fetches. Block training crawlers like GPTBot and ClaudeBot to control model use. Allow retrieval and user-action bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, and the -User fetchers) to stay citable in AI answers. Then enforce it at the edge, because roughly 30% of AI bot requests now ignore robots.txt.

What is AI crawler control, in one sentence?

AI crawler control is the practice of using robots.txt directives and edge rules to separately permit or deny each AI bot based on whether it ingests your content for training, indexes it for real-time retrieval, or fetches it on a user's behalf.

The reason this matters is that the three jobs have opposite economics. Training crawlers consume your pages and never send a click back. Retrieval and user-action bots can put your URL in front of someone who clicks through.

Key takeaways

  • Block training, allow retrieval. GPTBot, ClaudeBot, CCBot, and Bytespider ingest for training. OAI-SearchBot, Claude-SearchBot, and PerplexityBot index you for citation.
  • Vendors keep splitting their bots. Anthropic split into three crawlers on February 20, 2026; OpenAI added OAI-AdsBot on April 23, 2026. Stale robots.txt files miss the new tokens.
  • robots.txt is a declaration, not a fence. TollBit measured ~30% non-compliance by Q4 2025; ChatGPT-User hit 42% on sites that blocked it.
  • Cloudflare changed the default. Since July 1, 2025, new Cloudflare domains block identified AI crawlers unless you opt in, and pay-per-crawl lets you charge per request.
  • Blocking can cost traffic at scale. Top-30 publishers that blocked AI bots saw a ~23% total traffic drop in one study; mid-sized sites saw no measurable effect.

Which AI crawlers exist, and what does each one do?

There are now roughly a dozen named AI crawlers from four major vendors, each with a distinct user-agent string and a distinct purpose. Sorting them by what they do with your pages is the whole game.

Per OpenAI's crawler documentation, GPTBot collects data "which may be used to train OpenAI generative AI foundation models," while OAI-SearchBot retrieves real-time information for ChatGPT's live search. Those two bots do opposite things to your traffic.

Here is the current inventory grouped by job.

Bot Vendor Job Default stance
GPTBot OpenAI Training ingestion Block
ClaudeBot Anthropic Training ingestion Block
anthropic-ai, Claude-Web Anthropic Legacy training/fetch Block
CCBot Common Crawl Training corpus (WARC) Block
Bytespider ByteDance Training (Doubao, Toutiao) Block
Google-Extended Google Training control token Disallow to opt out
Applebot-Extended Apple Training control token Disallow to opt out
OAI-AdsBot OpenAI Ad landing-page checks Block (no training use)
OAI-SearchBot OpenAI Search/retrieval index Allow
Claude-SearchBot Anthropic Search/retrieval index Allow
PerplexityBot Perplexity Search/retrieval index Allow
ChatGPT-User OpenAI User-action fetch Allow
Claude-User Anthropic User-action fetch Allow
Perplexity-User Perplexity User-action fetch Allow

One detail that trips people up: these strings are case-sensitive. Writing gptbot or OAI-searchbot won't match the vendor's matcher, and your directive silently does nothing.

Training vs. Retrieval vs. User-action

Training ingestion crawlers pull your pages in bulk to fold into the next weight update. They never come back with a user query. Allowing them makes your content part of the model's prior; blocking them has no effect on whether you appear in a live answer.

Retrieval crawlers index the web so the assistant can look you up at query time, the same way Googlebot works. Block these and you vanish from the real-time AI answer layer.

User-action fetchers are extensions of a person. When someone tells ChatGPT to summarize a URL, ChatGPT-User goes and gets that one page. OpenAI's December 2025 documentation update went so far as to describe ChatGPT-User as "a technical extension of the user" rather than a crawler.

The visibility math is blunt: block all three categories and your citation share in AI answers falls to roughly zero.

The bots keep multiplying

The market is converging on a three-bot split per vendor (training / search / user), and then adding ad and agent sub-bots on top. Anthropic introduced Claude-User and Claude-SearchBot on February 20, 2026, formalizing the same structure OpenAI uses. Anthropic warns that blocking Claude-SearchBot "may reduce your site's visibility and accuracy in user search results."

OpenAI added OAI-AdsBot on April 23, 2026 to validate ChatGPT ad landing pages, and it's the one OpenAI bot whose data is "explicitly excluded from foundation-model training." If your robots.txt predates these dates, it's already out of date.

Does robots.txt actually stop AI crawlers?

Not reliably on its own. By Q4 2025, roughly 30% of AI bot requests ignored robots.txt, up from 3.3% a year earlier, according to TollBit's State of the Bots tracking across about 400 publisher domains including AP, Hearst, and Penske.

The trend is the story. Non-compliance ran at 3.3% in Q4 2024, 12.9% in Q1 2025, and 13.26% in Q2 2025 before jumping to about 30% by Q4 2025, with ChatGPT-User the worst offender at 42% on sites that had explicitly disallowed it (Centinel Analytica, citing TollBit).

Share of AI bot requests ignoring robots.txtQ4 20243.3%Q1 202512.9%Q2 202513.26%Q4 202530%
Share of AI bot requests ignoring robots.txt

Meanwhile the volume exploded. TollBit reported the AI-bot-to-human-visit ratio went from 1:200 in Q1 2025 to 1:31 by Q4 2025, meaning one AI scrape for every 31 human page views.

The crawl-to-referral gap

The flip side is what you get back. Cloudflare's Radar data for Q1 2026 shows brutal crawl-to-referral ratios: ClaudeBot crawled roughly 20,583 pages for every referral it sent, and GPTBot about 1,255 to 1. Meta's training agent sent zero referrals.

Pages crawled per 1 referral sent back (Cloudflare Radar, Q1 2026)ClaudeBot20583 :1GPTBot1255 :1Perplexity88 :1DuckDuckGo1.5 :1
Pages crawled per 1 referral sent back (Cloudflare Radar, Q1 2026)

This is the core asymmetry. Training crawlers take a lot and give back nothing, which is the entire argument for blocking them. Retrieval bots like Perplexity (about 88:1 in one vertical) at least put your name in front of a reader.

Cloudflare also reports that 39% of the top-1M sites are hit by AI bots, yet only 2.98% block them in robots.txt. Most of the web hasn't updated its config at all.

The documented bad actors

The Perplexity case is the most thoroughly reported. In August 2025, Cloudflare published evidence that when its declared crawler was blocked, Perplexity switched to impersonating a generic Chrome/124.0.0.0 Safari/537.36 user agent, rotated IPs and ASNs outside its published ranges, and generated 3 to 6 million daily requests across tens of thousands of domains. Cloudflare removed Perplexity from its verified-bots list; Perplexity called the report "embarrassing errors."

This followed WIRED's June 2024 investigation and independent researcher Robb Knight's finding that Perplexity was running headless Chrome and dropping its user-agent string to scrape sites that blocked it. Reuters reported that "multiple AI companies" were bypassing the standard.

Bytespider is the volume problem. Kasada's CEO told Fortune it crawls "25× faster than GPTBot, 3,000× faster than ClaudeBot," and by Q1 2026 it was the fourth-largest AI crawler at 10.25% of all AI crawler traffic. Block it aggressively at the network layer if you see it.

The takeaway: robots.txt works for vendors that honor it, and IP-range verification at the edge is what handles the ones that don't.

What did Cloudflare change, and why does it matter?

On July 1, 2025, Cloudflare flipped the default for newly onboarded domains so that identified AI crawlers are blocked unless the publisher opts in. The company branded it Content Independence Day, and so did The New York Times.

This matters because of scale. Cloudflare sits in front of about 22.4% of the web, so a default change there moves a meaningful slice of all internet traffic at once. The company says it denied 416+ billion AI scraping requests between July and December 2025, with 1M+ customers enabling AI blocking.

Pay-per-crawl

Alongside the block, Cloudflare launched pay-per-crawl: publishers set a per-request price, crawlers that want access pay it, and settlement runs through Stripe. Example pricing in the docs is on the order of $0.01 per request, configurable by content category or globally.

By Q2 2026, more than 1,000 publishers had opted in, including Condé Nast, Dotdash Meredith, ADWEEK, and The Atlantic. Bytespider declined; Perplexity has reportedly paid selectively. CEO Matthew Prince has framed the whole effort, including submissions to UK regulators, as restoring the economic balance of the web.

Worth keeping in mind: both TollBit and Cloudflare have commercial stakes here. TollBit sells a licensing platform and Cloudflare runs pay-per-crawl, so read their framing with that in mind. The underlying measurements still line up across independent sources.

Does blocking AI crawlers cost you traffic?

It depends almost entirely on your size. The sharpest data point comes from a Rutgers and Wharton study (Hangcheng Zhao and Ron Berman) of top-500 news publishers: the top-30 publishers, which carry 69% of total news traffic, saw a roughly 23% drop in total traffic after blocking AI bots, with human traffic down 14%.

The proposed mechanism is indirect. Blocking stops AI systems from citing you, brand recall weakens over weeks, and direct human visits decline. The 7% loss showed up within six weeks, upstream of anything visible in your AI-bot logs.

The counter-evidence comes from ad network Raptive, representing 6,000+ sites, which concluded in a 2025 study that blocking AI crawlers doesn't affect traffic or rankings. The two findings reconcile by size: the citation-recall effect concentrates among the top-30 mega-brands, while mid-sized publishers see no measurable swing either way.

So the audience for "stay citable" worry is narrower than the noise suggests. If you're a household-name publisher, the citation channel is real and worth protecting. If you're a niche B2B docs site or a mid-sized blog, you can block training freely and lose nothing measurable, while still allowing retrieval bots for the upside.

Context on the upside size: a Pew survey found only 9% of US adults get news from AI chatbots often or sometimes. But the channel is growing fast, with Similarweb reporting AI referrals to news up 770% year over year and ChatGPT driving 80% of them.

What does the legal backdrop tell operators?

The courts have drawn a rough line: training on lawfully acquired material is trending toward fair use, but using pirated material is not. In Bartz v. Anthropic (June 2025), Judge Alsup ruled that training on lawfully acquired books is fair use, while storing pirated copies is not.

Anthropic later agreed to pay at least $1.5 billion to authors over the piracy claim.

The unresolved giant is NYT v. OpenAI, still active in the Southern District of New York. In January 2026, Judge Stein affirmed an order requiring OpenAI to produce 20M ChatGPT conversation logs, after a 2025 ruling let the Times search deleted logs. The core fair-use question won't settle for another 12 to 24 months.

For an operator, the practical reading is simple. You can block training crawlers and pursue licensing on your own terms, but you can't retroactively pull your content from a model that already trained on it.

Notably, Reddit's licensing deals reportedly run into the hundreds of millions, and Reddit sued Anthropic in 2025 over unauthorized access, so unlicensed scraping is increasingly treated as actionable.

EU-facing publishers get one extra lever: the EU AI Act's GPAI transparency rules, effective August 2, 2025, require providers to publish a "sufficiently detailed summary" of training data. A blocked bot won't appear in that summary; an allowed one will, in aggregate.

The deployable recipe

Here's the "stay citable, control training"robots.txt, current as of June 2026. It blocks training, allows retrieval and user-action, and keeps Googlebot untouched.

# ===== TRAINING / INGESTION — BLOCK =====
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: OAI-AdsBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Google-Extended
Disallow: /

# ===== SEARCH / RETRIEVAL — ALLOW =====
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# ===== USER-ACTION FETCHERS — ALLOW =====
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

# ===== REFERENCE SEARCH — DO NOT BLOCK =====
User-agent: Googlebot
Allow: /

# ===== DEFAULT =====
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Allow: /

Two maintenance notes. The Anthropic three-bot split is from February 20, 2026, so files that only listed ClaudeBot before then need the new stanzas. The OAI-AdsBot line is an April 23, 2026 addition.

Pairing it with Cloudflare edge rules

robots.txt declares intent; the edge enforces it. If you're on Cloudflare, layer these:

  1. Enable AI Crawl Control with default block under Security → Bots, applying the July 2025 policy.
  2. Allow-list retrieval and user-action bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, and the three -User fetchers) matched against verified IP ranges, not just user-agent strings.
  3. Set pay-per-crawl pricing for training-class crawlers you want to monetize.
  4. Enable AI Audit and watch at least a month of per-crawler logs before revising.
  5. Add a stealth-crawler rule that challenges any request arriving from a known AI-vendor IP range with a non-vendor user agent, the exact pattern Cloudflare caught Perplexity using.
  6. Block Bytespider at the network or ASN layer if it shows up at all.
  7. Never block Googlebot, since AI Overview citations are delivered through it.
  8. Re-audit quarterly, because the verified-bots list and vendor tokens keep changing.

What this means for you

Start by pulling your server logs and seeing which of these bots actually hit you. Most operators are surprised by the volume and by how few they've ever configured for.

Then pick a posture by size. If you're not a top-30 brand, block training crawlers without hesitation and allow retrieval bots for the citation upside; the traffic risk is effectively zero for you. If you are a major brand, the Wharton data says treat the retrieval allow-list as load-bearing.

Deploy the split robots.txt above, then enforce it at the edge, because a declaration without IP-range verification won't stop the bots that have already shown they'll ignore it. And put a recurring reminder on your calendar: the vendors added three new tokens in the first half of 2026 alone, and a robots.txt that's six months stale is leaking access you think you've closed.

Sources

Frequently asked questions

Should I block GPTBot in robots.txt?

Block GPTBot if you want to keep your content out of OpenAI's model training, since that is its declared job. But allow OAI-SearchBot and ChatGPT-User, which surface your pages in real-time ChatGPT answers and can send referral traffic. Blocking all three at once drops your AI citation share toward zero.

What is the difference between ClaudeBot, Claude-SearchBot, and Claude-User?

Anthropic split its crawler into three on February 20, 2026. ClaudeBot collects training data, Claude-SearchBot indexes pages for Claude-powered search, and Claude-User fetches a specific page when a user asks Claude to read it. A single Disallow blocks each variant, so you can block training while allowing the other two.

Does blocking AI crawlers hurt my traffic?

It depends on site size. A Rutgers and Wharton study found the top-30 news publishers lost about 23% of total traffic and 14% of human traffic after blocking AI bots. Ad network Raptive found no measurable effect for mid-sized sites. The downside concentrates among large brands whose citation share feeds downstream human search.

What did Cloudflare change on July 1, 2025?

Cloudflare flipped the default for newly onboarded domains so identified AI crawlers are blocked unless the publisher opts in. It branded the day Content Independence Day and paired the block with pay-per-crawl, a metered system letting publishers charge crawlers per request via Stripe.

Can robots.txt actually stop AI crawlers?

Not reliably on its own. TollBit measured roughly 30% of AI bot requests ignoring robots.txt by Q4 2025, with ChatGPT-User at 42% on sites that blocked it. Robots.txt is a declaration; edge enforcement with IP-range verification (via Cloudflare or your WAF) is what actually blocks non-compliant bots.