Strategy9 min read

Reuters and Time Block AI Crawlers by Default: What the Allowlist Shift Means

Reuters and Time adopted allowlist-by-default AI crawler policies in May 2026, blocking all bots except a pre-approved set. People Inc. expanded its blocked user agents from approximately 2,100 to over 30,000 after the switch. A Tollbit report found 30% of total AI bot scrapes did not comply with explicit robots.txt permissions. Here is how the publisher AI blocking trend affects content strategy and AI visibility.

News media publishing website with digital content access policy settings

Reuters and Time Block AI Crawlers by Default: What the Allowlist Shift Means

Reuters and Time both adopted allowlist-by-default AI crawler policies in May 2026, blocking all bots except an approved set. Reuters approves bots from Amazon, Google, Bing/Microsoft, Yahoo, and OpenAI, requiring all others to demonstrate a "fair value exchange" via licensing payments, traffic referrals, or monetization assistance to gain access. People Inc. expanded its blocked user agent list from approximately 2,100 to over 30,000 after switching to the same model. A Tollbit report from Q3-Q4 2025 found that 30% of total AI bot scrapes did not comply with explicit robots.txt permissions. A BuzzStream analysis from January 2026 found 79% of top news publishers block at least one AI training bot. The SPUR Coalition, tracking publisher rights in AI contexts, counted 36 member organizations as of the article date after adding 30 new members in the prior month.

Key takeaways

Reuters and Time switched to allowlist-by-default crawler policies in May 2026, blocking all AI bots except pre-approved ones. Reuters explicitly requires a "fair value exchange" (licensing, traffic referrals, or monetization support) for bot approval.
People Inc. expanded its blocked user agent list from approximately 2,100 to over 30,000 after switching from a blocklist to an allowlist model, illustrating the coverage gap the old blocklist approach left.
A Tollbit Q3-Q4 2025 report found 30% of total AI bot scrapes did not comply with explicit robots.txt permissions. Publisher confidence in robots.txt enforcement alone is limited.
79% of top news publishers block at least one AI training bot (BuzzStream, January 2026). The distinction between training crawlers (GPTBot) and search/retrieval crawlers (OAI-SearchBot) determines whether blocking helps or hurts AI visibility.
The strategic question for non-publisher brands is the inverse of the publisher question: publishers want to prevent uncredited scraping; most brands want to ensure AI systems can access their content for citations and retrieval.

What the allowlist model changes

Blocklist model (legacy): maintains a list of named bots to block. All other bots, including newly launched AI crawlers, are allowed by default. Coverage degrades as new AI systems launch bots faster than the blocklist is updated.

Allowlist model (Reuters, Time, People Inc.): blocks all bots by default. Only explicitly approved bots can access the site. Coverage is comprehensive because unapproved bots are blocked regardless of name. The operational cost is higher: new bots require active approval rather than passive listing.

The ICP problem this creates for marketers who rely on news publisher coverage for AI visibility: if your brand is mentioned in a Reuters or Time article, and those publishers adopt allowlist policies that exclude the AI retrieval bots you want to have your content, the citation chain breaks. The publisher has the content. The AI system cannot retrieve it. Your brand mention becomes invisible to AI search.

The 30,000+ blocked user agents at People Inc. illustrates the scale: a list of 2,100 blocked bots (the legacy blocklist) left tens of thousands of AI user agents unaddressed. The switch to allowlist revealed the actual scope of unmanaged access.

Prooflytics tracks brand citation signals in the AI visibility layer of the daily briefing. For brands where editorial coverage in news publishers is a primary AI visibility driver, publisher crawler policies affect whether those citations reach AI retrieval systems. The signals layer distinguishes between brand mentions that are AI-retrievable and those that are not.

The enforcement gap

The Tollbit Q3-Q4 2025 finding that 30% of AI bot scrapes did not comply with explicit robots.txt permissions is significant for publishers and brands alike. robots.txt is an advisory protocol. It signals intent but has no technical enforcement mechanism. A bot that ignores robots.txt can still access the site; the publisher's only recourse is legal action under applicable law (CFAA in the US, GDPR in the EU).

The enforcement gap changes the calculus for both sides:

For publishers: allowlist models combined with technical access controls (IP blocking, user agent blocking at the server level, rate limiting) are more robust than robots.txt alone because they add a technical layer that non-compliant bots must actively circumvent.

For brands: a brand that wants its content available to AI retrieval bots cannot rely solely on robots.txt declarations. If the site's server blocks well-known AI retrieval user agents by IP or user agent header, the robots.txt allowance for those bots is irrelevant. The technical access layer must match the stated policy.

Prooflytics

Make the call with the whole picture

Briefs are daily; the understanding compounds.

Start free trial See pricing

14 days free · no credit card

How the training/retrieval distinction changes blocking decisions

The most consequential distinction in AI crawler policy is between training crawlers and search/retrieval crawlers. Most publishers conflate them, but they have opposite effects on AI visibility:

AI training crawlers (e.g., GPTBot, Google-Extended): crawl content to include it in the next model training run. Blocking these prevents future model versions from including your content in their base weights. It does not affect real-time retrieval (what ChatGPT sees when a user searches today).

AI search/retrieval crawlers (e.g., OAI-SearchBot, PerplexityBot, Applebot): crawl content to include it in real-time retrieval for AI-generated answers. Blocking these prevents your content from being cited in today's AI search responses.

From the CMO/CIO research: 77% of organizations with any AI policy only address training crawlers. Only 21% have a strategy for search-side crawlers. This means the majority of organizations optimizing for AI visibility have robots.txt configurations that block neither the training nor the retrieval crawlers (no policy), or that block only training crawlers (protecting against future training while inadvertently allowing or blocking retrieval crawlers without intent).

For brands that want to appear in AI search answers: ensure OAI-SearchBot, PerplexityBot, and Applebot are explicitly allowed in robots.txt. For brands that want to prevent AI training on their content without blocking AI citations: block GPTBot and Google-Extended while allowing OAI-SearchBot and PerplexityBot.

What non-publisher brands should do differently from publishers

The publisher AI blocking trend is driven by rights concerns: publishers want compensation for the use of their content in AI training and commercial applications. This is a legitimate business model dispute.

For non-publisher brands, the incentive structure is typically inverted. You want your content cited in AI search answers. You want AI systems to accurately describe your products and services. You want to appear when users query your category in ChatGPT or Perplexity. The publisher's goal (limiting AI access) is your anti-goal.

Four practical steps for non-publisher brands:

Step 1: Audit current robots.txt for AI crawler coverage. Check whether GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Applebot, Google-Extended, and ByteSpider are addressed. If they are not named (and you have no catch-all), their access is determined by whatever your default policy is.

Step 2: Decide the training versus retrieval split. If brand IP is not a concern, allow all AI crawlers. If you want to prevent training but allow retrieval citations, allow OAI-SearchBot and PerplexityBot while blocking GPTBot and Google-Extended. Document this decision so it survives infrastructure changes.

Step 3: Verify technical access matches stated policy. Check server logs for AI bot activity. If a bot you have explicitly allowed in robots.txt is not appearing in server logs despite known crawl activity, a technical layer (IP block, user agent filter at CDN level) may be overriding the robots.txt policy.

Step 4: Ensure key pages are crawlable. Product pages, About pages, pricing pages, and case studies are the highest-value pages for AI retrieval. Confirm they are indexed in Google Search Console and not blocked by authentication, JavaScript rendering requirements, or accidental noindex tags.

What to watch

More major publishers adopting allowlist models in 2026: the Reuters and Time moves follow The Atlantic and People Inc. If this trend continues, editorial coverage that drives AI citations will become dependent on whether the AI retrieval bots are on the publisher's approved list.
Licensing deals becoming standard for AI training access: if major AI companies standardize licensing arrangements with news publishers (following OpenAI's existing deals with Associated Press, Le Monde, and others), the question shifts to which publishers have deals and whether the retrieval bots operate under those deals or separately.
Non-compliant scraping enforcement actions: the 30% non-compliance rate and SPUR Coalition activity suggest legal enforcement actions are increasing. These are unlikely to affect brand content directly but may affect which AI training data sources are legally available, altering what AI systems know about market topics.
robots.txt deprecation discussions: some AI companies have proposed moving beyond robots.txt to negotiated access frameworks. If robots.txt loses status as the de facto access control mechanism, the crawler policy infrastructure for AI access will need to be rebuilt.

Bottom line

Reuters, Time, People Inc., and The Atlantic have adopted allowlist-by-default crawler policies, blocking all AI bots except pre-approved ones. This represents a structural shift from the legacy blocklist model.
The 30% non-compliance rate for robots.txt (Tollbit Q3-Q4 2025) means robots.txt alone is insufficient enforcement for publishers; it is still a necessary signal for compliant crawlers.
For non-publisher brands, the incentive is inverted: you want AI retrieval crawlers to access your content. Audit robots.txt for OAI-SearchBot, PerplexityBot, Applebot, and ClaudeBot specifically. Ensure key pages (product, about, case studies) are accessible to these bots.
The training/retrieval distinction is the most consequential robots.txt decision: allowing retrieval bots while optionally blocking training bots is possible and does not require an all-or-nothing choice.
For brands tracking AI visibility in Prooflytics: whether your content is accessible to AI retrieval systems determines whether AI citations are possible. The visibility layer in the daily briefing reflects what AI systems can currently reach and cite.
You can read independent reviews of Prooflytics on G2 and compare it to alternatives in the marketing analytics category.

Frequently asked questions

Does blocking AI training crawlers affect my Google ranking?+

Blocking GPTBot (OpenAI's training crawler) does not affect Google Search rankings. GPTBot is unrelated to Googlebot and has no effect on your Google crawl budget or indexation. Google-Extended is Google's AI training crawler and is separate from Googlebot. Blocking Google-Extended while allowing Googlebot maintains your Google Search ranking while preventing Google from using your content for AI training specifically.

If a news publisher blocks AI crawlers, does that mean my brand mentions there are invisible to AI?+

Depends on which bots they allow. If the publisher allows OAI-SearchBot (OpenAI's search retrieval crawler) and PerplexityBot, your brand mentions in their articles are still retrievable for AI citations. If they block all AI bots including retrieval bots, those mentions are not accessible to AI search unless they already appear in the model's training data from an earlier crawl.

Should small brands worry about the publisher AI blocking trend?+

Directly, no. The publisher trend affects editorial content that mentions brands. For a small brand with few press mentions, the publisher blocking trend has minimal current impact. The larger concern is ensuring your own site is accessible to AI retrieval crawlers, since your direct content (product pages, blog, case studies) is your primary AI visibility asset, not press coverage. Press coverage matters at scale when it becomes a significant OPID source.

What is the SPUR Coalition?+

SPUR (Sourcing and Publishing Under Regulation) is a publisher coalition formed to advance publisher rights in the context of AI training and commercial AI applications. Members include news organizations and digital publishers advocating for licensing frameworks, enforcement of robots.txt compliance, and legal protections against non-consensual AI scraping. As of the May 2026 reporting, it had 36 member organizations after adding 30 new members in the preceding month.

Is 30% robots.txt non-compliance rate unusually high?+

For traditional web crawlers (Googlebot, Bingbot), compliance is close to 100% because non-compliance with major search engines results in ranking penalties. AI training crawlers do not have the same enforcement mechanism: non-compliance does not result in a penalty the crawler operator suffers. The 30% figure from Tollbit's Q3-Q4 2025 report reflects the absence of a technical enforcement mechanism for robots.txt, not a technical failure. Allowlist models with server-level blocking address this by adding a technical layer that robots.txt alone cannot provide.

Prooflytics

Make the call with the whole picture

Briefs are daily; the understanding compounds.

Start free trial See pricing

14 days free · no credit card

Continue reading

Strategy· 10 min read

CMO vs CIO: The $40B AI Agent Accountability Gap in Enterprise Marketing

AI agent activity increased 150% month-over-month from November 2025 to March 2026. 88% of search visits are now AI agents. A survey of 1,000 enterprise leaders found 75% lack a documented plan and 72% report marketing owns AI agent responsibility without ever being formally handed it. The $40B opportunity at stake requires resolving who owns what between the CMO and CIO.

SEO· 9 min read

Robots.txt vs Noindex: What Each Controls and When to Use Which

Robots.txt blocks crawlers from reading a page. Noindex prevents a page from appearing in search results. They are not interchangeable -- using the wrong one can cause pages to surface in SERPs despite your intent. This guide clarifies which control to use for SEO, AI crawlers, and staging environments.

SEO· 9 min read

llms.txt: How to Make Your Marketing Site Readable by AI Agents

Google added llms.txt to Lighthouse agentic audits in May 2026, making it a measurable signal for how well sites expose their content to AI crawlers. llms.txt is a plain-text file at the root of your domain that tells AI agents what content is available, how to use it, and what is off-limits -- a robots.txt for the agentic web. Here is what to put in it.

Strategy· 8 min read

Google AI Search Is Building a Two-Tier Internet: Where B2B Marketers Stand

A study of 44 major US publishers found aggregate organic search traffic rose 5% after AI Overviews, but nearly all gains went to institutional brands. B2B content marketing teams in the middle tier are experiencing the opposite - structural traffic erosion with no algorithmic remedy.