llms.txt and robots.txt for AI

Andrew McPherson

Author Introduction

I am Andrew McPherson, and in my work with B2B clients I see the same friction repeatedly: AI systems cite outdated pages because nothing tells them what is current. llms.txt and robots.txt solve that problem at the infrastructure layer, and in this article I explain exactly how to configure both.

Outline

What llms.txt and robots.txt do for AI
Why feed declaration matters for B2B
How AI crawlers read both files
Building an effective llms.txt manifest
Configuring robots.txt for AI crawlers
Linking feeds to AI data surfaces
CiteCompass perspective on Citation Authority
Recent changes in crawler standards

Key Takeaways

llms.txt declares feeds; robots.txt controls crawler access
Feed declaration prioritises fresh, high-value content for AI
Separate user-agent rules let you manage each AI crawler
RAG retrieval and training crawling behave differently
JSON and RSS feeds are parsed most efficiently
Block CCBot to limit downstream commercial AI use
Crawl-delay protects servers without blocking indexing
Early llms.txt adoption is low risk, high upside

What Are llms.txt and robots.txt for AI?

The llms.txt file is an emerging standard for declaring content feeds and priorities specifically for AI systems. Placed in your site root, llms.txt tells language models where to find structured data, which content to prioritise, and how frequently it updates. Unlike traditional sitemaps that guide crawler behaviour, llms.txt functions as a feed declaration manifest optimised for retrieval-augmented generation (RAG) systems.

The robots.txt file has been extended to support AI-specific crawler directives. While the original robots.txt standard from 1994 controlled search engine bot access, modern implementations include user-agent rules for AI crawlers like GPTBot (OpenAI), Claude-Web (Anthropic), GoogleOther (Google AI features), and CCBot (Common Crawl). Google’s crawler overview confirms the separation between Googlebot and GoogleOther, enabling independent control for traditional search versus AI applications.

Both files operate at the site root level (example.com/llms.txt and example.com/robots.txt) and are fetched before AI systems begin content retrieval. They represent the first layer of control in what Microsoft’s “From Discovery to Influence” framework calls Surface 2: feeds and APIs. By declaring feed locations and crawler policies explicitly, B2B companies reduce ambiguity in how AI systems access and prioritise their content.

Why Feed Declaration Matters for B2B Companies

AI systems retrieve content through three distinct surfaces: crawled web pages, feeds and APIs, and live site interactions. When you rely exclusively on web crawling (Surface 1), you surrender control over what AI models discover, how fresh that content is, and which pages receive priority. Feed declaration through llms.txt shifts this dynamic by providing a curated manifest of high-value content.

Consider a B2B company with 10,000 web pages including product documentation, blog posts, customer case studies, and support articles. Without llms.txt, an AI crawler treats all pages equally, potentially indexing outdated blog posts while missing recently updated API documentation. The llms.txt file lets you declare: index the /docs/ JSON feed first, updated daily; the /blog/ RSS feed is secondary, updated weekly; ignore /archive/ entirely.

This explicit prioritisation becomes critical when AI systems generate responses with citation limits. If ChatGPT can only cite three sources when answering how your product handles authentication, you want those citations pointing to current documentation, not archived blog posts from 2019. Feed declaration ensures AI models access the right content in the right order at the right freshness interval.

The robots.txt extensions provide complementary control by managing crawler behaviour per AI system. OpenAI’s GPTBot might be allowed full access because ChatGPT drives significant traffic, while CCBot could be rate-limited or blocked entirely. These granular controls let you balance AI visibility against server load, competitive concerns, and content licensing preferences.

B2B companies face a distinct challenge: technical content changes rapidly, and outdated information in AI responses damages trust. A SaaS platform that updates its API weekly cannot afford having AI systems cite six-month-old documentation. Feed declaration with update frequency signals solves this by telling AI systems exactly when to re-index specific content areas, ensuring citations reflect current product reality.

How AI Crawlers Use These Files

When an AI system prepares to retrieve content from your domain, it follows a predictable sequence. First, it requests robots.txt to check crawler permissions. The file is parsed for user-agent specific rules matching the crawler’s identifier. If allowed, the crawler checks for rate limits (crawl-delay directives) and prohibited paths (Disallow rules). Only after robots.txt compliance is confirmed does the crawler proceed to content retrieval.

Modern AI crawlers implement varying degrees of robots.txt compliance. GPTBot (OpenAI) and Claude-Web (Anthropic) respect standard robots.txt directives and honour site-specific configurations. Anthropic’s Claude-Web documentation and OpenAI’s GPTBot documentation both confirm support for standard crawl directives including crawl-delay and path-specific rules. Google’s AI features use multiple crawlers: Googlebot for traditional search and GoogleOther for AI training data, with separate user-agent strings allowing independent control.

The llms.txt file operates differently. Rather than controlling access, it optimises discovery. When an AI system indexes your domain for RAG purposes, it looks for llms.txt to identify structured feed URLs. The llms.txt community specification defines how files declare feed locations, content types, update frequencies, and priority hints.

AI systems process llms.txt as a manifest, not a directive. If your llms.txt declares a JSON feed at /api/content.json updated hourly, RAG systems will preferentially retrieve from that feed rather than crawling individual HTML pages. This improves retrieval efficiency for the AI system while giving you control over content format, structure, and freshness. The manifest approach aligns with how AI systems already consume APIs: as structured data sources rather than unstructured web pages.

The technical implementation differs between retrieval and training use cases. For RAG-based retrieval (ChatGPT’s web browsing, Perplexity’s citation system), AI systems read llms.txt at query time to locate fresh content. For training data collection, crawlers read robots.txt once during bulk harvesting. Understanding this distinction helps you configure both files appropriately: llms.txt optimises real-time retrieval, while robots.txt governs bulk access.

How to Optimise llms.txt and robots.txt for AI

Creating an Effective llms.txt File

The llms.txt specification follows a simple key-value format with sections for different feed types. Start by identifying your highest-value content for AI systems: product documentation, API references, technical guides, and authoritative blog posts. Create dedicated feeds for each content type using formats AI systems parse efficiently (JSON, RSS, Atom, or structured HTML sitemaps).

A complete llms.txt file for a B2B SaaS company would declare primary feeds for documentation and API references updated daily in JSON format with high priority, secondary feeds for blog content and case studies updated weekly in RSS format with medium priority, and supplemental feeds for changelogs and glossaries updated monthly with low priority. An excluded-paths section then tells AI systems which directories to ignore entirely (for example /archive/, /internal/, and /staging/), complementing robots.txt disallow rules.

Each section declares feed URLs, update frequency, preferred format, and priority level. This gives AI systems a clear hierarchy so retrieval effort concentrates on the content you most want cited.

Configuring robots.txt for AI Crawlers

Modern robots.txt files require separate user-agent blocks for each AI crawler. The standard format allows precise control over access patterns, crawl rates, and content boundaries. A production-ready configuration would allow GPTBot and Claude-Web full access to /docs/, /blog/, and /api/ paths while disallowing /archive/ and /internal/ with a 10-second crawl delay, permit GoogleOther access to /docs/ and /blog/ with a faster 5-second delay, block CCBot entirely to prevent use in downstream commercial AI products, and apply conservative defaults to unknown crawlers.

The crawl-delay directive is measured in seconds between requests. For AI training crawlers that may fetch thousands of pages, a 10-second delay prevents server overload while still allowing complete indexing over days or weeks. For real-time RAG crawlers that fetch individual pages per query, lower delays of 2 to 5 seconds improve response time without straining infrastructure.

Linking llms.txt to AI Data Surfaces

Both files implement what Microsoft calls Surface 2 in the “From Discovery to Influence” framework: feeds and APIs. While Surface 1 (crawled web) depends on AI systems discovering content organically, Surface 2 gives you explicit control over what content is accessed, in what format, and how frequently.

The llms.txt file should reference the same structured feeds you publish for other purposes: RSS for blogs, JSON feeds for documentation, API endpoints for dynamic content. This unified approach means maintaining a single set of feeds for both human subscribers and AI systems, reducing technical overhead while ensuring consistency.

Testing your implementation requires checking both files are accessible and correctly formatted. Use curl or a browser to fetch both URLs; they should return 200 status codes and plain text content. For llms.txt, verify all feed URLs are valid and return structured data. For robots.txt, test with Google Search Console’s robots.txt tester or dedicated validation tools to confirm syntax correctness and directive logic.

CiteCompass Perspective

At CiteCompass, we view llms.txt and robots.txt as foundational elements of Citation Authority optimisation. These files represent the first interaction between your content infrastructure and AI systems, setting expectations for how your content should be accessed and prioritised. Companies that implement both files strategically see measurable improvements in citation frequency and accuracy. Our AI Visibility Suite tracks crawler behaviour and citation outcomes so you can see the impact of these changes.

The distinction between controlling access (robots.txt) and optimising discovery (llms.txt) mirrors our broader framework for AI visibility. You cannot force AI systems to cite you, but you can remove friction from the retrieval process. Feed declaration eliminates ambiguity about content location, format, and freshness. Crawler directives prevent AI systems from indexing low-value pages that dilute your Citation Authority.

We recommend B2B companies implement llms.txt even before formal specification finalisation. Early adoption signals to AI systems that your content infrastructure is optimised for programmatic access, potentially influencing retrieval prioritisation algorithms. The minimal implementation cost (a static text file and existing feeds) creates asymmetric upside: zero risk, potential citation improvements, and future-proofed infrastructure.

Monitor your server logs for AI crawler activity after implementing both files. Track which user-agents access your feeds, how frequently they crawl, and which content paths receive the most attention. This data reveals which AI systems respect your directives and which feeds drive the most RAG retrievals, informing ongoing optimisation efforts.

What Changed Recently

2025-01: Anthropic documented Claude-Web crawler behaviour and robots.txt compliance in official developer documentation, confirming support for standard directives.

2024-12: The llms.txt community specification reached draft v0.3, adding support for priority hints and excluded-paths declarations with informal support from multiple AI companies.

2024-11: Google clarified the distinction between Googlebot and GoogleOther, enabling separate crawler controls for search indexing versus AI features.

References

Anthropic (2025). Claude-Web Documentation. Official documentation on Claude’s web crawler behaviour, robots.txt compliance, and support for standard crawl directives.

OpenAI (2024). GPTBot Documentation. Official documentation specifying GPTBot default crawl rates and implementation guidance for webmasters controlling AI crawler access.

Google Search Central (2024). Overview of Google crawlers. Documentation clarifying the distinction between Googlebot and GoogleOther for traditional search versus AI applications.

llms.txt Community (2024). llms.txt Specification. Community-driven specification for declaring structured data feeds and API endpoints to AI crawlers.

llms.txt and robots.txt for AI: Feed Declaration and Crawler Control