anslyansly
AuditPricingBlog
Sign In
anslyansly

AI-readiness scanner for websites. Check your visibility in ChatGPT, Claude, and Perplexity.

@tryansly

Product

  • Audit
  • Pricing
  • Blog

Company

  • About
  • Privacy Policy
  • Terms of Service
  • Contact Us
© 2026 ansly. All rights reserved.
PrivacyTermsContact
anslyansly
AuditPricingBlog
Sign In
Home/Blog/From robots.txt to entities.txt: The Evolution of How Websites Talk to Machines
Timeline of technology evolution with glowing nodes on a dark background representing web standards across decades
AEO5 min read

From robots.txt to entities.txt: The Evolution of How Websites Talk to Machines

Every generation of machine visitors to the web demanded a new file. In 1994, crawlers needed permissions. In 2004, they needed an inventory. In 2024, agents needed a summary. In 2026, they need a knowledge graph.

ansly Team·March 11, 2026

The history of web infrastructure is a history of websites learning to talk to machines.

Every time a new type of machine visitor appeared, with new capabilities, new needs, new constraints, the web had to develop new vocabulary. A new file format. A new convention. A new layer of the stack.

We are in the middle of the fourth such transition. And understanding what drove each of the previous three clarifies exactly why entities.txt is necessary now.


1994: The Crawler Arrives, robots.txt

The first web crawlers appeared in 1993 and 1994. Their job was simple: follow links, index content, repeat. They were fast, thorough, and indiscriminate. They would crawl your login forms, your private admin pages, your session URLs, your PDFs, anything reachable via a link.

Website owners needed a way to say: "You can go here, but not there."

robots.txt was the answer. Proposed in 1994 and adopted as a de facto standard, it gave websites a simple protocol to communicate access permissions to automated agents. A few lines of plain text at the root of the domain, readable by any crawler that chose to check.

User-agent: *
Disallow: /admin/
Disallow: /private/

It was entirely about permissions. Nothing about meaning. Nothing about what the allowed pages contained or what they were for. The crawler's job was to crawl, figuring out meaning was a problem for the search engine's indexing algorithms.

What it gave machines: Permission maps. "Go here. Not there."


2004: The Indexer Scales, sitemap.xml

By 2004, the web had grown from millions of pages to billions. Search engines were good at finding the most-linked content, but they were systematically missing pages with few inbound links, valuable pages buried deep in large sites, or orphaned by poor internal linking.

Website owners needed a way to say: "These pages exist. Please index them."

Google introduced the Sitemap Protocol in 2005 (codified as sitemap.xml). It was a structured XML inventory: a list of URLs with optional metadata about when they were last modified and how frequently they changed. Not a crawl directive, an inventory.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2024-01-15</lastmod>
  </url>
</urlset>

It told crawlers nothing about what those pages meant, what entities they discussed, or how they related to each other. It was a catalog of addresses, not a map of meaning.

What it gave machines: Page inventory. "These pages exist. Check them last-modified on this date."


2024: The Agent Emerges, llms.txt

For two decades, the crawler/sitemap combination was sufficient because the consumers of web content were fundamentally the same type of entity: search engine indexers that ranked pages by link signals and keyword patterns.

Then large language models changed everything.

AI agents don't rank pages, they synthesize answers. They don't care about PageRank, they care about meaning. And they operate under a constraint that traditional crawlers don't: context windows. A crawler can index a million pages. An LLM agent generating an answer has a finite budget for how much text it can process.

The Answer.AI team recognized this and proposed llms.txt in 2024: a plain Markdown file at /llms.txt that gives AI agents a curated narrative summary of your site, what you do, who you serve, and where your most important content lives.

# Acme Corp

> Enterprise project management for distributed teams

## Docs
- [Getting Started](/docs/getting-started.md): How to set up your workspace
- [API Reference](/docs/api.md): Full API documentation

## Optional
- [Full site content](/llms-full.txt)

This was a genuine leap. For the first time, a web standard was designed specifically for AI reasoning rather than link-following or keyword indexing. llms.txt gives agents a starting point, a curated map of your content hierarchy.

But it has a ceiling. It's prose and links. An AI agent reading your llms.txt still has to infer what entities you have, how they relate to each other, and what you're authoritative on. The information is there, buried in narrative, but it's not structured data that a machine can traverse efficiently.

What it gave machines: Narrative summary. "Here's what this site is about and where to find key content."


2026: The Agent Era, entities.txt

The progression is clear in retrospect:

YearFileAudienceGave machines
1994robots.txtCrawlersPermission maps
2005sitemap.xmlIndexersPage inventory
2024llms.txtLLM agentsNarrative summary
2026entities.txtAI agentsSemantic knowledge graph

Each layer added more semantic richness because each generation of machine visitor needed more semantic context.

robots.txt told crawlers where they could go. sitemap.xml told indexers what pages existed. llms.txt told LLMs what the site was about. entities.txt tells agents what the site IS, its entities, their relationships, topic authority, and page-entity mappings.

The jump from llms.txt to entities.txt is the jump from narrative to graph. A narrative summary tells a story. A knowledge graph expresses a structure. For machines that reason over structured data, the difference is profound.

An agent reading your entities.txt can answer: "What entities does this site have?" "How does product A relate to product B?" "Is this site authoritative on topic X?" "Which page should I read to understand entity Y?", without crawling a single page, and without inferring from prose.

What it gives machines: Semantic knowledge graph. "Here are the entities, how they relate, and what we're authoritative on."


The Pattern: Machines Evolve, Infrastructure Follows

There's a consistent dynamic across all four transitions:

  1. A new type of machine visitor appears with new capabilities
  2. Existing web infrastructure doesn't give that machine what it needs
  3. A new convention emerges to fill the gap
  4. Early adopters gain an advantage; late adopters scramble to catch up

In 1994, sites that implemented robots.txt avoided crawl problems that punished sites without it. In 2005, sites with sitemaps got better index coverage. In 2024, sites with llms.txt file got better representation in AI-generated summaries. In 2026, sites with entities.txt will give AI agents the structured identity layer they need to represent those sites accurately.

The infrastructure follows the machine. The machine is now an agent. The infrastructure it needs is a knowledge graph.


For a practical guide to building your entities.txt, see entities.txt: Why Every Website Needs a Semantic Identity File for AI Agents. To understand how entities.txt reduces AI hallucination about your brand, see How entities.txt Reduces AI Hallucination.

For the technical foundation of making your site visible to AI crawlers in general, see GPTBot, ClaudeBot, PerplexityBot: The Complete AI Crawler Access Guide and The Complete Guide to llms.txt.

On this page

1994: The Crawler Arrives, robots.txt2004: The Indexer Scales, sitemap.xml2024: The Agent Emerges, llms.txtDocsOptional2026: The Agent Era, entities.txtThe Pattern: Machines Evolve, Infrastructure Follows

Related Articles

Checklist and planning notebook with pen on a desk, representing an organized AEO audit workflow
AEO5 min read

The AEO Audit Checklist 2026: 10 Fixes Ranked by Impact (Start With #1)

Most AEO guides give you a flat list with no priority. These 10 fixes are ranked by ROI, so you know exactly where to spend your first hour.

ansly Team·Mar 15, 2026
Analytics dashboard showing search visibility metrics and trend charts on a laptop screen
AEO7 min read

AI Search Visibility Tool: How to Choose and Use One in 2026

AI search visibility is not the same as Google search visibility. Position does not equal citation rate. Here is how to choose the right tool, what metrics to track, and why analytics alone won't tell you what to fix.

ansly Team·Mar 15, 2026
B2B team reviewing AI search strategy on laptop with digital charts in background
AEO7 min read

Generative Engine Optimization for B2B SaaS: The 2026 Playbook

GEO is the discipline that determines whether AI engines cite your product when buyers research your category. Here is what it means for B2B SaaS and the six tactics that move the needle.

ansly Team·Mar 15, 2026
← Back to Blog
anslyansly

AI-readiness scanner for websites. Check your visibility in ChatGPT, Claude, and Perplexity.

@tryansly

Product

  • Audit
  • Pricing
  • Blog

Company

  • About
  • Privacy Policy
  • Terms of Service
  • Contact Us
© 2026 ansly. All rights reserved.
PrivacyTermsContact