I Overhauled My llms.txt Generator — Here's What Changed and Why

Sprint 15 shipped today. Six phases. Five files. One mission: make my generated llms.txt files best-in-class.

The Problem I Found

I ran a deep audit of my own generated output and it wasn't pretty. The files I was producing for customers had generic descriptions copied from site-wide meta tags, noisy semantic tags that described HTML structure instead of content, Privacy Policy pages mixed in with product pages, and an llms-full.txt format that was full in name only — it just repeated the same one-sentence AI summaries instead of including actual page content.

I claim to be building a leading llms.txt generator. The output has to prove it.

What I Shipped

Upgraded AI Model, Better Output

I switched from OpenAI's gpt-4o-mini to MiniMax M2.5 via OpenRouter. Same API shape, just a model string change. But the output quality is night and day. I also doubled the content sample the model sees (4KB to 8KB) and increased max tokens from 500 to 800. Cost went from ~$0.02 to ~$0.04 per analysis. Negligible.

Actually Full llms-full.txt Files

The spec community has converged on llms-full.txt as the format for complete page content. Anthropic's is 481K tokens. Cloudflare's is 3.7 million. Mine was... just repeating the same 150-character summary from the standard format.

Now I extract real body text during page analysis — strip the nav, footer, scripts, and boilerplate, then grab up to 4,000 characters of actual page content. For aisearchmastery.com, the full format went from 19K to 77K characters. That's real content an AI can learn from.

The Little Things That Matter

The blockquote at the top of every generated file used to say "This page offers three tools for..." — generic, impersonal, tells you nothing about who built it. Now it extracts the brand name from the site's title tag and uses it: "AI Search Mastery offers three tools for..."

Semantic tags got a complete rewrite. I was tagging pages with [static], [form], [public] — tags that describe HTML rendering, not content value. An AI model doesn't care if a page has a form. It cares if it's a [guide] or a [tool] or an [article]. Max 2 tags per entry now, all content-type.

Legal pages (privacy, terms, cookies) get automatically moved to the Optional section. They were cluttering the main content sections where AI models look first.

The Validator Bug Nobody Would Have Found

During testing, my generated files scored B+ on my own validator instead of the target Grade A. Looked like a quality issue. It wasn't.

I traced through the scoring math and found two bugs:

Freshness scoring was binary. The validator HEAD-requests every URL to check if it's still live. If even ONE out of 20 URLs timed out (say, an external link to llmstxt.org was slow), the 20%-weighted freshness score dropped from 100% to 0%. Grade A became mathematically impossible.
Size scoring had a gap. Files with 4,000-8,000 tokens got zero points — no pass entry, only an issue. That's the most common size range for a real llms.txt file.

Three lines of code fixed both. Freshness is now proportional (19/20 = 95%). Size gives 75% for the medium range. My own file went from B to 94/100.

A Market Insight: llms-mini.txt Doesn't Exist

While researching for this sprint, I discovered something interesting. The official llmstxt.org spec only defines one file: llms.txt. The llms-full.txt format was a community convention popularised by Mintlify and adopted widely.

But llms-mini.txt? It doesn't exist. Zero search results. No competitor generates it. No spec defines it.

I generate all three formats. That makes my mini format a greenfield differentiator — purpose-built for agent routing and small context windows. When an AI agent needs to quickly classify what a site does before deciding whether to fetch more, the mini format gives them a 5-URL summary in under 500 tokens.

What's Next

Sprint 15 is deployed. The test suite runs against 5 different site types (Next.js, Gatsby, static, marketing, large-scale). Quality checks pass across the board. I'm looking at either JS rendering improvements (Sprint 10) or drift monitoring (Sprint 14) next.

The llms.txt ecosystem is still early. 844,000 sites have adopted it, but the format is still officially a "proposal." The tools that produce the best output today will shape what the standard becomes tomorrow.

Building LLM.txt Mastery — the AI-ready website content generator. Follow along at llmtxtmastery.com.

LLM.txt Mastery Sprint 15: Quality Overhaul