Skip to main content

Build a high-quality llms.txt for any website. Model-agnostic, SSRF-safe, no hallucinated URLs.

Project description

llmstxt-generator

Build a high-quality llms.txt for any website — from one command.

License: MIT Python 3.9+ Built by Trakkr

llmstxt-gen stripe.com

It crawls a site the same way an AI agent would — homepage, sitemap, robots.txt, the highest-signal pages — and writes a clean, spec-compliant llms.txt that gives models a faithful map of what the site is and where its important content lives. Every link in the output is one the generator actually saw: no invented URLs.

Model-agnostic by design. Runs against OpenAI, Anthropic, DeepSeek, Together, OpenRouter, Groq, or a local Ollama with a single flag.

Built by Trakkr — the AI visibility platform. This is the open-source engine behind Trakkr's free llms.txt tool.


What is llms.txt?

llms.txt is a simple Markdown file at a site's root (example.com/llms.txt) that tells AI models and agents what a site is about and which pages matter, without making them wade through navigation, scripts, and boilerplate. Think of it as robots.txt for meaning instead of access — a curated, machine-readable index of your most important content. The format is defined at llmstxt.org.

It's moving from convention to standard. In May 2026, Google added an llms.txt check to Lighthouse's new Agentic Browsing audit, putting it alongside the performance and accessibility signals teams already track. A good llms.txt is fast becoming table stakes for being well-represented in AI search and assistants.

Quickstart

pip install llmstxt-generator      # or: pipx install llmstxt-generator
# or install the latest straight from source:
#   pip install "git+https://github.com/trakkr-aisearch/llms-txt-generator"
export OPENAI_API_KEY=sk-...        # the only thing the default needs

llmstxt-gen stripe.com             # print to stdout
llmstxt-gen stripe.com -o llms.txt # write to a file
llmstxt-gen stripe.com --verbose   # watch the live discovery trace

As a library:

from llmstxt_generator import generate_llms_txt

result = generate_llms_txt("stripe.com")   # needs OPENAI_API_KEY
print(result.content)

print(result.pages_read, "pages read")
print(result.validation["link_count"], "links")
print(result.validation["dropped_invented_links"], "hallucinated URLs dropped")

Example output

Real output from llmstxt-gen stripe.com (default model, ~$0.001, ~20s, 23 links, 0 hallucinated URLs dropped). Trimmed for length — the full files for Stripe, Vercel, and Anthropic are in examples/.

# Stripe

> Stripe is a financial services platform that provides businesses with tools to
> accept payments, manage financial operations, and implement custom revenue
> models. It serves a diverse range of clients, from startups to large
> enterprises, across various industries.

## Payments Solutions

- [Stripe Payments](https://stripe.com/payments): Accept payments online and in person globally with a payments solution built for any business.
- [Payment methods](https://stripe.com/payments/payment-methods): Explore popular local payment methods to improve conversion rates for businesses.
- [Stripe Payments documentation](https://docs.stripe.com/payments.md): A guide to integrating Stripe's payments APIs.

## Connect Solutions

- [Stripe Connect](https://stripe.com/connect): Embed payments into products with seamless onboarding and global payouts.
- [Marketplace payments](https://stripe.com/connect/marketplaces): Tools for onboarding and paying out freelancers and sellers.

## Enterprise Solutions

- [Enterprise Payment Solutions for Large Businesses](https://stripe.com/enterprise): Tailored financial solutions for large enterprises.
- [Pricing & Fees](https://stripe.com/pricing): Details on Stripe's processing fees and pricing models for businesses.

## Optional

- [Stripe Newsroom](https://stripe.com/newsroom): Latest news and updates about Stripe's partnerships and innovations.
- [Legal](https://stripe.com/legal): Access Stripe's legal documents and policies.

Use any model

The default is OpenAI's gpt-4o-mini (cheap, fast, widely available). Switch providers with a flag or an env var — any OpenAI-compatible Chat Completions endpoint works, plus a native Anthropic adapter.

llmstxt-gen stripe.com --provider deepseek
llmstxt-gen stripe.com --provider anthropic --model claude-haiku-4-5-20251001
llmstxt-gen stripe.com --provider openrouter --model openai/gpt-4o-mini
LLMSTXT_PROVIDER=ollama llmstxt-gen stripe.com   # local, no key

Provider matrix

Provider --provider API key env Default model Notes
OpenAI openai (default) OPENAI_API_KEY gpt-4o-mini Works out of the box
Anthropic anthropic ANTHROPIC_API_KEY claude-haiku-4-5-20251001 pip install 'llmstxt-generator[anthropic]'
DeepSeek deepseek DEEPSEEK_API_KEY deepseek-chat OpenAI-compatible
Together together TOGETHER_API_KEY meta-llama/Llama-3.3-70B-Instruct-Turbo OpenAI-compatible
OpenRouter openrouter OPENROUTER_API_KEY openai/gpt-4o-mini Any model on OpenRouter
Groq groq GROQ_API_KEY llama-3.3-70b-versatile OpenAI-compatible
Ollama ollama (none) llama3.1 Local http://localhost:11434/v1
Any other custom LLMSTXT_API_KEY set --model Point --base-url at any OpenAI-compatible API

Override anything via env: LLMSTXT_PROVIDER, LLMSTXT_MODEL, LLMSTXT_BASE_URL, LLMSTXT_API_KEY. Arguments beat env; env beats defaults.

# A custom OpenAI-compatible gateway:
LLMSTXT_BASE_URL=https://my-gateway.internal/v1 \
LLMSTXT_API_KEY=... \
llmstxt-gen stripe.com --provider custom --model my-model

How it works

A fixed four-phase pipeline — no open-ended agent loop, so the cost and runtime are bounded and predictable (roughly a cent or less per site on the default model).

1. Discover  ──  fetch the homepage, robots.txt, sitemap.xml, and any existing
                 llms.txt; optionally ask the model what it knows about the brand
                 cold (to sharpen the summary, never to invent page content).

2. Enrich    ──  score every discovered URL (shallow + high-value slugs win),
                 then fetch the top pages for their real titles and descriptions.

3. Compose   ──  one streamed model call writes the llms.txt live, grounded only
                 in the pages we actually read.

4. Finalize  ──  strip code fences, validate every link against what we saw,
                 de-duplicate, drop emptied sections, and score the structure.

No hallucinated URLs. Phase 4 checks every link against the set of URLs the crawler actually discovered. When discovery is rich, on-site URLs the model assembled from real context are allowed; when discovery is sparse (a bot-walled or JS-only site), it switches to strict mode and keeps only URLs it literally saw — so the model can't fabricate a site map from memory. Duplicate links (the "eleven titles all pointing at the homepage" failure) are collapsed.

Same-site only, redirect-aware. Links are constrained to the apex domain and its subdomains. The effective host is taken from where the homepage actually resolved, so apex→www and rebrand redirects are handled correctly.

SSRF-safe. Every outbound fetch is screened by _safe_url: non-HTTP schemes, localhost, cloud metadata endpoints, and private / loopback / link-local / reserved IP ranges are all refused. Safe to point at user-supplied domains.

CLI reference

llmstxt-gen DOMAIN [options]

  -o, --output FILE       Write the file here instead of stdout.
  -v, --verbose           Print the live discovery/compose trace to stderr.
      --json              Emit the full result (file + stats) as JSON.

  --provider NAME         openai | anthropic | deepseek | together |
                          openrouter | groq | ollama | <custom>
  --model NAME            Override the provider's default model.
  --base-url URL          OpenAI-compatible base URL (for custom endpoints).
  --api-key KEY           API key (prefer env vars for secrets).

  --max-pages N           Max pages to read for real titles/metas (default 12).
  --no-cold-knowledge     Skip the cold-knowledge prior.
  --version

stdout receives only the llms.txt, so it pipes cleanly; the trace and diagnostics go to stderr.

Library API

from llmstxt_generator import (
    generate_llms_txt,         # sync, returns LlmsTxtResult
    generate_llms_txt_async,   # async, returns LlmsTxtResult
    generate_llms_txt_stream,  # async generator of trace events
    resolve_config,            # build a GeneratorConfig from env/args
    GeneratorConfig,
)

# Override provider/model/tuning inline:
result = generate_llms_txt("stripe.com", provider="deepseek", max_enrich_pages=20)

# Or stream the trace yourself:
import asyncio
async def main():
    async for event in generate_llms_txt_stream("stripe.com"):
        print(event["type"])
asyncio.run(main())

LlmsTxtResult carries content, structure, validation, pages_read, pages_discovered, tokens, cost_usd, elapsed_s, and more.

Development

git clone https://github.com/trakkr-aisearch/llms-txt-generator
cd llmstxt-generator
pip install -e ".[dev]"
pytest          # the test suite is fully offline — no network, no API key

See CONTRIBUTING.md.

License

MIT © Trakkr. See LICENSE.


Made by Trakkr — track and improve how your brand shows up in ChatGPT, Perplexity, Gemini, Google AI Overviews, and Claude. If this tool is useful, Trakkr is the platform behind it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmstxt_generator-0.1.0.tar.gz (36.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmstxt_generator-0.1.0-py3-none-any.whl (30.3 kB view details)

Uploaded Python 3

File details

Details for the file llmstxt_generator-0.1.0.tar.gz.

File metadata

  • Download URL: llmstxt_generator-0.1.0.tar.gz
  • Upload date:
  • Size: 36.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmstxt_generator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1144d02813c9dccc73996e79dbf56d177833e8cbcd77b400853cad85e302b845
MD5 d1855e686f99d8f20aa65f6ea51a60a9
BLAKE2b-256 f4993ad3bbd78e083d4cb185931a0fe033e2d671e24f356fdfeca95a9bbbb35c

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstxt_generator-0.1.0.tar.gz:

Publisher: publish.yml on trakkr-aisearch/llms-txt-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmstxt_generator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llmstxt_generator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b188c2b9ad92cf3b1a1ae2351a78a8b10331b1af8fe1f978d04717a9a019e54b
MD5 447a6f005552ff272f15a798422919a5
BLAKE2b-256 412a74eb2e342e811b6b2b4faa3b3dce282cf1648b0e0929264526965a3c62e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmstxt_generator-0.1.0-py3-none-any.whl:

Publisher: publish.yml on trakkr-aisearch/llms-txt-generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page