Skip to main content

Crawl any website and generate llms.txt — the AI-ready site map standard.

Project description

llms-generator

PyPI Python Versions License GitHub Stars

Crawl any website and generate llms.txt — the AI-ready site map standard.

llms.txt is a markdown file placed at a website's root (/llms.txt) that helps AI assistants like ChatGPT, Claude, and Perplexity understand your site's content structure. Think of it as robots.txt for AI.

This tool crawls your site, extracts page metadata, groups pages into logical sections, and outputs a spec-compliant llms.txt file.


Why llms.txt?

AI systems struggle to navigate large, noisy websites. An llms.txt file gives them a curated map of your most important content — leading to:

  • Accurate citations in AI-generated responses
  • Better brand representation in ChatGPT, Perplexity, Google AI Overviews
  • Less server load from AI crawlers wandering your site
  • Control over how AI systems reference your content

The llms.txt specification was proposed by Jeremy Howard in 2024 and is actively supported by Perplexity, Anthropic, and other AI platforms.


Installation

pip install llms-generator

For JavaScript-heavy sites (optional):

pip install llms-generator[js]
playwright install chromium

Usage

llms-gen https://example.com

That's it. The tool crawls your site and creates llms.txt in the current directory.

Options

Flag Default Description
URL required Target website URL
--depth 2 Maximum crawl depth
--output llms.txt Output file path
--full False Also generate llms-full.txt with full page content
--no-js False Skip Playwright JavaScript rendering fallback
--delay 1.0 Seconds between requests (be polite)

Examples

# Basic crawl (2 levels deep)
llms-gen https://example.com

# Crawl deeper, output to custom path
llms-gen https://docs.example.com --depth 3 --output site-llms.txt

# Generate both standard and full versions
llms-gen https://example.com --full

# Fast crawl without JS rendering
llms-gen https://example.com --no-js --delay 0.5

How it works

Per-page robot check

Every page is checked against three layers before being included or followed:

robots.txt ──┬── disallowed? → skip
             └── allowed? ──→ check HTTP X-Robots-Tag header
                                     │
                           noindex? ──→ skip
                           nofollow? ──→ still analyze, don't follow links
                                     │
                           absent ──→ check <meta name="robots">
                                     │
                           noindex? ──→ skip
                           nofollow? ──→ still analyze, don't follow links
                                     │
                           absent/index,follow ──→ analyze + follow links

Pages with noindex are excluded from llms.txt. Pages with nofollow are still analyzed for their content but their child links are not crawled.

Crawl strategy

  1. Parse robots.txt — respect Disallow and Crawl-Delay (gracefully handles missing or restricted robots.txt)
  2. BFS from the start URL up to --depth levels
  3. For each page:
    • Fetch with requests (handles most sites)
    • Skip 4xx/5xx responses, non-HTML content, and X-Robots-Tag: noindex
    • If content is empty (JS-rendered), fall back to Playwright headless browser
    • Extract: <title>, <h1>, <meta name="description">, first meaningful paragraph, directory path
    • Check <meta name="robots">noindex excludes the page, nofollow prevents link crawling
  4. Group pages into sections (directory-based, with H1 fallback)
  5. Assemble llms.txt per the spec

Performance note: Playwright browser is launched once and reused across all JS fallback fetches, then cleaned up when the crawl completes.

Section grouping

Pages are grouped into ## sections by their top-level directory path:

/docs/getting-started   → ## Docs
/blog/hello-world       → ## Blog
/api/v1/users           → ## Api

Pages without a clear directory path use their <h1> as the section name.


Output format

The generated llms.txt follows the llmstxt.org specification:

# Example Site

> A great example site with documentation and blog content.

This file provides AI systems with a structured summary of this website.

## Docs

- [Getting Started](https://example.com/docs/getting-started): How to get started with our platform.
- [API Reference](https://example.com/docs/api): Complete API documentation.

## Blog

- [Hello World](https://example.com/blog/hello): Our first blog post.

llms-full.txt

With --full, an expanded version is also generated that includes the full text content of each page inline — useful for providing complete context to LLMs in a single file.


Changelog

See CHANGELOG.md for the full release history.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llms_generator-0.1.4.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llms_generator-0.1.4-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file llms_generator-0.1.4.tar.gz.

File metadata

  • Download URL: llms_generator-0.1.4.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llms_generator-0.1.4.tar.gz
Algorithm Hash digest
SHA256 db8cc39cfae8b9fe8f25542849b413bcf0b57884819e011cf2a6de857c7a5ec7
MD5 1b405c91acd13576a181d0e36a91c67b
BLAKE2b-256 ac115608df22edd9e9b9684943461e43be447677de83f4139dd6df774954af04

See more details on using hashes here.

File details

Details for the file llms_generator-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: llms_generator-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llms_generator-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 cbb7c4e94c73022cda06f02dbb19aede270c26f9470dd42386084aaaf087bf8a
MD5 3efded8f1e3a1ca57e2b5cd12f8a49cd
BLAKE2b-256 1e978e2a931c2e32fbf11688da58c48222c67878b581c1fd79838e960f7f7214

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page