Crawl any website and generate llms.txt — the AI-ready site map standard.

These details have not been verified by PyPI

Project links

Project description

llms-generator

Crawl any website and generate llms.txt — the AI-ready site map standard.

llms.txt is a markdown file placed at a website's root (/llms.txt) that helps AI assistants like ChatGPT, Claude, and Perplexity understand your site's content structure. Think of it as robots.txt for AI.

This tool crawls your site, extracts page metadata, groups pages into logical sections, and outputs a spec-compliant llms.txt file.

Why llms.txt?

AI systems struggle to navigate large, noisy websites. An llms.txt file gives them a curated map of your most important content — leading to:

Accurate citations in AI-generated responses
Better brand representation in ChatGPT, Perplexity, Google AI Overviews
Less server load from AI crawlers wandering your site
Control over how AI systems reference your content

The llms.txt specification was proposed by Jeremy Howard in 2024 and is actively supported by Perplexity, Anthropic, and other AI platforms.

Installation

pip install llms-generator

For JavaScript-heavy sites (optional):

pip install llms-generator[js]
playwright install chromium

Usage

llms-gen https://example.com

That's it. The tool crawls your site and creates llms.txt in the current directory.

Options

Flag	Default	Description
`URL`	required	Target website URL
`--depth`	`2`	Maximum crawl depth
`--output`	`llms.txt`	Output file path
`--full`	`False`	Also generate `llms-full.txt` with full page content
`--no-js`	`False`	Skip Playwright JavaScript rendering fallback
`--delay`	`1.0`	Seconds between requests (be polite)

Examples

# Basic crawl (2 levels deep)
llms-gen https://example.com

# Crawl deeper, output to custom path
llms-gen https://docs.example.com --depth 3 --output site-llms.txt

# Generate both standard and full versions
llms-gen https://example.com --full

# Fast crawl without JS rendering
llms-gen https://example.com --no-js --delay 0.5

How it works

Per-page robot check

Every page is checked against three layers before being included or followed:

robots.txt ──┬── disallowed? → skip
             └── allowed? ──→ check HTTP X-Robots-Tag header
                                     │
                           noindex? ──→ skip
                           nofollow? ──→ still analyze, don't follow links
                                     │
                           absent ──→ check <meta name="robots">
                                     │
                           noindex? ──→ skip
                           nofollow? ──→ still analyze, don't follow links
                                     │
                           absent/index,follow ──→ analyze + follow links

Pages with noindex are excluded from llms.txt. Pages with nofollow are still analyzed for their content but their child links are not crawled.

Crawl strategy

Parse robots.txt — respect Disallow and Crawl-Delay (gracefully handles missing or restricted robots.txt)
BFS from the start URL up to --depth levels
For each page:
- Fetch with requests (handles most sites)
- Skip 4xx/5xx responses, non-HTML content, and X-Robots-Tag: noindex
- If content is empty (JS-rendered), fall back to Playwright headless browser
- Extract: <title>, <h1>, <meta name="description">, first meaningful paragraph, directory path
- Check <meta name="robots"> — noindex excludes the page, nofollow prevents link crawling
Group pages into sections (directory-based, with H1 fallback)
Assemble llms.txt per the spec

Performance note: Playwright browser is launched once and reused across all JS fallback fetches, then cleaned up when the crawl completes.

Section grouping

Pages are grouped into ## sections by their top-level directory path:

/docs/getting-started   → ## Docs
/blog/hello-world       → ## Blog
/api/v1/users           → ## Api

Pages without a clear directory path use their <h1> as the section name.

Output format

The generated llms.txt follows the llmstxt.org specification:

# Example Site

> A great example site with documentation and blog content.

This file provides AI systems with a structured summary of this website.

## Docs

- [Getting Started](https://example.com/docs/getting-started): How to get started with our platform.
- [API Reference](https://example.com/docs/api): Complete API documentation.

## Blog

- [Hello World](https://example.com/blog/hello): Our first blog post.

llms-full.txt

With --full, an expanded version is also generated that includes the full text content of each page inline — useful for providing complete context to LLMs in a single file.

Changelog

v0.1.3 (2026-06-06)

Fixed: Duplicate page entries in llms.txt caused by trailing-slash variants and http/https scheme variants - URLs are now normalized before deduplication

v0.1.2 (2026-06-06)

Fixed: USER_AGENT now reads from __version__ — stays in sync automatically
Fixed: X-Robots-Tag: nofollow is now respected — header-level and meta-level directives are merged
Fixed: Playwright browser instance properly cleaned up on launch failure (resource leak)
Fixed: requests.Session is now explicitly closed after crawl

v0.1.1 (2026-06-06)

Fixed: robots.txt returning 403/blocked no longer kills the entire crawl — gracefully falls back to allow-all
Fixed: --full flag now generates separate llms.txt (summary) and llms-full.txt (full content) as specified
Fixed: URL fragment stripping no longer corrupts paths (str.rstrip → proper split)
Fixed: <h1> text no longer overrides URL-path-based section grouping
Fixed: Playwright fallback no longer triggered on 404/500 errors — only on empty JS-rendered content
Optimized: Playwright browser instance reused across all JS fallback fetches (was launching/closing per-page)
Optimized: HTML parsed once per page instead of three times (directives, metadata, link extraction)
Fixed: requirements.txt no longer forces Playwright install (matches pyproject.toml optional-dep spec)
Removed: Dead isinstance(href, (list, tuple)) branch and unused regex

Development

git clone https://github.com/aouwalitshikkha/llms-generator.git
cd llms-generator
pip install -e .
pip install -e ".[js]"   # with Playwright support

Run tests:

pip install pytest
pytest

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Jun 12, 2026

0.2.0

Jun 12, 2026

0.1.11

Jun 12, 2026

0.1.10

Jun 7, 2026

0.1.9

Jun 7, 2026

0.1.7

Jun 6, 2026

0.1.6

Jun 6, 2026

0.1.5

Jun 6, 2026

0.1.4

Jun 6, 2026

This version

0.1.3

Jun 6, 2026

0.1.2

Jun 6, 2026

0.1.1

Jun 6, 2026

0.1.0

Jun 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llms_generator-0.1.3.tar.gz (14.1 kB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llms_generator-0.1.3-py3-none-any.whl (12.2 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file llms_generator-0.1.3.tar.gz.

File metadata

Download URL: llms_generator-0.1.3.tar.gz
Upload date: Jun 6, 2026
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llms_generator-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`6665a2f483e2a68b26c2fbcf97e508a8bc0cb1444e2cc6273d55527c5d82e5c8`
MD5	`49698a5626e7f95e026d14963bfcdba0`
BLAKE2b-256	`97bf85066c66c1f688360d6cc54431c49e0ccb7dc3ff199f43bb56f1fd4c5d84`

See more details on using hashes here.

File details

Details for the file llms_generator-0.1.3-py3-none-any.whl.

File metadata

Download URL: llms_generator-0.1.3-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 12.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llms_generator-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a225091caa8fc5c9d98170c5ddc1d99a65d865f4b5b66ab8e287abac5ef38374`
MD5	`46c7b66fbcdf2e322c86daee695be434`
BLAKE2b-256	`0d9160ece1809006c7c00f553fcf7cd3e6f231507e5131b2688418bc002c73c3`

See more details on using hashes here.

llms-generator 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llms-generator

Why llms.txt?

Installation

Usage

Options

Examples

How it works

Per-page robot check

Crawl strategy

Section grouping

Output format

llms-full.txt

Changelog

v0.1.3 (2026-06-06)

v0.1.2 (2026-06-06)

v0.1.1 (2026-06-06)

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes