Crawl any website and generate llms.txt — the AI-ready site map standard.
Project description
llms-generator
Crawl any website and generate llms.txt — the AI-ready site map standard.
llms.txt is a markdown file placed at a website's root (/llms.txt) that helps AI assistants like ChatGPT, Claude, and Perplexity understand your site's content structure. Think of it as robots.txt for AI.
This tool crawls your site, extracts page metadata, groups pages into logical sections, and outputs a spec-compliant llms.txt file.
Why llms.txt?
AI systems struggle to navigate large, noisy websites. An llms.txt file gives them a curated map of your most important content — leading to:
- Accurate citations in AI-generated responses
- Better brand representation in ChatGPT, Perplexity, Google AI Overviews
- Less server load from AI crawlers wandering your site
- Control over how AI systems reference your content
The llms.txt specification was proposed by Jeremy Howard in 2024 and is actively supported by Perplexity, Anthropic, and other AI platforms.
Installation
pip install llms-generator
For JavaScript-heavy sites (optional):
pip install llms-generator[js]
playwright install chromium
Usage
llms-gen https://example.com
That's it. The tool crawls your site and creates llms.txt in the current directory.
Options
| Flag | Default | Description |
|---|---|---|
URL |
required | Target website URL |
--depth |
2 |
Maximum crawl depth |
--output |
llms.txt |
Output file path |
--full |
False |
Also generate llms-full.txt with full page content |
--no-js |
False |
Skip Playwright JavaScript rendering fallback |
--delay |
1.0 |
Seconds between requests (be polite) |
Examples
# Basic crawl (2 levels deep)
llms-gen https://example.com
# Crawl deeper, output to custom path
llms-gen https://docs.example.com --depth 3 --output site-llms.txt
# Generate both standard and full versions
llms-gen https://example.com --full
# Fast crawl without JS rendering
llms-gen https://example.com --no-js --delay 0.5
How it works
Per-page robot check
Every page is checked against three layers before being included or followed:
robots.txt ──┬── disallowed? → skip
└── allowed? ──→ check HTTP X-Robots-Tag header
│
noindex? ──→ skip
nofollow? ──→ still analyze, don't follow links
│
absent ──→ check <meta name="robots">
│
noindex? ──→ skip
nofollow? ──→ still analyze, don't follow links
│
absent/index,follow ──→ analyze + follow links
Pages with noindex are excluded from llms.txt. Pages with nofollow are still analyzed for their content but their child links are not crawled.
Crawl strategy
- Parse
robots.txt— respectDisallowandCrawl-Delay(gracefully handles missing or restricted robots.txt) - BFS from the start URL up to
--depthlevels - For each page:
- Fetch with
requests(handles most sites) - Skip 4xx/5xx responses, non-HTML content, and
X-Robots-Tag: noindex - If content is empty (JS-rendered), fall back to Playwright headless browser
- Extract:
<title>,<h1>,<meta name="description">, first meaningful paragraph, directory path - Check
<meta name="robots">—noindexexcludes the page,nofollowprevents link crawling
- Fetch with
- Group pages into sections (directory-based, with H1 fallback)
- Assemble
llms.txtper the spec
Performance note: Playwright browser is launched once and reused across all JS fallback fetches, then cleaned up when the crawl completes.
Section grouping
Pages are grouped into ## sections by their top-level directory path:
/docs/getting-started → ## Docs
/blog/hello-world → ## Blog
/api/v1/users → ## Api
Pages without a clear directory path use their <h1> as the section name.
Output format
The generated llms.txt follows the llmstxt.org specification:
# Example Site
> A great example site with documentation and blog content.
This file provides AI systems with a structured summary of this website.
## Docs
- [Getting Started](https://example.com/docs/getting-started): How to get started with our platform.
- [API Reference](https://example.com/docs/api): Complete API documentation.
## Blog
- [Hello World](https://example.com/blog/hello): Our first blog post.
llms-full.txt
With --full, an expanded version is also generated that includes the full text content of each page inline — useful for providing complete context to LLMs in a single file.
Changelog
v0.1.3 (2026-06-06)
- Fixed: Duplicate page entries in llms.txt caused by trailing-slash variants and http/https scheme variants - URLs are now normalized before deduplication
v0.1.2 (2026-06-06)
- Fixed:
USER_AGENTnow reads from__version__— stays in sync automatically - Fixed:
X-Robots-Tag: nofollowis now respected — header-level and meta-level directives are merged - Fixed: Playwright browser instance properly cleaned up on launch failure (resource leak)
- Fixed:
requests.Sessionis now explicitly closed after crawl
v0.1.1 (2026-06-06)
- Fixed: robots.txt returning 403/blocked no longer kills the entire crawl — gracefully falls back to allow-all
- Fixed:
--fullflag now generates separatellms.txt(summary) andllms-full.txt(full content) as specified - Fixed: URL fragment stripping no longer corrupts paths (
str.rstrip→ proper split) - Fixed:
<h1>text no longer overrides URL-path-based section grouping - Fixed: Playwright fallback no longer triggered on 404/500 errors — only on empty JS-rendered content
- Optimized: Playwright browser instance reused across all JS fallback fetches (was launching/closing per-page)
- Optimized: HTML parsed once per page instead of three times (directives, metadata, link extraction)
- Fixed:
requirements.txtno longer forces Playwright install (matchespyproject.tomloptional-dep spec) - Removed: Dead
isinstance(href, (list, tuple))branch and unused regex
Development
git clone https://github.com/aouwalitshikkha/llms-generator.git
cd llms-generator
pip install -e .
pip install -e ".[js]" # with Playwright support
Run tests:
pip install pytest
pytest
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llms_generator-0.1.3.tar.gz.
File metadata
- Download URL: llms_generator-0.1.3.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6665a2f483e2a68b26c2fbcf97e508a8bc0cb1444e2cc6273d55527c5d82e5c8
|
|
| MD5 |
49698a5626e7f95e026d14963bfcdba0
|
|
| BLAKE2b-256 |
97bf85066c66c1f688360d6cc54431c49e0ccb7dc3ff199f43bb56f1fd4c5d84
|
File details
Details for the file llms_generator-0.1.3-py3-none-any.whl.
File metadata
- Download URL: llms_generator-0.1.3-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a225091caa8fc5c9d98170c5ddc1d99a65d865f4b5b66ab8e287abac5ef38374
|
|
| MD5 |
46c7b66fbcdf2e322c86daee695be434
|
|
| BLAKE2b-256 |
0d9160ece1809006c7c00f553fcf7cd3e6f231507e5131b2688418bc002c73c3
|