Crawl any website and generate llms.txt — the AI-ready site map standard.
Project description
llms-generator
Crawl any website and generate llms.txt. A Markdown file at your domain root that tells AI systems which pages to read.
Homepage: https://abdulaouwal.com/project/llms-generator/ Documentation: https://llms-generator.readthedocs.io/
Why llms.txt?
ChatGPT, Claude, and Gemini have small context windows. They cannot read your entire website. llms.txt gives them your essential pages in one request.
The llms.txt specification was proposed by Jeremy Howard in September 2024. It works alongside robots.txt and sitemap.xml. Each serves a different purpose.
Installation
pip install llms-generator
For JavaScript-heavy sites:
pip install llms-generator[js]
playwright install chromium
Usage
llms-gen https://example.com
This creates llms.txt in your current folder. Add --full to also generate the full content version.
Options
| Flag | Default | Description |
|---|---|---|
URL |
required | Target website URL |
--depth |
2 |
Maximum crawl depth |
--output |
llms.txt |
Output file path |
--full |
false |
Also generate llms-full.txt |
--no-js |
false |
Skip Playwright JS fallback |
--delay |
1.0 |
Seconds between requests |
Examples
llms-gen https://example.com
llms-gen https://example.com --depth 3 --output site-llms.txt
llms-gen https://example.com --full
llms-gen https://example.com --no-js --delay 0.5
How it works
Robot checks
Every page passes three checks:
- robots.txt - skips disallowed paths
- X-Robots-Tag - respects
noindexandnofollowfrom HTTP headers <meta name="robots">- respects page-level directives
Pages marked noindex are excluded. Pages marked nofollow are analyzed but their links are not crawled.
Crawl steps
- Parse
robots.txt. Handles missing or restricted files. - BFS from the start URL up to
--depthlevels. - For each page:
- Fetch with
requests - Skip 4xx/5xx responses and non-HTML content
- Fall back to Playwright if content is empty (JS-rendered sites)
- Extract title, h1, meta description, first paragraph, directory path
- Fetch with
- Group pages by top-level directory path
- Write
llms.txt
Section grouping
/docs/getting-started -> ## Docs
/blog/hello-world -> ## Blog
/api/v1/users -> ## Api
Pages without a directory path use their <h1> as the section name.
Output
The generated llms.txt follows the llmstxt.org specification:
# Example Site
> A great example site with documentation and blog content.
## Docs
- [Getting Started](https://example.com/docs/getting-started): How to get started.
- [API Reference](https://example.com/docs/api): Complete API documentation.
## Blog
- [Hello World](https://example.com/blog/hello): Our first blog post.
With --full, the tool also writes llms-full.txt with every page's full text under section headings.
Contributing
- Fork the repo and clone it
- Create a branch:
git checkout -b my-change - Install for development:
pip install -e . - Run tests:
python -m pytest tests/ - Push and open a pull request
Keep PRs focused. One change per PR. Write a clear description of what you changed and why.
Changelog
See CHANGELOG.md for the full release history.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llms_generator-0.1.10.tar.gz.
File metadata
- Download URL: llms_generator-0.1.10.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30abd3be97884e52fcc17621171b29cd4678a0b6295510702c7521c9062a508e
|
|
| MD5 |
1e24f4ce1f7962c1ca53c56685f9e416
|
|
| BLAKE2b-256 |
7a2397eefe750f8982ff10abf62a0f702ce00c64d07b922045b8cf69c7026f07
|
File details
Details for the file llms_generator-0.1.10-py3-none-any.whl.
File metadata
- Download URL: llms_generator-0.1.10-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de244e81b7938ac629bc4465c3384ee6924e1ea2cef712762e8edb2dfe1e3e70
|
|
| MD5 |
71dff9ff397a89e19d965c13c76898c2
|
|
| BLAKE2b-256 |
fb04dd3343629f5c31efeb049f2d842fb6c339e0fb966bec3c0ec9bd889c2e28
|