Skip to main content

Crawl any website and generate llms.txt - the AI-ready site map standard.

Project description

llms-generator

PyPI Docs Python Versions License

Crawl any website and generate llms.txt. A Markdown file at your domain root that tells AI systems which pages to read.

Homepage: https://abdulaouwal.com/project/llms-generator/ Documentation: https://llms-generator.readthedocs.io/


Why llms.txt?

ChatGPT, Claude, and Gemini have small context windows. They cannot read your entire website. llms.txt gives them your essential pages in one request.

The llms.txt specification was proposed by Jeremy Howard in September 2024. It works alongside robots.txt and sitemap.xml. Each serves a different purpose.


Installation

pip install llms-generator

For JavaScript-heavy sites:

pip install llms-generator[js]
playwright install chromium

Usage

llms-gen https://example.com

This creates llms.txt in your current folder. Add --full to also generate the full content version.

Options

Flag Default Description
URL required Target website URL
--depth 2 Maximum crawl depth
--output llms.txt Output file path
--full false Also generate llms-full.txt
--no-js false Skip Playwright JS fallback
--delay 1.0 Seconds between requests

Examples

llms-gen https://example.com
llms-gen https://example.com --depth 3 --output site-llms.txt
llms-gen https://example.com --full
llms-gen https://example.com --no-js --delay 0.5

How it works

Robot checks

Every page passes three checks:

  1. robots.txt - skips disallowed paths
  2. X-Robots-Tag - respects noindex and nofollow from HTTP headers
  3. <meta name="robots"> - respects page-level directives

Pages marked noindex are excluded. Pages marked nofollow are analyzed but their links are not crawled.

Crawl steps

  1. Parse robots.txt. Handles missing or restricted files.
  2. BFS from the start URL up to --depth levels.
  3. For each page:
    • Fetch with requests
    • Skip 4xx/5xx responses and non-HTML content
    • Fall back to Playwright if content is empty (JS-rendered sites)
    • Extract title, h1, meta description, first paragraph, directory path
  4. Group pages by top-level directory path
  5. Write llms.txt

Section grouping

/docs/getting-started   -> ## Docs
/blog/hello-world       -> ## Blog
/api/v1/users           -> ## Api

Pages without a directory path use their <h1> as the section name.


Output

The generated llms.txt follows the llmstxt.org specification:

# Example Site
> A great example site with documentation and blog content.

## Docs
- [Getting Started](https://example.com/docs/getting-started): How to get started.
- [API Reference](https://example.com/docs/api): Complete API documentation.

## Blog
- [Hello World](https://example.com/blog/hello): Our first blog post.

With --full, the tool also writes llms-full.txt with every page's full text under section headings.


Contributing

  1. Fork the repo and clone it
  2. Create a branch: git checkout -b my-change
  3. Install for development: pip install -e .
  4. Run tests: python -m pytest tests/
  5. Push and open a pull request

Keep PRs focused. One change per PR. Write a clear description of what you changed and why.


Changelog

See CHANGELOG.md for the full release history.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llms_generator-0.2.0.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llms_generator-0.2.0-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file llms_generator-0.2.0.tar.gz.

File metadata

  • Download URL: llms_generator-0.2.0.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llms_generator-0.2.0.tar.gz
Algorithm Hash digest
SHA256 13d31bb575a1d6d782970e9a92938fdb7c5a51440335001404fe14d240767666
MD5 0132b001b248e52767e6a7f9f78edeac
BLAKE2b-256 6f43c20af2d89b53e7cbedbe46a143f84f72b4f8f4b9e88b734779d5f9af75de

See more details on using hashes here.

File details

Details for the file llms_generator-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: llms_generator-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llms_generator-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 27a53edc3a2173ff92d3aea750ac90b69de19918a7b79d2ce1b91addb7317d9d
MD5 43759c075eb556cfc6d068f4b68f4ef4
BLAKE2b-256 9939b458bdbb7b4bf7cbaf473414b62524becb2fb23ab8d5d1f8d7ce1e84ea50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page