Skip to main content

Crawl any website and generate llms.txt — the AI-ready site map standard.

Project description

llms-generator

PyPI Docs Python Versions License

Crawl any website and generate llms.txt. A Markdown file at your domain root that tells AI systems which pages to read.

Homepage: https://abdulaouwal.com/project/llms-generator/ Documentation: https://llms-generator.readthedocs.io/


Why llms.txt?

ChatGPT, Claude, and Gemini have small context windows. They cannot read your entire website. llms.txt gives them your essential pages in one request.

The llms.txt specification was proposed by Jeremy Howard in September 2024. It works alongside robots.txt and sitemap.xml. Each serves a different purpose.


Installation

pip install llms-generator

For JavaScript-heavy sites:

pip install llms-generator[js]
playwright install chromium

Usage

llms-gen https://example.com

This creates llms.txt in your current folder. Add --full to also generate the full content version.

Options

Flag Default Description
URL required Target website URL
--depth 2 Maximum crawl depth
--output llms.txt Output file path
--full false Also generate llms-full.txt
--no-js false Skip Playwright JS fallback
--delay 1.0 Seconds between requests

Examples

llms-gen https://example.com
llms-gen https://example.com --depth 3 --output site-llms.txt
llms-gen https://example.com --full
llms-gen https://example.com --no-js --delay 0.5

How it works

Robot checks

Every page passes three checks:

  1. robots.txt - skips disallowed paths
  2. X-Robots-Tag - respects noindex and nofollow from HTTP headers
  3. <meta name="robots"> - respects page-level directives

Pages marked noindex are excluded. Pages marked nofollow are analyzed but their links are not crawled.

Crawl steps

  1. Parse robots.txt. Handles missing or restricted files.
  2. BFS from the start URL up to --depth levels.
  3. For each page:
    • Fetch with requests
    • Skip 4xx/5xx responses and non-HTML content
    • Fall back to Playwright if content is empty (JS-rendered sites)
    • Extract title, h1, meta description, first paragraph, directory path
  4. Group pages by top-level directory path
  5. Write llms.txt

Section grouping

/docs/getting-started   -> ## Docs
/blog/hello-world       -> ## Blog
/api/v1/users           -> ## Api

Pages without a directory path use their <h1> as the section name.


Output

The generated llms.txt follows the llmstxt.org specification:

# Example Site
> A great example site with documentation and blog content.

## Docs
- [Getting Started](https://example.com/docs/getting-started): How to get started.
- [API Reference](https://example.com/docs/api): Complete API documentation.

## Blog
- [Hello World](https://example.com/blog/hello): Our first blog post.

With --full, the tool also writes llms-full.txt with every page's full text under section headings.


Contributing

  1. Fork the repo and clone it
  2. Create a branch: git checkout -b my-change
  3. Install for development: pip install -e .
  4. Run tests: python -m pytest tests/
  5. Push and open a pull request

Keep PRs focused. One change per PR. Write a clear description of what you changed and why.


Changelog

See CHANGELOG.md for the full release history.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llms_generator-0.1.10.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llms_generator-0.1.10-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file llms_generator-0.1.10.tar.gz.

File metadata

  • Download URL: llms_generator-0.1.10.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llms_generator-0.1.10.tar.gz
Algorithm Hash digest
SHA256 30abd3be97884e52fcc17621171b29cd4678a0b6295510702c7521c9062a508e
MD5 1e24f4ce1f7962c1ca53c56685f9e416
BLAKE2b-256 7a2397eefe750f8982ff10abf62a0f702ce00c64d07b922045b8cf69c7026f07

See more details on using hashes here.

File details

Details for the file llms_generator-0.1.10-py3-none-any.whl.

File metadata

File hashes

Hashes for llms_generator-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 de244e81b7938ac629bc4465c3384ee6924e1ea2cef712762e8edb2dfe1e3e70
MD5 71dff9ff397a89e19d965c13c76898c2
BLAKE2b-256 fb04dd3343629f5c31efeb049f2d842fb6c339e0fb966bec3c0ec9bd889c2e28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page