Skip to main content

Crawl any website and generate llms.txt - the AI-ready site map standard.

Project description

llms-generator

PyPI Docs Python Versions License

Crawl any website and generate llms.txt. A Markdown file at your domain root that tells AI systems which pages to read.

Homepage: https://abdulaouwal.com/project/llms-generator/ Documentation: https://llms-generator.readthedocs.io/


Why llms.txt?

ChatGPT, Claude, and Gemini have small context windows. They cannot read your entire website. llms.txt gives them your essential pages in one request.

The llms.txt specification was proposed by Jeremy Howard in September 2024. It works alongside robots.txt and sitemap.xml. Each serves a different purpose.


Installation

pip install llms-generator

For JavaScript-heavy sites:

pip install llms-generator[js]
playwright install chromium

Usage

llms-gen https://example.com

This creates llms.txt in your current folder. Add --full to also generate the full content version.

Options

Flag Default Description
URL required Target website URL
--depth 2 Maximum crawl depth
--output llms.txt Output file path
--full false Also generate llms-full.txt
--no-js false Skip Playwright JS fallback
--delay 1.0 Seconds between requests

Examples

llms-gen https://example.com
llms-gen https://example.com --depth 3 --output site-llms.txt
llms-gen https://example.com --full
llms-gen https://example.com --no-js --delay 0.5

How it works

Robot checks

Every page passes three checks:

  1. robots.txt - skips disallowed paths
  2. X-Robots-Tag - respects noindex and nofollow from HTTP headers
  3. <meta name="robots"> - respects page-level directives

Pages marked noindex are excluded. Pages marked nofollow are analyzed but their links are not crawled.

Crawl steps

  1. Parse robots.txt. Handles missing or restricted files.
  2. BFS from the start URL up to --depth levels.
  3. For each page:
    • Fetch with requests
    • Skip 4xx/5xx responses and non-HTML content
    • Fall back to Playwright if content is empty (JS-rendered sites)
    • Extract title, h1, meta description, first paragraph, directory path
  4. Group pages by top-level directory path
  5. Write llms.txt

Section grouping

/docs/getting-started   -> ## Docs
/blog/hello-world       -> ## Blog
/api/v1/users           -> ## Api

Pages without a directory path use their <h1> as the section name.


Output

The generated llms.txt follows the llmstxt.org specification:

# Example Site
> A great example site with documentation and blog content.

## Docs
- [Getting Started](https://example.com/docs/getting-started): How to get started.
- [API Reference](https://example.com/docs/api): Complete API documentation.

## Blog
- [Hello World](https://example.com/blog/hello): Our first blog post.

With --full, the tool also writes llms-full.txt with every page's full text under section headings.


Contributing

  1. Fork the repo and clone it
  2. Create a branch: git checkout -b my-change
  3. Install for development: pip install -e .
  4. Run tests: python -m pytest tests/
  5. Push and open a pull request

Keep PRs focused. One change per PR. Write a clear description of what you changed and why.


Changelog

See CHANGELOG.md for the full release history.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llms_generator-0.1.11.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llms_generator-0.1.11-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file llms_generator-0.1.11.tar.gz.

File metadata

  • Download URL: llms_generator-0.1.11.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llms_generator-0.1.11.tar.gz
Algorithm Hash digest
SHA256 1f32fc58d134dea2386da0bba8828163846ddbe8f8a7e98db2601b354b679b48
MD5 1998dca5c2e078091e2cc3a657b0c3d4
BLAKE2b-256 075b69d62f9572050f38987add77bc47067ab12fa1c1672ef43742184b2d8daf

See more details on using hashes here.

File details

Details for the file llms_generator-0.1.11-py3-none-any.whl.

File metadata

File hashes

Hashes for llms_generator-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 030b241542e76231a57c3b1359ba58a90614642328fa7eedd2d44fcd1d352b28
MD5 ec60ff6bc32532560ea261be8307fad3
BLAKE2b-256 9a8b7a551ae16edf0915cb7cc3a8c16e37302b2af4a79f83ad03087c90a4c9e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page