Skip to main content

Crawl any website and generate llms.txt - the AI-ready site map standard.

Project description

llms-generator

PyPI Docs Python Versions License

Crawl any website and generate llms.txt. A Markdown file at your domain root that tells AI systems which pages to read.

Homepage: https://abdulaouwal.com/project/llms-generator/ Documentation: https://llms-generator.readthedocs.io/


Why llms.txt?

ChatGPT, Claude, and Gemini have small context windows. They cannot read your entire website. llms.txt gives them your essential pages in one request.

The llms.txt specification was proposed by Jeremy Howard in September 2024. It works alongside robots.txt and sitemap.xml. Each serves a different purpose.


Installation

pip install llms-generator

For JavaScript-heavy sites:

pip install llms-generator[js]
playwright install chromium

Usage

llms-gen https://example.com

This creates llms.txt in your current folder. Add --full to also generate the full content version.

Options

Flag Default Description
URL required Target website URL
--depth 2 Maximum crawl depth
--output llms.txt Output file path
--full false Also generate llms-full.txt
--no-js false Skip Playwright JS fallback
--delay 1.0 Seconds between requests

Examples

llms-gen https://example.com
llms-gen https://example.com --depth 3 --output site-llms.txt
llms-gen https://example.com --full
llms-gen https://example.com --no-js --delay 0.5

How it works

Robot checks

Every page passes three checks:

  1. robots.txt - skips disallowed paths
  2. X-Robots-Tag - respects noindex and nofollow from HTTP headers
  3. <meta name="robots"> - respects page-level directives

Pages marked noindex are excluded. Pages marked nofollow are analyzed but their links are not crawled.

Crawl steps

  1. Parse robots.txt. Handles missing or restricted files.
  2. BFS from the start URL up to --depth levels.
  3. For each page:
    • Fetch with requests
    • Skip 4xx/5xx responses and non-HTML content
    • Fall back to Playwright if content is empty (JS-rendered sites)
    • Extract title, h1, meta description, first paragraph, directory path
  4. Group pages by top-level directory path
  5. Write llms.txt

Section grouping

/docs/getting-started   -> ## Docs
/blog/hello-world       -> ## Blog
/api/v1/users           -> ## Api

Pages without a directory path use their <h1> as the section name.


Output

The generated llms.txt follows the llmstxt.org specification:

# Example Site
> A great example site with documentation and blog content.

## Docs
- [Getting Started](https://example.com/docs/getting-started): How to get started.
- [API Reference](https://example.com/docs/api): Complete API documentation.

## Blog
- [Hello World](https://example.com/blog/hello): Our first blog post.

With --full, the tool also writes llms-full.txt with every page's full text under section headings.


Contributing

  1. Fork the repo and clone it
  2. Create a branch: git checkout -b my-change
  3. Install for development: pip install -e .
  4. Run tests: python -m pytest tests/
  5. Push and open a pull request

Keep PRs focused. One change per PR. Write a clear description of what you changed and why.


Changelog

See CHANGELOG.md for the full release history.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llms_generator-0.2.1.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llms_generator-0.2.1-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file llms_generator-0.2.1.tar.gz.

File metadata

  • Download URL: llms_generator-0.2.1.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llms_generator-0.2.1.tar.gz
Algorithm Hash digest
SHA256 a4ee528fc7e8b72d5b48bef7b41502df33d6128e17fdb8e83653045f000dd85c
MD5 59cdf165f8ddb7ffe8282bf1ad88c40b
BLAKE2b-256 1a378fb374e061adfe4c9ae5e186dc167f3d011dfb2714ae24644cf8bd999c23

See more details on using hashes here.

File details

Details for the file llms_generator-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: llms_generator-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llms_generator-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2d97e1309b037b30ea2cd3834d3fe4a5fbca7ea2998f4ae76e049a080eea722a
MD5 af3fae0f109a1486d223b8cc86bd7e57
BLAKE2b-256 908bc604fae28490a12cc19fc92c24cc732e9fd66ec0395a99fba6f88e5edc0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page