Skip to main content

Crawl any website and generate llms.txt — the AI-ready site map standard.

Project description

llms-generator

PyPI Docs Python Versions License

Crawl any website and generate llms.txt. A Markdown file at your domain root that tells AI systems which pages to read.

Homepage: https://abdulaouwal.com/project/llms-generator/ Documentation: https://llms-generator.readthedocs.io/


Why llms.txt?

ChatGPT, Claude, and Gemini have small context windows. They cannot read your entire website. llms.txt gives them your essential pages in one request.

The llms.txt specification was proposed by Jeremy Howard in September 2024. It works alongside robots.txt and sitemap.xml. Each serves a different purpose.


Installation

pip install llms-generator

For JavaScript-heavy sites:

pip install llms-generator[js]
playwright install chromium

Usage

llms-gen https://example.com

This creates llms.txt in your current folder. Add --full to also generate the full content version.

Options

Flag Default Description
URL required Target website URL
--depth 2 Maximum crawl depth
--output llms.txt Output file path
--full false Also generate llms-full.txt
--no-js false Skip Playwright JS fallback
--delay 1.0 Seconds between requests

Examples

llms-gen https://example.com
llms-gen https://example.com --depth 3 --output site-llms.txt
llms-gen https://example.com --full
llms-gen https://example.com --no-js --delay 0.5

How it works

Robot checks

Every page passes three checks:

  1. robots.txt - skips disallowed paths
  2. X-Robots-Tag - respects noindex and nofollow from HTTP headers
  3. <meta name="robots"> - respects page-level directives

Pages marked noindex are excluded. Pages marked nofollow are analyzed but their links are not crawled.

Crawl steps

  1. Parse robots.txt. Handles missing or restricted files.
  2. BFS from the start URL up to --depth levels.
  3. For each page:
    • Fetch with requests
    • Skip 4xx/5xx responses and non-HTML content
    • Fall back to Playwright if content is empty (JS-rendered sites)
    • Extract title, h1, meta description, first paragraph, directory path
  4. Group pages by top-level directory path
  5. Write llms.txt

Section grouping

/docs/getting-started   -> ## Docs
/blog/hello-world       -> ## Blog
/api/v1/users           -> ## Api

Pages without a directory path use their <h1> as the section name.


Output

The generated llms.txt follows the llmstxt.org specification:

# Example Site
> A great example site with documentation and blog content.

## Docs
- [Getting Started](https://example.com/docs/getting-started): How to get started.
- [API Reference](https://example.com/docs/api): Complete API documentation.

## Blog
- [Hello World](https://example.com/blog/hello): Our first blog post.

With --full, the tool also writes llms-full.txt with every page's full text under section headings.


Contributing

  1. Fork the repo and clone it
  2. Create a branch: git checkout -b my-change
  3. Install for development: pip install -e .
  4. Run tests: python -m pytest tests/
  5. Push and open a pull request

Keep PRs focused. One change per PR. Write a clear description of what you changed and why.


Changelog

See CHANGELOG.md for the full release history.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llms_generator-0.1.9.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llms_generator-0.1.9-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file llms_generator-0.1.9.tar.gz.

File metadata

  • Download URL: llms_generator-0.1.9.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llms_generator-0.1.9.tar.gz
Algorithm Hash digest
SHA256 f348654f04bf5e29dfd663e6cbe0f07502f7385b9cee60105f92bd9e2b9bc05d
MD5 f21e54dc71340cb20ecd37433b8fd282
BLAKE2b-256 ddec0d3bd6313c4a38cd46acc346dcd2d4b006ad0807810e2cdcfa502a851032

See more details on using hashes here.

File details

Details for the file llms_generator-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: llms_generator-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llms_generator-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 17d384111acc795310d910f4f1b5a01bf8063c8c56531e751b7eb0b6bf2da265
MD5 f578e6afea74725a9a2d7e83f5055380
BLAKE2b-256 21c27b4390e4f01c448e392479555415eafec2218ab403711ef1cb8fff65532a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page