Generate llms-brief.txt files from documentation websites using AI
Project description
llmsbrieftxt
Generate llms-brief.txt files from any documentation website using AI. A focused, production-ready CLI tool that does one thing exceptionally well.
Quick Start
# Install
pip install llmsbrieftxt
# Set your OpenAI API key
export OPENAI_API_KEY="sk-your-api-key-here"
# Generate llms-brief.txt from a documentation site
llmtxt https://docs.python.org/3/
# Preview URLs before processing
llmtxt https://react.dev --show-urls
# Use a different model
llmtxt https://react.dev --model gpt-4o
What It Does
Crawls documentation websites, extracts content, and uses OpenAI to generate structured llms-brief.txt files. Each entry contains a title, URL, keywords, and one-line summary - making it easy for LLMs and developers to navigate documentation.
Key Features:
- Smart Crawling: Breadth-first discovery up to depth 3, with URL deduplication
- Content Extraction: HTML to Markdown using trafilatura
- AI Summarization: Structured output using OpenAI
- Automatic Caching: Summaries cached in
.llmsbrieftxt_cache/to avoid reprocessing - Production-Ready: Clean output, proper error handling, scriptable
Installation
# With pip
pip install llmsbrieftxt
# With uv (recommended)
uv pip install llmsbrieftxt
Prerequisites
- Python 3.10+
- OpenAI API Key: Required for generating summaries
export OPENAI_API_KEY="sk-your-api-key-here"
Usage
Basic Command
llmtxt <url> [options]
Output is automatically saved to ~/.claude/docs/<domain>.txt (e.g., docs.python.org.txt)
Options
--output PATH- Custom output path (default:~/.claude/docs/<domain>.txt)--model MODEL- OpenAI model to use (default:gpt-5-mini)--max-concurrent-summaries N- Concurrent LLM requests (default: 10)--show-urls- Preview discovered URLs with cost estimate (no API calls)--max-urls N- Limit number of URLs to process--depth N- Maximum crawl depth (default: 3)--cache-dir PATH- Cache directory path (default:.llmsbrieftxt_cache)--use-cache-only- Use only cached summaries, skip API calls for new pages--force-refresh- Ignore cache and regenerate all summaries
Examples
# Basic usage - saves to ~/.claude/docs/docs.python.org.txt
llmtxt https://docs.python.org/3/
# Use a different model
llmtxt https://react.dev --model gpt-4o
# Preview URLs with cost estimate before processing (no API calls)
llmtxt https://react.dev --show-urls
# Limit scope for testing
llmtxt https://docs.python.org --max-urls 50
# Custom crawl depth (explore deeper or shallower)
llmtxt https://example.com --depth 2
# Use only cached summaries (no API calls)
llmtxt https://docs.python.org/3/ --use-cache-only
# Force refresh all summaries (ignore cache)
llmtxt https://docs.python.org/3/ --force-refresh
# Custom cache directory
llmtxt https://example.com --cache-dir /tmp/my-cache
# Custom output location
llmtxt https://react.dev --output ./my-docs/react.txt
# Process with higher concurrency (if you have high rate limits)
llmtxt https://fastapi.tiangolo.com --max-concurrent-summaries 20
Searching and Listing
This tool focuses on generating llms-brief.txt files. For searching and listing, use standard Unix tools:
Search Documentation
# Search all docs
rg "async functions" ~/.claude/docs/
# Search specific file
rg "hooks" ~/.claude/docs/react.dev.txt
# Case-insensitive search
rg -i "error handling" ~/.claude/docs/
# Show context around matches
rg -C 2 "api" ~/.claude/docs/
# Or use grep
grep -r "async" ~/.claude/docs/
List Documentation
# List all docs
ls ~/.claude/docs/
# List with details
ls -lh ~/.claude/docs/
# Count entries in a file
grep -c "^Title:" ~/.claude/docs/react.dev.txt
# Find all docs and show sizes
find ~/.claude/docs/ -name "*.txt" -exec wc -l {} +
Why use standard tools? They're:
- Already installed on your system
- More powerful and flexible
- Well-documented
- Composable with other commands
- Faster than any custom implementation
How It Works
URL Discovery
The tool uses a comprehensive breadth-first search strategy:
- Explores links up to 3 levels deep from your starting URL
- Automatically excludes assets (CSS, JS, images) and non-documentation pages
- Sophisticated URL normalization prevents duplicate processing
- Discovers 100-300+ pages on typical documentation sites
Content Processing Pipeline
URL Discovery → Content Extraction → LLM Summarization → File Generation
- Crawl: Discover all documentation URLs
- Extract: Convert HTML to markdown using trafilatura
- Summarize: Generate structured summaries using OpenAI
- Cache: Store summaries in
.llmsbrieftxt_cache/for reuse - Generate: Compile into searchable llms-brief.txt format
Output Format
Each entry in the generated file contains:
Title: [Page Name](URL)
Keywords: searchable, terms, functions, concepts
Summary: One-line description of page content
Development
Setup
# Clone and install with dev dependencies
git clone https://github.com/stevennevins/llmsbrief.git
cd llmsbrief
uv sync --group dev
Running Tests
# All tests
uv run pytest
# Unit tests only
uv run pytest tests/unit/
# Specific test file
uv run pytest tests/unit/test_cli.py
# With verbose output
uv run pytest -v
Code Quality
# Lint code
uv run ruff check llmsbrieftxt/ tests/
# Format code
uv run ruff format llmsbrieftxt/ tests/
# Type checking
uv run mypy llmsbrieftxt/
Configuration
Default Settings
- Crawl Depth: 3 levels (configurable via
--depth) - Output Location:
~/.claude/docs/<domain>.txt(configurable via--output) - Cache Directory:
.llmsbrieftxt_cache/(configurable via--cache-dir) - OpenAI Model:
gpt-5-mini(configurable via--model) - Concurrent Requests: 10 (configurable via
--max-concurrent-summaries)
Environment Variables
OPENAI_API_KEY- Required for all operations
Usage Tips
Managing API Costs
- Preview with cost estimate: Use
--show-urlsto see discovered URLs and estimated API cost before processing - Limit scope: Use
--max-urlsto limit processing during testing - Automatic caching: Summaries are cached automatically - rerunning is cheap
- Cache-only mode: Use
--use-cache-onlyto generate output from cache without API calls - Force refresh: Use
--force-refreshwhen you need to regenerate all summaries - Cost-effective model: Default model
gpt-5-miniis cost-effective for most documentation
Controlling Crawl Depth
- Default depth (3): Good for most documentation sites (100-300 pages)
- Shallow crawl (1-2): Use for large sites or to focus on main pages only
- Deep crawl (4-5): Use for small sites or comprehensive coverage
- Example:
llmtxt https://example.com --depth 2 --show-urlsto preview scope
Cache Management
- Default location:
.llmsbrieftxt_cache/in current directory - Custom location: Use
--cache-dirfor shared caches or different organization - Cache benefits: Speeds up reruns, reduces API costs, enables incremental updates
- Failed URLs tracking: Failed URLs are written to
failed_urls.txtnext to output file
Organizing Documentation
All docs are saved to ~/.claude/docs/ by domain name:
~/.claude/docs/
├── docs.python.org.txt
├── react.dev.txt
├── pytorch.org.txt
└── fastapi.tiangolo.com.txt
This makes it easy for Claude Code and other tools to find and reference documentation.
Integrations
Claude Code
This tool is designed to work seamlessly with Claude Code. Once you've generated documentation files, Claude can search and reference them during development sessions.
MCP Servers
Generated llms-brief.txt files can be served via MCP (Model Context Protocol) servers. See the mcpdoc project for an example integration.
Troubleshooting
API Key Issues
# Verify API key is set
echo $OPENAI_API_KEY
# Set it if missing
export OPENAI_API_KEY="sk-your-api-key-here"
Rate Limiting
If you hit rate limits, reduce concurrent requests:
llmtxt https://example.com --max-concurrent-summaries 5
Large Documentation Sites
For very large sites (500+ pages):
- Start with
--show-urlsto see scope - Use
--max-urlsto process in batches - Increase
--max-concurrent-summariesif you have high rate limits
Migrating from 0.x
Version 1.0.0 removes search and list subcommands in favor of Unix tools:
# Before (v0.x)
llmsbrieftxt generate https://docs.python.org/3/
llmsbrieftxt search "async"
llmsbrieftxt list
# After (v1.0.0)
llmtxt https://docs.python.org/3/
rg "async" ~/.claude/docs/
ls ~/.claude/docs/
Why the change? Focus on doing one thing well. Search and list are better served by mature, powerful Unix tools you already have.
License
MIT
Contributing
Contributions welcome! Please:
- Run tests:
uv run pytest - Lint code:
uv run ruff check llmsbrieftxt/ tests/ - Format code:
uv run ruff format llmsbrieftxt/ tests/ - Check types:
uv run mypy llmsbrieftxt/ - Submit a PR
Links
- Homepage: https://github.com/stevennevins/llmsbrief
- Issues: https://github.com/stevennevins/llmsbrief/issues
- llms.txt Spec: https://llmstxt.org/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmsbrieftxt-1.4.0.tar.gz.
File metadata
- Download URL: llmsbrieftxt-1.4.0.tar.gz
- Upload date:
- Size: 108.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
561f205a0f1835681ced4b706895eec9ed1739738f763d266bc5472f8f65396f
|
|
| MD5 |
70c1f86efcd1b86a0930ca113f57839e
|
|
| BLAKE2b-256 |
59e09411a76ae41f38e22046acc50d06db9364ed21db6eb859a3cb0262de0cc3
|
Provenance
The following attestation bundles were made for llmsbrieftxt-1.4.0.tar.gz:
Publisher:
release.yml on stevennevins/llmstxt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmsbrieftxt-1.4.0.tar.gz -
Subject digest:
561f205a0f1835681ced4b706895eec9ed1739738f763d266bc5472f8f65396f - Sigstore transparency entry: 686896082
- Sigstore integration time:
-
Permalink:
stevennevins/llmstxt@9da0882727190da000a75f13050405b86109216f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/stevennevins
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9da0882727190da000a75f13050405b86109216f -
Trigger Event:
push
-
Statement type:
File details
Details for the file llmsbrieftxt-1.4.0-py3-none-any.whl.
File metadata
- Download URL: llmsbrieftxt-1.4.0-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
534bf082be25b257392c519dd4dd2e77dac672d2245eb221aac5880fb51bf9e4
|
|
| MD5 |
b0b04a007336b1843e860cc88d6e7d8d
|
|
| BLAKE2b-256 |
8745b2d680aa5dd4261846e494c7ff58f4f893a1cef0af953f5848b041c52768
|
Provenance
The following attestation bundles were made for llmsbrieftxt-1.4.0-py3-none-any.whl:
Publisher:
release.yml on stevennevins/llmstxt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmsbrieftxt-1.4.0-py3-none-any.whl -
Subject digest:
534bf082be25b257392c519dd4dd2e77dac672d2245eb221aac5880fb51bf9e4 - Sigstore transparency entry: 686896106
- Sigstore integration time:
-
Permalink:
stevennevins/llmstxt@9da0882727190da000a75f13050405b86109216f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/stevennevins
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@9da0882727190da000a75f13050405b86109216f -
Trigger Event:
push
-
Statement type: