Generate llms-brief.txt files from documentation websites using AI

These details have not been verified by PyPI

Project links

Project description

llmsbrieftxt

Generate llms-brief.txt files from any documentation website using AI. A focused, production-ready CLI tool that does one thing exceptionally well.

Quick Start

# Install
pip install llmsbrieftxt

# Set your OpenAI API key
export OPENAI_API_KEY="sk-your-api-key-here"

# Generate llms-brief.txt from a documentation site
llmtxt https://docs.python.org/3/

# Preview URLs before processing
llmtxt https://react.dev --show-urls

# Use a different model
llmtxt https://react.dev --model gpt-4o

What It Does

Crawls documentation websites, extracts content, and uses OpenAI to generate structured llms-brief.txt files. Each entry contains a title, URL, keywords, and one-line summary - making it easy for LLMs and developers to navigate documentation.

Key Features:

Smart Crawling: Breadth-first discovery up to depth 3, with URL deduplication
Content Extraction: HTML to Markdown using trafilatura
AI Summarization: Structured output using OpenAI
Automatic Caching: Summaries cached in .llmsbrieftxt_cache/ to avoid reprocessing
Production-Ready: Clean output, proper error handling, scriptable

Installation

# With pip
pip install llmsbrieftxt

# With uv (recommended)
uv pip install llmsbrieftxt

Prerequisites

Python 3.10+
OpenAI API Key: Required for generating summaries
```
export OPENAI_API_KEY="sk-your-api-key-here"
```

Usage

Basic Command

llmtxt <url> [options]

Output is automatically saved to ~/.claude/docs/<domain>.txt (e.g., docs.python.org.txt)

Options

--output PATH - Custom output path (default: ~/.claude/docs/<domain>.txt)
--model MODEL - OpenAI model to use (default: gpt-5-mini)
--max-concurrent-summaries N - Concurrent LLM requests (default: 10)
--show-urls - Preview discovered URLs with cost estimate (no API calls)
--max-urls N - Limit number of URLs to process
--depth N - Maximum crawl depth (default: 3)
--cache-dir PATH - Cache directory path (default: .llmsbrieftxt_cache)
--use-cache-only - Use only cached summaries, skip API calls for new pages
--force-refresh - Ignore cache and regenerate all summaries

Examples

# Basic usage - saves to ~/.claude/docs/docs.python.org.txt
llmtxt https://docs.python.org/3/

# Use a different model
llmtxt https://react.dev --model gpt-4o

# Preview URLs with cost estimate before processing (no API calls)
llmtxt https://react.dev --show-urls

# Limit scope for testing
llmtxt https://docs.python.org --max-urls 50

# Custom crawl depth (explore deeper or shallower)
llmtxt https://example.com --depth 2

# Use only cached summaries (no API calls)
llmtxt https://docs.python.org/3/ --use-cache-only

# Force refresh all summaries (ignore cache)
llmtxt https://docs.python.org/3/ --force-refresh

# Custom cache directory
llmtxt https://example.com --cache-dir /tmp/my-cache

# Custom output location
llmtxt https://react.dev --output ./my-docs/react.txt

# Process with higher concurrency (if you have high rate limits)
llmtxt https://fastapi.tiangolo.com --max-concurrent-summaries 20

Searching and Listing

This tool focuses on generating llms-brief.txt files. For searching and listing, use standard Unix tools:

Search Documentation

# Search all docs
rg "async functions" ~/.claude/docs/

# Search specific file
rg "hooks" ~/.claude/docs/react.dev.txt

# Case-insensitive search
rg -i "error handling" ~/.claude/docs/

# Show context around matches
rg -C 2 "api" ~/.claude/docs/

# Or use grep
grep -r "async" ~/.claude/docs/

List Documentation

# List all docs
ls ~/.claude/docs/

# List with details
ls -lh ~/.claude/docs/

# Count entries in a file
grep -c "^Title:" ~/.claude/docs/react.dev.txt

# Find all docs and show sizes
find ~/.claude/docs/ -name "*.txt" -exec wc -l {} +

Why use standard tools? They're:

Already installed on your system
More powerful and flexible
Well-documented
Composable with other commands
Faster than any custom implementation

How It Works

URL Discovery

The tool uses a comprehensive breadth-first search strategy:

Explores links up to 3 levels deep from your starting URL
Automatically excludes assets (CSS, JS, images) and non-documentation pages
Sophisticated URL normalization prevents duplicate processing
Discovers 100-300+ pages on typical documentation sites

Content Processing Pipeline

URL Discovery → Content Extraction → LLM Summarization → File Generation

Crawl: Discover all documentation URLs
Extract: Convert HTML to markdown using trafilatura
Summarize: Generate structured summaries using OpenAI
Cache: Store summaries in .llmsbrieftxt_cache/ for reuse
Generate: Compile into searchable llms-brief.txt format

Output Format

Each entry in the generated file contains:

Title: [Page Name](URL)
Keywords: searchable, terms, functions, concepts
Summary: One-line description of page content

Development

Setup

# Clone and install with dev dependencies
git clone https://github.com/stevennevins/llmsbrief.git
cd llmsbrief
uv sync --group dev

Running Tests

# All tests
uv run pytest

# Unit tests only
uv run pytest tests/unit/

# Specific test file
uv run pytest tests/unit/test_cli.py

# With verbose output
uv run pytest -v

E2E Testing with Ollama (No API Costs)

For testing without OpenAI API costs, use Ollama as a local LLM provider:

# 1. Install Ollama (one-time setup)
curl -fsSL https://ollama.com/install.sh | sh
# Or download from: https://ollama.com/download

# 2. Start Ollama service
ollama serve &

# 3. Pull a lightweight model
ollama pull tinyllama  # 637MB, fastest
# Or: ollama pull phi3:mini  # 2.3GB, better quality

# 4. Run E2E tests with Ollama
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama-dummy-key"
uv run pytest tests/integration/test_ollama_e2e.py -v

# 5. Or test the CLI directly
llmtxt https://example.com --model tinyllama --max-urls 5 --depth 1

Benefits:

✅ Zero API costs - runs completely local
✅ OpenAI-compatible endpoint
✅ Same code path as production
✅ Cached in GitHub Actions for CI/CD

Recommended Models:

tinyllama (637MB) - Fastest, great for CI/CD
phi3:mini (2.3GB) - Better quality, still fast
gemma2:2b (1.6GB) - Balanced option

Code Quality

# Lint code
uv run ruff check llmsbrieftxt/ tests/

# Format code
uv run ruff format llmsbrieftxt/ tests/

# Type checking
uv run mypy llmsbrieftxt/

Configuration

Default Settings

Crawl Depth: 3 levels (configurable via --depth)
Output Location: ~/.claude/docs/<domain>.txt (configurable via --output)
Cache Directory: .llmsbrieftxt_cache/ (configurable via --cache-dir)
OpenAI Model: gpt-5-mini (configurable via --model)
Concurrent Requests: 10 (configurable via --max-concurrent-summaries)

Environment Variables

OPENAI_API_KEY - Required for all operations
OPENAI_BASE_URL - Optional. Set to use OpenAI-compatible endpoints (e.g., Ollama at http://localhost:11434/v1)

Usage Tips

Managing API Costs

Preview with cost estimate: Use --show-urls to see discovered URLs and estimated API cost before processing
Limit scope: Use --max-urls to limit processing during testing
Automatic caching: Summaries are cached automatically - rerunning is cheap
Cache-only mode: Use --use-cache-only to generate output from cache without API calls
Force refresh: Use --force-refresh when you need to regenerate all summaries
Cost-effective model: Default model gpt-5-mini is cost-effective for most documentation

Controlling Crawl Depth

Default depth (3): Good for most documentation sites (100-300 pages)
Shallow crawl (1-2): Use for large sites or to focus on main pages only
Deep crawl (4-5): Use for small sites or comprehensive coverage
Example: llmtxt https://example.com --depth 2 --show-urls to preview scope

Cache Management

Default location: .llmsbrieftxt_cache/ in current directory
Custom location: Use --cache-dir for shared caches or different organization
Cache benefits: Speeds up reruns, reduces API costs, enables incremental updates
Failed URLs tracking: Failed URLs are written to failed_urls.txt next to output file

Organizing Documentation

All docs are saved to ~/.claude/docs/ by domain name:

~/.claude/docs/
├── docs.python.org.txt
├── react.dev.txt
├── pytorch.org.txt
└── fastapi.tiangolo.com.txt

This makes it easy for Claude Code and other tools to find and reference documentation.

Integrations

Claude Code

This tool is designed to work seamlessly with Claude Code. Once you've generated documentation files, Claude can search and reference them during development sessions.

MCP Servers

Generated llms-brief.txt files can be served via MCP (Model Context Protocol) servers. See the mcpdoc project for an example integration.

Troubleshooting

API Key Issues

# Verify API key is set
echo $OPENAI_API_KEY

# Set it if missing
export OPENAI_API_KEY="sk-your-api-key-here"

Rate Limiting

If you hit rate limits, reduce concurrent requests:

llmtxt https://example.com --max-concurrent-summaries 5

Large Documentation Sites

For very large sites (500+ pages):

Start with --show-urls to see scope
Use --max-urls to process in batches
Increase --max-concurrent-summaries if you have high rate limits

Migrating from 0.x

Version 1.0.0 removes search and list subcommands in favor of Unix tools:

# Before (v0.x)
llmsbrieftxt generate https://docs.python.org/3/
llmsbrieftxt search "async"
llmsbrieftxt list

# After (v1.0.0)
llmtxt https://docs.python.org/3/
rg "async" ~/.claude/docs/
ls ~/.claude/docs/

Why the change? Focus on doing one thing well. Search and list are better served by mature, powerful Unix tools you already have.

License

MIT

Contributing

Contributions welcome! Please:

Run tests: uv run pytest
Lint code: uv run ruff check llmsbrieftxt/ tests/
Format code: uv run ruff format llmsbrieftxt/ tests/
Check types: uv run mypy llmsbrieftxt/
Submit a PR

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.11.1

Nov 10, 2025

1.11.0

Nov 10, 2025

1.10.0

Nov 10, 2025

1.9.0

Nov 10, 2025

1.8.3

Nov 10, 2025

1.8.2

Nov 10, 2025

1.8.1

Nov 10, 2025

This version

1.8.0

Nov 10, 2025

1.7.0

Nov 10, 2025

1.6.0

Nov 10, 2025

1.5.0

Nov 10, 2025

1.4.0

Nov 10, 2025

1.3.1

Nov 10, 2025

1.3.0

Nov 10, 2025

1.2.0

Nov 9, 2025

1.1.5

Nov 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmsbrieftxt-1.8.0.tar.gz (122.9 kB view details)

Uploaded Nov 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmsbrieftxt-1.8.0-py3-none-any.whl (28.1 kB view details)

Uploaded Nov 10, 2025 Python 3

File details

Details for the file llmsbrieftxt-1.8.0.tar.gz.

File metadata

Download URL: llmsbrieftxt-1.8.0.tar.gz
Upload date: Nov 10, 2025
Size: 122.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmsbrieftxt-1.8.0.tar.gz
Algorithm	Hash digest
SHA256	`990c4ce126d24e2c170fa54479863f815e9e27849e09c345dd90942008b129c7`
MD5	`d2058aebf6cf8c7c5499d43e00c52d3e`
BLAKE2b-256	`4a1cf490706ff2ea0bd0c7a36056d3b3d32285e3799d972744c0a0e0c605ca9b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmsbrieftxt-1.8.0.tar.gz:

Publisher: release.yml on stevennevins/llmstxt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmsbrieftxt-1.8.0.tar.gz
- Subject digest: 990c4ce126d24e2c170fa54479863f815e9e27849e09c345dd90942008b129c7
- Sigstore transparency entry: 687996105
- Sigstore integration time: Nov 10, 2025
Source repository:
- Permalink: stevennevins/llmstxt@883cad4d3f8abc99e5df684c8aa5884f41bd811b
- Branch / Tag: refs/heads/main
- Owner: https://github.com/stevennevins
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@883cad4d3f8abc99e5df684c8aa5884f41bd811b
- Trigger Event: push

File details

Details for the file llmsbrieftxt-1.8.0-py3-none-any.whl.

File metadata

Download URL: llmsbrieftxt-1.8.0-py3-none-any.whl
Upload date: Nov 10, 2025
Size: 28.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmsbrieftxt-1.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da42ced14a65c849174c30706ea1dcb1a5a824957be0469f7daa939f004fd9aa`
MD5	`eb748635a4ccd06c2a91dd132f8f40d6`
BLAKE2b-256	`34dac226843e3a14c39dd7b28bca26dfda345803ca5b62fe1ed0466bc56c055f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmsbrieftxt-1.8.0-py3-none-any.whl:

Publisher: release.yml on stevennevins/llmstxt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmsbrieftxt-1.8.0-py3-none-any.whl
- Subject digest: da42ced14a65c849174c30706ea1dcb1a5a824957be0469f7daa939f004fd9aa
- Sigstore transparency entry: 687996165
- Sigstore integration time: Nov 10, 2025
Source repository:
- Permalink: stevennevins/llmstxt@883cad4d3f8abc99e5df684c8aa5884f41bd811b
- Branch / Tag: refs/heads/main
- Owner: https://github.com/stevennevins
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@883cad4d3f8abc99e5df684c8aa5884f41bd811b
- Trigger Event: push

llmsbrieftxt 1.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llmsbrieftxt

Quick Start

What It Does

Installation

Prerequisites

Usage

Basic Command

Options

Examples

Searching and Listing

Search Documentation

List Documentation

How It Works

URL Discovery

Content Processing Pipeline

Output Format

Development

Setup

Running Tests

E2E Testing with Ollama (No API Costs)

Code Quality

Configuration

Default Settings

Environment Variables

Usage Tips

Managing API Costs

Controlling Crawl Depth

Cache Management

Organizing Documentation

Integrations

Claude Code

MCP Servers

Troubleshooting

API Key Issues

Rate Limiting

Large Documentation Sites

Migrating from 0.x

License

Contributing

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance