Skip to main content

Generate llms-brief.txt files from documentation websites using AI

Project description

llmsbrieftxt

Generate llms-brief.txt files from any documentation website using AI. A focused, production-ready CLI tool that does one thing exceptionally well.

Quick Start

# Install
pip install llmsbrieftxt

# Set your OpenAI API key
export OPENAI_API_KEY="sk-your-api-key-here"

# Generate llms-brief.txt from a documentation site
llmtxt https://docs.python.org/3/

# Preview URLs before processing
llmtxt https://react.dev --show-urls

# Use a different model
llmtxt https://react.dev --model gpt-4o

What It Does

Crawls documentation websites, extracts content, and uses OpenAI to generate structured llms-brief.txt files. Each entry contains a title, URL, keywords, and one-line summary - making it easy for LLMs and developers to navigate documentation.

Key Features:

  • Smart Crawling: Breadth-first discovery up to depth 3, with URL deduplication
  • Content Extraction: HTML to Markdown using trafilatura
  • AI Summarization: Structured output using OpenAI
  • Automatic Caching: Summaries cached in .llmsbrieftxt_cache/ to avoid reprocessing
  • Production-Ready: Clean output, proper error handling, scriptable

Installation

# With pip
pip install llmsbrieftxt

# With uv (recommended)
uv pip install llmsbrieftxt

Prerequisites

  • Python 3.10+
  • OpenAI API Key: Required for generating summaries
    export OPENAI_API_KEY="sk-your-api-key-here"
    

Usage

Basic Command

llmtxt <url> [options]

Output is automatically saved to ~/.claude/docs/<domain>.txt (e.g., docs.python.org.txt)

Options

  • --output PATH - Custom output path (default: ~/.claude/docs/<domain>.txt)
  • --model MODEL - OpenAI model to use (default: gpt-5-mini)
  • --max-concurrent-summaries N - Concurrent LLM requests (default: 10)
  • --show-urls - Preview discovered URLs without processing
  • --max-urls N - Limit number of URLs to process

Examples

# Basic usage - saves to ~/.claude/docs/docs.python.org.txt
llmtxt https://docs.python.org/3/

# Use a different model
llmtxt https://react.dev --model gpt-4o

# Preview URLs before processing (no API calls)
llmtxt https://react.dev --show-urls

# Limit scope for testing
llmtxt https://docs.python.org --max-urls 50

# Custom output location
llmtxt https://react.dev --output ./my-docs/react.txt

# Process with higher concurrency (if you have high rate limits)
llmtxt https://fastapi.tiangolo.com --max-concurrent-summaries 20

Searching and Listing

This tool focuses on generating llms-brief.txt files. For searching and listing, use standard Unix tools:

Search Documentation

# Search all docs
rg "async functions" ~/.claude/docs/

# Search specific file
rg "hooks" ~/.claude/docs/react.dev.txt

# Case-insensitive search
rg -i "error handling" ~/.claude/docs/

# Show context around matches
rg -C 2 "api" ~/.claude/docs/

# Or use grep
grep -r "async" ~/.claude/docs/

List Documentation

# List all docs
ls ~/.claude/docs/

# List with details
ls -lh ~/.claude/docs/

# Count entries in a file
grep -c "^Title:" ~/.claude/docs/react.dev.txt

# Find all docs and show sizes
find ~/.claude/docs/ -name "*.txt" -exec wc -l {} +

Why use standard tools? They're:

  • Already installed on your system
  • More powerful and flexible
  • Well-documented
  • Composable with other commands
  • Faster than any custom implementation

How It Works

URL Discovery

The tool uses a comprehensive breadth-first search strategy:

  • Explores links up to 3 levels deep from your starting URL
  • Automatically excludes assets (CSS, JS, images) and non-documentation pages
  • Sophisticated URL normalization prevents duplicate processing
  • Discovers 100-300+ pages on typical documentation sites

Content Processing Pipeline

URL Discovery → Content Extraction → LLM Summarization → File Generation
  1. Crawl: Discover all documentation URLs
  2. Extract: Convert HTML to markdown using trafilatura
  3. Summarize: Generate structured summaries using OpenAI
  4. Cache: Store summaries in .llmsbrieftxt_cache/ for reuse
  5. Generate: Compile into searchable llms-brief.txt format

Output Format

Each entry in the generated file contains:

Title: [Page Name](URL)
Keywords: searchable, terms, functions, concepts
Summary: One-line description of page content

Development

Setup

# Clone and install with dev dependencies
git clone https://github.com/stevennevins/llmsbrief.git
cd llmsbrief
uv sync --group dev

Running Tests

# All tests
uv run pytest

# Unit tests only
uv run pytest tests/unit/

# Specific test file
uv run pytest tests/unit/test_cli.py

# With verbose output
uv run pytest -v

Code Quality

# Lint code
uv run ruff check llmsbrieftxt/ tests/

# Format code
uv run ruff format llmsbrieftxt/ tests/

# Type checking
uv run mypy llmsbrieftxt/

Configuration

Default Settings

  • Crawl Depth: 3 levels (hardcoded)
  • Output Location: ~/.claude/docs/<domain>.txt
  • Cache Directory: .llmsbrieftxt_cache/
  • OpenAI Model: gpt-5-mini
  • Concurrent Requests: 10

Environment Variables

  • OPENAI_API_KEY - Required for all operations

Usage Tips

Managing API Costs

  • Use --show-urls first to preview scope
  • Use --max-urls to limit processing during testing
  • Summaries are cached automatically - rerunning is cheap
  • Default model gpt-5-mini is cost-effective for most documentation

Organizing Documentation

All docs are saved to ~/.claude/docs/ by domain name:

~/.claude/docs/
├── docs.python.org.txt
├── react.dev.txt
├── pytorch.org.txt
└── fastapi.tiangolo.com.txt

This makes it easy for Claude Code and other tools to find and reference documentation.

Integrations

Claude Code

This tool is designed to work seamlessly with Claude Code. Once you've generated documentation files, Claude can search and reference them during development sessions.

MCP Servers

Generated llms-brief.txt files can be served via MCP (Model Context Protocol) servers. See the mcpdoc project for an example integration.

Troubleshooting

API Key Issues

# Verify API key is set
echo $OPENAI_API_KEY

# Set it if missing
export OPENAI_API_KEY="sk-your-api-key-here"

Rate Limiting

If you hit rate limits, reduce concurrent requests:

llmtxt https://example.com --max-concurrent-summaries 5

Large Documentation Sites

For very large sites (500+ pages):

  1. Start with --show-urls to see scope
  2. Use --max-urls to process in batches
  3. Increase --max-concurrent-summaries if you have high rate limits

Migrating from 0.x

Version 1.0.0 removes search and list subcommands in favor of Unix tools:

# Before (v0.x)
llmsbrieftxt generate https://docs.python.org/3/
llmsbrieftxt search "async"
llmsbrieftxt list

# After (v1.0.0)
llmtxt https://docs.python.org/3/
rg "async" ~/.claude/docs/
ls ~/.claude/docs/

Why the change? Focus on doing one thing well. Search and list are better served by mature, powerful Unix tools you already have.

License

MIT

Contributing

Contributions welcome! Please:

  1. Run tests: uv run pytest
  2. Lint code: uv run ruff check llmsbrieftxt/ tests/
  3. Format code: uv run ruff format llmsbrieftxt/ tests/
  4. Check types: uv run mypy llmsbrieftxt/
  5. Submit a PR

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmsbrieftxt-1.2.0.tar.gz (107.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmsbrieftxt-1.2.0-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file llmsbrieftxt-1.2.0.tar.gz.

File metadata

  • Download URL: llmsbrieftxt-1.2.0.tar.gz
  • Upload date:
  • Size: 107.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmsbrieftxt-1.2.0.tar.gz
Algorithm Hash digest
SHA256 fbfdab06ce8da9ba021f488da8e8f0b6e06df680e1370e6cad2b1ad95f4388f8
MD5 c0e11f826c52d3a1745fe0a6248e1c38
BLAKE2b-256 7b36c8a87ea067768645d6e5a57b9d1c96e3b9f5e4d1b6e2ffe9d0cfeea9d1ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmsbrieftxt-1.2.0.tar.gz:

Publisher: release.yml on stevennevins/llmstxt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmsbrieftxt-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: llmsbrieftxt-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 23.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llmsbrieftxt-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d266b9c5203034a9a7094a62e7508df145ac166ac7b5b78238569a9250779a3d
MD5 0a0e14d3e18a4a9999a92ab221ed8478
BLAKE2b-256 81062e9286fecc04fcbee2ba82e2ce6b5bef4011fdffce2fc9c11671d384fda3

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmsbrieftxt-1.2.0-py3-none-any.whl:

Publisher: release.yml on stevennevins/llmstxt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page