Skip to main content

Generates LLM context by scraping and summarizing documentation for Python libraries listed in a requirements.txt file.

Project description

LLM-Min: Generate Compact Docs for LLMs

License: MIT

Problem: LLMs often rely on outdated training data, leading to "vibe coding" where developers guess at API usage rather than working with current, accurate information. This makes development with rapidly evolving libraries challenging and error-prone.

Solution: llm-min automatically crawls Python library documentation (or a specified URL) and uses Google Gemini to generate compact, structured summaries (llm-min.txt) optimized for providing up-to-date context to LLMs. It also saves the full crawled text (llm-full.txt) for reference and copies the guideline used for structuring the summary (llm-min-guideline.md).

Give your LLMs the fresh, focused context they need to avoid hallucinations and generate accurate code.

Key Features

  • Automated Crawling: Crawls provided documentation URLs for any language. Can also automatically find and scrape official Python package documentation via package name.
  • LLM-Powered Summarization: Creates concise, structured summaries using Google Gemini, generating a llm_min.txt composed of Atomic Information Units (AIUs).
  • Flexible Input: Process packages from comma-separated names or URLs.
  • Organized Output: Saves results neatly per package (output_dir/package_name/).

Inspiration

  • min.js (Code can be really compressed)
  • LLM create new language (No need to restrict on current language, reasoning models are really good at abstract stuff)
  • context7 (Solve the problem but have limit)

Why not llms.txt?

Personally I love the concept of llms.txt, and apparently this is also the major inspiration.

  • Standard documentation often contains numerous redirect links, which LLMs may struggle to follow or interpret correctly.
  • Raw documentation is typically not optimized for token efficiency, leading to verbose input that can be costly or exceed LLM context window limits.
  • It can be difficult to ensure that an LLM is referencing the absolute latest version of online documentation. Developer might not update it frequently.

Understanding llm-min.txt and Atomic Information Units (AIUs)

The llm-min.txt file contains a KNOWLEDGE_BASE specifically structured for LLM consumption. This KNOWLEDGE_BASE is composed of "Atomic Information Units" (AIUs). Each AIU represents a distinct piece of information about the library, such as a function, a class, a feature, or a usage pattern.

AIUs are designed to be:

  • Atomic: Representing a single, focused concept.
  • Structured: Containing specific fields like type (typ), name (name), purpose (purp), inputs (in), outputs (out), usage examples (use), and relationships to other AIUs (rel).
  • Compact: Using abbreviations and minimal syntax to maximize information density.

The goal of this format is to provide an LLM with a rich, interconnected understanding of the library's capabilities and how to use them, enabling it to answer questions and generate accurate code. The full specification for interpreting the KNOWLEDGE_BASE and AIU structure can be found in the llm-min-guideline.md file (which is a copy of assets/guideline.md from the llm-min package). This guideline details the fields, abbreviations, and overall schema.

Snippets of the KNOWLEDGE_BASE format include:


Sample Output & Compression

A sample output for the crawl4ai library is available in the sample/crawl4ai/ directory. This can give you a concrete idea of what llm-min produces:

  • sample/crawl4ai/llm-full.txt: Contains the raw, complete text crawled from the crawl4ai documentation (Size: 124,424 token).
  • sample/crawl4ai/llm-min.txt: Contains the structured KNOWLEDGE_BASE generated by the LLM (Size: 22,422 token).
  • sample/crawl4ai/llm-min-guideline.md: A copy of the guideline used by the LLM for structuring the llm-min.txt content.

Compression Achieved:

In this specific example, llm-min reduced the input documentation size from approximately 511 KB to 76 KB. This represents a compression of about 85%, meaning the llm-min.txt file is roughly 15% of the original crawled text size.

This significant reduction in size, coupled with the structured format of llm-min.txt, makes the information much more digestible and efficient for Large Language Models, providing focused context without overwhelming them with raw documentation.

You can explore these files to see the transformation from verbose documentation to a compact, LLM-ready format.

Use Case: For example, this tool can be used to provide context to AI coding assistants like Cursor to improve their understanding of new or rapidly changing libraries.

Supported Languages

llm-min is designed to be language-agnostic when you provide direct documentation URLs using the --doc-urls option. This allows you to generate summaries for documentation written in any programming language (e.g., JavaScript, Java, Rust, Go, etc.) or even for non-programming related textual content.

When using the --packages option, llm-min currently leverages a search mechanism optimized for finding official Python package documentation. Therefore, this specific input method is best suited for Python libraries.

For all other languages and content types, please use the --doc-urls option.

Quick Start

1. Installation:

Using pip (Recommended for users):

pip install llm-min

For Development/Contribution (Using uv):

# Clone (if you haven't already)
# cd llm-min

# Install dependencies (using uv)
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
uv sync # Installs dependencies from pyproject.toml
uv pip install -e .

# Install browser binaries for crawling
playwright install

# Optional: Install pre-commit hooks for development
# uv pip install pre-commit
# pre-commit install

2. Configure API Key:

  • Recommended: Copy .env.example to .env and add your GEMINI_API_KEY. The application will automatically load it.
  • Alternatively: You can provide the key directly using the --gemini-api-key CLI flag.

3. Generate Docs (CLI Usage):

llm-min requires at least one input source (packages or URLs) and offers options to customize its behavior.

Input Sources (provide at least one; can be combined):

Common Options:

  • -o, --output-dir DIRECTORY: Directory to save outputs (default: llm_min_docs).
  • -p, --max-crawl-pages INTEGER: Max pages to crawl per package (default: 200, 0 for unlimited).
  • -D, --max-crawl-depth INTEGER: Max crawl depth from start URL (default: 3).
  • -c, --chunk-size INTEGER: Character chunk size for LLM compaction (default: 1,000,000).
  • -k, --gemini-api-key TEXT: Gemini API Key (or use GEMINI_API_KEY env var).
  • --gemini-model TEXT: Gemini model (default: gemini-2.5-flash-preview-04-17).
  • -v, --verbose: Enable verbose logging.

Example:

Process the typer package and the FastAPI documentation URL, limiting crawl to 50 pages for each, and save to my_docs:

llm-min -pkg "typer" -u "https://fastapi.tiangolo.com/" -o my_docs -p 50 --gemini-api-key YOUR_API_KEY

4. Generate Docs (Module Usage):

You can also import and use the LLMMinGenerator class directly in your Python code for programmatic control.

from llm_min import LLMMinGenerator
import os

# Configure LLM (optional, uses defaults if None)
llm_config = {
    "api_key": os.environ.get("GEMINI_API_KEY"), # Or provide directly
    "model_name": "gemini-2.5-flash-preview-04-17",
    "chunk_size": 1000000,
    "max_crawl_pages": 200,
    "max_crawl_depth": 3,
}

# Instantiate the generator
# Output will be saved to ./my_output_docs/package_name/ or ./my_output_docs/url_identifier/
generator = LLMMinGenerator(output_dir="./my_output_docs", llm_config=llm_config)

# Generate documentation for a package
try:
    generator.generate_from_package("requests")
    print("Documentation generated for 'requests'")
except Exception as e:
    print(f"Error generating documentation for 'requests': {e}")

# Generate documentation from a URL
try:
    generator.generate_from_url("https://docs.python.org/3/")
    print("Documentation generated for 'https://docs.python.org/3/'")
except Exception as e:
    print(f"Error generating documentation for 'https://docs.python.org/3/': {e}")

This example demonstrates how to create an instance of LLMMinGenerator, configure it, and then call either generate_from_package or generate_from_url to generate the documentation.

For a full list of options and their descriptions, run:

llm-min --help

Model choice

While you can specify different Gemini models using the --gemini-model option, we strongly recommend using gemini-2.5-flash-preview-04-17 (the default).

Here's why:

  1. Strong Reasoning: This model offers robust reasoning capabilities, which are crucial for accurately understanding and structuring documentation content into the KNOWLEDGE_BASE format.
  2. Long Context Window: With a 1 million token context window, gemini-2.5-flash-preview-04-17 is well-suited for processing extensive documentation, which is a common scenario for this tool.

Using the default model (gemini-2.5-flash-preview-04-17) provides a good balance of performance, cost, and capability for the task of generating compact LLM-friendly documentation.

Workflow Overview (src/llm_min)

The core logic of llm-min processes inputs as follows. All I/O-bound operations (like web requests and LLM calls) for each input item are handled asynchronously for efficiency.

The core logic of llm-min is now encapsulated within the LLMMinGenerator class. The CLI entry point (src/llm_min/main.py) parses arguments and delegates the generation process to an instance of this class. All I/O-bound operations (like web requests and LLM calls) for each input item are handled asynchronously for efficiency.

[ User Runs CLI: llm-min ... ]
           |
           v
+--------------------------+
| Parse CLI Args (main.py) |
+--------------------------+
           |
           v
+--------------------------+
| Instantiate & Use        |
| LLMMinGenerator          |
| (src/llm_min/generator.py)|
+--------------------------+
           |
           v
+--------------------------+
| LLMMinGenerator          |
| Orchestrates:            |
| - Search (search.py)     |
| - Crawling (crawler.py)  |
| - Compaction (compacter.py)|
| - Writing Outputs        |
+--------------------------+
           |
           v
[ Outputs Saved to output_dir ]

Brief Explanation:

  1. CLI Input (main.py): The tool starts by parsing command-line arguments.
  2. Generator Instantiation (main.py): An instance of LLMMinGenerator is created with the specified configuration.
  3. Generation Delegation (main.py -> generator.py): The main function calls the appropriate method (generate_from_package or generate_from_url) on the LLMMinGenerator instance.
  4. Orchestration (generator.py): The LLMMinGenerator class orchestrates the underlying logic:
    • If a package name is given, it uses the search logic (search.py) to find the documentation URL.
    • It then uses the crawling logic (crawler.py) to fetch content from the URL.
    • The crawled content is compacted using the LLM logic (compacter.py and llm/gemini.py), guided by assets/llm_min_guideline.md.
    • Finally, it writes the output files (llm-full.txt, llm-min.txt, and llm-min-guideline.md).

All steps from URL discovery/reception through to output generation are performed for each package/URL, with I/O operations handled asynchronously within the generator.

FAQ

FAQ

Q: What if the documentation for a package can't be found automatically when using --packages?

A: llm-min uses a search engine to find the official documentation URL. If it fails, or picks the wrong one, you can use the --doc-urls option to provide the exact URL(s) for the documentation you want to process.

Q: How does llm-min handle very large documentation sets?

A: llm-min crawls documentation up to max-crawl-pages and max-crawl-depth. The crawled content is then split into chunks defined by chunk-size before being processed by the LLM. You might need to adjust these parameters for very large sites. The gemini-2.5-flash-preview-04-17 model (default) has a large context window (1M tokens) which helps in processing substantial content effectively.

Q: What should I do if I encounter an error or get poor results?

A: Try running with the --verbose flag to get more detailed logs, which might indicate the issue. If a specific URL fails, test it in your browser. For poor summarization, you might experiment with a different Gemini model via --gemini-model, though the default is generally recommended. If problems persist, please open an issue on GitHub with details and logs.

**Q: Do you vibe code?

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests focusing on improving discovery, compaction, LLM support, or tests.

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_min-0.1.5.tar.gz (43.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_min-0.1.5-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_min-0.1.5.tar.gz.

File metadata

  • Download URL: llm_min-0.1.5.tar.gz
  • Upload date:
  • Size: 43.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_min-0.1.5.tar.gz
Algorithm Hash digest
SHA256 69eab51c7b5fb12630610dc6a5a1871202c7f7328b283005e27a298c85209253
MD5 698805df1f9e1af2f56f38326b7bd2bb
BLAKE2b-256 dea1b73951d36f24579fcedd789a30f740324c8b2bee7a6477ac0d288e9e2250

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_min-0.1.5.tar.gz:

Publisher: publish.yml on marv1nnnnn/llm-min.txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_min-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: llm_min-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_min-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0339436c6247677b3aba5eef1183b87e6a86c327b9d7feebc0db3ace017f5e60
MD5 e463f3d7e6545cd4a5c58ead421d107f
BLAKE2b-256 d10b7694f0aa971fba98c7bc48cfba4593d23389d65d4fe5e2da23c8bd69f794

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_min-0.1.5-py3-none-any.whl:

Publisher: publish.yml on marv1nnnnn/llm-min.txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page