Skip to main content

Generates LLM context by scraping and summarizing documentation for Python libraries listed in a requirements.txt file.

Project description

LLM Minimal Documentation Generator

License: MIT

Overview

LLM Minimal Documentation Generator is a tool designed to automatically scrape and process technical documentation for Python libraries. It generates two key outputs for each library:

  1. llm-full.txt: The complete, raw text content crawled from the documentation website.
  2. llm-min.txt: A compact, structured summary of the documentation, optimized for consumption by Large Language Models (LLMs), generated using Google Gemini according to the PCS (Progressive Compaction Strategy) guide.

This tool facilitates the creation of focused context files, enabling LLMs to provide more accurate and relevant information about specific libraries.

Features

  • Automatic Documentation Discovery: Finds official documentation URLs for specified Python packages.
  • Web Crawling: Efficiently scrapes documentation websites (powered by crawl4ai).
  • LLM-Powered Compaction: Uses Google Gemini to condense crawled documentation into a structured, minimal format (PCS).
  • Flexible Input: Accepts package lists from:
    • requirements.txt files.
    • Folders containing a requirements.txt file.
    • Direct string input.
  • Programmatic Usage: Provides a Python client (LLMMinClient) for integration into other workflows.
  • Configurable Crawling: Control maximum pages and depth for the web crawler.
  • Organized Output: Saves results in a structured directory format (output_dir/package_name/).

Installation

  1. Clone the repository:

    git clone <repository_url> # Replace with actual URL
    cd llm-min-generator       # Or your project directory name
    
  2. Set up the environment and install dependencies using uv:

    # Ensure you have uv installed (https://github.com/astral-sh/uv)
    python -m venv .venv
    source .venv/bin/activate # or .venv\Scripts\activate on Windows
    uv pip install -r requirements.txt # Or use the appropriate requirements file
    uv pip install -e . # Install the package in editable mode
    
  3. Configure API Key:

    • Copy the .env.example file to .env:
    cp .env.example .env
    
    • Edit the .env file and add your Google Gemini API key:
    GEMINI_API_KEY=YOUR_API_KEY_HERE
    
    • Alternatively, you can provide the key directly via the --gemini-api-key command-line option or when initializing LLMMinClient.

Usage (Command Line)

The tool is run via the llm-min-generator command (if installed correctly) or python -m llm_min_generator.main.

Command Structure:

llm-min-generator [OPTIONS]

Input Options (Choose ONE):

  • --requirements-file PATH or -f PATH: Path to a requirements.txt file.

    llm-min-generator -f sample_requirements.txt
    
  • --input-folder PATH or -d PATH: Path to a folder containing a requirements.txt file.

    llm-min-generator -d /path/to/your/project/
    
  • --packages "PKG1\nPKG2" or -pkg "PKG1\nPKG2": A string containing package names, separated by newlines (\n).

    llm-min-generator --packages "requests\npydantic>=2.0"
    
  • --doc-url URL or -u URL: Directly specify the documentation URL for a single package, bypassing the automatic search. This is useful if the search fails or if you want to target a specific version's documentation. When using this option, only provide one package via --packages or ensure your --requirements-file/--input-folder contains only one package.

    llm-min-generator --packages "requests" --doc-url "https://requests.readthedocs.io/en/latest/"
    

Common Options:

  • --output-dir PATH or -o PATH: Directory to save the generated documentation. (Default: my_docs)
  • --max-crawl-pages N or -p N: Maximum number of pages to crawl per package. Set to 0 for unlimited. (Default: 200)
  • --max-crawl-depth N or -D N: Maximum depth to crawl from the starting URL. (Default: 2)
  • --chunk-size N or -c N: Chunk size (in characters) for LLM compaction. (Default: 1000000)
  • --gemini-api-key KEY or -k KEY: Your Google Gemini API Key (overrides the .env file).

Example:

Generate documentation for packages in sample_requirements.txt, saving to output_docs, crawling up to 100 pages:

llm-min-generator -f sample_requirements.txt -o output_docs -p 100

Programmatic Usage (Python)

Beyond the command-line interface, you can use llm-min-generator programmatically in your Python projects via the LLMMinClient.

Initialization

First, import the client:

from llm_min.client import LLMMinClient

To initialize the client, you need to provide your Google Gemini API key. You can do this either by setting the GEMINI_API_KEY environment variable or by passing the key directly to the constructor. The client also requires the pcs-guide.md file to be present in the project root directory (or provide a custom path).

import os

# Option 1: Using environment variable (Recommended)
# Ensure 'GEMINI_API_KEY' is set in your environment
# export GEMINI_API_KEY='YOUR_API_KEY_HERE'
try:
    # Assumes pcs-guide.md is in the project root
    client = LLMMinClient()
except ValueError as e:
    print(f"Error initializing client (API Key?): {e}")
    # Handle missing API key
except FileNotFoundError as e:
    print(f"Error initializing client (PCS Guide?): {e}")
    # Handle missing pcs-guide.md

# Option 2: Passing API key directly
api_key = os.environ.get("GEMINI_API_KEY", "YOUR_FALLBACK_API_KEY_HERE") # Get from env or use placeholder
custom_guide_path = "/path/to/your/custom/pcs-guide.md" # Optional

try:
    client_direct_key = LLMMinClient(
        api_key=api_key
        # Optionally specify model, chunk size, or PCS guide path:
        # model="gemini-pro",
        # max_tokens_per_chunk=5000,
        # pcs_guide_path=custom_guide_path
    )
except ValueError as e:
    print(f"Error initializing client (API Key?): {e}")
except FileNotFoundError as e:
    print(f"Error initializing client (PCS Guide?): {e}")

Compacting Content

Once initialized, use the compact method to process your text content:

# Assuming 'client' is an initialized LLMMinClient instance from Option 1 above
long_text_content = """
# Your extensive documentation or text content goes here...
# For example, the raw content scraped from a website or a large text file.
# This content will be automatically chunked based on the client's configuration
# and then compacted using the LLM according to the PCS guide.
# ... (potentially thousands of lines) ...
# It will be automatically chunked and compacted.
"""

subject_of_content = "My Library Documentation" # Optional, but helpful context for the LLM

if 'client' in locals(): # Check if client was initialized successfully
    try:
        compacted_pcs_output = client.compact(
            content=long_text_content,
            subject=subject_of_content
        )
        print("Compacted Output (PCS Format):")
        print(compacted_pcs_output)

        # You can save this output to a file, e.g., llm-min.txt
        # output_filename = f"{subject_of_content.lower().replace(' ', '_')}-llm-min.txt"
        # with open(output_filename, "w", encoding="utf-8") as f:
        #     f.write(compacted_pcs_output)
        # print(f"Saved compacted output to {output_filename}")

    except Exception as e:
        print(f"An error occurred during compaction: {e}")
else:
    print("LLMMinClient was not initialized successfully.")

This allows you to integrate the documentation compaction process directly into your Python workflows.

Output Structure

The tool generates the following structure in the specified output directory:

output_dir/
├── package_name_1/
│   ├── llm-full.txt  # Raw crawled content
│   └── llm-min.txt   # Compacted PCS content
├── package_name_2/
│   ├── llm-full.txt
│   └── llm-min.txt
└── ...

Contributing

Contributions are welcome! Please refer to the CONTRIBUTING.md file (if available) for guidelines.

Key areas for contribution:

  • Improving documentation discovery logic.
  • Enhancing the compaction prompts/strategy (PCS guide).
  • Adding support for more LLM providers.
  • Improving error handling and reporting.
  • Writing tests.

License

This project is licensed under the MIT License - see the LICENSE file for details (if available, otherwise assume MIT).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_min-0.1.0.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_min-0.1.0-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file llm_min-0.1.0.tar.gz.

File metadata

  • Download URL: llm_min-0.1.0.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_min-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ed9c9047e2d907047539bd487239c4e67cdcde54f4038a60bee4e990953ef1d3
MD5 0ace9f7e7f36b59015b99fb1eb198b27
BLAKE2b-256 5c6c596ff640ac3f0daee99817987407f5db363c361f73e6a30ede88d02013f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_min-0.1.0.tar.gz:

Publisher: publish.yml on marv1nnnnn/llm-min.txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_min-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_min-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_min-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c76cb9f8d9fdc6fada78a73e0b171ce67f68f3bd1da3931c09c413e93f23455
MD5 a71a6355e746cf936b4029b32afedec8
BLAKE2b-256 60f89886ead79e5299df274c693b4b97f652d78e17dc09277538a859009e0354

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_min-0.1.0-py3-none-any.whl:

Publisher: publish.yml on marv1nnnnn/llm-min.txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page