Generates LLM context by scraping and summarizing documentation for Python libraries listed in a requirements.txt file.

Project description

LLM Minimal Documentation Generator

Overview

LLM Minimal Documentation Generator is a tool designed to automatically scrape and process technical documentation for Python libraries. It generates two key outputs for each library:

llm-full.txt: The complete, raw text content crawled from the documentation website.
llm-min.txt: A compact, structured summary of the documentation, optimized for consumption by Large Language Models (LLMs), generated using Google Gemini according to the PCS (Progressive Compaction Strategy) guide.

This tool facilitates the creation of focused context files, enabling LLMs to provide more accurate and relevant information about specific libraries.

Features

Automatic Documentation Discovery: Finds official documentation URLs for specified Python packages.
Web Crawling: Efficiently scrapes documentation websites (powered by crawl4ai).
LLM-Powered Compaction: Uses Google Gemini to condense crawled documentation into a structured, minimal format (PCS).
Flexible Input: Accepts package lists from:
- requirements.txt files.
- Folders containing a requirements.txt file.
- Direct string input.
Programmatic Usage: Provides a Python client (LLMMinClient) for integration into other workflows.
Configurable Crawling: Control maximum pages and depth for the web crawler.
Organized Output: Saves results in a structured directory format (output_dir/package_name/).

Installation

Clone the repository:

git clone <repository_url> # Replace with actual URL
cd llm-min-generator       # Or your project directory name

Set up the environment and install dependencies using uv:

# Ensure you have uv installed (https://github.com/astral-sh/uv)
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
uv pip install -r requirements.txt # Or use the appropriate requirements file
uv pip install -e . # Install the package in editable mode

Install Playwright Browsers: The documentation crawler uses Playwright. After installing the package, you need to download the necessary browser binaries:
```
playwright install
```
Note: Depending on your environment (e.g., containers), you might need to install system dependencies for Playwright. See the Playwright documentation for details.
Install and Set up Pre-commit Hooks: Pre-commit hooks help maintain code quality by running checks before you commit.

pip install pre-commit
pre-commit install

After installation, the hooks will run automatically on git commit. You can also run them manually on all files with pre-commit run --all-files.

Configure API Key:
- Copy the .env.example file to .env:
```
cp .env.example .env
```
- Edit the .env file and add your Google Gemini API key:
```
GEMINI_API_KEY=YOUR_API_KEY_HERE
```
- Alternatively, you can provide the key directly via the --gemini-api-key command-line option or when initializing LLMMinClient.

Usage (Command Line)

The tool is run via the llm-min-generator command (if installed correctly) or python -m llm_min_generator.main.

Command Structure:

llm-min-generator [OPTIONS]

Input Options (Choose ONE):

--requirements-file PATH or -f PATH: Path to a requirements.txt file.
```
llm-min-generator -f sample_requirements.txt
```
--input-folder PATH or -d PATH: Path to a folder containing a requirements.txt file.
```
llm-min-generator -d /path/to/your/project/
```
--packages "PKG1\nPKG2" or -pkg "PKG1\nPKG2": A string containing package names, separated by newlines (\n).
```
llm-min-generator --packages "requests\npydantic>=2.0"
```
--doc-url URL or -u URL: Directly specify the documentation URL for a single package, bypassing the automatic search. This is useful if the search fails or if you want to target a specific version's documentation. When using this option, only provide one package via --packages or ensure your --requirements-file/--input-folder contains only one package.
```
llm-min-generator --packages "requests" --doc-url "https://requests.readthedocs.io/en/latest/"
```

Common Options:

--output-dir PATH or -o PATH: Directory to save the generated documentation. (Default: my_docs)
--max-crawl-pages N or -p N: Maximum number of pages to crawl per package. Set to 0 for unlimited. (Default: 200)
--max-crawl-depth N or -D N: Maximum depth to crawl from the starting URL. (Default: 2)
--chunk-size N or -c N: Chunk size (in characters) for LLM compaction. (Default: 1000000)
--gemini-api-key KEY or -k KEY: Your Google Gemini API Key (overrides the .env file).

Example:

Generate documentation for packages in sample_requirements.txt, saving to output_docs, crawling up to 100 pages:

llm-min-generator -f sample_requirements.txt -o output_docs -p 100

Programmatic Usage (Python)

Beyond the command-line interface, you can use llm-min-generator programmatically in your Python projects via the LLMMinClient.

Initialization

First, import the client:

from llm_min.client import LLMMinClient

To initialize the client, you need to provide your Google Gemini API key. You can do this either by setting the GEMINI_API_KEY environment variable or by passing the key directly to the constructor. The client also requires the pcs-guide.md file to be present in the project root directory (or provide a custom path).

import os

# Option 1: Using environment variable (Recommended)
# Ensure 'GEMINI_API_KEY' is set in your environment
# export GEMINI_API_KEY='YOUR_API_KEY_HERE'
try:
    # Assumes pcs-guide.md is in the project root
    client = LLMMinClient()
except ValueError as e:
    print(f"Error initializing client (API Key?): {e}")
    # Handle missing API key
except FileNotFoundError as e:
    print(f"Error initializing client (PCS Guide?): {e}")
    # Handle missing pcs-guide.md

# Option 2: Passing API key directly
api_key = os.environ.get("GEMINI_API_KEY", "YOUR_FALLBACK_API_KEY_HERE") # Get from env or use placeholder
custom_guide_path = "/path/to/your/custom/pcs-guide.md" # Optional

try:
    client_direct_key = LLMMinClient(
        api_key=api_key
        # Optionally specify model, chunk size, or PCS guide path:
        # model="gemini-pro",
        # max_tokens_per_chunk=5000,
        # pcs_guide_path=custom_guide_path
    )
except ValueError as e:
    print(f"Error initializing client (API Key?): {e}")
except FileNotFoundError as e:
    print(f"Error initializing client (PCS Guide?): {e}")

Compacting Content

Once initialized, use the compact method to process your text content:

# Assuming 'client' is an initialized LLMMinClient instance from Option 1 above
long_text_content = """
# Your extensive documentation or text content goes here...
# For example, the raw content scraped from a website or a large text file.
# This content will be automatically chunked based on the client's configuration
# and then compacted using the LLM according to the PCS guide.
# ... (potentially thousands of lines) ...
# It will be automatically chunked and compacted.
"""

subject_of_content = "My Library Documentation" # Optional, but helpful context for the LLM

if 'client' in locals(): # Check if client was initialized successfully
    try:
        compacted_pcs_output = client.compact(
            content=long_text_content,
            subject=subject_of_content
        )
        print("Compacted Output (PCS Format):")
        print(compacted_pcs_output)

        # You can save this output to a file, e.g., llm-min.txt
        # output_filename = f"{subject_of_content.lower().replace(' ', '_')}-llm-min.txt"
        # with open(output_filename, "w", encoding="utf-8") as f:
        #     f.write(compacted_pcs_output)
        # print(f"Saved compacted output to {output_filename}")

    except Exception as e:
        print(f"An error occurred during compaction: {e}")
else:
    print("LLMMinClient was not initialized successfully.")

This allows you to integrate the documentation compaction process directly into your Python workflows.

Output Structure

The tool generates the following structure in the specified output directory:

output_dir/
├── package_name_1/
│   ├── llm-full.txt  # Raw crawled content
│   └── llm-min.txt   # Compacted PCS content
├── package_name_2/
│   ├── llm-full.txt
│   └── llm-min.txt
└── ...

Contributing

Contributions are welcome! Please refer to the CONTRIBUTING.md file (if available) for guidelines.

Key areas for contribution:

Improving documentation discovery logic.
Enhancing the compaction prompts/strategy (PCS guide).
Adding support for more LLM providers.
Improving error handling and reporting.
Writing tests.

License

This project is licensed under the MIT License - see the LICENSE file for details (if available, otherwise assume MIT).

Project details

Release history Release notifications | RSS feed

0.3.1

Jun 18, 2025

0.3.0

Jun 5, 2025

0.2.4

Jun 1, 2025

0.2.3

May 16, 2025

0.2.1

May 16, 2025

0.2.0

May 15, 2025

0.1.5

May 11, 2025

0.1.4

May 11, 2025

0.1.3

Apr 30, 2025

0.1.2

Apr 30, 2025

This version

0.1.1

Apr 29, 2025

0.1.0

Apr 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_min-0.1.1.tar.gz (41.7 kB view details)

Uploaded Apr 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_min-0.1.1-py3-none-any.whl (29.9 kB view details)

Uploaded Apr 29, 2025 Python 3

File details

Details for the file llm_min-0.1.1.tar.gz.

File metadata

Download URL: llm_min-0.1.1.tar.gz
Upload date: Apr 29, 2025
Size: 41.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_min-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`52ce661825d2516a2e0d9773cfa0c30e58e1ba7d817b5ce5464ab8a7544dca08`
MD5	`5256dfc8e073fe7ccd45c4d811a39405`
BLAKE2b-256	`c1644adf058813ba159c4858fa49caa14f2a71714497858f774ddae7f0e8cba6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_min-0.1.1.tar.gz:

Publisher: publish.yml on marv1nnnnn/llm-min.txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_min-0.1.1.tar.gz
- Subject digest: 52ce661825d2516a2e0d9773cfa0c30e58e1ba7d817b5ce5464ab8a7544dca08
- Sigstore transparency entry: 204263638
- Sigstore integration time: Apr 29, 2025
Source repository:
- Permalink: marv1nnnnn/llm-min.txt@d0cc3aeb3df6eca1075c71f6a9ffadb75253a53a
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/marv1nnnnn
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d0cc3aeb3df6eca1075c71f6a9ffadb75253a53a
- Trigger Event: push

File details

Details for the file llm_min-0.1.1-py3-none-any.whl.

File metadata

Download URL: llm_min-0.1.1-py3-none-any.whl
Upload date: Apr 29, 2025
Size: 29.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_min-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a42ffc9cabcc9c0febc062e3459d692d35036b3d8a2182dca98d4ad186d330c6`
MD5	`47751bb064c59177585003ff418171d5`
BLAKE2b-256	`dd1d9f3742a57e355f90460d9fa1b654f7854f5458000849ce00018a774329a2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_min-0.1.1-py3-none-any.whl:

Publisher: publish.yml on marv1nnnnn/llm-min.txt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_min-0.1.1-py3-none-any.whl
- Subject digest: a42ffc9cabcc9c0febc062e3459d692d35036b3d8a2182dca98d4ad186d330c6
- Sigstore transparency entry: 204263645
- Sigstore integration time: Apr 29, 2025
Source repository:
- Permalink: marv1nnnnn/llm-min.txt@d0cc3aeb3df6eca1075c71f6a9ffadb75253a53a
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/marv1nnnnn
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d0cc3aeb3df6eca1075c71f6a9ffadb75253a53a
- Trigger Event: push

llm-min 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LLM Minimal Documentation Generator

Overview

Features

Installation

Usage (Command Line)

Programmatic Usage (Python)

Initialization

Compacting Content

Output Structure

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance