Generates LLM context by scraping and summarizing documentation for Python libraries listed in a requirements.txt file.
Project description
LLM Minimal Documentation Generator
Overview
LLM Minimal Documentation Generator is a tool designed to automatically scrape and process technical documentation for Python libraries. It generates two key outputs for each library:
llm-full.txt: The complete, raw text content crawled from the documentation website.llm-min.txt: A compact, structured summary of the documentation, optimized for consumption by Large Language Models (LLMs), generated using Google Gemini according to the PCS (Progressive Compaction Strategy) guide.
This tool facilitates the creation of focused context files, enabling LLMs to provide more accurate and relevant information about specific libraries.
Features
- Automatic Documentation Discovery: Finds official documentation URLs for specified Python packages.
- Web Crawling: Efficiently scrapes documentation websites (powered by
crawl4ai). - LLM-Powered Compaction: Uses Google Gemini to condense crawled documentation into a structured, minimal format (PCS).
- Flexible Input: Accepts package lists from:
requirements.txtfiles.- Folders containing a
requirements.txtfile. - Direct string input.
- Programmatic Usage: Provides a Python client (
LLMMinClient) for integration into other workflows. - Configurable Crawling: Control maximum pages and depth for the web crawler.
- Organized Output: Saves results in a structured directory format (
output_dir/package_name/).
Installation
-
Clone the repository:
git clone <repository_url> # Replace with actual URL cd llm-min-generator # Or your project directory name
-
Set up the environment and install dependencies using
uv:# Ensure you have uv installed (https://github.com/astral-sh/uv) python -m venv .venv source .venv/bin/activate # or .venv\Scripts\activate on Windows uv pip install -r requirements.txt # Or use the appropriate requirements file uv pip install -e . # Install the package in editable mode
-
Install Playwright Browsers: The documentation crawler uses Playwright. After installing the package, you need to download the necessary browser binaries:
playwright installNote: Depending on your environment (e.g., containers), you might need to install system dependencies for Playwright. See the Playwright documentation for details.
-
Install and Set up Pre-commit Hooks: Pre-commit hooks help maintain code quality by running checks before you commit.
pip install pre-commit
pre-commit install
After installation, the hooks will run automatically on git commit. You can also run them manually on all files with pre-commit run --all-files.
- Configure API Key:
- Copy the
.env.examplefile to.env:
cp .env.example .env
- Edit the
.envfile and add your Google Gemini API key:
GEMINI_API_KEY=YOUR_API_KEY_HERE
- Alternatively, you can provide the key directly via the
--gemini-api-keycommand-line option or when initializingLLMMinClient.
- Copy the
Usage (Command Line)
The tool is run via the llm-min-generator command (if installed correctly) or python -m llm_min_generator.main.
Command Structure:
llm-min-generator [OPTIONS]
Input Options (Choose ONE):
-
--requirements-file PATHor-f PATH: Path to arequirements.txtfile.llm-min-generator -f sample_requirements.txt
-
--input-folder PATHor-d PATH: Path to a folder containing arequirements.txtfile.llm-min-generator -d /path/to/your/project/
-
--packages "PKG1\nPKG2"or-pkg "PKG1\nPKG2": A string containing package names, separated by newlines (\n).llm-min-generator --packages "requests\npydantic>=2.0"
-
--doc-url URLor-u URL: Directly specify the documentation URL for a single package, bypassing the automatic search. This is useful if the search fails or if you want to target a specific version's documentation. When using this option, only provide one package via--packagesor ensure your--requirements-file/--input-foldercontains only one package.llm-min-generator --packages "requests" --doc-url "https://requests.readthedocs.io/en/latest/"
Common Options:
--output-dir PATHor-o PATH: Directory to save the generated documentation. (Default:my_docs)--max-crawl-pages Nor-p N: Maximum number of pages to crawl per package. Set to0for unlimited. (Default:200)--max-crawl-depth Nor-D N: Maximum depth to crawl from the starting URL. (Default:2)--chunk-size Nor-c N: Chunk size (in characters) for LLM compaction. (Default:1000000)--gemini-api-key KEYor-k KEY: Your Google Gemini API Key (overrides the.envfile).
Example:
Generate documentation for packages in sample_requirements.txt, saving to output_docs, crawling up to 100 pages:
llm-min-generator -f sample_requirements.txt -o output_docs -p 100
Programmatic Usage (Python)
Beyond the command-line interface, you can use llm-min-generator programmatically in your Python projects via the LLMMinClient.
Initialization
First, import the client:
from llm_min.client import LLMMinClient
To initialize the client, you need to provide your Google Gemini API key. You can do this either by setting the GEMINI_API_KEY environment variable or by passing the key directly to the constructor. The client also requires the pcs-guide.md file to be present in the project root directory (or provide a custom path).
import os
# Option 1: Using environment variable (Recommended)
# Ensure 'GEMINI_API_KEY' is set in your environment
# export GEMINI_API_KEY='YOUR_API_KEY_HERE'
try:
# Assumes pcs-guide.md is in the project root
client = LLMMinClient()
except ValueError as e:
print(f"Error initializing client (API Key?): {e}")
# Handle missing API key
except FileNotFoundError as e:
print(f"Error initializing client (PCS Guide?): {e}")
# Handle missing pcs-guide.md
# Option 2: Passing API key directly
api_key = os.environ.get("GEMINI_API_KEY", "YOUR_FALLBACK_API_KEY_HERE") # Get from env or use placeholder
custom_guide_path = "/path/to/your/custom/pcs-guide.md" # Optional
try:
client_direct_key = LLMMinClient(
api_key=api_key
# Optionally specify model, chunk size, or PCS guide path:
# model="gemini-pro",
# max_tokens_per_chunk=5000,
# pcs_guide_path=custom_guide_path
)
except ValueError as e:
print(f"Error initializing client (API Key?): {e}")
except FileNotFoundError as e:
print(f"Error initializing client (PCS Guide?): {e}")
Compacting Content
Once initialized, use the compact method to process your text content:
# Assuming 'client' is an initialized LLMMinClient instance from Option 1 above
long_text_content = """
# Your extensive documentation or text content goes here...
# For example, the raw content scraped from a website or a large text file.
# This content will be automatically chunked based on the client's configuration
# and then compacted using the LLM according to the PCS guide.
# ... (potentially thousands of lines) ...
# It will be automatically chunked and compacted.
"""
subject_of_content = "My Library Documentation" # Optional, but helpful context for the LLM
if 'client' in locals(): # Check if client was initialized successfully
try:
compacted_pcs_output = client.compact(
content=long_text_content,
subject=subject_of_content
)
print("Compacted Output (PCS Format):")
print(compacted_pcs_output)
# You can save this output to a file, e.g., llm-min.txt
# output_filename = f"{subject_of_content.lower().replace(' ', '_')}-llm-min.txt"
# with open(output_filename, "w", encoding="utf-8") as f:
# f.write(compacted_pcs_output)
# print(f"Saved compacted output to {output_filename}")
except Exception as e:
print(f"An error occurred during compaction: {e}")
else:
print("LLMMinClient was not initialized successfully.")
This allows you to integrate the documentation compaction process directly into your Python workflows.
Output Structure
The tool generates the following structure in the specified output directory:
output_dir/
├── package_name_1/
│ ├── llm-full.txt # Raw crawled content
│ └── llm-min.txt # Compacted PCS content
├── package_name_2/
│ ├── llm-full.txt
│ └── llm-min.txt
└── ...
Contributing
Contributions are welcome! Please refer to the CONTRIBUTING.md file (if available) for guidelines.
Key areas for contribution:
- Improving documentation discovery logic.
- Enhancing the compaction prompts/strategy (PCS guide).
- Adding support for more LLM providers.
- Improving error handling and reporting.
- Writing tests.
License
This project is licensed under the MIT License - see the LICENSE file for details (if available, otherwise assume MIT).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_min-0.1.1.tar.gz.
File metadata
- Download URL: llm_min-0.1.1.tar.gz
- Upload date:
- Size: 41.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52ce661825d2516a2e0d9773cfa0c30e58e1ba7d817b5ce5464ab8a7544dca08
|
|
| MD5 |
5256dfc8e073fe7ccd45c4d811a39405
|
|
| BLAKE2b-256 |
c1644adf058813ba159c4858fa49caa14f2a71714497858f774ddae7f0e8cba6
|
Provenance
The following attestation bundles were made for llm_min-0.1.1.tar.gz:
Publisher:
publish.yml on marv1nnnnn/llm-min.txt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_min-0.1.1.tar.gz -
Subject digest:
52ce661825d2516a2e0d9773cfa0c30e58e1ba7d817b5ce5464ab8a7544dca08 - Sigstore transparency entry: 204263638
- Sigstore integration time:
-
Permalink:
marv1nnnnn/llm-min.txt@d0cc3aeb3df6eca1075c71f6a9ffadb75253a53a -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/marv1nnnnn
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d0cc3aeb3df6eca1075c71f6a9ffadb75253a53a -
Trigger Event:
push
-
Statement type:
File details
Details for the file llm_min-0.1.1-py3-none-any.whl.
File metadata
- Download URL: llm_min-0.1.1-py3-none-any.whl
- Upload date:
- Size: 29.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a42ffc9cabcc9c0febc062e3459d692d35036b3d8a2182dca98d4ad186d330c6
|
|
| MD5 |
47751bb064c59177585003ff418171d5
|
|
| BLAKE2b-256 |
dd1d9f3742a57e355f90460d9fa1b654f7854f5458000849ce00018a774329a2
|
Provenance
The following attestation bundles were made for llm_min-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on marv1nnnnn/llm-min.txt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_min-0.1.1-py3-none-any.whl -
Subject digest:
a42ffc9cabcc9c0febc062e3459d692d35036b3d8a2182dca98d4ad186d330c6 - Sigstore transparency entry: 204263645
- Sigstore integration time:
-
Permalink:
marv1nnnnn/llm-min.txt@d0cc3aeb3df6eca1075c71f6a9ffadb75253a53a -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/marv1nnnnn
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d0cc3aeb3df6eca1075c71f6a9ffadb75253a53a -
Trigger Event:
push
-
Statement type: