A tool to cut PDFs into pieces and process them with LLMs in batches
Project description
Corte
A tool to cut PDFs into pieces and process them with LLMs in batches.
Features
- Split PDFs into manageable chunks
- Process chunks with various LLM providers (OpenAI, Anthropic, LangChain models)
- Support for local models via Ollama and cloud providers via LangChain
- Batch processing with configurable concurrency
- Rate limiting for API calls
- Command-line interface
- Combine results from multiple chunks
Installation
# Basic installation
pip install corte
# With LangChain support for additional providers
pip install corte[langchain]
# Full installation with all optional dependencies
pip install corte[all]
Usage
Command Line Interface
Split a PDF into chunks
corte split document.pdf --chunk-size 5 --output-dir ./chunks
Process a PDF with an LLM
# Using OpenAI directly
corte process document.pdf --provider openai --prompt "Summarize this content" --output results.json
# Using Anthropic via LangChain
corte process document.pdf --provider langchain_anthropic --model claude-3-opus-20240229 --prompt "Summarize this content" --output results.json
# Using local Ollama model
corte process document.pdf --provider ollama --model llama3 --prompt "Summarize this content" --output results.json
# Using Google Gemini via LangChain
corte process document.pdf --provider langchain_google --model gemini-1.5-pro --prompt "Summarize this content" --output results.json
Get PDF information
corte info document.pdf
List available providers
corte providers
Check LangChain integration status
corte langchain
Python API
from corte import PDFProcessor, LLMClientFactory, BatchProcessor
# Create processors
pdf_processor = PDFProcessor(chunk_size=5)
# Use direct OpenAI integration
llm_client = LLMClientFactory.create_client("openai")
# Or use LangChain providers
llm_client = LLMClientFactory.create_client("langchain_anthropic", model="claude-3-opus-20240229")
# Or use local Ollama model
llm_client = LLMClientFactory.create_client("ollama", model="llama3")
batch_processor = BatchProcessor(llm_client, pdf_processor)
# Process PDF
results = batch_processor.process_pdf_batch(
pdf_path="document.pdf",
prompt="Summarize this content"
)
# Combine results
combined_text = batch_processor.combine_results(results)
Configuration
Set your API keys as environment variables:
# For direct integrations
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
# For LangChain integrations (same keys work)
export GOOGLE_API_KEY="your-google-key" # For Google Gemini
# Ollama runs locally, no API key needed
Or use a .env file in your project directory.
Requirements
- Python 3.8+
- PyPDF2 for PDF processing
- OpenAI Python client
- Anthropic Python client
- Click for CLI interface
- LangChain (optional, for additional provider support)
LangChain Providers
When you install corte[langchain], you get access to:
- langchain_openai: OpenAI models via LangChain (alternative to direct integration)
- langchain_anthropic: Anthropic Claude models via LangChain
- langchain_google: Google Gemini models
- ollama: Local models via Ollama (requires Ollama to be running)
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
corte-0.1.0.tar.gz
(12.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
corte-0.1.0-py3-none-any.whl
(11.7 kB
view details)
File details
Details for the file corte-0.1.0.tar.gz.
File metadata
- Download URL: corte-0.1.0.tar.gz
- Upload date:
- Size: 12.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a32c9e27a181b922889263887b7826ededc3d4a4aa091e9b9475eea629ba005
|
|
| MD5 |
02c23b642f5291af11cc23bbf9d9effd
|
|
| BLAKE2b-256 |
eb829d4119ec7086ca459114b969093bd823ac240c3b5bd016ecc056d7628a96
|
File details
Details for the file corte-0.1.0-py3-none-any.whl.
File metadata
- Download URL: corte-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
522ac26b950c84fb16db0ee9986a5fd76cf2af2c5e9bdd9e2aeebc26c593edad
|
|
| MD5 |
5dcb86d73126e972c2fc8481a65666c3
|
|
| BLAKE2b-256 |
b063f5f7b9fff9451d93ac516282ae00b3f8e62ed70e71e0178320e89259035e
|