Skip to main content

A tool to cut PDFs into pieces and process them with LLMs in batches

Project description

Corte

A tool to cut PDFs into pieces and process them with LLMs in batches.

Features

  • Split PDFs into manageable chunks
  • Process chunks with various LLM providers (OpenAI, Anthropic, LangChain models)
  • Support for local models via Ollama and cloud providers via LangChain
  • Batch processing with configurable concurrency
  • Rate limiting for API calls
  • Command-line interface
  • Combine results from multiple chunks

Installation

# Basic installation
pip install corte

# With LangChain support for additional providers
pip install corte[langchain]

# Full installation with all optional dependencies
pip install corte[all]

Usage

Command Line Interface

Split a PDF into chunks

corte split document.pdf --chunk-size 5 --output-dir ./chunks

Process a PDF with an LLM

# Using OpenAI directly
corte process document.pdf --provider openai --prompt "Summarize this content" --output results.json

# Using Anthropic via LangChain
corte process document.pdf --provider langchain_anthropic --model claude-3-opus-20240229 --prompt "Summarize this content" --output results.json

# Using local Ollama model
corte process document.pdf --provider ollama --model llama3 --prompt "Summarize this content" --output results.json

# Using Google Gemini via LangChain
corte process document.pdf --provider langchain_google --model gemini-1.5-pro --prompt "Summarize this content" --output results.json

Get PDF information

corte info document.pdf

List available providers

corte providers

Check LangChain integration status

corte langchain

Python API

from corte import PDFProcessor, LLMClientFactory, BatchProcessor

# Create processors
pdf_processor = PDFProcessor(chunk_size=5)

# Use direct OpenAI integration
llm_client = LLMClientFactory.create_client("openai")

# Or use LangChain providers
llm_client = LLMClientFactory.create_client("langchain_anthropic", model="claude-3-opus-20240229")

# Or use local Ollama model
llm_client = LLMClientFactory.create_client("ollama", model="llama3")

batch_processor = BatchProcessor(llm_client, pdf_processor)

# Process PDF
results = batch_processor.process_pdf_batch(
    pdf_path="document.pdf",
    prompt="Summarize this content"
)

# Combine results
combined_text = batch_processor.combine_results(results)

Configuration

Set your API keys as environment variables:

# For direct integrations
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"

# For LangChain integrations (same keys work)
export GOOGLE_API_KEY="your-google-key"  # For Google Gemini

# Ollama runs locally, no API key needed

Or use a .env file in your project directory.

Requirements

  • Python 3.8+
  • PyPDF2 for PDF processing
  • OpenAI Python client
  • Anthropic Python client
  • Click for CLI interface
  • LangChain (optional, for additional provider support)

LangChain Providers

When you install corte[langchain], you get access to:

  • langchain_openai: OpenAI models via LangChain (alternative to direct integration)
  • langchain_anthropic: Anthropic Claude models via LangChain
  • langchain_google: Google Gemini models
  • ollama: Local models via Ollama (requires Ollama to be running)

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corte-0.1.0.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corte-0.1.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file corte-0.1.0.tar.gz.

File metadata

  • Download URL: corte-0.1.0.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for corte-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2a32c9e27a181b922889263887b7826ededc3d4a4aa091e9b9475eea629ba005
MD5 02c23b642f5291af11cc23bbf9d9effd
BLAKE2b-256 eb829d4119ec7086ca459114b969093bd823ac240c3b5bd016ecc056d7628a96

See more details on using hashes here.

File details

Details for the file corte-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: corte-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for corte-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 522ac26b950c84fb16db0ee9986a5fd76cf2af2c5e9bdd9e2aeebc26c593edad
MD5 5dcb86d73126e972c2fc8481a65666c3
BLAKE2b-256 b063f5f7b9fff9451d93ac516282ae00b3f8e62ed70e71e0178320e89259035e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page