Skip to main content

Extract PDF content optimized for Large Language Model (LLM) consumption

Project description

pdf2llm - PDF to LLM Context Extractor

Extract content from PDFs in a format optimized for Large Language Model (LLM) consumption.

Features

  • Multiple formats: Extract as Markdown (preserves structure) or plain text
  • Image extraction: Automatically extracts and saves images with configurable DPI
  • Table preservation: Maintains table structure in Markdown format
  • Page boundaries: Optional page markers for maintaining document structure
  • Batch processing: Process multiple PDFs at once
  • Organized output: Clean directory structure for each PDF
  • Structure analysis: Analyze PDFs before extraction
  • Token estimation: Get token counts for LLM context planning

Installation

# Install from PyPI
pip install pdf2llm
# or
uv pip install pdf2llm

# Or install from source
git clone https://github.com/yourusername/pdf2llm.git
cd pdf2llm
uv sync

Usage

Command Line Interface

# Basic extraction
uv run ./pdf2llm document.pdf

# Extract to specific directory
uv run ./pdf2llm document.pdf -o extracted_docs/

# Batch process multiple PDFs
uv run ./pdf2llm *.pdf -o zoning_docs/

# Extract as plain text without images
uv run ./pdf2llm document.pdf --format text --no-images

# Analyze PDF structure only
uv run ./pdf2llm document.pdf --analyze-only

# High quality image extraction
uv run ./pdf2llm document.pdf --dpi 300

# Get JSON output for integration
uv run ./pdf2llm document.pdf --json

# Set token limit warning
uv run ./pdf2llm document.pdf --token-limit 4000

Python API

from pdf_utils import PDFExtractor

# Create extractor
extractor = PDFExtractor(
    output_dir=Path("extracted"),
    image_format="png",
    dpi=150
)

# Extract single PDF
result = extractor.extract(
    Path("document.pdf"),
    output_format="markdown"
)

print(f"Tokens: {result.token_estimate}")
print(f"Pages: {result.page_count}")
print(f"Has images: {result.has_images}")
print(f"Has tables: {result.has_tables}")

# Save to file
output_path = extractor.save_extraction(result, Path("document.pdf"))

# Batch extraction
pdf_files = list(Path("pdfs/").glob("*.pdf"))
results = extractor.batch_extract(pdf_files)

Output Structure

extracted/
├── document_name/
│   ├── content.md         # Extracted content
│   └── images/           # Extracted images (if any)
│       ├── page-1-0.png
│       └── page-2-0.png
└── another_document/
    ├── content.md
    └── images/

Use Cases

Zoning Documents Analysis

# Extract all zoning PDFs with high-quality images
uv run ./pdf2llm zoning_*.pdf -o zoning_analysis/ --dpi 300

# Then in your Python code:
with open("zoning_analysis/zoning_code_2024/content.md", "r") as f:
    content = f.read()

# Use with your LLM
response = llm.chat(
    messages=[{
        "role": "system",
        "content": f"You are analyzing zoning documents. Document: {content}"
    }, {
        "role": "user", 
        "content": "What are the setback requirements for R-1 zones?"
    }]
)

Document Q&A System

# Process all documents
uv run ./pdf2llm documents/*.pdf -o knowledge_base/

# Check token counts
uv run ./pdf2llm documents/*.pdf --json | jq '.token_estimate'

Research Paper Analysis

# Extract with tables and figures
uv run ./pdf2llm research_paper.pdf --dpi 200

# Extract text only for quick analysis
uv run ./pdf2llm research_paper.pdf --format text --no-images

CLI Options

Option Description Default
-o, --output-dir Output directory extracted/
--format Output format (markdown, text, both) markdown
--no-images Skip image extraction False
--image-format Image format (png, jpg, jpeg) png
--dpi DPI for image extraction 150
--no-page-chunks Disable page boundary markers False
--analyze-only Only analyze structure False
--quiet Minimal output False
--json JSON output False
--token-limit Warn if exceeds limit None

Package Structure

pdf_utils/
├── core/
│   └── extractor.py      # Core extraction logic
├── cli/
│   └── main.py          # CLI interface
└── __init__.py         # Package exports

Requirements

  • Python 3.12+
  • uv (for dependency management)
  • Dependencies managed in pyproject.toml

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2llm-0.1.1.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2llm-0.1.1-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf2llm-0.1.1.tar.gz.

File metadata

  • Download URL: pdf2llm-0.1.1.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for pdf2llm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f963cfd0433de670ecbbaa2ff57db85da7fdc8908e928ab6914d22214b6e97ae
MD5 79eadd4d72945441e113568da17245ef
BLAKE2b-256 724d2ba84e7a832a4c445819421bc6881e94138d0385a540d5718c4ed8b835db

See more details on using hashes here.

File details

Details for the file pdf2llm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdf2llm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for pdf2llm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6a0bc2d76b43779da0474c662fa5fa61345a4210545d508bc3b1d34ee12bbeeb
MD5 d2a78b82b8f295ac9006e2d56853efe0
BLAKE2b-256 f76b022097fab890a88612b7073d797fc03f52e2a84c20ece7df78016d08cfb6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page