Skip to main content

Extract PDF content optimized for Large Language Model (LLM) consumption

Project description

pdf2llm - PDF to LLM Context Extractor

Extract content from PDFs in a format optimized for Large Language Model (LLM) consumption.

Features

  • Multiple formats: Extract as Markdown (preserves structure) or plain text
  • Image extraction: Automatically extracts and saves images with configurable DPI
  • Table preservation: Maintains table structure in Markdown format
  • Page boundaries: Optional page markers for maintaining document structure
  • Batch processing: Process multiple PDFs at once
  • Organized output: Clean directory structure for each PDF
  • Structure analysis: Analyze PDFs before extraction
  • Token estimation: Get token counts for LLM context planning

Installation

# Install from PyPI
pip install pdf2llm
# or
uv pip install pdf2llm

# Or install from source
git clone https://github.com/yourusername/pdf2llm.git
cd pdf2llm
uv sync

Usage

Command Line Interface

# Basic extraction
uv run ./pdf2llm document.pdf

# Extract to specific directory
uv run ./pdf2llm document.pdf -o extracted_docs/

# Batch process multiple PDFs
uv run ./pdf2llm *.pdf -o zoning_docs/

# Extract as plain text without images
uv run ./pdf2llm document.pdf --format text --no-images

# Analyze PDF structure only
uv run ./pdf2llm document.pdf --analyze-only

# High quality image extraction
uv run ./pdf2llm document.pdf --dpi 300

# Get JSON output for integration
uv run ./pdf2llm document.pdf --json

# Set token limit warning
uv run ./pdf2llm document.pdf --token-limit 4000

Python API

from pdf_utils import PDFExtractor

# Create extractor
extractor = PDFExtractor(
    output_dir=Path("extracted"),
    image_format="png",
    dpi=150
)

# Extract single PDF
result = extractor.extract(
    Path("document.pdf"),
    output_format="markdown"
)

print(f"Tokens: {result.token_estimate}")
print(f"Pages: {result.page_count}")
print(f"Has images: {result.has_images}")
print(f"Has tables: {result.has_tables}")

# Save to file
output_path = extractor.save_extraction(result, Path("document.pdf"))

# Batch extraction
pdf_files = list(Path("pdfs/").glob("*.pdf"))
results = extractor.batch_extract(pdf_files)

Output Structure

extracted/
├── document_name/
│   ├── content.md         # Extracted content
│   └── images/           # Extracted images (if any)
│       ├── page-1-0.png
│       └── page-2-0.png
└── another_document/
    ├── content.md
    └── images/

Use Cases

Zoning Documents Analysis

# Extract all zoning PDFs with high-quality images
uv run ./pdf2llm zoning_*.pdf -o zoning_analysis/ --dpi 300

# Then in your Python code:
with open("zoning_analysis/zoning_code_2024/content.md", "r") as f:
    content = f.read()

# Use with your LLM
response = llm.chat(
    messages=[{
        "role": "system",
        "content": f"You are analyzing zoning documents. Document: {content}"
    }, {
        "role": "user", 
        "content": "What are the setback requirements for R-1 zones?"
    }]
)

Document Q&A System

# Process all documents
uv run ./pdf2llm documents/*.pdf -o knowledge_base/

# Check token counts
uv run ./pdf2llm documents/*.pdf --json | jq '.token_estimate'

Research Paper Analysis

# Extract with tables and figures
uv run ./pdf2llm research_paper.pdf --dpi 200

# Extract text only for quick analysis
uv run ./pdf2llm research_paper.pdf --format text --no-images

CLI Options

Option Description Default
-o, --output-dir Output directory extracted/
--format Output format (markdown, text, both) markdown
--no-images Skip image extraction False
--image-format Image format (png, jpg, jpeg) png
--dpi DPI for image extraction 150
--no-page-chunks Disable page boundary markers False
--analyze-only Only analyze structure False
--quiet Minimal output False
--json JSON output False
--token-limit Warn if exceeds limit None

Package Structure

pdf_utils/
├── core/
│   └── extractor.py      # Core extraction logic
├── cli/
│   └── main.py          # CLI interface
└── __init__.py         # Package exports

Requirements

  • Python 3.12+
  • uv (for dependency management)
  • Dependencies managed in pyproject.toml

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2llm-0.1.0.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2llm-0.1.0-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file pdf2llm-0.1.0.tar.gz.

File metadata

  • Download URL: pdf2llm-0.1.0.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for pdf2llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 84fb30447b37a799b6e2f7a68698fc85ad6a8808ac2e05097ccb326b85a44253
MD5 9596d4a12c98c8c3882f407febd5af66
BLAKE2b-256 60499b56a574b880169398a2d12af6518de7b7a2f5c77d39ee2c9c6ed6e44946

See more details on using hashes here.

File details

Details for the file pdf2llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for pdf2llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 299e02e7f8d6ed5f66e800d474f3c9e7ba2baa7d1d3b97d17fc03d505c99308a
MD5 26e0d3deef9cfdae617e2f843e80b947
BLAKE2b-256 af81015ad74cd705db42b7085e0520550755662f658a8ef96f738fdfec1325e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page