Extract PDF content optimized for Large Language Model (LLM) consumption

These details have not been verified by PyPI

Project links

Project description

pdf2llm - PDF to LLM Context Extractor

Extract content from PDFs in a format optimized for Large Language Model (LLM) consumption.

Features

Multiple formats: Extract as Markdown (preserves structure) or plain text
Image extraction: Automatically extracts and saves images with configurable DPI
Table preservation: Maintains table structure in Markdown format
Page boundaries: Optional page markers for maintaining document structure
Batch processing: Process multiple PDFs at once
Organized output: Clean directory structure for each PDF
Structure analysis: Analyze PDFs before extraction
Token estimation: Get token counts for LLM context planning

Installation

# Install from PyPI
pip install pdf2llm
# or
uv pip install pdf2llm

# Or install from source
git clone https://github.com/yourusername/pdf2llm.git
cd pdf2llm
uv sync

Usage

Command Line Interface

# Basic extraction
uv run ./pdf2llm document.pdf

# Extract to specific directory
uv run ./pdf2llm document.pdf -o extracted_docs/

# Batch process multiple PDFs
uv run ./pdf2llm *.pdf -o zoning_docs/

# Extract as plain text without images
uv run ./pdf2llm document.pdf --format text --no-images

# Analyze PDF structure only
uv run ./pdf2llm document.pdf --analyze-only

# High quality image extraction
uv run ./pdf2llm document.pdf --dpi 300

# Get JSON output for integration
uv run ./pdf2llm document.pdf --json

# Set token limit warning
uv run ./pdf2llm document.pdf --token-limit 4000

Python API

from pdf_utils import PDFExtractor

# Create extractor
extractor = PDFExtractor(
    output_dir=Path("extracted"),
    image_format="png",
    dpi=150
)

# Extract single PDF
result = extractor.extract(
    Path("document.pdf"),
    output_format="markdown"
)

print(f"Tokens: {result.token_estimate}")
print(f"Pages: {result.page_count}")
print(f"Has images: {result.has_images}")
print(f"Has tables: {result.has_tables}")

# Save to file
output_path = extractor.save_extraction(result, Path("document.pdf"))

# Batch extraction
pdf_files = list(Path("pdfs/").glob("*.pdf"))
results = extractor.batch_extract(pdf_files)

Output Structure

extracted/
├── document_name/
│   ├── content.md         # Extracted content
│   └── images/           # Extracted images (if any)
│       ├── page-1-0.png
│       └── page-2-0.png
└── another_document/
    ├── content.md
    └── images/

Use Cases

Zoning Documents Analysis

# Extract all zoning PDFs with high-quality images
uv run ./pdf2llm zoning_*.pdf -o zoning_analysis/ --dpi 300

# Then in your Python code:
with open("zoning_analysis/zoning_code_2024/content.md", "r") as f:
    content = f.read()

# Use with your LLM
response = llm.chat(
    messages=[{
        "role": "system",
        "content": f"You are analyzing zoning documents. Document: {content}"
    }, {
        "role": "user", 
        "content": "What are the setback requirements for R-1 zones?"
    }]
)

Document Q&A System

# Process all documents
uv run ./pdf2llm documents/*.pdf -o knowledge_base/

# Check token counts
uv run ./pdf2llm documents/*.pdf --json | jq '.token_estimate'

Research Paper Analysis

# Extract with tables and figures
uv run ./pdf2llm research_paper.pdf --dpi 200

# Extract text only for quick analysis
uv run ./pdf2llm research_paper.pdf --format text --no-images

CLI Options

Option	Description	Default
`-o, --output-dir`	Output directory	`extracted/`
`--format`	Output format (markdown, text, both)	`markdown`
`--no-images`	Skip image extraction	False
`--image-format`	Image format (png, jpg, jpeg)	`png`
`--dpi`	DPI for image extraction	`150`
`--no-page-chunks`	Disable page boundary markers	False
`--analyze-only`	Only analyze structure	False
`--quiet`	Minimal output	False
`--json`	JSON output	False
`--token-limit`	Warn if exceeds limit	None

Package Structure

pdf_utils/
├── core/
│   └── extractor.py      # Core extraction logic
├── cli/
│   └── main.py          # CLI interface
└── __init__.py         # Package exports

Requirements

Python 3.12+
uv (for dependency management)
Dependencies managed in pyproject.toml

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Aug 1, 2025

This version

0.1.0

Aug 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2llm-0.1.0.tar.gz (7.5 kB view details)

Uploaded Aug 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2llm-0.1.0-py3-none-any.whl (11.8 kB view details)

Uploaded Aug 1, 2025 Python 3

File details

Details for the file pdf2llm-0.1.0.tar.gz.

File metadata

Download URL: pdf2llm-0.1.0.tar.gz
Upload date: Aug 1, 2025
Size: 7.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for pdf2llm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`84fb30447b37a799b6e2f7a68698fc85ad6a8808ac2e05097ccb326b85a44253`
MD5	`9596d4a12c98c8c3882f407febd5af66`
BLAKE2b-256	`60499b56a574b880169398a2d12af6518de7b7a2f5c77d39ee2c9c6ed6e44946`

See more details on using hashes here.

File details

Details for the file pdf2llm-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdf2llm-0.1.0-py3-none-any.whl
Upload date: Aug 1, 2025
Size: 11.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for pdf2llm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`299e02e7f8d6ed5f66e800d474f3c9e7ba2baa7d1d3b97d17fc03d505c99308a`
MD5	`26e0d3deef9cfdae617e2f843e80b947`
BLAKE2b-256	`af81015ad74cd705db42b7085e0520550755662f658a8ef96f738fdfec1325e7`

See more details on using hashes here.

pdf2llm 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdf2llm - PDF to LLM Context Extractor

Features

Installation

Usage

Command Line Interface

Python API

Output Structure

Use Cases

Zoning Documents Analysis

Document Q&A System

Research Paper Analysis

CLI Options

Package Structure

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes