Extract PDF content optimized for Large Language Model (LLM) consumption
Project description
pdf2llm - PDF to LLM Context Extractor
Extract content from PDFs in a format optimized for Large Language Model (LLM) consumption.
Features
- Multiple formats: Extract as Markdown (preserves structure) or plain text
- Image extraction: Automatically extracts and saves images with configurable DPI
- Table preservation: Maintains table structure in Markdown format
- Page boundaries: Optional page markers for maintaining document structure
- Batch processing: Process multiple PDFs at once
- Organized output: Clean directory structure for each PDF
- Structure analysis: Analyze PDFs before extraction
- Token estimation: Get token counts for LLM context planning
Installation
# Install from PyPI
pip install pdf2llm
# or
uv pip install pdf2llm
# Or install from source
git clone https://github.com/yourusername/pdf2llm.git
cd pdf2llm
uv sync
Usage
Command Line Interface
# Basic extraction
uv run ./pdf2llm document.pdf
# Extract to specific directory
uv run ./pdf2llm document.pdf -o extracted_docs/
# Batch process multiple PDFs
uv run ./pdf2llm *.pdf -o zoning_docs/
# Extract as plain text without images
uv run ./pdf2llm document.pdf --format text --no-images
# Analyze PDF structure only
uv run ./pdf2llm document.pdf --analyze-only
# High quality image extraction
uv run ./pdf2llm document.pdf --dpi 300
# Get JSON output for integration
uv run ./pdf2llm document.pdf --json
# Set token limit warning
uv run ./pdf2llm document.pdf --token-limit 4000
Python API
from pdf_utils import PDFExtractor
# Create extractor
extractor = PDFExtractor(
output_dir=Path("extracted"),
image_format="png",
dpi=150
)
# Extract single PDF
result = extractor.extract(
Path("document.pdf"),
output_format="markdown"
)
print(f"Tokens: {result.token_estimate}")
print(f"Pages: {result.page_count}")
print(f"Has images: {result.has_images}")
print(f"Has tables: {result.has_tables}")
# Save to file
output_path = extractor.save_extraction(result, Path("document.pdf"))
# Batch extraction
pdf_files = list(Path("pdfs/").glob("*.pdf"))
results = extractor.batch_extract(pdf_files)
Output Structure
extracted/
├── document_name/
│ ├── content.md # Extracted content
│ └── images/ # Extracted images (if any)
│ ├── page-1-0.png
│ └── page-2-0.png
└── another_document/
├── content.md
└── images/
Use Cases
Zoning Documents Analysis
# Extract all zoning PDFs with high-quality images
uv run ./pdf2llm zoning_*.pdf -o zoning_analysis/ --dpi 300
# Then in your Python code:
with open("zoning_analysis/zoning_code_2024/content.md", "r") as f:
content = f.read()
# Use with your LLM
response = llm.chat(
messages=[{
"role": "system",
"content": f"You are analyzing zoning documents. Document: {content}"
}, {
"role": "user",
"content": "What are the setback requirements for R-1 zones?"
}]
)
Document Q&A System
# Process all documents
uv run ./pdf2llm documents/*.pdf -o knowledge_base/
# Check token counts
uv run ./pdf2llm documents/*.pdf --json | jq '.token_estimate'
Research Paper Analysis
# Extract with tables and figures
uv run ./pdf2llm research_paper.pdf --dpi 200
# Extract text only for quick analysis
uv run ./pdf2llm research_paper.pdf --format text --no-images
CLI Options
| Option | Description | Default |
|---|---|---|
-o, --output-dir |
Output directory | extracted/ |
--format |
Output format (markdown, text, both) | markdown |
--no-images |
Skip image extraction | False |
--image-format |
Image format (png, jpg, jpeg) | png |
--dpi |
DPI for image extraction | 150 |
--no-page-chunks |
Disable page boundary markers | False |
--analyze-only |
Only analyze structure | False |
--quiet |
Minimal output | False |
--json |
JSON output | False |
--token-limit |
Warn if exceeds limit | None |
Package Structure
pdf_utils/
├── core/
│ └── extractor.py # Core extraction logic
├── cli/
│ └── main.py # CLI interface
└── __init__.py # Package exports
Requirements
- Python 3.12+
- uv (for dependency management)
- Dependencies managed in
pyproject.toml
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf2llm-0.1.1.tar.gz
(7.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
pdf2llm-0.1.1-py3-none-any.whl
(11.7 kB
view details)
File details
Details for the file pdf2llm-0.1.1.tar.gz.
File metadata
- Download URL: pdf2llm-0.1.1.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f963cfd0433de670ecbbaa2ff57db85da7fdc8908e928ab6914d22214b6e97ae
|
|
| MD5 |
79eadd4d72945441e113568da17245ef
|
|
| BLAKE2b-256 |
724d2ba84e7a832a4c445819421bc6881e94138d0385a540d5718c4ed8b835db
|
File details
Details for the file pdf2llm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdf2llm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a0bc2d76b43779da0474c662fa5fa61345a4210545d508bc3b1d34ee12bbeeb
|
|
| MD5 |
d2a78b82b8f295ac9006e2d56853efe0
|
|
| BLAKE2b-256 |
f76b022097fab890a88612b7073d797fc03f52e2a84c20ece7df78016d08cfb6
|