Minimal document-to-jsonl serializer with coordinates for AI
Project description
docpipe-mini
Minimal document-to-jsonl serializer with coordinates for AI
docpipe-mini converts documents into JSONL (JSON Lines) format with coordinate information, perfect for AI consumption. Focus on speed, minimal dependencies, and clean output.
🚀 Quick Start
# Install (5 MB core, zero dependencies)
pip install docpipe-mini
# Install PDF support (+11 MB, BSD license)
pip install docpipe-mini[pdf]
# Convert document to JSONL
python -m docpipe serialize-cmd document.pdf > document.jsonl
📖 Usage
Python API
import docpipe_mini as dp
# Simple serialization
for chunk in dp.serialize("paper.pdf"):
print(chunk.to_jsonl())
# {"doc_id":"uuid","page":1,"x":0.1,"y":0.2,"w":0.8,"h":0.1,"type":"text","text":"...","tokens":42}
# Direct JSONL output
for line in dp.serialize_to_jsonl("paper.pdf"):
print(line)
# List supported formats
print(dp.list_formats())
Command Line
# Basic usage
python -m docpipe serialize-cmd document.pdf > output.jsonl
# Save to file
python -m docpipe serialize-cmd document.pdf -o output.jsonl
# Include images and export them
python -m docpipe serialize-cmd document.docx --include-binary --export-images ./images
# Filter content types
python -m docpipe serialize-cmd document.pdf --types text,table
# Show processing statistics
python -m docpipe serialize-cmd document.pdf --stats
# List supported formats
python -m docpipe formats
# Show system information
python -m docpipe info
# Validate document without full processing
python -m docpipe validate document.pdf
✨ Key Features
🖼️ Image Extraction
- PDF Images: Accurate extraction with coordinates using PyMuPDF
- Word Images: Standard library extraction from DOCX files
- Multiple Formats: PNG, JPEG, GIF, BMP, TIFF, WebP support
- Export Options: Base64 encoding in JSON or save to separate files
🎛️ Rich CLI Interface
- Progress Bars: Real-time processing progress with Rich
- Statistics: Detailed processing metrics and content breakdown
- Content Filtering: Filter by content type (text, table, image)
- Memory Management: Built-in memory limits and monitoring
📍 Coordinate-Based Ordering
- Reading Order: Content appears in document reading order
- Accurate Positioning: Normalized coordinates (0-1 range)
- Multi-Content Support: Text, tables, and images positioned correctly
📊 Output Format
Each line is a JSON object with:
{
"doc_id": "uuid", # Document identifier
"page": 1, # Page number (1-based)
"x": 0.123, # Normalized X coordinate (0-1)
"y": 0.456, # Normalized Y coordinate (0-1)
"w": 0.7, # Normalized width (0-1)
"h": 0.08, # Normalized height (0-1)
"type": "text", # Content type: "text" | "table" | "image"
"text": "...", # Text content (null for images)
"tokens": 42, # Estimated token count
"binary_data": "base64...", # Binary data for images (base64 encoded, optional)
"binary_encoding": "base64",# Binary encoding format
"metadata": { # Additional metadata
"source_file": "doc.pdf",
"file_name": "doc.pdf",
"file_extension": ".pdf",
"file_size": 1048576,
"extraction_method": "pymupdf"
}
}
📦 Installation
Core Installation (5 MB)
pip install docpipe-mini
Zero third-party dependencies. Add format support as needed.
Optional Formats
# PDF support with PyMuPDF (AGPL, recommended, +11 MB)
pip install docpipe-mini[pdf]
# CLI with Rich interface (typer, +2 MB)
pip install docpipe-mini[cli]
# All optional dependencies
pip install docpipe-mini[all]
Development
git clone https://github.com/docpipe/docpipe-mini
cd docpipe-mini
uv sync --extra dev
pytest
🎯 Design Goals
- Minimal Dependencies: Core uses only Python standard library
- Fast Processing: ~300ms/MB on typical hardware
- AI-Ready Output: Clean JSONL with coordinates for LLM consumption
- Type Safety: Full type hints and mypy strict compliance
- Memory Safe: Built-in memory limits and lazy processing
- Rich CLI: Beautiful command-line interface with progress bars and statistics
- Image Support: Automatic image extraction with base64 encoding and file export
- Coordinate Ordering: Content is output in document reading order (top-to-bottom, left-to-right)
🏗️ Architecture
Document → Serializer → DocumentChunk → JSONL
- Loaders: Zero-dependency document parsers
- Processors: Coordinate extraction and text chunking
- Output: Standardized JSONL format
📋 Supported Formats
| Format | Status | Library | License | Features |
|---|---|---|---|---|
| ✅ | PyMuPDF | AGPL | Text, images, tables with accurate coordinates | |
| DOCX | ✅ | Standard Library | MIT | Text, images with coordinate estimation |
| XLSX | 🚧 | Planned | - | Coming soon |
| Images | 🚧 | Planned | - | Coming soon |
🔧 Configuration
import docpipe_mini as dp
# Memory limit
for chunk in dp.serialize("large.pdf", max_mem_mb=256):
# Process with 256MB memory limit
pass
# Custom document ID
for chunk in dp.serialize("paper.pdf", doc_id="my-paper"):
# Use custom ID instead of UUID
pass
# Process with image extraction
for chunk in dp.serialize("document.docx"):
# Images are automatically extracted and base64 encoded
if chunk.type == "image":
print(f"Found image: {chunk.metadata['image_format']}, size: {chunk.metadata['image_size_bytes']} bytes")
CLI Options
# Memory management
python -m docpipe serialize-cmd large.pdf --max-mem 256
# Content filtering
python -m docpipe serialize-cmd document.pdf --types text,table # Only text and tables
python -m docpipe serialize-cmd document.docx --types image # Only images
# Image handling
python -m docpipe serialize-cmd document.docx --include-binary # Include base64 image data
python -m docpipe serialize-cmd document.docx --export-images ./img # Export images to files
# Output control
python -m docpipe serialize-cmd document.pdf --no-jsonl # Plain text output
python -m docpipe serialize-cmd document.pdf --stats # Show processing statistics
🧪 Testing
# Run tests
pytest
# Run benchmarks
pytest -m benchmark
# Type checking
mypy --strict
📈 Performance
- Installation: 5 MB core, zero dependencies
- Processing: ~300ms/MB for PDF documents
- Memory: ~3x document size peak usage
- Output: 1-2x input size (JSON overhead)
🤝 Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure
mypy --strictpasses - Submit a pull request
📄 License
MIT License - see LICENSE file for details.
🔗 Links
docpipe-mini - Fast, minimal document serialization for AI applications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docpipe_mini-0.1.0a1.tar.gz.
File metadata
- Download URL: docpipe_mini-0.1.0a1.tar.gz
- Upload date:
- Size: 9.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
213c75f0a0fc0e207cde62bcf4be629e28ecba6bd4c2c6317e36832b698124b3
|
|
| MD5 |
5683b5fcfc511d420e6d1775342dff19
|
|
| BLAKE2b-256 |
c5a916dd49fa3a23e0413645d3cd84f0bb0dffa20f7351e40192992d1e2f45fe
|
File details
Details for the file docpipe_mini-0.1.0a1-py3-none-any.whl.
File metadata
- Download URL: docpipe_mini-0.1.0a1-py3-none-any.whl
- Upload date:
- Size: 36.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
535b146f11ce87cbcdcb170b00206f485fbc26898c8b14dcf7e4bdd83defcc7b
|
|
| MD5 |
f7b65cd19ae7d22242d2431417fcc7d1
|
|
| BLAKE2b-256 |
f59e08180b9d89f29b5243b1ac7171f76ba059d727e75e8a0e5cf7cf81b9ebbb
|