Minimal document-to-jsonl serializer with coordinates for AI

These details have not been verified by PyPI

Project links

Project description

docpipe-mini

Minimal document-to-jsonl serializer with coordinates for AI

docpipe-mini converts documents into JSONL (JSON Lines) format with coordinate information, perfect for AI consumption. Focus on speed, minimal dependencies, and clean output.

🚀 Quick Start

# Install (5 MB core, zero dependencies)
pip install docpipe-mini

# Install PDF support (+11 MB, BSD license)
pip install docpipe-mini[pdf]

# Convert document to JSONL
python -m docpipe serialize-cmd document.pdf > document.jsonl

📖 Usage

Python API

import docpipe_mini as dp

# Simple serialization
for chunk in dp.serialize("paper.pdf"):
    print(chunk.to_jsonl())
    # {"doc_id":"uuid","page":1,"x":0.1,"y":0.2,"w":0.8,"h":0.1,"type":"text","text":"...","tokens":42}

# Direct JSONL output
for line in dp.serialize_to_jsonl("paper.pdf"):
    print(line)

# List supported formats
print(dp.list_formats())

Command Line

# Basic usage
python -m docpipe serialize-cmd document.pdf > output.jsonl

# Save to file
python -m docpipe serialize-cmd document.pdf -o output.jsonl

# Include images and export them
python -m docpipe serialize-cmd document.docx --include-binary --export-images ./images

# Filter content types
python -m docpipe serialize-cmd document.pdf --types text,table

# Show processing statistics
python -m docpipe serialize-cmd document.pdf --stats

# List supported formats
python -m docpipe formats

# Show system information
python -m docpipe info

# Validate document without full processing
python -m docpipe validate document.pdf

✨ Key Features

🖼️ Image Extraction

PDF Images: Accurate extraction with coordinates using PyMuPDF
Word Images: Standard library extraction from DOCX files
Multiple Formats: PNG, JPEG, GIF, BMP, TIFF, WebP support
Export Options: Base64 encoding in JSON or save to separate files

🎛️ Rich CLI Interface

Progress Bars: Real-time processing progress with Rich
Statistics: Detailed processing metrics and content breakdown
Content Filtering: Filter by content type (text, table, image)
Memory Management: Built-in memory limits and monitoring

📍 Coordinate-Based Ordering

Reading Order: Content appears in document reading order
Accurate Positioning: Normalized coordinates (0-1 range)
Multi-Content Support: Text, tables, and images positioned correctly

📊 Output Format

Each line is a JSON object with:

{
  "doc_id": "uuid",           # Document identifier
  "page": 1,                  # Page number (1-based)
  "x": 0.123,                 # Normalized X coordinate (0-1)
  "y": 0.456,                 # Normalized Y coordinate (0-1)
  "w": 0.7,                   # Normalized width (0-1)
  "h": 0.08,                  # Normalized height (0-1)
  "type": "text",             # Content type: "text" | "table" | "image"
  "text": "...",              # Text content (null for images)
  "tokens": 42,               # Estimated token count
  "binary_data": "base64...", # Binary data for images (base64 encoded, optional)
  "binary_encoding": "base64",# Binary encoding format
  "metadata": {               # Additional metadata
    "source_file": "doc.pdf",
    "file_name": "doc.pdf",
    "file_extension": ".pdf",
    "file_size": 1048576,
    "extraction_method": "pymupdf"
  }
}

📦 Installation

Core Installation (5 MB)

pip install docpipe-mini

Zero third-party dependencies. Add format support as needed.

Optional Formats

# PDF support with PyMuPDF (AGPL, recommended, +11 MB)
pip install docpipe-mini[pdf]

# CLI with Rich interface (typer, +2 MB)
pip install docpipe-mini[cli]

# All optional dependencies
pip install docpipe-mini[all]

Development

git clone https://github.com/docpipe/docpipe-mini
cd docpipe-mini
uv sync --extra dev
pytest

🎯 Design Goals

Minimal Dependencies: Core uses only Python standard library
Fast Processing: ~300ms/MB on typical hardware
AI-Ready Output: Clean JSONL with coordinates for LLM consumption
Type Safety: Full type hints and mypy strict compliance
Memory Safe: Built-in memory limits and lazy processing
Rich CLI: Beautiful command-line interface with progress bars and statistics
Image Support: Automatic image extraction with base64 encoding and file export
Coordinate Ordering: Content is output in document reading order (top-to-bottom, left-to-right)

🏗️ Architecture

Document → Serializer → DocumentChunk → JSONL

Loaders: Zero-dependency document parsers
Processors: Coordinate extraction and text chunking
Output: Standardized JSONL format

📋 Supported Formats

Format	Status	Library	License	Features
PDF	✅	PyMuPDF	AGPL	Text, images, tables with accurate coordinates
DOCX	✅	Standard Library	MIT	Text, images with coordinate estimation
XLSX	🚧	Planned	-	Coming soon
Images	🚧	Planned	-	Coming soon

🔧 Configuration

import docpipe_mini as dp

# Memory limit
for chunk in dp.serialize("large.pdf", max_mem_mb=256):
    # Process with 256MB memory limit
    pass

# Custom document ID
for chunk in dp.serialize("paper.pdf", doc_id="my-paper"):
    # Use custom ID instead of UUID
    pass

# Process with image extraction
for chunk in dp.serialize("document.docx"):
    # Images are automatically extracted and base64 encoded
    if chunk.type == "image":
        print(f"Found image: {chunk.metadata['image_format']}, size: {chunk.metadata['image_size_bytes']} bytes")

CLI Options

# Memory management
python -m docpipe serialize-cmd large.pdf --max-mem 256

# Content filtering
python -m docpipe serialize-cmd document.pdf --types text,table  # Only text and tables
python -m docpipe serialize-cmd document.docx --types image      # Only images

# Image handling
python -m docpipe serialize-cmd document.docx --include-binary   # Include base64 image data
python -m docpipe serialize-cmd document.docx --export-images ./img  # Export images to files

# Output control
python -m docpipe serialize-cmd document.pdf --no-jsonl  # Plain text output
python -m docpipe serialize-cmd document.pdf --stats     # Show processing statistics

🧪 Testing

# Run tests
pytest

# Run benchmarks
pytest -m benchmark

# Type checking
mypy --strict

📈 Performance

Installation: 5 MB core, zero dependencies
Processing: ~300ms/MB for PDF documents
Memory: ~3x document size peak usage
Output: 1-2x input size (JSON overhead)

🤝 Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure mypy --strict passes
Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🔗 Links

docpipe-mini - Fast, minimal document serialization for AI applications.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.4

Nov 3, 2025

0.2.3

Oct 20, 2025

This version

0.1.0a1 pre-release

Oct 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpipe_mini-0.1.0a1.tar.gz (9.2 MB view details)

Uploaded Oct 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docpipe_mini-0.1.0a1-py3-none-any.whl (36.1 kB view details)

Uploaded Oct 16, 2025 Python 3

File details

Details for the file docpipe_mini-0.1.0a1.tar.gz.

File metadata

Download URL: docpipe_mini-0.1.0a1.tar.gz
Upload date: Oct 16, 2025
Size: 9.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpipe_mini-0.1.0a1.tar.gz
Algorithm	Hash digest
SHA256	`213c75f0a0fc0e207cde62bcf4be629e28ecba6bd4c2c6317e36832b698124b3`
MD5	`5683b5fcfc511d420e6d1775342dff19`
BLAKE2b-256	`c5a916dd49fa3a23e0413645d3cd84f0bb0dffa20f7351e40192992d1e2f45fe`

See more details on using hashes here.

File details

Details for the file docpipe_mini-0.1.0a1-py3-none-any.whl.

File metadata

Download URL: docpipe_mini-0.1.0a1-py3-none-any.whl
Upload date: Oct 16, 2025
Size: 36.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpipe_mini-0.1.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`535b146f11ce87cbcdcb170b00206f485fbc26898c8b14dcf7e4bdd83defcc7b`
MD5	`f7b65cd19ae7d22242d2431417fcc7d1`
BLAKE2b-256	`f59e08180b9d89f29b5243b1ac7171f76ba059d727e75e8a0e5cf7cf81b9ebbb`

See more details on using hashes here.

docpipe-mini 0.1.0a1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docpipe-mini

🚀 Quick Start

📖 Usage

Python API

Command Line

✨ Key Features

🖼️ Image Extraction

🎛️ Rich CLI Interface

📍 Coordinate-Based Ordering

📊 Output Format

📦 Installation

Core Installation (5 MB)

Optional Formats

Development

🎯 Design Goals

🏗️ Architecture

📋 Supported Formats

🔧 Configuration

CLI Options

🧪 Testing

📈 Performance

🤝 Contributing

📄 License

🔗 Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes