Skip to main content

Minimal document-to-jsonl serializer with coordinates for AI

Project description

docpipe-mini

Minimal document-to-jsonl serializer with coordinates for AI

docpipe-mini converts documents into JSONL (JSON Lines) format with coordinate information, perfect for AI consumption. Focus on speed, minimal dependencies, and clean output.

🚀 Quick Start

# Install (5 MB core, zero dependencies)
pip install docpipe-mini

# Install PDF support (+11 MB, BSD license)
pip install docpipe-mini[pdf]

# Convert document to JSONL
python -m docpipe serialize-cmd document.pdf > document.jsonl

📖 Usage

Python API

import docpipe_mini as dp

# Simple serialization
for chunk in dp.serialize("paper.pdf"):
    print(chunk.to_jsonl())
    # {"doc_id":"uuid","page":1,"x":0.1,"y":0.2,"w":0.8,"h":0.1,"type":"text","text":"...","tokens":42}

# Direct JSONL output
for line in dp.serialize_to_jsonl("paper.pdf"):
    print(line)

# List supported formats
print(dp.list_formats())

Command Line

# Basic usage
python -m docpipe serialize-cmd document.pdf > output.jsonl

# Save to file
python -m docpipe serialize-cmd document.pdf -o output.jsonl

# Include images and export them
python -m docpipe serialize-cmd document.docx --include-binary --export-images ./images

# Filter content types
python -m docpipe serialize-cmd document.pdf --types text,table

# Show processing statistics
python -m docpipe serialize-cmd document.pdf --stats

# List supported formats
python -m docpipe formats

# Show system information
python -m docpipe info

# Validate document without full processing
python -m docpipe validate document.pdf

✨ Key Features

🖼️ Image Extraction

  • PDF Images: Accurate extraction with coordinates using PyMuPDF
  • Word Images: Standard library extraction from DOCX files
  • Multiple Formats: PNG, JPEG, GIF, BMP, TIFF, WebP support
  • Export Options: Base64 encoding in JSON or save to separate files

🎛️ Rich CLI Interface

  • Progress Bars: Real-time processing progress with Rich
  • Statistics: Detailed processing metrics and content breakdown
  • Content Filtering: Filter by content type (text, table, image)
  • Memory Management: Built-in memory limits and monitoring

📍 Coordinate-Based Ordering

  • Reading Order: Content appears in document reading order
  • Accurate Positioning: Normalized coordinates (0-1 range)
  • Multi-Content Support: Text, tables, and images positioned correctly

📊 Output Format

Each line is a JSON object with:

{
  "doc_id": "uuid",           # Document identifier
  "page": 1,                  # Page number (1-based)
  "x": 0.123,                 # Normalized X coordinate (0-1)
  "y": 0.456,                 # Normalized Y coordinate (0-1)
  "w": 0.7,                   # Normalized width (0-1)
  "h": 0.08,                  # Normalized height (0-1)
  "type": "text",             # Content type: "text" | "table" | "image"
  "text": "...",              # Text content (null for images)
  "tokens": 42,               # Estimated token count
  "binary_data": "base64...", # Binary data for images (base64 encoded, optional)
  "binary_encoding": "base64",# Binary encoding format
  "metadata": {               # Additional metadata
    "source_file": "doc.pdf",
    "file_name": "doc.pdf",
    "file_extension": ".pdf",
    "file_size": 1048576,
    "extraction_method": "pymupdf"
  }
}

📦 Installation

Core Installation (5 MB)

pip install docpipe-mini

Zero third-party dependencies. Add format support as needed.

Optional Formats

# PDF support with PyMuPDF (AGPL, recommended, +11 MB)
pip install docpipe-mini[pdf]

# CLI with Rich interface (typer, +2 MB)
pip install docpipe-mini[cli]

# All optional dependencies
pip install docpipe-mini[all]

Development

git clone https://github.com/docpipe/docpipe-mini
cd docpipe-mini
uv sync --extra dev
pytest

🎯 Design Goals

  • Minimal Dependencies: Core uses only Python standard library
  • Fast Processing: ~300ms/MB on typical hardware
  • AI-Ready Output: Clean JSONL with coordinates for LLM consumption
  • Type Safety: Full type hints and mypy strict compliance
  • Memory Safe: Built-in memory limits and lazy processing
  • Rich CLI: Beautiful command-line interface with progress bars and statistics
  • Image Support: Automatic image extraction with base64 encoding and file export
  • Coordinate Ordering: Content is output in document reading order (top-to-bottom, left-to-right)

🏗️ Architecture

Document → Serializer → DocumentChunk → JSONL
  1. Loaders: Zero-dependency document parsers
  2. Processors: Coordinate extraction and text chunking
  3. Output: Standardized JSONL format

📋 Supported Formats

Format Status Library License Features
PDF PyMuPDF AGPL Text, images, tables with accurate coordinates
DOCX Standard Library MIT Text, images with coordinate estimation
XLSX 🚧 Planned - Coming soon
Images 🚧 Planned - Coming soon

🔧 Configuration

import docpipe_mini as dp

# Memory limit
for chunk in dp.serialize("large.pdf", max_mem_mb=256):
    # Process with 256MB memory limit
    pass

# Custom document ID
for chunk in dp.serialize("paper.pdf", doc_id="my-paper"):
    # Use custom ID instead of UUID
    pass

# Process with image extraction
for chunk in dp.serialize("document.docx"):
    # Images are automatically extracted and base64 encoded
    if chunk.type == "image":
        print(f"Found image: {chunk.metadata['image_format']}, size: {chunk.metadata['image_size_bytes']} bytes")

CLI Options

# Memory management
python -m docpipe serialize-cmd large.pdf --max-mem 256

# Content filtering
python -m docpipe serialize-cmd document.pdf --types text,table  # Only text and tables
python -m docpipe serialize-cmd document.docx --types image      # Only images

# Image handling
python -m docpipe serialize-cmd document.docx --include-binary   # Include base64 image data
python -m docpipe serialize-cmd document.docx --export-images ./img  # Export images to files

# Output control
python -m docpipe serialize-cmd document.pdf --no-jsonl  # Plain text output
python -m docpipe serialize-cmd document.pdf --stats     # Show processing statistics

🧪 Testing

# Run tests
pytest

# Run benchmarks
pytest -m benchmark

# Type checking
mypy --strict

📈 Performance

  • Installation: 5 MB core, zero dependencies
  • Processing: ~300ms/MB for PDF documents
  • Memory: ~3x document size peak usage
  • Output: 1-2x input size (JSON overhead)

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure mypy --strict passes
  5. Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🔗 Links


docpipe-mini - Fast, minimal document serialization for AI applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpipe_mini-0.1.0a1.tar.gz (9.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpipe_mini-0.1.0a1-py3-none-any.whl (36.1 kB view details)

Uploaded Python 3

File details

Details for the file docpipe_mini-0.1.0a1.tar.gz.

File metadata

  • Download URL: docpipe_mini-0.1.0a1.tar.gz
  • Upload date:
  • Size: 9.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpipe_mini-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 213c75f0a0fc0e207cde62bcf4be629e28ecba6bd4c2c6317e36832b698124b3
MD5 5683b5fcfc511d420e6d1775342dff19
BLAKE2b-256 c5a916dd49fa3a23e0413645d3cd84f0bb0dffa20f7351e40192992d1e2f45fe

See more details on using hashes here.

File details

Details for the file docpipe_mini-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: docpipe_mini-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 36.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpipe_mini-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 535b146f11ce87cbcdcb170b00206f485fbc26898c8b14dcf7e4bdd83defcc7b
MD5 f7b65cd19ae7d22242d2431417fcc7d1
BLAKE2b-256 f59e08180b9d89f29b5243b1ac7171f76ba059d727e75e8a0e5cf7cf81b9ebbb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page