Skip to main content

Minimal document-to-jsonl serializer with coordinates for AI

Project description

docpipe

Protocol-oriented document serialization with coordinate-aware chunks for AI

docpipe converts documents into coordinate-aware chunks perfect for AI consumption. Built with a protocol-oriented mixin design for extensibility, zero-dependency core, and enterprise-grade logging.

🚀 Quick Start

# Install (5 MB core, zero dependencies)
pip install docpipe

# Install PDF support (+11 MB, BSD license)
pip install docpipe[pdf]

# Convert document to JSONL
python -m docpipe serialize document.pdf > document.jsonl

📖 Usage

Python API

from docpipe import DocxSerializer, XlsxSerializer, PdfiumSerializer

# Word documents with advanced features
with DocxSerializer() as serializer:
    # Configure logging and serialization
    serializer.configure_logging(enable_performance_logging=True, log_level="DEBUG")
    serializer.configure_memory_limit(max_mem_mb=512)

    # Stream chunks for memory efficiency
    for chunk in serializer.iterate_chunks("report.docx"):
        print(f"Type: {chunk.type}, Position: ({chunk.x:.2f}, {chunk.y:.2f})")
        print(f"Content: {chunk.text[:100]}...")

# Excel files with header injection
excel = XlsxSerializer()
excel.configure_header_injection(header_row=1)  # Use first row as headers
for chunk in excel.iterate_chunks("data.xlsx"):
    headers = chunk.metadata.get('headers', [])
    print(f"Headers: {headers}")
    print(f"Data: {chunk.text}")

# PDF processing
pdf = PdfiumSerializer()
for chunk in pdf.iterate_chunks("document.pdf"):
    if chunk.type == "table":
        print(f"Table with {chunk.metadata.get('row_count', 0)} rows")
    elif chunk.type == "text":
        print(f"Text: {chunk.text[:100]}...")

Context Manager Pattern

# All serializers support context managers for resource management
with XlsxSerializer() as serializer:
    serializer.configure_memory_limit(max_mem_mb=256)
    serializer.configure_logging(enable_performance_logging=True)

    # Process multiple files with consistent configuration
    for file_path in ["data1.xlsx", "data2.xlsx"]:
        for chunk in serializer.iterate_chunks(file_path):
            # Process chunk
            process_chunk(chunk)

# Automatic cleanup on context exit
# Logs performance statistics
# Resets configuration to defaults

Memory-Efficient Iterator Pattern

# For large documents, use iterator pattern
serializer = DocxSerializer()
chunk_count = 0
for chunk in serializer.iterate_chunks("large_document.docx"):
    chunk_count += 1
    # Process chunk immediately without loading all into memory
    process_chunk(chunk)

    # Optional: limit processing
    if chunk_count >= 1000:
        break

print(f"Processed {chunk_count} chunks efficiently")

Command Line

# Basic usage
python -m docpipe serialize document.pdf > output.jsonl

# Advanced options
python -m docpipe serialize document.docx \
    --memory-limit 512 \
    --enable-logging \
    --log-level DEBUG \
    --output formatted.jsonl

# Excel with header injection
python -m docpipe serialize data.xlsx \
    --header-row 1 \
    --rag-format

# List supported formats
python -m docpipe formats

# Show system information
python -m docpipe info

✨ Key Features

🔧 Protocol-Oriented Architecture

  • Mixin Design: LoggingMixin, SerializerMixin for composable functionality
  • Type Safety: Runtime checkable protocols with mypy strict compliance
  • Extensibility: Easy to add new serializers via protocol implementation
  • Zero Dependencies: Core functionality uses only Python standard library

📝 Advanced Excel Processing

  • Header Injection: Automatic or custom header support
    # Use first row as headers
    serializer.configure_header_injection(header_row=1)
    
    # Or provide custom headers
    custom_headers = ["Name", "Age", "Department"]
    serializer.configure_header_injection(custom_headers=custom_headers)
    
  • Cell-Level Processing: Individual cell extraction with coordinates
  • Table Structure: Maintain spreadsheet structure in output
  • Embedded Images: Extract images from worksheets
  • Chart Detection: Identify and describe Excel charts
  • RAG Format: Optimized output for Retrieval-Augmented Generation

📄 Word Document Processing

  • Correct Content Ordering: Images appear in document reading order (not at end)
  • Mixed Content: Handle text and images in their natural sequence
  • Coordinate Estimation: Smart positioning based on document structure
  • Format Preservation: Detect bold, italic, and other formatting
  • Image Extraction: Base64 encoding with format detection

📊 PDF Processing

  • Text Extraction: Accurate text with coordinates
  • Table Recognition: Automatic table detection and extraction
  • Image Support: Extract images with position data
  • Memory Safe: Proper resource management for large files

🗂️ Enterprise Logging

  • Structured Logging: Comprehensive logging with performance metrics
  • Timing Information: Operation timing with context data
  • Progress Tracking: Real-time processing progress
  • Error Handling: Detailed error reporting with context
  • Performance Analytics: Built-in performance monitoring

🎛️ Rich Configuration

serializer = XlsxSerializer()

# Configure multiple aspects with method chaining
serializer.configure_memory_limit(max_mem_mb=512)\
          .configure_logging(enable_performance_logging=True, log_level="INFO")\
          .configure_header_injection(header_row=1)\
          .configure_rag_format(enable_backward_compatible=True)

# Use with context manager for automatic cleanup
with serializer:
    for chunk in serializer.iterate_chunks("data.xlsx"):
        print(chunk.to_dict())

📊 Output Format

Each chunk is a DocumentChunk object:

@dataclass
class DocumentChunk:
    doc_id: str                    # Document identifier
    page: int                      # Page number (1-based)
    x: float                       # Normalized X coordinate (0-1)
    y: float                       # Normalized Y coordinate (0-1)
    w: float                       # Normalized width (0-1)
    h: float                       # Normalized height (0-1)
    type: str                      # Content type: "text" | "table" | "image"
    text: Optional[str]            # Text content
    tokens: Optional[int]          # Estimated token count
    binary_data: Optional[str]     # Base64 encoded image data
    metadata: Dict[str, Any]       # Additional metadata

JSONL Output

{
  "doc_id": "uuid",
  "page": 1,
  "x": 0.123,
  "y": 0.456,
  "w": 0.7,
  "h": 0.08,
  "type": "text",
  "text": "Sample content...",
  "tokens": 42,
  "binary_data": null,
  "metadata": {
    "source_file": "document.docx",
    "serializer": "DocxSerializer",
    "extraction_method": "docx_stdlib_ordered",
    "paragraph_index": 15,
    "has_formatting": true,
    "font_sizes": [12, 14],
    "processing_time": 0.045
  }
}

📦 Installation

Core Installation (5 MB)

pip install docpipe

Zero third-party dependencies for core functionality.

Optional Formats

# PDF support with PyMuPDF (AGPL, recommended, +11 MB)
pip install docpipe[pdf]

# Development tools
pip install docpipe[dev]

Development

git clone https://github.com/docpipe/docpipe
cd docpipe
uv sync --extra dev
pytest
mypy --strict

🏗️ Architecture

Protocol-Oriented Design

Protocols (Interfaces) ← Mixins (Implementations) ← Serializers (Concrete Classes)
  1. Protocols (_protocols.py):

    • DocumentSerializer: Core serialization interface
    • LoggingMixinProto: Structured logging interface
    • SerializerMixinProto: Configuration and context management
  2. Mixins (Default implementations):

    • LoggingMixin: Performance logging, timing, error tracking
    • SerializerMixin: Memory limits, context management, configuration
  3. Serializers (Concrete implementations):

    • DocxSerializer: Word document processing
    • XlsxSerializer: Excel spreadsheet processing
    • PdfiumSerializer: PDF document processing

Data Flow

Document File → Serializer → DocumentChunk(s) → JSONL/Objects
     ↓              ↓              ↓
   File I/O    Protocol API   Structured Output

📋 Supported Formats

Format Status Library License Features
PDF pypdfium2 BSD Text, images, tables with coordinates
DOCX Standard Library MIT Text, images, formatting, correct ordering
XLSX Standard Library MIT Cells, tables, headers, charts, images

🔧 Advanced Configuration

Excel Header Injection

# Method 1: Use first row as headers
excel = XlsxSerializer()
excel.configure_header_injection(header_row=1)

# Method 2: Custom headers
custom_headers = ["Product", "Price", "Quantity", "Category"]
excel.configure_header_injection(custom_headers=custom_headers)

# Method 3: Per-file configuration
with excel.configure_header_injection(header_row=1) as configured:
    for chunk in configured.iterate_chunks("sales_data.xlsx"):
        # Headers are automatically injected into metadata
        print(f"Headers: {chunk.metadata.get('headers', [])}")
        print(f"Data: {chunk.text}")

Memory Management

# Set memory limits
serializer = DocxSerializer()
serializer.configure_memory_limit(max_mem_mb=256)

# Iterator pattern for large files
for chunk in serializer.iterate_chunks("large_file.docx"):
    # Process chunk immediately
    process_chunk(chunk)
    # Memory usage stays low

Logging Configuration

# Enable detailed logging
serializer = XlsxSerializer()
serializer.configure_logging(
    enable_performance_logging=True,
    log_level="DEBUG"
)

# Logs include:
# - Operation timing
# - Memory usage
# - Processing progress
# - Error context
# - Performance metrics

Context Manager Usage

# Automatic resource management
with XlsxSerializer() as serializer:
    serializer.configure_memory_limit(max_mem_mb=512)
    serializer.configure_logging(enable_performance_logging=True)

    # Process multiple files
    for file_path in ["file1.xlsx", "file2.xlsx"]:
        for chunk in serializer.iterate_chunks(file_path):
            process_chunk(chunk)

# Automatic cleanup on exit:
# - Reset configuration
# - Close file handles
# - Log performance summary
# - Clean up resources

🧪 Testing

# Run all tests
pytest

# Run specific serializer tests
pytest tests/test_docx.py
pytest tests/test_xlsx.py
pytest tests/test_pdf.py

# Type checking
mypy --strict

# Performance benchmarks
pytest -m benchmark

📈 Performance

  • Installation: 5 MB core, zero dependencies
  • Processing: ~300ms/MB for typical documents
  • Memory: Configurable limits, iterator pattern for large files
  • Output: Clean, coordinate-aware chunks optimized for AI

🎯 Design Goals

  • Protocol-First: Composable architecture via protocols and mixins
  • Zero Dependencies: Core functionality uses only Python standard library
  • Memory Safe: Built-in memory limits and iterator pattern
  • Enterprise Ready: Comprehensive logging and error handling
  • AI-Optimized: Coordinate-aware output for LLM consumption
  • Correct Ordering: Content appears in natural reading order
  • Type Safe: Full type hints and mypy strict compliance

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure mypy --strict passes
  5. Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🔗 Links


docpipe - Protocol-oriented document serialization for AI applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpipe_mini-0.2.3.tar.gz (9.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpipe_mini-0.2.3-py3-none-any.whl (80.0 kB view details)

Uploaded Python 3

File details

Details for the file docpipe_mini-0.2.3.tar.gz.

File metadata

  • Download URL: docpipe_mini-0.2.3.tar.gz
  • Upload date:
  • Size: 9.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpipe_mini-0.2.3.tar.gz
Algorithm Hash digest
SHA256 f7649396a7b496390ae8df6361fe4891d2bc7b571d318596d1332a690566877c
MD5 5d91432048db5f142012ec18c6d6fdb2
BLAKE2b-256 eff9c41d8a39322511ac3bd6636be8b56c600bdd68f88297ba70fd159679c59a

See more details on using hashes here.

File details

Details for the file docpipe_mini-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: docpipe_mini-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 80.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpipe_mini-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 48d033f3377d18be49adac3f94f31d95c58832a8095f34054403b1cf0f614cf1
MD5 9e9537efb4c0ecae8779d5125ee79532
BLAKE2b-256 e8d625332846995e26e77ec208ec11b6cc11f1ebd9ef7b696449bb858485b5dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page