Minimal document-to-jsonl serializer with coordinates for AI
Project description
docpipe
Protocol-oriented document serialization with coordinate-aware chunks for AI
docpipe converts documents into coordinate-aware chunks perfect for AI consumption. Built with a protocol-oriented mixin design for extensibility, zero-dependency core, and enterprise-grade logging.
🚀 Quick Start
# Install (5 MB core, zero dependencies)
pip install docpipe
# Install PDF support (+11 MB, BSD license)
pip install docpipe[pdf]
# Convert document to JSONL
python -m docpipe serialize document.pdf > document.jsonl
📖 Usage
Python API
from docpipe import DocxSerializer, XlsxSerializer, PdfiumSerializer
# Word documents with advanced features
with DocxSerializer() as serializer:
# Configure logging and serialization
serializer.configure_logging(enable_performance_logging=True, log_level="DEBUG")
serializer.configure_memory_limit(max_mem_mb=512)
# Stream chunks for memory efficiency
for chunk in serializer.iterate_chunks("report.docx"):
print(f"Type: {chunk.type}, Position: ({chunk.x:.2f}, {chunk.y:.2f})")
print(f"Content: {chunk.text[:100]}...")
# Excel files with header injection
excel = XlsxSerializer()
excel.configure_header_injection(header_row=1) # Use first row as headers
for chunk in excel.iterate_chunks("data.xlsx"):
headers = chunk.metadata.get('headers', [])
print(f"Headers: {headers}")
print(f"Data: {chunk.text}")
# PDF processing
pdf = PdfiumSerializer()
for chunk in pdf.iterate_chunks("document.pdf"):
if chunk.type == "table":
print(f"Table with {chunk.metadata.get('row_count', 0)} rows")
elif chunk.type == "text":
print(f"Text: {chunk.text[:100]}...")
Context Manager Pattern
# All serializers support context managers for resource management
with XlsxSerializer() as serializer:
serializer.configure_memory_limit(max_mem_mb=256)
serializer.configure_logging(enable_performance_logging=True)
# Process multiple files with consistent configuration
for file_path in ["data1.xlsx", "data2.xlsx"]:
for chunk in serializer.iterate_chunks(file_path):
# Process chunk
process_chunk(chunk)
# Automatic cleanup on context exit
# Logs performance statistics
# Resets configuration to defaults
Memory-Efficient Iterator Pattern
# For large documents, use iterator pattern
serializer = DocxSerializer()
chunk_count = 0
for chunk in serializer.iterate_chunks("large_document.docx"):
chunk_count += 1
# Process chunk immediately without loading all into memory
process_chunk(chunk)
# Optional: limit processing
if chunk_count >= 1000:
break
print(f"Processed {chunk_count} chunks efficiently")
Command Line
# Basic usage
python -m docpipe serialize document.pdf > output.jsonl
# Advanced options
python -m docpipe serialize document.docx \
--memory-limit 512 \
--enable-logging \
--log-level DEBUG \
--output formatted.jsonl
# Excel with header injection
python -m docpipe serialize data.xlsx \
--header-row 1 \
--rag-format
# List supported formats
python -m docpipe formats
# Show system information
python -m docpipe info
✨ Key Features
🔧 Protocol-Oriented Architecture
- Mixin Design: LoggingMixin, SerializerMixin for composable functionality
- Type Safety: Runtime checkable protocols with mypy strict compliance
- Extensibility: Easy to add new serializers via protocol implementation
- Zero Dependencies: Core functionality uses only Python standard library
📝 Advanced Excel Processing
- Header Injection: Automatic or custom header support
# Use first row as headers serializer.configure_header_injection(header_row=1) # Or provide custom headers custom_headers = ["Name", "Age", "Department"] serializer.configure_header_injection(custom_headers=custom_headers)
- Cell-Level Processing: Individual cell extraction with coordinates
- Table Structure: Maintain spreadsheet structure in output
- Embedded Images: Extract images from worksheets
- Chart Detection: Identify and describe Excel charts
- RAG Format: Optimized output for Retrieval-Augmented Generation
📄 Word Document Processing
- Correct Content Ordering: Images appear in document reading order (not at end)
- Mixed Content: Handle text and images in their natural sequence
- Coordinate Estimation: Smart positioning based on document structure
- Format Preservation: Detect bold, italic, and other formatting
- Image Extraction: Base64 encoding with format detection
📊 PDF Processing
- Text Extraction: Accurate text with coordinates
- Table Recognition: Automatic table detection and extraction
- Image Support: Extract images with position data
- Memory Safe: Proper resource management for large files
🗂️ Enterprise Logging
- Structured Logging: Comprehensive logging with performance metrics
- Timing Information: Operation timing with context data
- Progress Tracking: Real-time processing progress
- Error Handling: Detailed error reporting with context
- Performance Analytics: Built-in performance monitoring
🎛️ Rich Configuration
serializer = XlsxSerializer()
# Configure multiple aspects with method chaining
serializer.configure_memory_limit(max_mem_mb=512)\
.configure_logging(enable_performance_logging=True, log_level="INFO")\
.configure_header_injection(header_row=1)\
.configure_rag_format(enable_backward_compatible=True)
# Use with context manager for automatic cleanup
with serializer:
for chunk in serializer.iterate_chunks("data.xlsx"):
print(chunk.to_dict())
📊 Output Format
Each chunk is a DocumentChunk object:
@dataclass
class DocumentChunk:
doc_id: str # Document identifier
page: int # Page number (1-based)
x: float # Normalized X coordinate (0-1)
y: float # Normalized Y coordinate (0-1)
w: float # Normalized width (0-1)
h: float # Normalized height (0-1)
type: str # Content type: "text" | "table" | "image"
text: Optional[str] # Text content
tokens: Optional[int] # Estimated token count
binary_data: Optional[str] # Base64 encoded image data
metadata: Dict[str, Any] # Additional metadata
JSONL Output
{
"doc_id": "uuid",
"page": 1,
"x": 0.123,
"y": 0.456,
"w": 0.7,
"h": 0.08,
"type": "text",
"text": "Sample content...",
"tokens": 42,
"binary_data": null,
"metadata": {
"source_file": "document.docx",
"serializer": "DocxSerializer",
"extraction_method": "docx_stdlib_ordered",
"paragraph_index": 15,
"has_formatting": true,
"font_sizes": [12, 14],
"processing_time": 0.045
}
}
📦 Installation
Core Installation (5 MB)
pip install docpipe
Zero third-party dependencies for core functionality.
Optional Formats
# PDF support with PyMuPDF (AGPL, recommended, +11 MB)
pip install docpipe[pdf]
# Development tools
pip install docpipe[dev]
Development
git clone https://github.com/docpipe/docpipe
cd docpipe
uv sync --extra dev
pytest
mypy --strict
🏗️ Architecture
Protocol-Oriented Design
Protocols (Interfaces) ← Mixins (Implementations) ← Serializers (Concrete Classes)
-
Protocols (
_protocols.py):DocumentSerializer: Core serialization interfaceLoggingMixinProto: Structured logging interfaceSerializerMixinProto: Configuration and context management
-
Mixins (Default implementations):
LoggingMixin: Performance logging, timing, error trackingSerializerMixin: Memory limits, context management, configuration
-
Serializers (Concrete implementations):
DocxSerializer: Word document processingXlsxSerializer: Excel spreadsheet processingPdfiumSerializer: PDF document processing
Data Flow
Document File → Serializer → DocumentChunk(s) → JSONL/Objects
↓ ↓ ↓
File I/O Protocol API Structured Output
📋 Supported Formats
| Format | Status | Library | License | Features |
|---|---|---|---|---|
| ✅ | pypdfium2 | BSD | Text, images, tables with coordinates | |
| DOCX | ✅ | Standard Library | MIT | Text, images, formatting, correct ordering |
| XLSX | ✅ | Standard Library | MIT | Cells, tables, headers, charts, images |
🔧 Advanced Configuration
Excel Header Injection
# Method 1: Use first row as headers
excel = XlsxSerializer()
excel.configure_header_injection(header_row=1)
# Method 2: Custom headers
custom_headers = ["Product", "Price", "Quantity", "Category"]
excel.configure_header_injection(custom_headers=custom_headers)
# Method 3: Per-file configuration
with excel.configure_header_injection(header_row=1) as configured:
for chunk in configured.iterate_chunks("sales_data.xlsx"):
# Headers are automatically injected into metadata
print(f"Headers: {chunk.metadata.get('headers', [])}")
print(f"Data: {chunk.text}")
Memory Management
# Set memory limits
serializer = DocxSerializer()
serializer.configure_memory_limit(max_mem_mb=256)
# Iterator pattern for large files
for chunk in serializer.iterate_chunks("large_file.docx"):
# Process chunk immediately
process_chunk(chunk)
# Memory usage stays low
Logging Configuration
# Enable detailed logging
serializer = XlsxSerializer()
serializer.configure_logging(
enable_performance_logging=True,
log_level="DEBUG"
)
# Logs include:
# - Operation timing
# - Memory usage
# - Processing progress
# - Error context
# - Performance metrics
Context Manager Usage
# Automatic resource management
with XlsxSerializer() as serializer:
serializer.configure_memory_limit(max_mem_mb=512)
serializer.configure_logging(enable_performance_logging=True)
# Process multiple files
for file_path in ["file1.xlsx", "file2.xlsx"]:
for chunk in serializer.iterate_chunks(file_path):
process_chunk(chunk)
# Automatic cleanup on exit:
# - Reset configuration
# - Close file handles
# - Log performance summary
# - Clean up resources
🧪 Testing
# Run all tests
pytest
# Run specific serializer tests
pytest tests/test_docx.py
pytest tests/test_xlsx.py
pytest tests/test_pdf.py
# Type checking
mypy --strict
# Performance benchmarks
pytest -m benchmark
📈 Performance
- Installation: 5 MB core, zero dependencies
- Processing: ~300ms/MB for typical documents
- Memory: Configurable limits, iterator pattern for large files
- Output: Clean, coordinate-aware chunks optimized for AI
🎯 Design Goals
- Protocol-First: Composable architecture via protocols and mixins
- Zero Dependencies: Core functionality uses only Python standard library
- Memory Safe: Built-in memory limits and iterator pattern
- Enterprise Ready: Comprehensive logging and error handling
- AI-Optimized: Coordinate-aware output for LLM consumption
- Correct Ordering: Content appears in natural reading order
- Type Safe: Full type hints and mypy strict compliance
🤝 Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure
mypy --strictpasses - Submit a pull request
📄 License
MIT License - see LICENSE file for details.
🔗 Links
docpipe - Protocol-oriented document serialization for AI applications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docpipe_mini-0.2.3.tar.gz.
File metadata
- Download URL: docpipe_mini-0.2.3.tar.gz
- Upload date:
- Size: 9.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7649396a7b496390ae8df6361fe4891d2bc7b571d318596d1332a690566877c
|
|
| MD5 |
5d91432048db5f142012ec18c6d6fdb2
|
|
| BLAKE2b-256 |
eff9c41d8a39322511ac3bd6636be8b56c600bdd68f88297ba70fd159679c59a
|
File details
Details for the file docpipe_mini-0.2.3-py3-none-any.whl.
File metadata
- Download URL: docpipe_mini-0.2.3-py3-none-any.whl
- Upload date:
- Size: 80.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48d033f3377d18be49adac3f94f31d95c58832a8095f34054403b1cf0f614cf1
|
|
| MD5 |
9e9537efb4c0ecae8779d5125ee79532
|
|
| BLAKE2b-256 |
e8d625332846995e26e77ec208ec11b6cc11f1ebd9ef7b696449bb858485b5dd
|