Recovery orchestration framework for targeted extraction failure recovery
Project description
document-recovery
A recovery orchestration framework for targeted extraction failure recovery. This library provides a production-ready architecture for routing deficiencies to specialized recovery strategies, merging recovered data, and tracking costs.
Overview
document-recovery implements a recovery orchestration framework that receives deficiencies from document-confidence, routes them to specialized recovery strategies (OCR, Table, Spatial, Cross-reference), merges recovered data, resolves conflicts, and tracks costs.
Architecture
The library follows a modular architecture with:
- Strategy Pattern - Specialized recovery strategies for different failure types
- Router - Deficiency-to-strategy mapping with budget enforcement
- Merger - Recursive merging of primary and recovery output
- Reconciler - Deterministic conflict resolution
- Telemetry - Cost tracking and performance metrics
- Protocol-based Interfaces - Type-safe contracts for extensibility
Installation
pip install document-recovery
Optional Dependencies
# For development
pip install document-recovery[dev]
Quick Start
from document_recovery import (
RecoveryConfig,
RecoveryRouter,
RecoveryPipeline,
)
from document_recovery.strategies import OcrRecoveryStrategy, TableRecoveryStrategy
# Configure recovery
config = RecoveryConfig(
max_recovery_attempts=2,
budget_limit_usd=5.0,
enable_ocr_recovery=True,
)
# Initialize router
router = RecoveryRouter(config)
# Initialize strategies
strategies = {
"ocr_recovery": OcrRecoveryStrategy(config, azure_client),
"table_recovery": TableRecoveryStrategy(config, azure_client),
}
# Initialize pipeline
pipeline = RecoveryPipeline(config, router, strategies)
# Execute recovery
result = await pipeline.execute(
primary_result=extraction,
deficiencies=deficiencies,
pages=pages,
)
Configuration
Recovery Configuration
from document_recovery import RecoveryConfig
config = RecoveryConfig(
max_recovery_attempts=2,
max_vision_calls_per_document=10,
budget_limit_usd=5.0,
enable_ocr_recovery=True,
enable_table_recovery=True,
enable_spatial_recovery=True,
enable_crossref_recovery=True,
vision_model="gpt-4.1",
vision_timeout_seconds=60.0,
parallel_recovery_limit=5,
)
Recovery Strategies
OCR Recovery Strategy
Recovers text using Azure Document Intelligence.
from document_recovery.strategies import OcrRecoveryStrategy
strategy = OcrRecoveryStrategy(config, azure_client)
result = await strategy.recover(pages, deficiency)
Table Recovery Strategy
Recovers tables using Azure Document Intelligence.
from document_recovery.strategies import TableRecoveryStrategy
strategy = TableRecoveryStrategy(config, azure_client)
result = await strategy.recover(pages, deficiency)
Spatial Recovery Strategy
Recovers spatial/layout information using Vision LLMs.
from document_recovery.strategies import SpatialRecoveryStrategy
with open("prompts/spatial_recovery.txt") as f:
prompt = f.read()
strategy = SpatialRecoveryStrategy(config, vision_model, prompt)
result = await strategy.recover(pages, deficiency)
Cross-Reference Recovery Strategy
Recovers cross-references using document-agents.
from document_recovery.strategies import CrossrefRecoveryStrategy
strategy = CrossrefRecoveryStrategy(config, crossref_agent)
result = await strategy.recover(pages, deficiency)
Deficiency Routing
The router maps deficiencies to strategies:
STRATEGY_MAP = {
"ocr_gap": "ocr_recovery",
"table_missing": "table_recovery",
"spatial_failure": "spatial_recovery",
"crossref_broken": "crossref_recovery",
}
Result Merging
The merger combines primary extraction with recovery results:
from document_recovery import ResultMerger
merger = ResultMerger()
result = merger.merge(primary_result, recovery_results)
Conflict Resolution
The reconciler resolves conflicts using predefined rules:
RESOLUTION_RULES = {
"text": "prefer_longer",
"upc": "prefer_primary",
"name": "prefer_primary",
"default": "prefer_recovery",
}
Telemetry
Track recovery metrics and costs:
from document_recovery import RecoveryTelemetry
telemetry = RecoveryTelemetry()
telemetry.record_recovery(
strategy_name="ocr_recovery",
success=True,
cost_usd=0.05,
latency_ms=100,
pages_count=2,
)
summary = telemetry.get_summary()
Custom Recovery Strategies
Create custom strategies by extending BaseRecoveryStrategy:
from document_recovery.strategies import BaseRecoveryStrategy
class CustomRecoveryStrategy(BaseRecoveryStrategy):
async def _execute(self, pages, deficiency):
# Custom recovery logic
return RecoveryResult(...)
Development
Running Tests
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=document_recovery
Code Style
# Format code
black document_recovery
# Lint code
ruff check document_recovery
# Type check
mypy document_recovery
Design Principles
- Async-first - All operations are asynchronous
- Provider Agnostic - Interfaces for external dependencies
- Extensible - Plugin architecture for custom strategies
- Type-safe - Full type hints with Pydantic validation
- Production-ready - Enterprise-scale performance
Dependencies
pydantic>=2.0- Data validationtyping_extensions>=4.0- Type extensions
External Dependencies
The library depends on external interfaces that must be implemented:
IAzureDocumentIntelligenceClient- Azure Document IntelligenceIVisionModel- Vision LLM interfaceIExtractionAgent- document-agents interfaceIDocumentParser- Document parser interface
Performance
The library is designed for:
- Parallel recovery execution
- Budget enforcement
- Cost tracking
- Conflict resolution
License
MIT
Support
For issues, questions, or contributions, please visit the project repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pepsico_document_recovery-0.1.0.tar.gz.
File metadata
- Download URL: pepsico_document_recovery-0.1.0.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6633b84e433b1be01ea2740778a7b4808d47be41cdcbf40d197bfc5f5f2afcae
|
|
| MD5 |
de9b73b68cd49afd377b31d30d402bc2
|
|
| BLAKE2b-256 |
77b8b6119d57698c27524bc1089c71fab04077307c42b179e0f2cb49d623cc67
|
File details
Details for the file pepsico_document_recovery-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pepsico_document_recovery-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64db20c805d5a49b430815e2666a7168b75b211321f29b47a2fd4afe5cacefb4
|
|
| MD5 |
ba129f0e1ecf8ccc651d05a624173d60
|
|
| BLAKE2b-256 |
e73e191f7a12a82516d91939b535033a1b7f8e8bfbefa1f730ce2cd808bbd0ae
|