Skip to main content

Recovery orchestration framework for targeted extraction failure recovery

Project description

document-recovery

A recovery orchestration framework for targeted extraction failure recovery. This library provides a production-ready architecture for routing deficiencies to specialized recovery strategies, merging recovered data, and tracking costs.

Overview

document-recovery implements a recovery orchestration framework that receives deficiencies from document-confidence, routes them to specialized recovery strategies (OCR, Table, Spatial, Cross-reference), merges recovered data, resolves conflicts, and tracks costs.

Architecture

The library follows a modular architecture with:

  • Strategy Pattern - Specialized recovery strategies for different failure types
  • Router - Deficiency-to-strategy mapping with budget enforcement
  • Merger - Recursive merging of primary and recovery output
  • Reconciler - Deterministic conflict resolution
  • Telemetry - Cost tracking and performance metrics
  • Protocol-based Interfaces - Type-safe contracts for extensibility

Installation

pip install document-recovery

Optional Dependencies

# For development
pip install document-recovery[dev]

Quick Start

from document_recovery import (
    RecoveryConfig,
    RecoveryRouter,
    RecoveryPipeline,
)
from document_recovery.strategies import OcrRecoveryStrategy, TableRecoveryStrategy

# Configure recovery
config = RecoveryConfig(
    max_recovery_attempts=2,
    budget_limit_usd=5.0,
    enable_ocr_recovery=True,
)

# Initialize router
router = RecoveryRouter(config)

# Initialize strategies
strategies = {
    "ocr_recovery": OcrRecoveryStrategy(config, azure_client),
    "table_recovery": TableRecoveryStrategy(config, azure_client),
}

# Initialize pipeline
pipeline = RecoveryPipeline(config, router, strategies)

# Execute recovery
result = await pipeline.execute(
    primary_result=extraction,
    deficiencies=deficiencies,
    pages=pages,
)

Configuration

Recovery Configuration

from document_recovery import RecoveryConfig

config = RecoveryConfig(
    max_recovery_attempts=2,
    max_vision_calls_per_document=10,
    budget_limit_usd=5.0,
    enable_ocr_recovery=True,
    enable_table_recovery=True,
    enable_spatial_recovery=True,
    enable_crossref_recovery=True,
    vision_model="gpt-4.1",
    vision_timeout_seconds=60.0,
    parallel_recovery_limit=5,
)

Recovery Strategies

OCR Recovery Strategy

Recovers text using Azure Document Intelligence.

from document_recovery.strategies import OcrRecoveryStrategy

strategy = OcrRecoveryStrategy(config, azure_client)
result = await strategy.recover(pages, deficiency)

Table Recovery Strategy

Recovers tables using Azure Document Intelligence.

from document_recovery.strategies import TableRecoveryStrategy

strategy = TableRecoveryStrategy(config, azure_client)
result = await strategy.recover(pages, deficiency)

Spatial Recovery Strategy

Recovers spatial/layout information using Vision LLMs.

from document_recovery.strategies import SpatialRecoveryStrategy

with open("prompts/spatial_recovery.txt") as f:
    prompt = f.read()

strategy = SpatialRecoveryStrategy(config, vision_model, prompt)
result = await strategy.recover(pages, deficiency)

Cross-Reference Recovery Strategy

Recovers cross-references using document-agents.

from document_recovery.strategies import CrossrefRecoveryStrategy

strategy = CrossrefRecoveryStrategy(config, crossref_agent)
result = await strategy.recover(pages, deficiency)

Deficiency Routing

The router maps deficiencies to strategies:

STRATEGY_MAP = {
    "ocr_gap": "ocr_recovery",
    "table_missing": "table_recovery",
    "spatial_failure": "spatial_recovery",
    "crossref_broken": "crossref_recovery",
}

Result Merging

The merger combines primary extraction with recovery results:

from document_recovery import ResultMerger

merger = ResultMerger()
result = merger.merge(primary_result, recovery_results)

Conflict Resolution

The reconciler resolves conflicts using predefined rules:

RESOLUTION_RULES = {
    "text": "prefer_longer",
    "upc": "prefer_primary",
    "name": "prefer_primary",
    "default": "prefer_recovery",
}

Telemetry

Track recovery metrics and costs:

from document_recovery import RecoveryTelemetry

telemetry = RecoveryTelemetry()
telemetry.record_recovery(
    strategy_name="ocr_recovery",
    success=True,
    cost_usd=0.05,
    latency_ms=100,
    pages_count=2,
)

summary = telemetry.get_summary()

Custom Recovery Strategies

Create custom strategies by extending BaseRecoveryStrategy:

from document_recovery.strategies import BaseRecoveryStrategy

class CustomRecoveryStrategy(BaseRecoveryStrategy):
    async def _execute(self, pages, deficiency):
        # Custom recovery logic
        return RecoveryResult(...)

Development

Running Tests

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=document_recovery

Code Style

# Format code
black document_recovery

# Lint code
ruff check document_recovery

# Type check
mypy document_recovery

Design Principles

  1. Async-first - All operations are asynchronous
  2. Provider Agnostic - Interfaces for external dependencies
  3. Extensible - Plugin architecture for custom strategies
  4. Type-safe - Full type hints with Pydantic validation
  5. Production-ready - Enterprise-scale performance

Dependencies

  • pydantic>=2.0 - Data validation
  • typing_extensions>=4.0 - Type extensions

External Dependencies

The library depends on external interfaces that must be implemented:

  • IAzureDocumentIntelligenceClient - Azure Document Intelligence
  • IVisionModel - Vision LLM interface
  • IExtractionAgent - document-agents interface
  • IDocumentParser - Document parser interface

Performance

The library is designed for:

  • Parallel recovery execution
  • Budget enforcement
  • Cost tracking
  • Conflict resolution

License

MIT

Support

For issues, questions, or contributions, please visit the project repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pepsico_document_recovery-0.1.0.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pepsico_document_recovery-0.1.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file pepsico_document_recovery-0.1.0.tar.gz.

File metadata

File hashes

Hashes for pepsico_document_recovery-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6633b84e433b1be01ea2740778a7b4808d47be41cdcbf40d197bfc5f5f2afcae
MD5 de9b73b68cd49afd377b31d30d402bc2
BLAKE2b-256 77b8b6119d57698c27524bc1089c71fab04077307c42b179e0f2cb49d623cc67

See more details on using hashes here.

File details

Details for the file pepsico_document_recovery-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pepsico_document_recovery-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 64db20c805d5a49b430815e2666a7168b75b211321f29b47a2fd4afe5cacefb4
MD5 ba129f0e1ecf8ccc651d05a624173d60
BLAKE2b-256 e73e191f7a12a82516d91939b535033a1b7f8e8bfbefa1f730ce2cd808bbd0ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page