Skip to main content

Foundational shared library for document-processing and planogram-extraction platform

Project description

document-core

A foundational shared library for the document-processing and planogram-extraction platform. This package provides domain models, interfaces, enums, schemas, exceptions, hashing utilities, and configuration classes for building document processing applications.

Purpose

document-core serves as the contract and domain layer for a larger document processing ecosystem. It defines the shared data structures, protocols, and utilities used across OCR engines, vision models, storage backends, and orchestration services.

Key Design Principles:

  • Domain Driven Design (DDD)
  • SOLID principles
  • Strong typing with complete type hints
  • Async-first interfaces
  • Immutable value objects where appropriate
  • Pydantic v2 models for validation
  • Production-grade validation
  • JSON serialization support
  • Auditability and forward compatibility

Architecture

The library is organized into several key modules:

Core Modules

  • enums.py - Enumeration types for page types, processing modes, field sources, job status, review decisions, and deficiency types
  • errors.py - Exception hierarchy with error codes, messages, and details
  • hashing.py - SHA256 hashing utilities for files, bytes, and text
  • config.py - Configuration management with environment variable support

Domain Models (models/)

  • document.py - Document, Page, and PageMetadata models
  • planogram.py - Product, Shelf, Section, ExtractionMetadata, and PlanogramResult models
  • extraction.py - FieldConflict and ExtractionResult models
  • confidence.py - ConfidenceScore and ConfidenceReport models
  • job.py - JobConfig and JobInfo models
  • review.py - ReviewTask and ReviewResult models

Interfaces (interfaces/)

Protocol definitions for implementing:

  • parser.py - IDocumentParser for document parsing
  • ocr.py - IOcrEngine for OCR text and table extraction
  • vision.py - IVisionModel for image analysis
  • agent.py - IExtractionAgent for AI-based extraction
  • storage.py - IFileStorage for file operations
  • cache.py - IResultCache for caching
  • queue.py - IJobQueue for job management

Schemas (schemas/)

  • api_schemas.py - API request/response models
  • output_schema.json - JSON Schema for PlanogramResult (Draft 2020-12)

Installation

Requirements

  • Python >= 3.11
  • pydantic >= 2.0

Install from Source

cd document-core
pip install -e .

Install Dependencies

pip install pydantic>=2.0

Usage

Basic Model Usage

from document_core import PageMetadata, PageType, Document
from datetime import datetime

# Create page metadata
metadata = PageMetadata(
    page_number=1,
    page_type=PageType.PLANOGRAM,
    width_px=1920,
    height_px=1080,
    image_area_ratio=0.95,
    small_text_ratio=0.1,
    detected_table_regions=2,
    detected_shelf_regions=5,
    raw_char_count=1000,
    has_rotated_text=False,
    content_hash="a" * 64,
)

# Access computed properties
print(f"Aspect ratio: {metadata.aspect_ratio}")

Planogram Models

from document_core import Product, Shelf, Section, PlanogramResult, ExtractionMetadata
from document_core.enums import FieldSource
from datetime import datetime

# Create a product
product = Product(
    name="Coca-Cola 12oz",
    upc="04963406",
    facings=3,
    source=FieldSource.PRIMARY,
)

# Create a shelf with products
shelf = Shelf(
    shelf_number=1,
    products=[product],
)

# Create a section with shelves
section = Section(
    section_name="Beverages",
    shelves=[shelf],
)

# Create extraction metadata
metadata = ExtractionMetadata(
    processing_time_ms=1500.0,
    model_name="planogram-extractor-v1",
    ocr_engine="tesseract",
    confidence_score=0.92,
    created_at=datetime.now(),
)

# Create complete planogram result
planogram = PlanogramResult(
    store_name="Store #123",
    category="Beverages",
    sections=[section],
    metadata=metadata,
)

# Access computed properties
print(f"Total products: {planogram.total_products}")
print(f"Total shelves: {planogram.total_shelves}")

Confidence Reports

from document_core import ConfidenceScore, ConfidenceReport
from datetime import datetime

# Create field scores
field_scores = [
    ConfidenceScore(
        field_name="product_name",
        score=0.95,
        source="ocr",
        reason="Clear text",
    ),
    ConfidenceScore(
        field_name="upc",
        score=0.88,
        source="ocr",
        reason="Slightly blurry",
    ),
]

# Create confidence report
report = ConfidenceReport(
    overall_score=0.91,
    field_scores=field_scores,
    deficiencies=[],
    generated_at=datetime.now(),
)

# Check if review is required
if report.is_review_required():
    print("Manual review required")
else:
    print("Confidence is acceptable")

Hashing Utilities

from document_core import compute_sha256_file, compute_sha256_text

# Hash text
text_hash = compute_sha256_text("Hello, World!")
print(f"Text hash: {text_hash}")

# Hash file
file_hash = compute_sha256_file("/path/to/document.pdf")
print(f"File hash: {file_hash}")

Configuration

from document_core import BaseConfig

# Load from environment variables
config = BaseConfig.from_env()

# Or create directly
config = BaseConfig(
    environment="production",
    log_level="INFO",
    cache_ttl_seconds=3600,
)

Error Handling

from document_core import ValidationError, DocumentParseError

try:
    # Your validation logic
    pass
except ValidationError as e:
    print(f"Validation failed: {e.message}")
    print(f"Field: {e.details.get('field')}")
except DocumentParseError as e:
    print(f"Parse failed: {e.message}")
    print(f"Document ID: {e.details.get('document_id')}")

Extending Interfaces

The library provides Protocol-based interfaces that you can implement to create custom components:

Implementing a Custom OCR Engine

from document_core.interfaces import IOcrEngine, OcrResult, TableResult
from document_core.errors import OcrError

class CustomOcrEngine(IOcrEngine):
    async def extract_text(self, image_path: str) -> OcrResult:
        try:
            # Your OCR implementation
            text = "Extracted text..."
            confidence = 0.95
            
            return OcrResult(
                text=text,
                confidence=confidence,
                processing_time_ms=500.0,
                success=True,
            )
        except Exception as e:
            raise OcrError(
                message=f"OCR failed: {str(e)}",
                ocr_engine="custom",
            )
    
    async def extract_tables(self, image_path: str) -> TableResult:
        # Your table extraction implementation
        pass

Implementing a Custom Storage Backend

from document_core.interfaces import IFileStorage
from document_core.errors import StorageError

class S3Storage(IFileStorage):
    async def upload(self, file_path: str, storage_key: str) -> str:
        # Your S3 upload implementation
        return f"s3://bucket/{storage_key}"
    
    async def download(self, storage_key: str, local_path: str) -> None:
        # Your S3 download implementation
        pass
    
    async def exists(self, storage_key: str) -> bool:
        # Your existence check implementation
        pass
    
    async def delete(self, storage_key: str) -> None:
        # Your delete implementation
        pass

JSON Schema

A JSON Schema for the PlanogramResult model is provided in schemas/output_schema.json. This schema follows JSON Schema Draft 2020-12 and can be used for validation in systems that don't use Python/Pydantic.

# Validate a JSON file against the schema
ajv validate --schema=document_core/schemas/output_schema.json --data=data.json

Testing

Run the test suite:

pytest tests/

Run tests with coverage:

pytest tests/ --cov=document_core --cov-report=html

Versioning Strategy

This project follows Semantic Versioning:

  • MAJOR: Incompatible API changes
  • MINOR: New functionality in a backwards compatible manner
  • PATCH: Backwards compatible bug fixes

Given the library is in early development (v0.1.0), minor versions may include breaking changes until v1.0 is released.

Project Structure

document-core/
├── pyproject.toml
├── README.md
├── document_core/
│   ├── __init__.py
│   ├── enums.py
│   ├── errors.py
│   ├── hashing.py
│   ├── config.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── document.py
│   │   ├── planogram.py
│   │   ├── extraction.py
│   │   ├── confidence.py
│   │   ├── job.py
│   │   └── review.py
│   ├── interfaces/
│   │   ├── __init__.py
│   │   ├── parser.py
│   │   ├── ocr.py
│   │   ├── vision.py
│   │   ├── agent.py
│   │   ├── storage.py
│   │   ├── cache.py
│   │   └── queue.py
│   └── schemas/
│       ├── __init__.py
│       ├── api_schemas.py
│       └── output_schema.json
└── tests/
    ├── test_hashing.py
    ├── test_enums.py
    ├── test_models.py
    ├── test_document_validation.py
    └── test_confidence.py

Design Decisions

Pure Contract Library

This package contains no implementations of:

  • OCR engines
  • AI models
  • Storage backends
  • Business logic
  • Orchestration code

It is intentionally a pure shared contract/domain package. Implementations should be provided by separate service packages that depend on document-core.

Pydantic v2

All models use Pydantic v2 with:

  • extra="forbid" - Prevents unexpected fields
  • validate_assignment=True - Validates on assignment
  • Comprehensive validators for data integrity

Async-First Interfaces

All protocol interfaces are async to support high-throughput, non-blocking operations in production environments.

Enum Serialization

All enums inherit from both str and Enum for seamless JSON serialization/deserialization.

License

Proprietary - PepsiCo AI Team

Support

For issues, questions, or contributions, please contact the PepsiCo AI Team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_core-0.1.0.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_core-0.1.0-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file document_core-0.1.0.tar.gz.

File metadata

  • Download URL: document_core-0.1.0.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for document_core-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5537e0b28482b543ef3c6c72c35019848f4858b8464893fd6e6c56518ad09cba
MD5 221a35427d3e17366a7f2e4e852b837f
BLAKE2b-256 51896705c9d1b3ffcc869cc859ee9d7a141aaa1cb11e26b9398cd2b18da1431a

See more details on using hashes here.

File details

Details for the file document_core-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: document_core-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for document_core-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3370ba0bdf0d9ce402238b073c38a3db34f5d9180c2d1bc0d37c2fea05037f2f
MD5 2870d2b061ae47765bbb8f0b3c111d48
BLAKE2b-256 7f4fde9095cfa7294ef081354e867fdaad252ae8571c955b0a524b2402059f98

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page