Foundational shared library for document-processing and planogram-extraction platform

These details have not been verified by PyPI

Project links

Project description

document-core

A foundational shared library for the document-processing and planogram-extraction platform. This package provides domain models, interfaces, enums, schemas, exceptions, hashing utilities, and configuration classes for building document processing applications.

Purpose

document-core serves as the contract and domain layer for a larger document processing ecosystem. It defines the shared data structures, protocols, and utilities used across OCR engines, vision models, storage backends, and orchestration services.

Key Design Principles:

Domain Driven Design (DDD)
SOLID principles
Strong typing with complete type hints
Async-first interfaces
Immutable value objects where appropriate
Pydantic v2 models for validation
Production-grade validation
JSON serialization support
Auditability and forward compatibility

Architecture

The library is organized into several key modules:

Core Modules

enums.py - Enumeration types for page types, processing modes, field sources, job status, review decisions, and deficiency types
errors.py - Exception hierarchy with error codes, messages, and details
hashing.py - SHA256 hashing utilities for files, bytes, and text
config.py - Configuration management with environment variable support

Domain Models (`models/`)

document.py - Document, Page, and PageMetadata models
planogram.py - Product, Shelf, Section, ExtractionMetadata, and PlanogramResult models
extraction.py - FieldConflict and ExtractionResult models
confidence.py - ConfidenceScore and ConfidenceReport models
job.py - JobConfig and JobInfo models
review.py - ReviewTask and ReviewResult models

Interfaces (`interfaces/`)

Protocol definitions for implementing:

parser.py - IDocumentParser for document parsing
ocr.py - IOcrEngine for OCR text and table extraction
vision.py - IVisionModel for image analysis
agent.py - IExtractionAgent for AI-based extraction
storage.py - IFileStorage for file operations
cache.py - IResultCache for caching
queue.py - IJobQueue for job management

Schemas (`schemas/`)

api_schemas.py - API request/response models
output_schema.json - JSON Schema for PlanogramResult (Draft 2020-12)

Installation

Requirements

Python >= 3.11
pydantic >= 2.0

Install from Source

cd document-core
pip install -e .

Install Dependencies

pip install pydantic>=2.0

Usage

Basic Model Usage

from document_core import PageMetadata, PageType, Document
from datetime import datetime

# Create page metadata
metadata = PageMetadata(
    page_number=1,
    page_type=PageType.PLANOGRAM,
    width_px=1920,
    height_px=1080,
    image_area_ratio=0.95,
    small_text_ratio=0.1,
    detected_table_regions=2,
    detected_shelf_regions=5,
    raw_char_count=1000,
    has_rotated_text=False,
    content_hash="a" * 64,
)

# Access computed properties
print(f"Aspect ratio: {metadata.aspect_ratio}")

Planogram Models

from document_core import Product, Shelf, Section, PlanogramResult, ExtractionMetadata
from document_core.enums import FieldSource
from datetime import datetime

# Create a product
product = Product(
    name="Coca-Cola 12oz",
    upc="04963406",
    facings=3,
    source=FieldSource.PRIMARY,
)

# Create a shelf with products
shelf = Shelf(
    shelf_number=1,
    products=[product],
)

# Create a section with shelves
section = Section(
    section_name="Beverages",
    shelves=[shelf],
)

# Create extraction metadata
metadata = ExtractionMetadata(
    processing_time_ms=1500.0,
    model_name="planogram-extractor-v1",
    ocr_engine="tesseract",
    confidence_score=0.92,
    created_at=datetime.now(),
)

# Create complete planogram result
planogram = PlanogramResult(
    store_name="Store #123",
    category="Beverages",
    sections=[section],
    metadata=metadata,
)

# Access computed properties
print(f"Total products: {planogram.total_products}")
print(f"Total shelves: {planogram.total_shelves}")

Confidence Reports

from document_core import ConfidenceScore, ConfidenceReport
from datetime import datetime

# Create field scores
field_scores = [
    ConfidenceScore(
        field_name="product_name",
        score=0.95,
        source="ocr",
        reason="Clear text",
    ),
    ConfidenceScore(
        field_name="upc",
        score=0.88,
        source="ocr",
        reason="Slightly blurry",
    ),
]

# Create confidence report
report = ConfidenceReport(
    overall_score=0.91,
    field_scores=field_scores,
    deficiencies=[],
    generated_at=datetime.now(),
)

# Check if review is required
if report.is_review_required():
    print("Manual review required")
else:
    print("Confidence is acceptable")

Hashing Utilities

from document_core import compute_sha256_file, compute_sha256_text

# Hash text
text_hash = compute_sha256_text("Hello, World!")
print(f"Text hash: {text_hash}")

# Hash file
file_hash = compute_sha256_file("/path/to/document.pdf")
print(f"File hash: {file_hash}")

Configuration

from document_core import BaseConfig

# Load from environment variables
config = BaseConfig.from_env()

# Or create directly
config = BaseConfig(
    environment="production",
    log_level="INFO",
    cache_ttl_seconds=3600,
)

Error Handling

from document_core import ValidationError, DocumentParseError

try:
    # Your validation logic
    pass
except ValidationError as e:
    print(f"Validation failed: {e.message}")
    print(f"Field: {e.details.get('field')}")
except DocumentParseError as e:
    print(f"Parse failed: {e.message}")
    print(f"Document ID: {e.details.get('document_id')}")

Extending Interfaces

The library provides Protocol-based interfaces that you can implement to create custom components:

Implementing a Custom OCR Engine

from document_core.interfaces import IOcrEngine, OcrResult, TableResult
from document_core.errors import OcrError

class CustomOcrEngine(IOcrEngine):
    async def extract_text(self, image_path: str) -> OcrResult:
        try:
            # Your OCR implementation
            text = "Extracted text..."
            confidence = 0.95
            
            return OcrResult(
                text=text,
                confidence=confidence,
                processing_time_ms=500.0,
                success=True,
            )
        except Exception as e:
            raise OcrError(
                message=f"OCR failed: {str(e)}",
                ocr_engine="custom",
            )
    
    async def extract_tables(self, image_path: str) -> TableResult:
        # Your table extraction implementation
        pass

Implementing a Custom Storage Backend

from document_core.interfaces import IFileStorage
from document_core.errors import StorageError

class S3Storage(IFileStorage):
    async def upload(self, file_path: str, storage_key: str) -> str:
        # Your S3 upload implementation
        return f"s3://bucket/{storage_key}"
    
    async def download(self, storage_key: str, local_path: str) -> None:
        # Your S3 download implementation
        pass
    
    async def exists(self, storage_key: str) -> bool:
        # Your existence check implementation
        pass
    
    async def delete(self, storage_key: str) -> None:
        # Your delete implementation
        pass

JSON Schema

A JSON Schema for the PlanogramResult model is provided in schemas/output_schema.json. This schema follows JSON Schema Draft 2020-12 and can be used for validation in systems that don't use Python/Pydantic.

# Validate a JSON file against the schema
ajv validate --schema=document_core/schemas/output_schema.json --data=data.json

Testing

Run the test suite:

pytest tests/

Run tests with coverage:

pytest tests/ --cov=document_core --cov-report=html

Versioning Strategy

This project follows Semantic Versioning:

MAJOR: Incompatible API changes
MINOR: New functionality in a backwards compatible manner
PATCH: Backwards compatible bug fixes

Given the library is in early development (v0.1.0), minor versions may include breaking changes until v1.0 is released.

Project Structure

document-core/
├── pyproject.toml
├── README.md
├── document_core/
│   ├── __init__.py
│   ├── enums.py
│   ├── errors.py
│   ├── hashing.py
│   ├── config.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── document.py
│   │   ├── planogram.py
│   │   ├── extraction.py
│   │   ├── confidence.py
│   │   ├── job.py
│   │   └── review.py
│   ├── interfaces/
│   │   ├── __init__.py
│   │   ├── parser.py
│   │   ├── ocr.py
│   │   ├── vision.py
│   │   ├── agent.py
│   │   ├── storage.py
│   │   ├── cache.py
│   │   └── queue.py
│   └── schemas/
│       ├── __init__.py
│       ├── api_schemas.py
│       └── output_schema.json
└── tests/
    ├── test_hashing.py
    ├── test_enums.py
    ├── test_models.py
    ├── test_document_validation.py
    └── test_confidence.py

Design Decisions

Pure Contract Library

This package contains no implementations of:

OCR engines
AI models
Storage backends
Business logic
Orchestration code

It is intentionally a pure shared contract/domain package. Implementations should be provided by separate service packages that depend on document-core.

Pydantic v2

All models use Pydantic v2 with:

extra="forbid" - Prevents unexpected fields
validate_assignment=True - Validates on assignment
Comprehensive validators for data integrity

Async-First Interfaces

All protocol interfaces are async to support high-throughput, non-blocking operations in production environments.

Enum Serialization

All enums inherit from both str and Enum for seamless JSON serialization/deserialization.

License

Proprietary - PepsiCo AI Team

Support

For issues, questions, or contributions, please contact the PepsiCo AI Team.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_core-0.1.0.tar.gz (22.2 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

document_core-0.1.0-py3-none-any.whl (24.1 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file document_core-0.1.0.tar.gz.

File metadata

Download URL: document_core-0.1.0.tar.gz
Upload date: Jun 15, 2026
Size: 22.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for document_core-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5537e0b28482b543ef3c6c72c35019848f4858b8464893fd6e6c56518ad09cba`
MD5	`221a35427d3e17366a7f2e4e852b837f`
BLAKE2b-256	`51896705c9d1b3ffcc869cc859ee9d7a141aaa1cb11e26b9398cd2b18da1431a`

See more details on using hashes here.

File details

Details for the file document_core-0.1.0-py3-none-any.whl.

File metadata

Download URL: document_core-0.1.0-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 24.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for document_core-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3370ba0bdf0d9ce402238b073c38a3db34f5d9180c2d1bc0d37c2fea05037f2f`
MD5	`2870d2b061ae47765bbb8f0b3c111d48`
BLAKE2b-256	`7f4fde9095cfa7294ef081354e867fdaad252ae8571c955b0a524b2402059f98`

See more details on using hashes here.

document-core 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

document-core

Purpose

Architecture

Core Modules

Domain Models (models/)

Interfaces (interfaces/)

Schemas (schemas/)

Installation

Requirements

Install from Source

Install Dependencies

Usage

Basic Model Usage

Planogram Models

Confidence Reports

Hashing Utilities

Configuration

Error Handling

Extending Interfaces

Implementing a Custom OCR Engine

Implementing a Custom Storage Backend

JSON Schema

Testing

Versioning Strategy

Project Structure

Design Decisions

Pure Contract Library

Pydantic v2

Async-First Interfaces

Enum Serialization

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Domain Models (`models/`)

Interfaces (`interfaces/`)

Schemas (`schemas/`)