Foundational shared library for document-processing and planogram-extraction platform
Project description
document-core
A foundational shared library for the document-processing and planogram-extraction platform. This package provides domain models, interfaces, enums, schemas, exceptions, hashing utilities, and configuration classes for building document processing applications.
Purpose
document-core serves as the contract and domain layer for a larger document processing ecosystem. It defines the shared data structures, protocols, and utilities used across OCR engines, vision models, storage backends, and orchestration services.
Key Design Principles:
- Domain Driven Design (DDD)
- SOLID principles
- Strong typing with complete type hints
- Async-first interfaces
- Immutable value objects where appropriate
- Pydantic v2 models for validation
- Production-grade validation
- JSON serialization support
- Auditability and forward compatibility
Architecture
The library is organized into several key modules:
Core Modules
enums.py- Enumeration types for page types, processing modes, field sources, job status, review decisions, and deficiency typeserrors.py- Exception hierarchy with error codes, messages, and detailshashing.py- SHA256 hashing utilities for files, bytes, and textconfig.py- Configuration management with environment variable support
Domain Models (models/)
document.py- Document, Page, and PageMetadata modelsplanogram.py- Product, Shelf, Section, ExtractionMetadata, and PlanogramResult modelsextraction.py- FieldConflict and ExtractionResult modelsconfidence.py- ConfidenceScore and ConfidenceReport modelsjob.py- JobConfig and JobInfo modelsreview.py- ReviewTask and ReviewResult models
Interfaces (interfaces/)
Protocol definitions for implementing:
parser.py- IDocumentParser for document parsingocr.py- IOcrEngine for OCR text and table extractionvision.py- IVisionModel for image analysisagent.py- IExtractionAgent for AI-based extractionstorage.py- IFileStorage for file operationscache.py- IResultCache for cachingqueue.py- IJobQueue for job management
Schemas (schemas/)
api_schemas.py- API request/response modelsoutput_schema.json- JSON Schema for PlanogramResult (Draft 2020-12)
Installation
Requirements
- Python >= 3.11
- pydantic >= 2.0
Install from Source
cd document-core
pip install -e .
Install Dependencies
pip install pydantic>=2.0
Usage
Basic Model Usage
from document_core import PageMetadata, PageType, Document
from datetime import datetime
# Create page metadata
metadata = PageMetadata(
page_number=1,
page_type=PageType.PLANOGRAM,
width_px=1920,
height_px=1080,
image_area_ratio=0.95,
small_text_ratio=0.1,
detected_table_regions=2,
detected_shelf_regions=5,
raw_char_count=1000,
has_rotated_text=False,
content_hash="a" * 64,
)
# Access computed properties
print(f"Aspect ratio: {metadata.aspect_ratio}")
Planogram Models
from document_core import Product, Shelf, Section, PlanogramResult, ExtractionMetadata
from document_core.enums import FieldSource
from datetime import datetime
# Create a product
product = Product(
name="Coca-Cola 12oz",
upc="04963406",
facings=3,
source=FieldSource.PRIMARY,
)
# Create a shelf with products
shelf = Shelf(
shelf_number=1,
products=[product],
)
# Create a section with shelves
section = Section(
section_name="Beverages",
shelves=[shelf],
)
# Create extraction metadata
metadata = ExtractionMetadata(
processing_time_ms=1500.0,
model_name="planogram-extractor-v1",
ocr_engine="tesseract",
confidence_score=0.92,
created_at=datetime.now(),
)
# Create complete planogram result
planogram = PlanogramResult(
store_name="Store #123",
category="Beverages",
sections=[section],
metadata=metadata,
)
# Access computed properties
print(f"Total products: {planogram.total_products}")
print(f"Total shelves: {planogram.total_shelves}")
Confidence Reports
from document_core import ConfidenceScore, ConfidenceReport
from datetime import datetime
# Create field scores
field_scores = [
ConfidenceScore(
field_name="product_name",
score=0.95,
source="ocr",
reason="Clear text",
),
ConfidenceScore(
field_name="upc",
score=0.88,
source="ocr",
reason="Slightly blurry",
),
]
# Create confidence report
report = ConfidenceReport(
overall_score=0.91,
field_scores=field_scores,
deficiencies=[],
generated_at=datetime.now(),
)
# Check if review is required
if report.is_review_required():
print("Manual review required")
else:
print("Confidence is acceptable")
Hashing Utilities
from document_core import compute_sha256_file, compute_sha256_text
# Hash text
text_hash = compute_sha256_text("Hello, World!")
print(f"Text hash: {text_hash}")
# Hash file
file_hash = compute_sha256_file("/path/to/document.pdf")
print(f"File hash: {file_hash}")
Configuration
from document_core import BaseConfig
# Load from environment variables
config = BaseConfig.from_env()
# Or create directly
config = BaseConfig(
environment="production",
log_level="INFO",
cache_ttl_seconds=3600,
)
Error Handling
from document_core import ValidationError, DocumentParseError
try:
# Your validation logic
pass
except ValidationError as e:
print(f"Validation failed: {e.message}")
print(f"Field: {e.details.get('field')}")
except DocumentParseError as e:
print(f"Parse failed: {e.message}")
print(f"Document ID: {e.details.get('document_id')}")
Extending Interfaces
The library provides Protocol-based interfaces that you can implement to create custom components:
Implementing a Custom OCR Engine
from document_core.interfaces import IOcrEngine, OcrResult, TableResult
from document_core.errors import OcrError
class CustomOcrEngine(IOcrEngine):
async def extract_text(self, image_path: str) -> OcrResult:
try:
# Your OCR implementation
text = "Extracted text..."
confidence = 0.95
return OcrResult(
text=text,
confidence=confidence,
processing_time_ms=500.0,
success=True,
)
except Exception as e:
raise OcrError(
message=f"OCR failed: {str(e)}",
ocr_engine="custom",
)
async def extract_tables(self, image_path: str) -> TableResult:
# Your table extraction implementation
pass
Implementing a Custom Storage Backend
from document_core.interfaces import IFileStorage
from document_core.errors import StorageError
class S3Storage(IFileStorage):
async def upload(self, file_path: str, storage_key: str) -> str:
# Your S3 upload implementation
return f"s3://bucket/{storage_key}"
async def download(self, storage_key: str, local_path: str) -> None:
# Your S3 download implementation
pass
async def exists(self, storage_key: str) -> bool:
# Your existence check implementation
pass
async def delete(self, storage_key: str) -> None:
# Your delete implementation
pass
JSON Schema
A JSON Schema for the PlanogramResult model is provided in schemas/output_schema.json. This schema follows JSON Schema Draft 2020-12 and can be used for validation in systems that don't use Python/Pydantic.
# Validate a JSON file against the schema
ajv validate --schema=document_core/schemas/output_schema.json --data=data.json
Testing
Run the test suite:
pytest tests/
Run tests with coverage:
pytest tests/ --cov=document_core --cov-report=html
Versioning Strategy
This project follows Semantic Versioning:
- MAJOR: Incompatible API changes
- MINOR: New functionality in a backwards compatible manner
- PATCH: Backwards compatible bug fixes
Given the library is in early development (v0.1.0), minor versions may include breaking changes until v1.0 is released.
Project Structure
document-core/
├── pyproject.toml
├── README.md
├── document_core/
│ ├── __init__.py
│ ├── enums.py
│ ├── errors.py
│ ├── hashing.py
│ ├── config.py
│ ├── models/
│ │ ├── __init__.py
│ │ ├── document.py
│ │ ├── planogram.py
│ │ ├── extraction.py
│ │ ├── confidence.py
│ │ ├── job.py
│ │ └── review.py
│ ├── interfaces/
│ │ ├── __init__.py
│ │ ├── parser.py
│ │ ├── ocr.py
│ │ ├── vision.py
│ │ ├── agent.py
│ │ ├── storage.py
│ │ ├── cache.py
│ │ └── queue.py
│ └── schemas/
│ ├── __init__.py
│ ├── api_schemas.py
│ └── output_schema.json
└── tests/
├── test_hashing.py
├── test_enums.py
├── test_models.py
├── test_document_validation.py
└── test_confidence.py
Design Decisions
Pure Contract Library
This package contains no implementations of:
- OCR engines
- AI models
- Storage backends
- Business logic
- Orchestration code
It is intentionally a pure shared contract/domain package. Implementations should be provided by separate service packages that depend on document-core.
Pydantic v2
All models use Pydantic v2 with:
extra="forbid"- Prevents unexpected fieldsvalidate_assignment=True- Validates on assignment- Comprehensive validators for data integrity
Async-First Interfaces
All protocol interfaces are async to support high-throughput, non-blocking operations in production environments.
Enum Serialization
All enums inherit from both str and Enum for seamless JSON serialization/deserialization.
License
Proprietary - PepsiCo AI Team
Support
For issues, questions, or contributions, please contact the PepsiCo AI Team.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file document_core-0.1.0.tar.gz.
File metadata
- Download URL: document_core-0.1.0.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5537e0b28482b543ef3c6c72c35019848f4858b8464893fd6e6c56518ad09cba
|
|
| MD5 |
221a35427d3e17366a7f2e4e852b837f
|
|
| BLAKE2b-256 |
51896705c9d1b3ffcc869cc859ee9d7a141aaa1cb11e26b9398cd2b18da1431a
|
File details
Details for the file document_core-0.1.0-py3-none-any.whl.
File metadata
- Download URL: document_core-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3370ba0bdf0d9ce402238b073c38a3db34f5d9180c2d1bc0d37c2fea05037f2f
|
|
| MD5 |
2870d2b061ae47765bbb8f0b3c111d48
|
|
| BLAKE2b-256 |
7f4fde9095cfa7294ef081354e867fdaad252ae8571c955b0a524b2402059f98
|