Unified parsing abstraction layer for document extraction providers
Project description
document-parser
A unified parsing abstraction layer for document extraction providers. This library provides a consistent interface for multiple document parsing engines including LlamaParse Premium and Azure Document Intelligence.
Overview
document-parser wraps multiple parsing engines behind the shared IDocumentParser interface from document-core, enabling:
- Provider abstraction and easy switching
- Response normalization across providers
- Retry handling with exponential backoff
- Batch parsing with concurrency control
- Cache integration
- Provider failover support
- Enterprise-scale document workloads (10,000+ pages)
Architecture
The library follows SOLID principles with dependency injection, adapter pattern, and factory pattern:
- Adapter Pattern - Normalizes provider responses to
ParseResult - Factory Pattern - Creates parser instances by name
- Dependency Injection - Parser instances injected into
BatchParser - Async-first - Non-blocking operations for high throughput
- Connection Pooling -
httpx.AsyncClientfor efficient HTTP communication
Supported Parsers
LlamaParse Premium
- Primary provider for high-quality parsing
- Supports tables, images, charts
- Premium mode for best results
- Async job-based processing with polling
Azure Document Intelligence
- Recovery parser and alternative provider
- OCR recovery when needed
- Table extraction with
prebuilt-layout - Text extraction with
prebuilt-read
Installation
Requirements
- Python >= 3.11
- document-core >= 0.1.0
- httpx >= 0.27
- tenacity >= 8.0
- pydantic >= 2.0
Optional Dependencies
azure-ai-documentintelligence- For Azure Document Intelligence support
Install from Source
cd document-parser
pip install -e .
Install with Azure Support
pip install -e ".[azure]"
Configuration
LlamaParse Configuration
from document_parser import LlamaParseConfig
config = LlamaParseConfig(
api_key="your-api-key",
base_url="https://api.cloud.llamaindex.ai",
premium_mode=True,
extract_tables=True,
extract_images=True,
extract_charts=True,
preserve_layout=True,
language="en",
max_concurrent_jobs=10,
poll_interval_seconds=2.0,
timeout_seconds=300.0,
max_retries=3,
)
Azure Document Intelligence Configuration
from document_parser import AzureDIConfig
config = AzureDIConfig(
endpoint="https://your-endpoint.cognitiveservices.azure.com",
api_key="your-api-key",
api_version="2024-11-30",
prebuilt_model="prebuilt-layout",
timeout_seconds=120.0,
max_retries=3,
)
Usage Examples
Using the Factory
from document_parser import ParserFactory, LlamaParseConfig, AzureDIConfig
# Create LlamaParse parser
llama_config = LlamaParseConfig(api_key="your-key")
llama_parser = ParserFactory.create("llamaparse", llama_config)
# Create Azure DI parser
azure_config = AzureDIConfig(endpoint="https://...", api_key="your-key")
azure_parser = ParserFactory.create("azure_di", azure_config)
Parsing a Single Page
import asyncio
from document_parser import LlamaParseClient, LlamaParseConfig
async def parse_page(page):
config = LlamaParseConfig(api_key="your-key")
parser = LlamaParseClient(config)
result = await parser.parse(page)
print(f"Markdown: {result.markdown}")
print(f"Tables: {len(result.tables)}")
print(f"Images: {len(result.images)}")
print(f"Duration: {result.parse_duration_ms}ms")
await parser.close()
# Run
asyncio.run(parse_page(page))
Batch Parsing
import asyncio
from document_parser import BatchParser, LlamaParseClient, LlamaParseConfig
async def parse_document(document):
config = LlamaParseConfig(api_key="your-key")
parser = LlamaParseClient(config)
batch_parser = BatchParser(
parser=parser,
cache=None, # Optional cache implementation
max_concurrency=10,
)
results = await batch_parser.parse_document(document)
print(f"Parsed {len(results)} pages")
await parser.close()
# Run
asyncio.run(parse_document(document))
Batch Parsing with Cache
from document_parser import BatchParser
from document_core.interfaces import IResultCache
class MyCache(IResultCache):
async def get(self, key):
# Retrieve from cache
pass
async def put(self, key, value, ttl_seconds):
# Store in cache
pass
cache = MyCache()
batch_parser = BatchParser(parser=parser, cache=cache)
Retry Strategy
The library uses tenacity for retry logic with exponential backoff:
Retry Conditions
- HTTP status codes: 429, 500, 502, 503, 504
- Connection errors
- Timeout errors
- Transport errors
Backoff Schedule
- 1s, 2s, 4s, 8s, 16s (exponential)
- Configurable max retries (default: 3)
Custom Retry Configuration
from document_parser.llamaparse import LlamaParseConfig
config = LlamaParseConfig(
api_key="your-key",
max_retries=5, # Increase retries
)
Error Handling
Exception Hierarchy
DocumentParserError (base)
├── LlamaParseError
├── AzureDIError
├── ParserTimeoutError
├── ParserAuthenticationError
├── ParserRateLimitError
├── AdapterError
├── BatchParserError
└── UnsupportedParserError
Error Handling Example
from document_parser import LlamaParseError, ParserTimeoutError
try:
result = await parser.parse(page)
except ParserTimeoutError as e:
print(f"Parsing timed out: {e.message}")
except ParserAuthenticationError as e:
print(f"Authentication failed: {e.message}")
except LlamaParseError as e:
print(f"Parsing failed: {e.message}")
print(f"Job ID: {e.details.get('job_id')}")
Performance Tuning
Concurrency Control
# LlamaParse
config = LlamaParseConfig(
api_key="your-key",
max_concurrent_jobs=20, # Increase for faster processing
)
# BatchParser
batch_parser = BatchParser(
parser=parser,
max_concurrency=20,
)
Memory Efficiency
- Use streaming uploads for large files
- Process in batches for very large documents
- Enable caching to avoid reprocessing
- Configure appropriate timeout values
Connection Pooling
The library uses httpx.AsyncClient with connection pooling:
- Default: 10 connections
- Configurable via
max_concurrent_jobs - Automatic keep-alive for reuse
Extending New Parsers
1. Create Configuration
from pydantic import BaseModel
class MyParserConfig(BaseModel):
api_key: str
base_url: str
timeout_seconds: float = 120.0
2. Create Adapter
from document_parser.models import ParseResult
class MyParserAdapter:
def adapt(self, raw_response, page_number, parse_duration_ms) -> ParseResult:
# Convert provider response to ParseResult
pass
3. Implement IDocumentParser
from document_core.interfaces import IDocumentParser
class MyParserClient(IDocumentParser):
async def parse(self, page, config) -> ParseResult:
# Parse page
pass
async def parse_batch(self, pages, config) -> list[ParseResult]:
# Parse batch
pass
4. Register in Factory
from document_parser.factory import ParserFactory
class ParserFactory:
@staticmethod
def create(parser_name, config):
if parser_name == "my_parser":
return MyParserClient(config)
# ... existing parsers
Development Guide
Running Tests
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=document_parser --cov-report=html
# Run specific test file
pytest tests/test_factory.py
Code Style
# Format code with black
black document_parser/
# Lint with ruff
ruff check document_parser/
# Type check with mypy
mypy document_parser/
Project Structure
document-parser/
├── pyproject.toml
├── README.md
├── document_parser/
│ ├── __init__.py
│ ├── factory.py
│ ├── batch.py
│ ├── models.py
│ ├── exceptions.py
│ ├── llamaparse/
│ │ ├── __init__.py
│ │ ├── client.py
│ │ ├── config.py
│ │ ├── adapter.py
│ │ └── retry.py
│ └── azure_di/
│ ├── __init__.py
│ ├── client.py
│ ├── config.py
│ └── adapter.py
├── tests/
│ ├── test_factory.py
│ ├── test_batch.py
│ ├── test_llamaparse.py
│ ├── test_azure_di.py
│ ├── test_adapters.py
│ └── test_retry.py
└── docs/
Design Principles
- SOLID - Single responsibility, open/closed, Liskov substitution, interface segregation, dependency inversion
- Dependency Injection - Parser instances injected into batch parser
- Adapter Pattern - Normalize provider responses
- Factory Pattern - Create parsers by name
- Async-first - Non-blocking operations
- High Concurrency - Semaphore throttling, connection pooling
- Provider Abstraction - Switch providers without code changes
- Open/Closed Principle - Add new parsers without modifying existing code
- Structured Logging - Log all major operations
- Testability - Mock external APIs, unit tests for all components
- Type Safety - Complete type hints, mypy validation
- Pydantic Validation - Strict data validation
Dependencies
Internal
document-core- Shared models, enums, interfaces, and utilities
External
httpx- Async HTTP client with connection poolingtenacity- Retry logic with exponential backoffpydantic- Data validation
Optional
azure-ai-documentintelligence- Azure Document Intelligence SDK
License
MIT License - PepsiCo
Support
For issues, questions, or contributions, please contact the PepsiCo AI Team.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pepsico_document_parser-0.1.0.tar.gz.
File metadata
- Download URL: pepsico_document_parser-0.1.0.tar.gz
- Upload date:
- Size: 23.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37543f5af129b56fc9b958a139c8a3805f6fef828d01397f40381c44acc26206
|
|
| MD5 |
ce2f3839f49ffd489045f0bfad866438
|
|
| BLAKE2b-256 |
ab92c7a326af32855afa13cc5ec080ae9db022d111cc2adc2e839bf41f9f9f07
|
File details
Details for the file pepsico_document_parser-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pepsico_document_parser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0109a7de5cf51fb3b0c5cacdbc83a4c74330c3adba8f258a56424a45d87dc3b5
|
|
| MD5 |
ca08cc5eb29178d5dbde8d12b05817e0
|
|
| BLAKE2b-256 |
64901a69859decbecc3faa5a9963baea3699dec708353869e9bfda7152278243
|