Unified parsing abstraction layer for document extraction providers

These details have not been verified by PyPI

Project links

Project description

document-parser

A unified parsing abstraction layer for document extraction providers. This library provides a consistent interface for multiple document parsing engines including LlamaParse Premium and Azure Document Intelligence.

Overview

document-parser wraps multiple parsing engines behind the shared IDocumentParser interface from document-core, enabling:

Provider abstraction and easy switching
Response normalization across providers
Retry handling with exponential backoff
Batch parsing with concurrency control
Cache integration
Provider failover support
Enterprise-scale document workloads (10,000+ pages)

Architecture

The library follows SOLID principles with dependency injection, adapter pattern, and factory pattern:

Adapter Pattern - Normalizes provider responses to ParseResult
Factory Pattern - Creates parser instances by name
Dependency Injection - Parser instances injected into BatchParser
Async-first - Non-blocking operations for high throughput
Connection Pooling - httpx.AsyncClient for efficient HTTP communication

Supported Parsers

LlamaParse Premium

Primary provider for high-quality parsing
Supports tables, images, charts
Premium mode for best results
Async job-based processing with polling

Azure Document Intelligence

Recovery parser and alternative provider
OCR recovery when needed
Table extraction with prebuilt-layout
Text extraction with prebuilt-read

Installation

Requirements

Python >= 3.11
document-core >= 0.1.0
httpx >= 0.27
tenacity >= 8.0
pydantic >= 2.0

Optional Dependencies

azure-ai-documentintelligence - For Azure Document Intelligence support

Install from Source

cd document-parser
pip install -e .

Install with Azure Support

pip install -e ".[azure]"

Configuration

LlamaParse Configuration

from document_parser import LlamaParseConfig

config = LlamaParseConfig(
    api_key="your-api-key",
    base_url="https://api.cloud.llamaindex.ai",
    premium_mode=True,
    extract_tables=True,
    extract_images=True,
    extract_charts=True,
    preserve_layout=True,
    language="en",
    max_concurrent_jobs=10,
    poll_interval_seconds=2.0,
    timeout_seconds=300.0,
    max_retries=3,
)

Azure Document Intelligence Configuration

from document_parser import AzureDIConfig

config = AzureDIConfig(
    endpoint="https://your-endpoint.cognitiveservices.azure.com",
    api_key="your-api-key",
    api_version="2024-11-30",
    prebuilt_model="prebuilt-layout",
    timeout_seconds=120.0,
    max_retries=3,
)

Usage Examples

Using the Factory

from document_parser import ParserFactory, LlamaParseConfig, AzureDIConfig

# Create LlamaParse parser
llama_config = LlamaParseConfig(api_key="your-key")
llama_parser = ParserFactory.create("llamaparse", llama_config)

# Create Azure DI parser
azure_config = AzureDIConfig(endpoint="https://...", api_key="your-key")
azure_parser = ParserFactory.create("azure_di", azure_config)

Parsing a Single Page

import asyncio
from document_parser import LlamaParseClient, LlamaParseConfig

async def parse_page(page):
    config = LlamaParseConfig(api_key="your-key")
    parser = LlamaParseClient(config)
    
    result = await parser.parse(page)
    print(f"Markdown: {result.markdown}")
    print(f"Tables: {len(result.tables)}")
    print(f"Images: {len(result.images)}")
    print(f"Duration: {result.parse_duration_ms}ms")
    
    await parser.close()

# Run
asyncio.run(parse_page(page))

Batch Parsing

import asyncio
from document_parser import BatchParser, LlamaParseClient, LlamaParseConfig

async def parse_document(document):
    config = LlamaParseConfig(api_key="your-key")
    parser = LlamaParseClient(config)
    
    batch_parser = BatchParser(
        parser=parser,
        cache=None,  # Optional cache implementation
        max_concurrency=10,
    )
    
    results = await batch_parser.parse_document(document)
    print(f"Parsed {len(results)} pages")
    
    await parser.close()

# Run
asyncio.run(parse_document(document))

Batch Parsing with Cache

from document_parser import BatchParser
from document_core.interfaces import IResultCache

class MyCache(IResultCache):
    async def get(self, key):
        # Retrieve from cache
        pass
    
    async def put(self, key, value, ttl_seconds):
        # Store in cache
        pass

cache = MyCache()
batch_parser = BatchParser(parser=parser, cache=cache)

Retry Strategy

The library uses tenacity for retry logic with exponential backoff:

Retry Conditions

HTTP status codes: 429, 500, 502, 503, 504
Connection errors
Timeout errors
Transport errors

Backoff Schedule

1s, 2s, 4s, 8s, 16s (exponential)
Configurable max retries (default: 3)

Custom Retry Configuration

from document_parser.llamaparse import LlamaParseConfig

config = LlamaParseConfig(
    api_key="your-key",
    max_retries=5,  # Increase retries
)

Error Handling

Exception Hierarchy

DocumentParserError (base)
├── LlamaParseError
├── AzureDIError
├── ParserTimeoutError
├── ParserAuthenticationError
├── ParserRateLimitError
├── AdapterError
├── BatchParserError
└── UnsupportedParserError

Error Handling Example

from document_parser import LlamaParseError, ParserTimeoutError

try:
    result = await parser.parse(page)
except ParserTimeoutError as e:
    print(f"Parsing timed out: {e.message}")
except ParserAuthenticationError as e:
    print(f"Authentication failed: {e.message}")
except LlamaParseError as e:
    print(f"Parsing failed: {e.message}")
    print(f"Job ID: {e.details.get('job_id')}")

Performance Tuning

Concurrency Control

# LlamaParse
config = LlamaParseConfig(
    api_key="your-key",
    max_concurrent_jobs=20,  # Increase for faster processing
)

# BatchParser
batch_parser = BatchParser(
    parser=parser,
    max_concurrency=20,
)

Memory Efficiency

Use streaming uploads for large files
Process in batches for very large documents
Enable caching to avoid reprocessing
Configure appropriate timeout values

Connection Pooling

The library uses httpx.AsyncClient with connection pooling:

Default: 10 connections
Configurable via max_concurrent_jobs
Automatic keep-alive for reuse

Extending New Parsers

1. Create Configuration

from pydantic import BaseModel

class MyParserConfig(BaseModel):
    api_key: str
    base_url: str
    timeout_seconds: float = 120.0

2. Create Adapter

from document_parser.models import ParseResult

class MyParserAdapter:
    def adapt(self, raw_response, page_number, parse_duration_ms) -> ParseResult:
        # Convert provider response to ParseResult
        pass

3. Implement IDocumentParser

from document_core.interfaces import IDocumentParser

class MyParserClient(IDocumentParser):
    async def parse(self, page, config) -> ParseResult:
        # Parse page
        pass
    
    async def parse_batch(self, pages, config) -> list[ParseResult]:
        # Parse batch
        pass

4. Register in Factory

from document_parser.factory import ParserFactory

class ParserFactory:
    @staticmethod
    def create(parser_name, config):
        if parser_name == "my_parser":
            return MyParserClient(config)
        # ... existing parsers

Development Guide

Running Tests

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=document_parser --cov-report=html

# Run specific test file
pytest tests/test_factory.py

Code Style

# Format code with black
black document_parser/

# Lint with ruff
ruff check document_parser/

# Type check with mypy
mypy document_parser/

Project Structure

document-parser/
├── pyproject.toml
├── README.md
├── document_parser/
│   ├── __init__.py
│   ├── factory.py
│   ├── batch.py
│   ├── models.py
│   ├── exceptions.py
│   ├── llamaparse/
│   │   ├── __init__.py
│   │   ├── client.py
│   │   ├── config.py
│   │   ├── adapter.py
│   │   └── retry.py
│   └── azure_di/
│       ├── __init__.py
│       ├── client.py
│       ├── config.py
│       └── adapter.py
├── tests/
│   ├── test_factory.py
│   ├── test_batch.py
│   ├── test_llamaparse.py
│   ├── test_azure_di.py
│   ├── test_adapters.py
│   └── test_retry.py
└── docs/

Design Principles

SOLID - Single responsibility, open/closed, Liskov substitution, interface segregation, dependency inversion
Dependency Injection - Parser instances injected into batch parser
Adapter Pattern - Normalize provider responses
Factory Pattern - Create parsers by name
Async-first - Non-blocking operations
High Concurrency - Semaphore throttling, connection pooling
Provider Abstraction - Switch providers without code changes
Open/Closed Principle - Add new parsers without modifying existing code
Structured Logging - Log all major operations
Testability - Mock external APIs, unit tests for all components
Type Safety - Complete type hints, mypy validation
Pydantic Validation - Strict data validation

Dependencies

Internal

document-core - Shared models, enums, interfaces, and utilities

External

httpx - Async HTTP client with connection pooling
tenacity - Retry logic with exponential backoff
pydantic - Data validation

Optional

azure-ai-documentintelligence - Azure Document Intelligence SDK

License

MIT License - PepsiCo

Support

For issues, questions, or contributions, please contact the PepsiCo AI Team.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pepsico_document_parser-0.1.0.tar.gz (23.7 kB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pepsico_document_parser-0.1.0-py3-none-any.whl (23.0 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file pepsico_document_parser-0.1.0.tar.gz.

File metadata

Download URL: pepsico_document_parser-0.1.0.tar.gz
Upload date: Jun 18, 2026
Size: 23.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for pepsico_document_parser-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`37543f5af129b56fc9b958a139c8a3805f6fef828d01397f40381c44acc26206`
MD5	`ce2f3839f49ffd489045f0bfad866438`
BLAKE2b-256	`ab92c7a326af32855afa13cc5ec080ae9db022d111cc2adc2e839bf41f9f9f07`

See more details on using hashes here.

File details

Details for the file pepsico_document_parser-0.1.0-py3-none-any.whl.

File metadata

Download URL: pepsico_document_parser-0.1.0-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 23.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for pepsico_document_parser-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0109a7de5cf51fb3b0c5cacdbc83a4c74330c3adba8f258a56424a45d87dc3b5`
MD5	`ca08cc5eb29178d5dbde8d12b05817e0`
BLAKE2b-256	`64901a69859decbecc3faa5a9963baea3699dec708353869e9bfda7152278243`

See more details on using hashes here.

pepsico-document-parser 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

document-parser

Overview

Architecture

Supported Parsers

LlamaParse Premium

Azure Document Intelligence

Installation

Requirements

Optional Dependencies

Install from Source

Install with Azure Support

Configuration

LlamaParse Configuration

Azure Document Intelligence Configuration

Usage Examples

Using the Factory

Parsing a Single Page

Batch Parsing

Batch Parsing with Cache

Retry Strategy

Retry Conditions

Backoff Schedule

Custom Retry Configuration

Error Handling

Exception Hierarchy

Error Handling Example

Performance Tuning

Concurrency Control

Memory Efficiency

Connection Pooling

Extending New Parsers

1. Create Configuration

2. Create Adapter

3. Implement IDocumentParser

4. Register in Factory

Development Guide

Running Tests

Code Style

Project Structure

Design Principles

Dependencies

Internal

External

Optional

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes