Skip to main content

Unified parsing abstraction layer for document extraction providers

Project description

document-parser

A unified parsing abstraction layer for document extraction providers. This library provides a consistent interface for multiple document parsing engines including LlamaParse Premium and Azure Document Intelligence.

Overview

document-parser wraps multiple parsing engines behind the shared IDocumentParser interface from document-core, enabling:

  • Provider abstraction and easy switching
  • Response normalization across providers
  • Retry handling with exponential backoff
  • Batch parsing with concurrency control
  • Cache integration
  • Provider failover support
  • Enterprise-scale document workloads (10,000+ pages)

Architecture

The library follows SOLID principles with dependency injection, adapter pattern, and factory pattern:

  • Adapter Pattern - Normalizes provider responses to ParseResult
  • Factory Pattern - Creates parser instances by name
  • Dependency Injection - Parser instances injected into BatchParser
  • Async-first - Non-blocking operations for high throughput
  • Connection Pooling - httpx.AsyncClient for efficient HTTP communication

Supported Parsers

LlamaParse Premium

  • Primary provider for high-quality parsing
  • Supports tables, images, charts
  • Premium mode for best results
  • Async job-based processing with polling

Azure Document Intelligence

  • Recovery parser and alternative provider
  • OCR recovery when needed
  • Table extraction with prebuilt-layout
  • Text extraction with prebuilt-read

Installation

Requirements

  • Python >= 3.11
  • document-core >= 0.1.0
  • httpx >= 0.27
  • tenacity >= 8.0
  • pydantic >= 2.0

Optional Dependencies

  • azure-ai-documentintelligence - For Azure Document Intelligence support

Install from Source

cd document-parser
pip install -e .

Install with Azure Support

pip install -e ".[azure]"

Configuration

LlamaParse Configuration

from document_parser import LlamaParseConfig

config = LlamaParseConfig(
    api_key="your-api-key",
    base_url="https://api.cloud.llamaindex.ai",
    premium_mode=True,
    extract_tables=True,
    extract_images=True,
    extract_charts=True,
    preserve_layout=True,
    language="en",
    max_concurrent_jobs=10,
    poll_interval_seconds=2.0,
    timeout_seconds=300.0,
    max_retries=3,
)

Azure Document Intelligence Configuration

from document_parser import AzureDIConfig

config = AzureDIConfig(
    endpoint="https://your-endpoint.cognitiveservices.azure.com",
    api_key="your-api-key",
    api_version="2024-11-30",
    prebuilt_model="prebuilt-layout",
    timeout_seconds=120.0,
    max_retries=3,
)

Usage Examples

Using the Factory

from document_parser import ParserFactory, LlamaParseConfig, AzureDIConfig

# Create LlamaParse parser
llama_config = LlamaParseConfig(api_key="your-key")
llama_parser = ParserFactory.create("llamaparse", llama_config)

# Create Azure DI parser
azure_config = AzureDIConfig(endpoint="https://...", api_key="your-key")
azure_parser = ParserFactory.create("azure_di", azure_config)

Parsing a Single Page

import asyncio
from document_parser import LlamaParseClient, LlamaParseConfig

async def parse_page(page):
    config = LlamaParseConfig(api_key="your-key")
    parser = LlamaParseClient(config)
    
    result = await parser.parse(page)
    print(f"Markdown: {result.markdown}")
    print(f"Tables: {len(result.tables)}")
    print(f"Images: {len(result.images)}")
    print(f"Duration: {result.parse_duration_ms}ms")
    
    await parser.close()

# Run
asyncio.run(parse_page(page))

Batch Parsing

import asyncio
from document_parser import BatchParser, LlamaParseClient, LlamaParseConfig

async def parse_document(document):
    config = LlamaParseConfig(api_key="your-key")
    parser = LlamaParseClient(config)
    
    batch_parser = BatchParser(
        parser=parser,
        cache=None,  # Optional cache implementation
        max_concurrency=10,
    )
    
    results = await batch_parser.parse_document(document)
    print(f"Parsed {len(results)} pages")
    
    await parser.close()

# Run
asyncio.run(parse_document(document))

Batch Parsing with Cache

from document_parser import BatchParser
from document_core.interfaces import IResultCache

class MyCache(IResultCache):
    async def get(self, key):
        # Retrieve from cache
        pass
    
    async def put(self, key, value, ttl_seconds):
        # Store in cache
        pass

cache = MyCache()
batch_parser = BatchParser(parser=parser, cache=cache)

Retry Strategy

The library uses tenacity for retry logic with exponential backoff:

Retry Conditions

  • HTTP status codes: 429, 500, 502, 503, 504
  • Connection errors
  • Timeout errors
  • Transport errors

Backoff Schedule

  • 1s, 2s, 4s, 8s, 16s (exponential)
  • Configurable max retries (default: 3)

Custom Retry Configuration

from document_parser.llamaparse import LlamaParseConfig

config = LlamaParseConfig(
    api_key="your-key",
    max_retries=5,  # Increase retries
)

Error Handling

Exception Hierarchy

DocumentParserError (base)
├── LlamaParseError
├── AzureDIError
├── ParserTimeoutError
├── ParserAuthenticationError
├── ParserRateLimitError
├── AdapterError
├── BatchParserError
└── UnsupportedParserError

Error Handling Example

from document_parser import LlamaParseError, ParserTimeoutError

try:
    result = await parser.parse(page)
except ParserTimeoutError as e:
    print(f"Parsing timed out: {e.message}")
except ParserAuthenticationError as e:
    print(f"Authentication failed: {e.message}")
except LlamaParseError as e:
    print(f"Parsing failed: {e.message}")
    print(f"Job ID: {e.details.get('job_id')}")

Performance Tuning

Concurrency Control

# LlamaParse
config = LlamaParseConfig(
    api_key="your-key",
    max_concurrent_jobs=20,  # Increase for faster processing
)

# BatchParser
batch_parser = BatchParser(
    parser=parser,
    max_concurrency=20,
)

Memory Efficiency

  • Use streaming uploads for large files
  • Process in batches for very large documents
  • Enable caching to avoid reprocessing
  • Configure appropriate timeout values

Connection Pooling

The library uses httpx.AsyncClient with connection pooling:

  • Default: 10 connections
  • Configurable via max_concurrent_jobs
  • Automatic keep-alive for reuse

Extending New Parsers

1. Create Configuration

from pydantic import BaseModel

class MyParserConfig(BaseModel):
    api_key: str
    base_url: str
    timeout_seconds: float = 120.0

2. Create Adapter

from document_parser.models import ParseResult

class MyParserAdapter:
    def adapt(self, raw_response, page_number, parse_duration_ms) -> ParseResult:
        # Convert provider response to ParseResult
        pass

3. Implement IDocumentParser

from document_core.interfaces import IDocumentParser

class MyParserClient(IDocumentParser):
    async def parse(self, page, config) -> ParseResult:
        # Parse page
        pass
    
    async def parse_batch(self, pages, config) -> list[ParseResult]:
        # Parse batch
        pass

4. Register in Factory

from document_parser.factory import ParserFactory

class ParserFactory:
    @staticmethod
    def create(parser_name, config):
        if parser_name == "my_parser":
            return MyParserClient(config)
        # ... existing parsers

Development Guide

Running Tests

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=document_parser --cov-report=html

# Run specific test file
pytest tests/test_factory.py

Code Style

# Format code with black
black document_parser/

# Lint with ruff
ruff check document_parser/

# Type check with mypy
mypy document_parser/

Project Structure

document-parser/
├── pyproject.toml
├── README.md
├── document_parser/
│   ├── __init__.py
│   ├── factory.py
│   ├── batch.py
│   ├── models.py
│   ├── exceptions.py
│   ├── llamaparse/
│   │   ├── __init__.py
│   │   ├── client.py
│   │   ├── config.py
│   │   ├── adapter.py
│   │   └── retry.py
│   └── azure_di/
│       ├── __init__.py
│       ├── client.py
│       ├── config.py
│       └── adapter.py
├── tests/
│   ├── test_factory.py
│   ├── test_batch.py
│   ├── test_llamaparse.py
│   ├── test_azure_di.py
│   ├── test_adapters.py
│   └── test_retry.py
└── docs/

Design Principles

  • SOLID - Single responsibility, open/closed, Liskov substitution, interface segregation, dependency inversion
  • Dependency Injection - Parser instances injected into batch parser
  • Adapter Pattern - Normalize provider responses
  • Factory Pattern - Create parsers by name
  • Async-first - Non-blocking operations
  • High Concurrency - Semaphore throttling, connection pooling
  • Provider Abstraction - Switch providers without code changes
  • Open/Closed Principle - Add new parsers without modifying existing code
  • Structured Logging - Log all major operations
  • Testability - Mock external APIs, unit tests for all components
  • Type Safety - Complete type hints, mypy validation
  • Pydantic Validation - Strict data validation

Dependencies

Internal

  • document-core - Shared models, enums, interfaces, and utilities

External

  • httpx - Async HTTP client with connection pooling
  • tenacity - Retry logic with exponential backoff
  • pydantic - Data validation

Optional

  • azure-ai-documentintelligence - Azure Document Intelligence SDK

License

MIT License - PepsiCo

Support

For issues, questions, or contributions, please contact the PepsiCo AI Team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pepsico_document_parser-0.1.0.tar.gz (23.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pepsico_document_parser-0.1.0-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file pepsico_document_parser-0.1.0.tar.gz.

File metadata

  • Download URL: pepsico_document_parser-0.1.0.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for pepsico_document_parser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 37543f5af129b56fc9b958a139c8a3805f6fef828d01397f40381c44acc26206
MD5 ce2f3839f49ffd489045f0bfad866438
BLAKE2b-256 ab92c7a326af32855afa13cc5ec080ae9db022d111cc2adc2e839bf41f9f9f07

See more details on using hashes here.

File details

Details for the file pepsico_document_parser-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pepsico_document_parser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0109a7de5cf51fb3b0c5cacdbc83a4c74330c3adba8f258a56424a45d87dc3b5
MD5 ca08cc5eb29178d5dbde8d12b05817e0
BLAKE2b-256 64901a69859decbecc3faa5a9963baea3699dec708353869e9bfda7152278243

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page