Production-grade document extraction with intelligent fallback chain: Docling -> PyMuPDF -> pdfplumber -> Tesseract

These details have not been verified by PyPI

Project links

Project description

Docling Extractor

Production-grade document extraction library with intelligent fallback chain for robust PDF processing.

Overview

Docling Extractor provides a reliable pipeline for extracting text, tables, images, and structured content from PDF documents. The library automatically detects document type (scanned vs. digital) and applies the appropriate extraction method with fallback support.

Key Features

Intelligent PDF Detection

Automatically identifies scanned vs. digital PDFs
Routes documents to optimal extraction method

Robust Fallback Chain

For Digital PDFs:

Docling - Advanced layout analysis with table/image extraction
PyMuPDF - Fast text extraction with basic structure
pdfplumber - Table-focused extraction
Raw text - Binary extraction as last resort

For Scanned PDFs:

Tesseract OCR - Text recognition
PyMuPDF - Image extraction with minimal text

Production Ready

90-second timeout protection with process termination
Databricks optimized for distributed processing
Structured output (pages, sections, tables, images, formulas)
Comprehensive error handling and logging

Installation

Basic installation (PyMuPDF + pdfplumber):

pip install docling-extractor

Full installation (includes Docling and Tesseract):

pip install docling-extractor[full]

Note: Tesseract requires system installation:

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

Quick Start

Basic Usage

from docling_extractor import extract_single_document

# Extract a document
result = extract_single_document(
    input_path="/path/to/document.pdf",
    output_dir="/path/to/output",
    document_id="doc_001"
)

# Access extracted content
print(f"Status: {result['registry']['processing_status']}")
print(f"Pages extracted: {len(result['pages'])}")
print(f"Tables found: {len(result['tables'])}")
print(f"Tools used: {result['registry']['tools_used']}")

# Get page text
for page in result['pages']:
    print(f"Page {page['page_number']}: {page['text'][:100]}...")

Working with Different Document Types

# Digital PDF (uses Docling -> PyMuPDF -> pdfplumber chain)
result = extract_single_document(
    input_path="digital_document.pdf",
    output_dir="./output"
)

# Scanned PDF (uses Tesseract -> PyMuPDF chain)
result = extract_single_document(
    input_path="scanned_document.pdf",
    output_dir="./output"
)

Accessing Structured Data

result = extract_single_document(
    input_path="document.pdf",
    output_dir="./output",
    document_id="my_doc"
)

# Pages
for page in result['pages']:
    print(f"Page {page['page_number']}")
    print(page['text'])

# Tables
for table in result['tables']:
    print(f"Table on page {table['page_number']}")
    print(f"Dimensions: {table['row_count']}x{table['column_count']}")
    
# Images
for img in result['images']:
    print(f"Image saved to: {img['image_path']}")
    print(f"Size: {img['width']}x{img['height']}")

# Sections (from Docling)
for section in result['sections']:
    print(f"Section type: {section['section_type']}")
    print(f"Content: {section['content_text'][:100]}...")

Output Structure

The extraction result contains:

{
    'registry': {
        'protocol_id': 'document_id',
        'document_id': 'document_id',
        'processing_status': 'success',  # or 'failed'
        'page_count': 10,
        'tools_used': ['docling'],  # or ['pymupdf'], ['pdfplumber'], etc.
        'error_message': '',
        'processed_at': '2024-12-31T12:00:00'
    },
    'pages': [
        {
            'document_id': 'doc_id',
            'protocol_id': 'doc_id',
            'page_number': 1,
            'text': 'extracted text...',
            'source_path': '/path/to/file.pdf'
        }
    ],
    'tables': [...],
    'images': [...],
    'sections': [...],
    'formulas': [...],
    'errors': [...]
}

Use Cases

Clinical Trials & Regulatory Documents

Extract data from clinical trial protocols
Process regulatory submission documents
Handle ICH-GCP compliant document structures

Research & Academic Papers

Extract structured content from research papers
Preserve table and figure information
Handle multi-column layouts

Financial Documents

Process annual reports and filings
Extract tables from financial statements
Handle scanned historical documents

Enterprise Document Processing

Batch processing with Databricks integration
Reliable extraction with fallback support
Structured output for downstream NLP/ML

Advanced Configuration

Using Docling Directly

from docling_extractor import DoclingExtractor

extractor = DoclingExtractor(output_dir="./output")
result = extractor.extract(
    path="document.pdf",
    doc_id="my_document"
)

Checking PDF Type

from docling_extractor import is_scanned_pdf

if is_scanned_pdf("document.pdf"):
    print("Scanned PDF - will use OCR")
else:
    print("Digital PDF - will use text extraction")

Performance

Typical processing times (single-threaded):

Digital PDF (10 pages): 5-15 seconds
Scanned PDF (10 pages): 30-60 seconds
Timeout: 90 seconds (configurable in code)

For Databricks distributed processing, use extract_single_document with Spark UDFs.

Requirements

Core Dependencies:

Python >= 3.8
PyMuPDF >= 1.23.0
pdfplumber >= 0.10.0

Optional Dependencies:

docling >= 2.0.0 (for advanced extraction)
docling-core >= 2.0.0
pytesseract >= 0.3.10 (for OCR)

Troubleshooting

Docling timeout issues: The library includes a 90-second hard timeout with process termination. If Docling hangs, it will automatically fall back to PyMuPDF.

OCR not working: Ensure Tesseract is installed system-wide. Check with:

tesseract --version

Read-only filesystem (Databricks): The library is designed for Databricks' read-only site-packages. Docling OCR is disabled by default to avoid RapidOCR filesystem issues.

License

MIT License - see LICENSE file for details

Author

Nalini Panwar - LinkedIn | GitHub

Contributing

Contributions welcome. Please open an issue or submit a pull request at: https://github.com/panwarnalini-hub/clinical-doc-pipelines

Acknowledgments

Built with:

Docling - IBM Research document understanding
PyMuPDF - Fast PDF processing
pdfplumber - Table extraction
Tesseract - OCR engine

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Dec 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_extractor-1.0.0.tar.gz (13.6 kB view details)

Uploaded Dec 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docling_extractor-1.0.0-py3-none-any.whl (11.1 kB view details)

Uploaded Dec 31, 2025 Python 3

File details

Details for the file docling_extractor-1.0.0.tar.gz.

File metadata

Download URL: docling_extractor-1.0.0.tar.gz
Upload date: Dec 31, 2025
Size: 13.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for docling_extractor-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`8b92668efd5927452a0d4e62c7bbb6acc7b74504b1a53765eb715120be02c598`
MD5	`929121f2b2919bdc2b18051f024f3a6b`
BLAKE2b-256	`a52613e22c20d229da5656a4855ce45ea98ec7e54ea64d1077992db94d1cdc71`

See more details on using hashes here.

File details

Details for the file docling_extractor-1.0.0-py3-none-any.whl.

File metadata

Download URL: docling_extractor-1.0.0-py3-none-any.whl
Upload date: Dec 31, 2025
Size: 11.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for docling_extractor-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ccd6c5a4901b4a863a98c6759f8528b3b3d8c5b8480379c3073bcef68992db81`
MD5	`2083115e4999abcf4b7a58b83de4e1ac`
BLAKE2b-256	`dd7dd065ed333d2cafd7c10ae92166d02741c219f55bb67a242888db47d66df9`

See more details on using hashes here.

docling-extractor 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Docling Extractor

Overview

Key Features

Installation

Quick Start

Basic Usage

Working with Different Document Types

Accessing Structured Data

Output Structure

Use Cases

Advanced Configuration

Using Docling Directly

Checking PDF Type

Performance

Requirements

Troubleshooting

License

Author

Contributing

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes