Skip to main content

Turn any document into structured data. Unified interface for document structuring engines with built-in evaluation.

Project description

Docfold

PyPI version Python 3.10+ License: MIT CI Tests

Turn any document into structured data. Unified Python toolkit for document structuring — one interface, 16 engines, built-in benchmarks.

Docfold is the open-source extraction engine from Datatera.ai — extracted from our commercial enterprise AI data platform and battle-tested in production with enterprise clients across finance, insurance, and mining.

Read the announcement: Docfold - open-source document processing toolkit

Engine Comparison

Research-based estimates from public benchmarks, documentation, and community reports. See detailed methodology. Run your own: docfold compare your_doc.pdf

Engine docfold Type License Text PDF Scan/OCR Tables BBox Conf Speed Cost
Docling Local MIT ★★★ ★★☆ ★★★ Medium Free
MinerU Local AGPL ★★★ ★★★ ★★★ Slow Free
Marker SaaS Paid ★★★ ★★★ ★★★ Fast $$
PyMuPDF Local AGPL ★★★ ☆☆☆ ★☆☆ Ultra Free
PaddleOCR Local Apache ★☆☆ ★★★ ★★☆ Medium Free
Tesseract Local Apache ★☆☆ ★★☆ ★☆☆ Medium Free
EasyOCR Local Apache ★☆☆ ★★★ ☆☆☆ Medium Free
Unstructured Local Apache ★★☆ ★★☆ ★★☆ Medium Free
LlamaParse SaaS Paid ★★★ ★★★ ★★★ Fast $$
Mistral OCR SaaS Paid ★★★ ★★★ ★★★ Fast $$
Zerox VLM MIT ★★★ ★★★ ★★☆ Slow $$$
AWS Textract SaaS Paid ★★★ ★★★ ★★★ Fast $$
Google Doc AI SaaS Paid ★★★ ★★★ ★★★ Fast $$
Azure Doc Intel SaaS Paid ★★★ ★★★ ★★★ Fast $$
Nougat Local MIT ★★★ ★★☆ ★★☆ Slow Free
Surya Local GPL ★★☆ ★★★ ★★☆ Medium Free

★★★ Excellent ★★☆ Good ★☆☆ Basic ☆☆☆ Not supported — $$ ~$1-3/1K pages $$$ ~$5-15/1K pages — BBox Bounding boxes — Conf Confidence scores

Full engine profiles, format matrix, hardware requirements, and cost breakdown →

How to Choose

Your situation Recommended engine
Digital PDF, speed is critical PyMuPDF — zero deps, ~1000 pages/sec
Scanned documents, need OCR PaddleOCR (80+ langs), EasyOCR (PyTorch), or Tesseract (100+ langs)
Complex layouts + tables Docling or MinerU (free), LlamaParse (paid)
Academic papers + math formulas MinerU or Nougat (free), Mistral OCR (paid)
Best quality, budget available Mistral OCR or LlamaParse
Use any Vision LLM (GPT-4o, Claude, etc.) Zerox — model-agnostic
Self-hosted, all-in-one ETL Unstructured with hi_res strategy
Diverse file types (not just PDF) Docling or Unstructured
Need bounding boxes + confidence Textract, Google DocAI, or Azure DocInt
Office files (DOCX/PPTX/XLSX) Docling, Marker, Unstructured, or Azure DocInt
AWS/GCP/Azure native pipeline Textract / Google DocAI / Azure DocInt

Why Docfold?

Every engine has trade-offs. Docfold lets you switch between them with one line:

Challenge Without Docfold With Docfold
Try a new engine Rewrite your pipeline Change one string: engine_hint="docling"
Compare quality Manual side-by-side router.compare("doc.pdf") — one line
Batch 1000 files Build your own concurrency router.process_batch(files, concurrency=5)
Measure accuracy Write custom metrics Built-in CER, WER, Table F1, Reading Order
Switch engines later Major refactor Zero code changes — same EngineResult
from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine
from docfold.engines.pymupdf_engine import PyMuPDFEngine

router = EngineRouter([DoclingEngine(), PyMuPDFEngine()])

# Auto-select the best available engine
result = await router.process("invoice.pdf")
print(result.content)       # Markdown output
print(result.engine_name)   # Which engine was used
print(result.processing_time_ms)

# Compare all engines on the same document
results = await router.compare("invoice.pdf")
for name, res in results.items():
    print(f"{name}: {len(res.content)} chars in {res.processing_time_ms}ms")

Supported Engines

Engine Type License Formats GPU Install
Docling Local MIT PDF, DOCX, PPTX, XLSX, HTML, images No pip install docfold[docling]
MinerU Local AGPL-3.0 PDF Recommended pip install docfold[mineru]
Marker API SaaS Paid PDF, Office, images N/A pip install docfold[marker]
PyMuPDF Local AGPL-3.0 PDF No pip install docfold[pymupdf]
PaddleOCR Local Apache-2.0 Images, scanned PDFs Optional pip install docfold[paddleocr]
Tesseract Local Apache-2.0 Images, scanned PDFs No pip install docfold[tesseract]
EasyOCR Local Apache-2.0 Images, scanned PDFs Optional pip install docfold[easyocr]
Unstructured Local Apache-2.0 PDF, Office, HTML, email, ePub Optional pip install docfold[unstructured]
LlamaParse SaaS Paid PDF, Office, images N/A pip install docfold[llamaparse]
Mistral OCR SaaS Paid PDF, images N/A pip install docfold[mistral-ocr]
Zerox VLM MIT PDF, images Depends pip install docfold[zerox]
AWS Textract SaaS Paid PDF, images N/A pip install docfold[textract]
Google Doc AI SaaS Paid PDF, images N/A pip install docfold[google-docai]
Azure Doc Intel SaaS Paid PDF, Office, HTML, images N/A pip install docfold[azure-docint]
Nougat Local MIT (code) PDF Recommended pip install docfold[nougat]
Surya Local GPL-3.0 PDF, images Optional pip install docfold[surya]

Adding your own engine? Implement the DocumentEngine interface — see Adding a Custom Engine below.

Installation

# Core only (no engines — useful for writing custom adapters)
pip install docfold

# With specific engines
pip install docfold[docling]
pip install docfold[docling,pymupdf,tesseract]

# Everything
pip install docfold[all]

Requires Python 3.10+.

CLI

# Convert a document
docfold convert invoice.pdf
docfold convert report.pdf --engine docling --format html --output report.html

# List available engines
docfold engines

# Compare engines on a document
docfold compare invoice.pdf

# Run evaluation benchmark
docfold evaluate tests/evaluation/dataset/ --output report.json

Batch Processing

Process hundreds of documents with bounded concurrency and progress tracking:

from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine

router = EngineRouter([DoclingEngine()])

# Simple batch
batch = await router.process_batch(
    ["invoice1.pdf", "invoice2.pdf", "report.docx"],
    concurrency=3,
)
print(f"{batch.succeeded}/{batch.total} succeeded in {batch.total_time_ms}ms")

# With progress callback
def on_progress(*, current, total, file_path, engine_name, status, **_):
    print(f"[{current}/{total}] {status}: {file_path} ({engine_name})")

batch = await router.process_batch(
    file_paths,
    concurrency=5,
    on_progress=on_progress,
)

# Access results
for path, result in batch.results.items():
    print(f"{path}: {len(result.content)} chars")

# Check errors
for path, error in batch.errors.items():
    print(f"FAILED {path}: {error}")

Unified Result Format

Every engine returns the same EngineResult dataclass:

@dataclass
class EngineResult:
    content: str              # The extracted text (markdown/html/json/text)
    format: OutputFormat      # markdown | html | json | text
    engine_name: str          # Which engine produced this
    metadata: dict            # Engine-specific metadata
    pages: int | None         # Number of pages processed
    images: dict | None       # Extracted images {filename: base64}
    tables: list | None       # Extracted tables
    bounding_boxes: list | None  # Layout element positions
    confidence: float | None  # Overall confidence [0-1]
    processing_time_ms: int   # Wall-clock time

Evaluation Framework

Docfold includes a built-in evaluation harness to objectively compare engines:

pip install docfold[evaluation]
docfold evaluate path/to/dataset/ --engines docling,pymupdf,marker

Metrics measured:

Metric What it measures Target
CER (Character Error Rate) Character-level text accuracy < 0.05
WER (Word Error Rate) Word-level text accuracy < 0.10
Table F1 Table detection and cell content accuracy > 0.85
Heading F1 Heading detection precision/recall > 0.90
Reading Order Score Correctness of reading order (Kendall's tau) > 0.90

See docs/evaluation.md for the ground truth JSON schema and detailed usage.

Architecture

                        ┌─────────────────────────────┐
                        │       Your Application      │
                        └──────────┬──────────────────┘
                                   │
                        ┌──────────▼──────────────────┐
                        │       EngineRouter          │
                        │  select() / process()       │
                        │  process_batch() / compare() │
                        └──────────┬──────────────────┘
                                   │
     ┌──────────┬───────┬──────────┴──────┬──────────┬──────────┐
     ▼          ▼       ▼                 ▼          ▼          ▼
┌────────┐ ┌────────┐ ┌──────────┐  ┌────────┐ ┌────────┐ ┌──────┐
│Docling │ │ MinerU │ │Unstructd │  │ Marker │ │PyMuPDF │ │ OCR  │
│(local) │ │(local) │ │ (local)  │  │ (SaaS) │ │(local) │ │Paddle│
└────────┘ └────────┘ └──────────┘  └────────┘ └────────┘ │Tess. │
     │          │           │            │          │      └──────┘
     │     ┌────────┐ ┌──────────┐ ┌────────┐      │          │
     │     │Llama   │ │ Mistral  │ │ Zerox  │      │          │
     │     │Parse   │ │  OCR     │ │ (VLM)  │      │          │
     │     │(SaaS)  │ │ (SaaS)  │ │        │      │          │
     │     └────────┘ └──────────┘ └────────┘      │          │
     │          │           │            │          │          │
     │     ┌────────┐ ┌──────────┐ ┌────────┐      │          │
     │     │Textract│ │Google    │ │ Azure  │      │          │
     │     │ (AWS)  │ │DocAI     │ │DocInt  │      │          │
     │     │        │ │ (GCP)    │ │        │      │          │
     │     └────────┘ └──────────┘ └────────┘      │          │
     └──────────┴───────┴─────────────┴─────────────┴──────────┘
                                   │
                          ┌────────▼───────┐
                          │  EngineResult  │
                          │  (unified)     │
                          └────────────────┘

Engine Selection Logic

When no engine is explicitly specified, the router selects one automatically:

  1. Explicit hintengine_hint="docling" in the call
  2. Environment defaultENGINE_DEFAULT=docling env var
  3. Extension-aware priority — each file type has its own engine priority chain (e.g., .png prefers PaddleOCR, .pdf prefers Docling, .docx skips PDF-only engines)
  4. User-configurable — override with fallback_order or restrict with allowed_engines
# Restrict to specific engines
router = EngineRouter(engines, allowed_engines={"docling", "pymupdf"})

# Custom fallback order
router = EngineRouter(engines, fallback_order=["pymupdf", "docling", "marker"])

# CLI: --engines flag
# docfold convert invoice.pdf --engines docling,pymupdf

Adding a Custom Engine

Implement the DocumentEngine interface:

from docfold.engines.base import DocumentEngine, EngineResult, OutputFormat

class MyEngine(DocumentEngine):
    @property
    def name(self) -> str:
        return "my_engine"

    @property
    def supported_extensions(self) -> set[str]:
        return {"pdf", "docx"}

    def is_available(self) -> bool:
        try:
            import my_library
            return True
        except ImportError:
            return False

    async def process(self, file_path, output_format=OutputFormat.MARKDOWN, **kwargs):
        # Your extraction logic here
        content = extract(file_path)
        return EngineResult(
            content=content,
            format=output_format,
            engine_name=self.name,
        )

# Register it
router.register(MyEngine())

Related Projects

Docfold builds on and integrates with these excellent projects:

Project Description
Docling IBM's document conversion toolkit — PDF, DOCX, PPTX, and more
MinerU / PDF-Extract-Kit End-to-end PDF structuring with layout analysis and formula recognition
Marker High-quality PDF to Markdown converter
PyMuPDF Fast PDF/XPS/EPUB processing library
PaddleOCR Multilingual OCR toolkit (80+ languages)
Tesseract Open-source OCR engine (100+ languages)
Unstructured ETL toolkit for diverse document types
LlamaParse LLM-powered document parsing
Mistral OCR Vision LLM document understanding
Zerox Model-agnostic Vision LLM OCR
Nougat Meta's academic PDF to Markdown model
Surya Multilingual OCR + layout analysis

Built by

Project Description
Datatera.ai AI-powered data transformation and document processing platform
Orquesta AI AI orchestration and agent management platform
AI Agent Labs AI agent services and location-based intelligence

Development

git clone https://github.com/mihailorama/docfold.git
cd docfold
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check src/ tests/
mypy src/

See CONTRIBUTING.md for detailed guidelines.

License

MIT. See LICENSE.

Note: Some engine backends have their own licenses (AGPL-3.0 for PyMuPDF and MinerU, GPL-3.0 for Surya, SaaS terms for Marker/LlamaParse/Mistral). Docfold itself is MIT — the engine adapters are optional extras that you install separately.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docfold-0.6.13.tar.gz (100.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docfold-0.6.13-py3-none-any.whl (66.5 kB view details)

Uploaded Python 3

File details

Details for the file docfold-0.6.13.tar.gz.

File metadata

  • Download URL: docfold-0.6.13.tar.gz
  • Upload date:
  • Size: 100.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docfold-0.6.13.tar.gz
Algorithm Hash digest
SHA256 b370701ac44c89f62a2dd4c598f5372b09974ab44b72062aa341edf1ef6c1dc0
MD5 82c03b44edecd51aca5bad649eddd273
BLAKE2b-256 f471ce2c680016d4058c03d92eb6d92afedfa80890c039c995f43dba52746a5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for docfold-0.6.13.tar.gz:

Publisher: ci.yml on Mihailorama/docfold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docfold-0.6.13-py3-none-any.whl.

File metadata

  • Download URL: docfold-0.6.13-py3-none-any.whl
  • Upload date:
  • Size: 66.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docfold-0.6.13-py3-none-any.whl
Algorithm Hash digest
SHA256 442f14209e48180ea922ea87c4ecb57f673df566df4820f4d59cd86f07a78175
MD5 4ef3536b4f7a470dc4767d8f356d9cfb
BLAKE2b-256 a173299615ee11f4d77ffef5208c412065ed4c0198cd42c559dd40bf5b2310f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for docfold-0.6.13-py3-none-any.whl:

Publisher: ci.yml on Mihailorama/docfold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page