Turn any document into structured data. Unified interface for document structuring engines with built-in evaluation.

These details have not been verified by PyPI

Project description

Docfold

Turn any document into structured data. Unified Python toolkit for document structuring — one interface, 16 engines, built-in benchmarks.

Docfold is the open-source extraction engine from Datatera.ai — extracted from our commercial enterprise AI data platform and battle-tested in production with enterprise clients across finance, insurance, and mining.

Read the announcement: Docfold - open-source document processing toolkit

Engine Comparison

Research-based estimates from public benchmarks, documentation, and community reports. See detailed methodology. Run your own: docfold compare your_doc.pdf

Engine	docfold	Type	License	Text PDF	Scan/OCR	Tables	BBox	Conf	Speed	Cost
Docling	✅	Local	MIT	★★★	★★☆	★★★	✅	—	Medium	Free
MinerU	✅	Local	AGPL	★★★	★★★	★★★	—	—	Slow	Free
Marker	✅	SaaS	Paid	★★★	★★★	★★★	✅	—	Fast	$$
PyMuPDF	✅	Local	AGPL	★★★	☆☆☆	★☆☆	—	—	Ultra	Free
PaddleOCR	✅	Local	Apache	★☆☆	★★★	★★☆	—	✅	Medium	Free
Tesseract	✅	Local	Apache	★☆☆	★★☆	★☆☆	—	—	Medium	Free
EasyOCR	✅	Local	Apache	★☆☆	★★★	☆☆☆	—	✅	Medium	Free
Unstructured	✅	Local	Apache	★★☆	★★☆	★★☆	—	—	Medium	Free
LlamaParse	✅	SaaS	Paid	★★★	★★★	★★★	—	—	Fast	$$
Mistral OCR	✅	SaaS	Paid	★★★	★★★	★★★	—	—	Fast	$$
Zerox	✅	VLM	MIT	★★★	★★★	★★☆	—	—	Slow	$$$
AWS Textract	✅	SaaS	Paid	★★★	★★★	★★★	✅	✅	Fast	$$
Google Doc AI	✅	SaaS	Paid	★★★	★★★	★★★	✅	✅	Fast	$$
Azure Doc Intel	✅	SaaS	Paid	★★★	★★★	★★★	✅	✅	Fast	$$
Nougat	✅	Local	MIT	★★★	★★☆	★★☆	—	—	Slow	Free
Surya	✅	Local	GPL	★★☆	★★★	★★☆	✅	✅	Medium	Free

★★★ Excellent ★★☆ Good ★☆☆ Basic ☆☆☆ Not supported — $$ ~$1-3/1K pages $$$ ~$5-15/1K pages — BBox Bounding boxes — Conf Confidence scores

Full engine profiles, format matrix, hardware requirements, and cost breakdown →

How to Choose

Your situation	Recommended engine
Digital PDF, speed is critical	PyMuPDF — zero deps, ~1000 pages/sec
Scanned documents, need OCR	PaddleOCR (80+ langs), EasyOCR (PyTorch), or Tesseract (100+ langs)
Complex layouts + tables	Docling or MinerU (free), LlamaParse (paid)
Academic papers + math formulas	MinerU or Nougat (free), Mistral OCR (paid)
Best quality, budget available	Mistral OCR or LlamaParse
Use any Vision LLM (GPT-4o, Claude, etc.)	Zerox — model-agnostic
Self-hosted, all-in-one ETL	Unstructured with hi_res strategy
Diverse file types (not just PDF)	Docling or Unstructured
Need bounding boxes + confidence	Textract, Google DocAI, or Azure DocInt
Office files (DOCX/PPTX/XLSX)	Docling, Marker, Unstructured, or Azure DocInt
AWS/GCP/Azure native pipeline	Textract / Google DocAI / Azure DocInt

Why Docfold?

Every engine has trade-offs. Docfold lets you switch between them with one line:

Challenge	Without Docfold	With Docfold
Try a new engine	Rewrite your pipeline	Change one string: `engine_hint="docling"`
Compare quality	Manual side-by-side	`router.compare("doc.pdf")` — one line
Batch 1000 files	Build your own concurrency	`router.process_batch(files, concurrency=5)`
Measure accuracy	Write custom metrics	Built-in CER, WER, Table F1, Reading Order
Switch engines later	Major refactor	Zero code changes — same `EngineResult`

from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine
from docfold.engines.pymupdf_engine import PyMuPDFEngine

router = EngineRouter([DoclingEngine(), PyMuPDFEngine()])

# Auto-select the best available engine
result = await router.process("invoice.pdf")
print(result.content)       # Markdown output
print(result.engine_name)   # Which engine was used
print(result.processing_time_ms)

# Compare all engines on the same document
results = await router.compare("invoice.pdf")
for name, res in results.items():
    print(f"{name}: {len(res.content)} chars in {res.processing_time_ms}ms")

Supported Engines

Engine	Type	License	Formats	GPU	Install
Docling	Local	MIT	PDF, DOCX, PPTX, XLSX, HTML, images	No	`pip install docfold[docling]`
MinerU	Local	AGPL-3.0	PDF	Recommended	`pip install docfold[mineru]`
Marker API	SaaS	Paid	PDF, Office, images	N/A	`pip install docfold[marker]`
PyMuPDF	Local	AGPL-3.0	PDF	No	`pip install docfold[pymupdf]`
PaddleOCR	Local	Apache-2.0	Images, scanned PDFs	Optional	`pip install docfold[paddleocr]`
Tesseract	Local	Apache-2.0	Images, scanned PDFs	No	`pip install docfold[tesseract]`
EasyOCR	Local	Apache-2.0	Images, scanned PDFs	Optional	`pip install docfold[easyocr]`
Unstructured	Local	Apache-2.0	PDF, Office, HTML, email, ePub	Optional	`pip install docfold[unstructured]`
LlamaParse	SaaS	Paid	PDF, Office, images	N/A	`pip install docfold[llamaparse]`
Mistral OCR	SaaS	Paid	PDF, images	N/A	`pip install docfold[mistral-ocr]`
Zerox	VLM	MIT	PDF, images	Depends	`pip install docfold[zerox]`
AWS Textract	SaaS	Paid	PDF, images	N/A	`pip install docfold[textract]`
Google Doc AI	SaaS	Paid	PDF, images	N/A	`pip install docfold[google-docai]`
Azure Doc Intel	SaaS	Paid	PDF, Office, HTML, images	N/A	`pip install docfold[azure-docint]`
Nougat	Local	MIT (code)	PDF	Recommended	`pip install docfold[nougat]`
Surya	Local	GPL-3.0	PDF, images	Optional	`pip install docfold[surya]`

Adding your own engine? Implement the DocumentEngine interface — see Adding a Custom Engine below.

Installation

# Core only (no engines — useful for writing custom adapters)
pip install docfold

# With specific engines
pip install docfold[docling]
pip install docfold[docling,pymupdf,tesseract]

# Everything
pip install docfold[all]

Requires Python 3.10+.

CLI

# Convert a document
docfold convert invoice.pdf
docfold convert report.pdf --engine docling --format html --output report.html

# List available engines
docfold engines

# Compare engines on a document
docfold compare invoice.pdf

# Run evaluation benchmark
docfold evaluate tests/evaluation/dataset/ --output report.json

Batch Processing

Process hundreds of documents with bounded concurrency and progress tracking:

from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine

router = EngineRouter([DoclingEngine()])

# Simple batch
batch = await router.process_batch(
    ["invoice1.pdf", "invoice2.pdf", "report.docx"],
    concurrency=3,
)
print(f"{batch.succeeded}/{batch.total} succeeded in {batch.total_time_ms}ms")

# With progress callback
def on_progress(*, current, total, file_path, engine_name, status, **_):
    print(f"[{current}/{total}] {status}: {file_path} ({engine_name})")

batch = await router.process_batch(
    file_paths,
    concurrency=5,
    on_progress=on_progress,
)

# Access results
for path, result in batch.results.items():
    print(f"{path}: {len(result.content)} chars")

# Check errors
for path, error in batch.errors.items():
    print(f"FAILED {path}: {error}")

Unified Result Format

Every engine returns the same EngineResult dataclass:

@dataclass
class EngineResult:
    content: str              # The extracted text (markdown/html/json/text)
    format: OutputFormat      # markdown | html | json | text
    engine_name: str          # Which engine produced this
    metadata: dict            # Engine-specific metadata
    pages: int | None         # Number of pages processed
    images: dict | None       # Extracted images {filename: base64}
    tables: list | None       # Extracted tables
    bounding_boxes: list | None  # Layout element positions
    confidence: float | None  # Overall confidence [0-1]
    processing_time_ms: int   # Wall-clock time

Evaluation Framework

Docfold includes a built-in evaluation harness to objectively compare engines:

pip install docfold[evaluation]
docfold evaluate path/to/dataset/ --engines docling,pymupdf,marker

Metrics measured:

Metric	What it measures	Target
CER (Character Error Rate)	Character-level text accuracy	< 0.05
WER (Word Error Rate)	Word-level text accuracy	< 0.10
Table F1	Table detection and cell content accuracy	> 0.85
Heading F1	Heading detection precision/recall	> 0.90
Reading Order Score	Correctness of reading order (Kendall's tau)	> 0.90

See docs/evaluation.md for the ground truth JSON schema and detailed usage.

Architecture

                        ┌─────────────────────────────┐
                        │       Your Application      │
                        └──────────┬──────────────────┘
                                   │
                        ┌──────────▼──────────────────┐
                        │       EngineRouter          │
                        │  select() / process()       │
                        │  process_batch() / compare() │
                        └──────────┬──────────────────┘
                                   │
     ┌──────────┬───────┬──────────┴──────┬──────────┬──────────┐
     ▼          ▼       ▼                 ▼          ▼          ▼
┌────────┐ ┌────────┐ ┌──────────┐  ┌────────┐ ┌────────┐ ┌──────┐
│Docling │ │ MinerU │ │Unstructd │  │ Marker │ │PyMuPDF │ │ OCR  │
│(local) │ │(local) │ │ (local)  │  │ (SaaS) │ │(local) │ │Paddle│
└────────┘ └────────┘ └──────────┘  └────────┘ └────────┘ │Tess. │
     │          │           │            │          │      └──────┘
     │     ┌────────┐ ┌──────────┐ ┌────────┐      │          │
     │     │Llama   │ │ Mistral  │ │ Zerox  │      │          │
     │     │Parse   │ │  OCR     │ │ (VLM)  │      │          │
     │     │(SaaS)  │ │ (SaaS)  │ │        │      │          │
     │     └────────┘ └──────────┘ └────────┘      │          │
     │          │           │            │          │          │
     │     ┌────────┐ ┌──────────┐ ┌────────┐      │          │
     │     │Textract│ │Google    │ │ Azure  │      │          │
     │     │ (AWS)  │ │DocAI     │ │DocInt  │      │          │
     │     │        │ │ (GCP)    │ │        │      │          │
     │     └────────┘ └──────────┘ └────────┘      │          │
     └──────────┴───────┴─────────────┴─────────────┴──────────┘
                                   │
                          ┌────────▼───────┐
                          │  EngineResult  │
                          │  (unified)     │
                          └────────────────┘

Engine Selection Logic

When no engine is explicitly specified, the router selects one automatically:

Explicit hint — engine_hint="docling" in the call
Environment default — ENGINE_DEFAULT=docling env var
Extension-aware priority — each file type has its own engine priority chain (e.g., .png prefers PaddleOCR, .pdf prefers Docling, .docx skips PDF-only engines)
User-configurable — override with fallback_order or restrict with allowed_engines

# Restrict to specific engines
router = EngineRouter(engines, allowed_engines={"docling", "pymupdf"})

# Custom fallback order
router = EngineRouter(engines, fallback_order=["pymupdf", "docling", "marker"])

# CLI: --engines flag
# docfold convert invoice.pdf --engines docling,pymupdf

Adding a Custom Engine

Implement the DocumentEngine interface:

from docfold.engines.base import DocumentEngine, EngineResult, OutputFormat

class MyEngine(DocumentEngine):
    @property
    def name(self) -> str:
        return "my_engine"

    @property
    def supported_extensions(self) -> set[str]:
        return {"pdf", "docx"}

    def is_available(self) -> bool:
        try:
            import my_library
            return True
        except ImportError:
            return False

    async def process(self, file_path, output_format=OutputFormat.MARKDOWN, **kwargs):
        # Your extraction logic here
        content = extract(file_path)
        return EngineResult(
            content=content,
            format=output_format,
            engine_name=self.name,
        )

# Register it
router.register(MyEngine())

Related Projects

Docfold builds on and integrates with these excellent projects:

Project	Description
Docling	IBM's document conversion toolkit — PDF, DOCX, PPTX, and more
MinerU / PDF-Extract-Kit	End-to-end PDF structuring with layout analysis and formula recognition
Marker	High-quality PDF to Markdown converter
PyMuPDF	Fast PDF/XPS/EPUB processing library
PaddleOCR	Multilingual OCR toolkit (80+ languages)
Tesseract	Open-source OCR engine (100+ languages)
Unstructured	ETL toolkit for diverse document types
LlamaParse	LLM-powered document parsing
Mistral OCR	Vision LLM document understanding
Zerox	Model-agnostic Vision LLM OCR
Nougat	Meta's academic PDF to Markdown model
Surya	Multilingual OCR + layout analysis

Built by

Project	Description
Datatera.ai	AI-powered data transformation and document processing platform
Orquesta AI	AI orchestration and agent management platform
AI Agent Labs	AI agent services and location-based intelligence

Development

git clone https://github.com/mihailorama/docfold.git
cd docfold
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check src/ tests/
mypy src/

See CONTRIBUTING.md for detailed guidelines.

License

MIT. See LICENSE.

Note: Some engine backends have their own licenses (AGPL-3.0 for PyMuPDF and MinerU, GPL-3.0 for Surya, SaaS terms for Marker/LlamaParse/Mistral). Docfold itself is MIT — the engine adapters are optional extras that you install separately.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.6.13

Mar 23, 2026

0.6.12

Mar 4, 2026

0.6.11

Mar 4, 2026

0.6.10

Mar 4, 2026

0.6.9

Mar 4, 2026

0.6.8

Mar 2, 2026

0.6.7

Mar 2, 2026

0.6.6

Mar 1, 2026

0.6.5

Mar 1, 2026

0.6.4

Mar 1, 2026

0.6.3

Feb 28, 2026

0.6.2

Feb 28, 2026

0.6.1

Feb 24, 2026

0.6.0

Feb 20, 2026

0.5.1

Feb 12, 2026

0.5.0

Feb 12, 2026

0.4.0

Feb 12, 2026

0.3.0

Feb 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docfold-0.6.13.tar.gz (100.2 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docfold-0.6.13-py3-none-any.whl (66.5 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file docfold-0.6.13.tar.gz.

File metadata

Download URL: docfold-0.6.13.tar.gz
Upload date: Mar 23, 2026
Size: 100.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docfold-0.6.13.tar.gz
Algorithm	Hash digest
SHA256	`b370701ac44c89f62a2dd4c598f5372b09974ab44b72062aa341edf1ef6c1dc0`
MD5	`82c03b44edecd51aca5bad649eddd273`
BLAKE2b-256	`f471ce2c680016d4058c03d92eb6d92afedfa80890c039c995f43dba52746a5d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docfold-0.6.13.tar.gz:

Publisher: ci.yml on Mihailorama/docfold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docfold-0.6.13.tar.gz
- Subject digest: b370701ac44c89f62a2dd4c598f5372b09974ab44b72062aa341edf1ef6c1dc0
- Sigstore transparency entry: 1165539765
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: Mihailorama/docfold@930bba52a90f53de635ca1db5f649efcef6b092b
- Branch / Tag: refs/tags/v0.6.13
- Owner: https://github.com/Mihailorama
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@930bba52a90f53de635ca1db5f649efcef6b092b
- Trigger Event: push

File details

Details for the file docfold-0.6.13-py3-none-any.whl.

File metadata

Download URL: docfold-0.6.13-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 66.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docfold-0.6.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`442f14209e48180ea922ea87c4ecb57f673df566df4820f4d59cd86f07a78175`
MD5	`4ef3536b4f7a470dc4767d8f356d9cfb`
BLAKE2b-256	`a173299615ee11f4d77ffef5208c412065ed4c0198cd42c559dd40bf5b2310f0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docfold-0.6.13-py3-none-any.whl:

Publisher: ci.yml on Mihailorama/docfold

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docfold-0.6.13-py3-none-any.whl
- Subject digest: 442f14209e48180ea922ea87c4ecb57f673df566df4820f4d59cd86f07a78175
- Sigstore transparency entry: 1165539877
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: Mihailorama/docfold@930bba52a90f53de635ca1db5f649efcef6b092b
- Branch / Tag: refs/tags/v0.6.13
- Owner: https://github.com/Mihailorama
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@930bba52a90f53de635ca1db5f649efcef6b092b
- Trigger Event: push

docfold 0.6.13

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Docfold

Engine Comparison

How to Choose

Why Docfold?

Supported Engines

Installation

CLI

Batch Processing

Unified Result Format

Evaluation Framework

Architecture

Engine Selection Logic

Adding a Custom Engine

Related Projects

Built by

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance