Turn any document into structured data. Unified interface for document structuring engines with built-in evaluation.
Project description
Docfold
Turn any document into structured data. Unified Python toolkit for document structuring — one interface, 16 engines, built-in benchmarks.
Docfold is the open-source extraction engine from Datatera.ai — extracted from our commercial enterprise AI data platform and battle-tested in production with enterprise clients across finance, insurance, and mining.
Read the announcement: Docfold - open-source document processing toolkit
Engine Comparison
Research-based estimates from public benchmarks, documentation, and community reports. See detailed methodology. Run your own:
docfold compare your_doc.pdf
| Engine | docfold | Type | License | Text PDF | Scan/OCR | Tables | BBox | Conf | Speed | Cost |
|---|---|---|---|---|---|---|---|---|---|---|
| Docling | ✅ | Local | MIT | ★★★ | ★★☆ | ★★★ | ✅ | — | Medium | Free |
| MinerU | ✅ | Local | AGPL | ★★★ | ★★★ | ★★★ | — | — | Slow | Free |
| Marker | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | ✅ | — | Fast | $$ |
| PyMuPDF | ✅ | Local | AGPL | ★★★ | ☆☆☆ | ★☆☆ | — | — | Ultra | Free |
| PaddleOCR | ✅ | Local | Apache | ★☆☆ | ★★★ | ★★☆ | — | ✅ | Medium | Free |
| Tesseract | ✅ | Local | Apache | ★☆☆ | ★★☆ | ★☆☆ | — | — | Medium | Free |
| EasyOCR | ✅ | Local | Apache | ★☆☆ | ★★★ | ☆☆☆ | — | ✅ | Medium | Free |
| Unstructured | ✅ | Local | Apache | ★★☆ | ★★☆ | ★★☆ | — | — | Medium | Free |
| LlamaParse | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | — | — | Fast | $$ |
| Mistral OCR | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | — | — | Fast | $$ |
| Zerox | ✅ | VLM | MIT | ★★★ | ★★★ | ★★☆ | — | — | Slow | $$$ |
| AWS Textract | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | ✅ | ✅ | Fast | $$ |
| Google Doc AI | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | ✅ | ✅ | Fast | $$ |
| Azure Doc Intel | ✅ | SaaS | Paid | ★★★ | ★★★ | ★★★ | ✅ | ✅ | Fast | $$ |
| Nougat | ✅ | Local | MIT | ★★★ | ★★☆ | ★★☆ | — | — | Slow | Free |
| Surya | ✅ | Local | GPL | ★★☆ | ★★★ | ★★☆ | ✅ | ✅ | Medium | Free |
★★★ Excellent ★★☆ Good ★☆☆ Basic ☆☆☆ Not supported — $$ ~$1-3/1K pages $$$ ~$5-15/1K pages — BBox Bounding boxes — Conf Confidence scores
Full engine profiles, format matrix, hardware requirements, and cost breakdown →
How to Choose
| Your situation | Recommended engine |
|---|---|
| Digital PDF, speed is critical | PyMuPDF — zero deps, ~1000 pages/sec |
| Scanned documents, need OCR | PaddleOCR (80+ langs), EasyOCR (PyTorch), or Tesseract (100+ langs) |
| Complex layouts + tables | Docling or MinerU (free), LlamaParse (paid) |
| Academic papers + math formulas | MinerU or Nougat (free), Mistral OCR (paid) |
| Best quality, budget available | Mistral OCR or LlamaParse |
| Use any Vision LLM (GPT-4o, Claude, etc.) | Zerox — model-agnostic |
| Self-hosted, all-in-one ETL | Unstructured with hi_res strategy |
| Diverse file types (not just PDF) | Docling or Unstructured |
| Need bounding boxes + confidence | Textract, Google DocAI, or Azure DocInt |
| Office files (DOCX/PPTX/XLSX) | Docling, Marker, Unstructured, or Azure DocInt |
| AWS/GCP/Azure native pipeline | Textract / Google DocAI / Azure DocInt |
Why Docfold?
Every engine has trade-offs. Docfold lets you switch between them with one line:
| Challenge | Without Docfold | With Docfold |
|---|---|---|
| Try a new engine | Rewrite your pipeline | Change one string: engine_hint="docling" |
| Compare quality | Manual side-by-side | router.compare("doc.pdf") — one line |
| Batch 1000 files | Build your own concurrency | router.process_batch(files, concurrency=5) |
| Measure accuracy | Write custom metrics | Built-in CER, WER, Table F1, Reading Order |
| Switch engines later | Major refactor | Zero code changes — same EngineResult |
from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine
from docfold.engines.pymupdf_engine import PyMuPDFEngine
router = EngineRouter([DoclingEngine(), PyMuPDFEngine()])
# Auto-select the best available engine
result = await router.process("invoice.pdf")
print(result.content) # Markdown output
print(result.engine_name) # Which engine was used
print(result.processing_time_ms)
# Compare all engines on the same document
results = await router.compare("invoice.pdf")
for name, res in results.items():
print(f"{name}: {len(res.content)} chars in {res.processing_time_ms}ms")
Supported Engines
| Engine | Type | License | Formats | GPU | Install |
|---|---|---|---|---|---|
| Docling | Local | MIT | PDF, DOCX, PPTX, XLSX, HTML, images | No | pip install docfold[docling] |
| MinerU | Local | AGPL-3.0 | Recommended | pip install docfold[mineru] |
|
| Marker API | SaaS | Paid | PDF, Office, images | N/A | pip install docfold[marker] |
| PyMuPDF | Local | AGPL-3.0 | No | pip install docfold[pymupdf] |
|
| PaddleOCR | Local | Apache-2.0 | Images, scanned PDFs | Optional | pip install docfold[paddleocr] |
| Tesseract | Local | Apache-2.0 | Images, scanned PDFs | No | pip install docfold[tesseract] |
| EasyOCR | Local | Apache-2.0 | Images, scanned PDFs | Optional | pip install docfold[easyocr] |
| Unstructured | Local | Apache-2.0 | PDF, Office, HTML, email, ePub | Optional | pip install docfold[unstructured] |
| LlamaParse | SaaS | Paid | PDF, Office, images | N/A | pip install docfold[llamaparse] |
| Mistral OCR | SaaS | Paid | PDF, images | N/A | pip install docfold[mistral-ocr] |
| Zerox | VLM | MIT | PDF, images | Depends | pip install docfold[zerox] |
| AWS Textract | SaaS | Paid | PDF, images | N/A | pip install docfold[textract] |
| Google Doc AI | SaaS | Paid | PDF, images | N/A | pip install docfold[google-docai] |
| Azure Doc Intel | SaaS | Paid | PDF, Office, HTML, images | N/A | pip install docfold[azure-docint] |
| Nougat | Local | MIT (code) | Recommended | pip install docfold[nougat] |
|
| Surya | Local | GPL-3.0 | PDF, images | Optional | pip install docfold[surya] |
Adding your own engine? Implement the
DocumentEngineinterface — see Adding a Custom Engine below.
Installation
# Core only (no engines — useful for writing custom adapters)
pip install docfold
# With specific engines
pip install docfold[docling]
pip install docfold[docling,pymupdf,tesseract]
# Everything
pip install docfold[all]
Requires Python 3.10+.
CLI
# Convert a document
docfold convert invoice.pdf
docfold convert report.pdf --engine docling --format html --output report.html
# List available engines
docfold engines
# Compare engines on a document
docfold compare invoice.pdf
# Run evaluation benchmark
docfold evaluate tests/evaluation/dataset/ --output report.json
Batch Processing
Process hundreds of documents with bounded concurrency and progress tracking:
from docfold import EngineRouter
from docfold.engines.docling_engine import DoclingEngine
router = EngineRouter([DoclingEngine()])
# Simple batch
batch = await router.process_batch(
["invoice1.pdf", "invoice2.pdf", "report.docx"],
concurrency=3,
)
print(f"{batch.succeeded}/{batch.total} succeeded in {batch.total_time_ms}ms")
# With progress callback
def on_progress(*, current, total, file_path, engine_name, status, **_):
print(f"[{current}/{total}] {status}: {file_path} ({engine_name})")
batch = await router.process_batch(
file_paths,
concurrency=5,
on_progress=on_progress,
)
# Access results
for path, result in batch.results.items():
print(f"{path}: {len(result.content)} chars")
# Check errors
for path, error in batch.errors.items():
print(f"FAILED {path}: {error}")
Unified Result Format
Every engine returns the same EngineResult dataclass:
@dataclass
class EngineResult:
content: str # The extracted text (markdown/html/json/text)
format: OutputFormat # markdown | html | json | text
engine_name: str # Which engine produced this
metadata: dict # Engine-specific metadata
pages: int | None # Number of pages processed
images: dict | None # Extracted images {filename: base64}
tables: list | None # Extracted tables
bounding_boxes: list | None # Layout element positions
confidence: float | None # Overall confidence [0-1]
processing_time_ms: int # Wall-clock time
Evaluation Framework
Docfold includes a built-in evaluation harness to objectively compare engines:
pip install docfold[evaluation]
docfold evaluate path/to/dataset/ --engines docling,pymupdf,marker
Metrics measured:
| Metric | What it measures | Target |
|---|---|---|
| CER (Character Error Rate) | Character-level text accuracy | < 0.05 |
| WER (Word Error Rate) | Word-level text accuracy | < 0.10 |
| Table F1 | Table detection and cell content accuracy | > 0.85 |
| Heading F1 | Heading detection precision/recall | > 0.90 |
| Reading Order Score | Correctness of reading order (Kendall's tau) | > 0.90 |
See docs/evaluation.md for the ground truth JSON schema and detailed usage.
Architecture
┌─────────────────────────────┐
│ Your Application │
└──────────┬──────────────────┘
│
┌──────────▼──────────────────┐
│ EngineRouter │
│ select() / process() │
│ process_batch() / compare() │
└──────────┬──────────────────┘
│
┌──────────┬───────┬──────────┴──────┬──────────┬──────────┐
▼ ▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌──────┐
│Docling │ │ MinerU │ │Unstructd │ │ Marker │ │PyMuPDF │ │ OCR │
│(local) │ │(local) │ │ (local) │ │ (SaaS) │ │(local) │ │Paddle│
└────────┘ └────────┘ └──────────┘ └────────┘ └────────┘ │Tess. │
│ │ │ │ │ └──────┘
│ ┌────────┐ ┌──────────┐ ┌────────┐ │ │
│ │Llama │ │ Mistral │ │ Zerox │ │ │
│ │Parse │ │ OCR │ │ (VLM) │ │ │
│ │(SaaS) │ │ (SaaS) │ │ │ │ │
│ └────────┘ └──────────┘ └────────┘ │ │
│ │ │ │ │ │
│ ┌────────┐ ┌──────────┐ ┌────────┐ │ │
│ │Textract│ │Google │ │ Azure │ │ │
│ │ (AWS) │ │DocAI │ │DocInt │ │ │
│ │ │ │ (GCP) │ │ │ │ │
│ └────────┘ └──────────┘ └────────┘ │ │
└──────────┴───────┴─────────────┴─────────────┴──────────┘
│
┌────────▼───────┐
│ EngineResult │
│ (unified) │
└────────────────┘
Engine Selection Logic
When no engine is explicitly specified, the router selects one automatically:
- Explicit hint —
engine_hint="docling"in the call - Environment default —
ENGINE_DEFAULT=doclingenv var - Extension-aware priority — each file type has its own engine priority chain (e.g.,
.pngprefers PaddleOCR,.pdfprefers Docling,.docxskips PDF-only engines) - User-configurable — override with
fallback_orderor restrict withallowed_engines
# Restrict to specific engines
router = EngineRouter(engines, allowed_engines={"docling", "pymupdf"})
# Custom fallback order
router = EngineRouter(engines, fallback_order=["pymupdf", "docling", "marker"])
# CLI: --engines flag
# docfold convert invoice.pdf --engines docling,pymupdf
Adding a Custom Engine
Implement the DocumentEngine interface:
from docfold.engines.base import DocumentEngine, EngineResult, OutputFormat
class MyEngine(DocumentEngine):
@property
def name(self) -> str:
return "my_engine"
@property
def supported_extensions(self) -> set[str]:
return {"pdf", "docx"}
def is_available(self) -> bool:
try:
import my_library
return True
except ImportError:
return False
async def process(self, file_path, output_format=OutputFormat.MARKDOWN, **kwargs):
# Your extraction logic here
content = extract(file_path)
return EngineResult(
content=content,
format=output_format,
engine_name=self.name,
)
# Register it
router.register(MyEngine())
Related Projects
Docfold builds on and integrates with these excellent projects:
| Project | Description |
|---|---|
| Docling | IBM's document conversion toolkit — PDF, DOCX, PPTX, and more |
| MinerU / PDF-Extract-Kit | End-to-end PDF structuring with layout analysis and formula recognition |
| Marker | High-quality PDF to Markdown converter |
| PyMuPDF | Fast PDF/XPS/EPUB processing library |
| PaddleOCR | Multilingual OCR toolkit (80+ languages) |
| Tesseract | Open-source OCR engine (100+ languages) |
| Unstructured | ETL toolkit for diverse document types |
| LlamaParse | LLM-powered document parsing |
| Mistral OCR | Vision LLM document understanding |
| Zerox | Model-agnostic Vision LLM OCR |
| Nougat | Meta's academic PDF to Markdown model |
| Surya | Multilingual OCR + layout analysis |
Built by
| Project | Description |
|---|---|
| Datatera.ai | AI-powered data transformation and document processing platform |
| Orquesta AI | AI orchestration and agent management platform |
| AI Agent Labs | AI agent services and location-based intelligence |
Development
git clone https://github.com/mihailorama/docfold.git
cd docfold
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
ruff check src/ tests/
mypy src/
See CONTRIBUTING.md for detailed guidelines.
License
MIT. See LICENSE.
Note: Some engine backends have their own licenses (AGPL-3.0 for PyMuPDF and MinerU, GPL-3.0 for Surya, SaaS terms for Marker/LlamaParse/Mistral). Docfold itself is MIT — the engine adapters are optional extras that you install separately.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docfold-0.6.13.tar.gz.
File metadata
- Download URL: docfold-0.6.13.tar.gz
- Upload date:
- Size: 100.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b370701ac44c89f62a2dd4c598f5372b09974ab44b72062aa341edf1ef6c1dc0
|
|
| MD5 |
82c03b44edecd51aca5bad649eddd273
|
|
| BLAKE2b-256 |
f471ce2c680016d4058c03d92eb6d92afedfa80890c039c995f43dba52746a5d
|
Provenance
The following attestation bundles were made for docfold-0.6.13.tar.gz:
Publisher:
ci.yml on Mihailorama/docfold
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docfold-0.6.13.tar.gz -
Subject digest:
b370701ac44c89f62a2dd4c598f5372b09974ab44b72062aa341edf1ef6c1dc0 - Sigstore transparency entry: 1165539765
- Sigstore integration time:
-
Permalink:
Mihailorama/docfold@930bba52a90f53de635ca1db5f649efcef6b092b -
Branch / Tag:
refs/tags/v0.6.13 - Owner: https://github.com/Mihailorama
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@930bba52a90f53de635ca1db5f649efcef6b092b -
Trigger Event:
push
-
Statement type:
File details
Details for the file docfold-0.6.13-py3-none-any.whl.
File metadata
- Download URL: docfold-0.6.13-py3-none-any.whl
- Upload date:
- Size: 66.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
442f14209e48180ea922ea87c4ecb57f673df566df4820f4d59cd86f07a78175
|
|
| MD5 |
4ef3536b4f7a470dc4767d8f356d9cfb
|
|
| BLAKE2b-256 |
a173299615ee11f4d77ffef5208c412065ed4c0198cd42c559dd40bf5b2310f0
|
Provenance
The following attestation bundles were made for docfold-0.6.13-py3-none-any.whl:
Publisher:
ci.yml on Mihailorama/docfold
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docfold-0.6.13-py3-none-any.whl -
Subject digest:
442f14209e48180ea922ea87c4ecb57f673df566df4820f4d59cd86f07a78175 - Sigstore transparency entry: 1165539877
- Sigstore integration time:
-
Permalink:
Mihailorama/docfold@930bba52a90f53de635ca1db5f649efcef6b092b -
Branch / Tag:
refs/tags/v0.6.13 - Owner: https://github.com/Mihailorama
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@930bba52a90f53de635ca1db5f649efcef6b092b -
Trigger Event:
push
-
Statement type: