Universal document parser: PDF / Office / email / images / HTML — tiered routing for cost & accuracy
Project description
mag-file-handler
Universal document parser for any file format your product receives.
from file_handler import parse
result = parse("inbound/email.eml")
print(result.text) # extracted plain text
print(result.format) # "eml"
print(result.engine) # "email"
print(result.extra) # engine-specific provenance
Why this exists
Different document types need different tools — and getting it wrong is expensive. Born-digital PDFs should hit a free text-layer extractor; scanned newspapers should hit Tika; complex slides and textbooks need a vision LLM. This library routes each document to the right engine, automatically.
System requirements
- Python 3.10, 3.11, 3.12, or 3.13 (CPython).
- OS / arch: Linux x86-64 (glibc ≥ 2.28), macOS x86-64 (≥ 10.12),
macOS ARM64 (≥ 11), Windows x86-64. Linux ARM64 (Graviton, RPi)
and Windows ARM64 are not currently supported because the
extractousdependency does not ship wheels for those targets. - No JVM is required.
extractousships native binaries built with GraalVM AOT compilation, so installation and runtime are pure-native even though Apache Tika is used internally for the long tail of formats. - Optional: a
VisionClientimplementation (yours or your platform's LLM gateway's) for image OCR and scanned-PDF Tier 2 routing. See Vision is pluggable.
Install
pip install mag-file-handler # core (PDF router, Office, HTML, txt, EML)
pip install mag-file-handler[email] # + Outlook .msg support
pip install mag-file-handler[all] # everything optional
[vision]is a documentation-only marker extra — the library ships zero LLM SDK dependencies by design. To use vision OCR, pass aVisionClientyou implement (or one your platform provides) intoparse(path, vision_client=...). See below for the protocol.
Usage
Library
from file_handler import parse
result = parse("/path/to/file")
if result.ok:
process(result.text)
else:
log.warn("parse failed: %s", result.error)
# Engine-specific provenance lives in result.extra
if result.engine == "pdf_router":
print(result.extra["tier_counts"]) # {"tier0_pdfium": 5, "tier1_extractous": 0, "tier2_claude": 2}
print(result.extra["claude_usage"]) # {"input_tokens": ..., "output_tokens": ...}
CLI
file-handler parse document.pdf # prints extracted text
file-handler parse document.pdf --json # full result as JSON
file-handler detect document.pdf # format detection only
file-handler info # version + which engines are available
How routing works
FORMAT DETECTION
(magic bytes + ext)
│
┌───────────────────┼───────────────────────┐
▼ ▼ ▼
PDF Image (jpg/png/…) Email (eml/msg)
│ │ │
▼ ▼ ▼
PDF Router Claude Vision parse + recurse on
(per-page tiers) each attachment
┌──────────────┐
│ Tier 0 pdfium native text-layer free, ms-fast
│ Tier 1 Extractous (Tika) free, ~2 s
│ Tier 2 Claude Haiku 4.5 vision ~$0.008, ~15 s
└──────────────┘
Office, HTML, MD, TXT, … → Extractous (Tika handles them natively)
For PDFs, the per-page decision tree picks:
| Signal | Decision |
|---|---|
text_layer_chars >= 100 |
Tier 0 — free, instant |
is_broadsheet (long edge ≥ 1500 pt) |
Tier 1 — Extractous wins on dense newspapers |
clean_columned (2–3 uniform cols) |
Tier 1 — Extractous wins on structured columns |
| else | Tier 2 — LLM vision (slides, mixed cols, textbooks) |
If Tier 0 / Tier 1 returns near-empty text on a page that visibly has content, the engine falls back to Tier 2 automatically (conservative — only on empty output, not on questionable quality).
Returned types
@dataclass
class ParseResult:
text: str # extracted plain text
format: str # "pdf", "docx", "eml", …
engine: str # which engine handled it
mime: str
detection_confidence: str # "magic" / "ext" / "content"
path: str
error: str | None # set if parsing failed
page_count: int | None
extra: dict # engine-specific provenance
ok: bool # property, True if error is None
extra always carries enough to audit a parse:
| Engine | Notable extra keys |
|---|---|
pdf_router |
tier_counts, vision_usage, routing (per-page) |
vision |
vision_usage, vision_client |
email |
attachments (list of sub-ParseResults) |
extractous |
metadata_keys_count |
Vision is pluggable — and there is no default
The library ships ZERO LLM SDK dependencies. Vision (image OCR + Tier 2 of
the PDF router) requires the caller to inject a VisionClient. If none is
provided, vision-needing operations return an error in result.error
instead of falling back to a default provider.
class VisionClient(Protocol):
def ocr_image(
self,
image_bytes: bytes,
media_type: str,
*,
prompt: str | None = None,
max_tokens: int = 16384,
) -> tuple[str, dict[str, Any]]:
"""Return (extracted_text, usage_metadata)."""
This is intentional — production deployments use a per-org LLM gateway
(model selection, access control, cost tracking, secrets management) that
the library has no business knowing about. The parse(path, vision_client=...)
parameter (and recursively for email attachments) is the integration point.
Temporal integration — the canonical production pattern
In production, file_handler.parse() is not called from inside a single
Temporal Activity. Instead, magoneai's workflow composes file_handler's
per-tier engines as separate activities, with vision as its own activity
backed by LLMGateway. This gives Temporal-native retries, observability,
and rate-limiting per tier — without any asyncio.run bridging.
# magoneai/temporal/file_handler/activities.py
from pathlib import Path
from temporalio import activity
from be.core.database import get_async_session
from be.llm.gateway import LLMGateway, LoadedImage, build_vision_message
# file_handler exposes building blocks; activities orchestrate them.
import file_handler
from file_handler.engines import extractous_engine, pdfium_engine, pdf_router
from file_handler.engines._ocr_helpers import (
render_image_file_for_ocr,
render_pdf_page_for_ocr,
)
from file_handler.engines.page_features import extract_features
@activity.defn
async def detect_format_activity(file_uri: str) -> dict:
fmt = file_handler.detect(file_uri)
return {"format_id": fmt.format_id.value, "mime": fmt.mime}
@activity.defn
async def extract_via_extractous_activity(file_uri: str) -> dict:
"""Tika-based extraction for DOCX/PPTX/XLSX/HTML/MD/TXT and PDF Tier 1."""
return extractous_engine.parse_file(Path(file_uri))
@activity.defn
async def plan_pdf_route_activity(file_uri: str) -> list[dict]:
"""Per-page tier plan, made by the workflow before dispatch."""
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument(file_uri)
plan = []
for i, page in enumerate(pdf):
f = extract_features(page, i)
tier, reason = pdf_router._decide_tier(f)
plan.append({"page": i, "tier": tier, "reason": reason})
return plan
@activity.defn
async def pdfium_text_layer_activity(file_uri: str, page_idx: int) -> str:
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument(file_uri)
return pdfium_engine.extract_text(pdf[page_idx])
@activity.defn
async def extractous_page_activity(file_uri: str, page_idx: int) -> dict:
return extractous_engine.parse_pdf_page(Path(file_uri), page_idx)
@activity.defn
async def vision_ocr_activity(
file_uri: str,
page_idx: int | None, # None = whole-file image; int = PDF page
project_id: str,
llm_config_id: str,
) -> dict:
"""Vision OCR via the magoneai LLM gateway. Native async — no asyncio.run."""
if page_idx is None:
image_bytes, media_type, _ = render_image_file_for_ocr(Path(file_uri))
else:
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument(file_uri)
image_bytes, media_type, _ = render_pdf_page_for_ocr(pdf[page_idx])
async with get_async_session() as session:
gateway = LLMGateway(session)
loaded = LoadedImage(
base64_data=__import__("base64").standard_b64encode(image_bytes).decode("ascii"),
mime_type=media_type,
file_id=f"file_handler-{file_uri}:{page_idx}",
size_bytes=len(image_bytes),
)
response = await gateway.complete(
project_id=project_id,
llm_config_id=llm_config_id,
messages=[build_vision_message(text="Transcribe this page.", images=[loaded])],
parameters={"max_tokens": 16384},
images=[loaded],
source_type="file_handler",
)
return {
"text": response.content,
"input_tokens": response.usage.tokens_in,
"output_tokens": response.usage.tokens_out,
"request_id": response.usage.request_id,
}
# magoneai/temporal/file_handler/workflows.py
from datetime import timedelta
from temporalio import workflow
from .activities import (
detect_format_activity,
extract_via_extractous_activity,
plan_pdf_route_activity,
pdfium_text_layer_activity,
extractous_page_activity,
vision_ocr_activity,
)
IMAGE_FORMATS = {"jpeg", "png", "tiff", "gif", "webp", "bmp"}
@workflow.defn
class ParseDocumentWorkflow:
@workflow.run
async def run(self, file_uri: str, project_id: str, llm_config_id: str) -> dict:
fmt = await workflow.execute_activity(
detect_format_activity, file_uri,
start_to_close_timeout=timedelta(seconds=10),
)
if fmt["format_id"] == "pdf":
return await workflow.execute_child_workflow(
ParsePdfWorkflow.run, file_uri, project_id, llm_config_id,
)
if fmt["format_id"] in IMAGE_FORMATS:
res = await workflow.execute_activity(
vision_ocr_activity, file_uri, None, project_id, llm_config_id,
start_to_close_timeout=timedelta(minutes=3),
)
return {"text": res["text"], "engine": "vision", "format": fmt["format_id"]}
# Office, HTML, MD, TXT, EML, etc. → Tika handles them, no vision.
res = await workflow.execute_activity(
extract_via_extractous_activity, file_uri,
start_to_close_timeout=timedelta(minutes=2),
)
return {**res, "format": fmt["format_id"]}
@workflow.defn
class ParsePdfWorkflow:
@workflow.run
async def run(self, file_uri: str, project_id: str, llm_config_id: str) -> dict:
plan = await workflow.execute_activity(
plan_pdf_route_activity, file_uri,
start_to_close_timeout=timedelta(seconds=30),
)
# Fan-out: each page runs its chosen tier as an independent activity.
async def run_tier(page_decision: dict) -> str:
i = page_decision["page"]
tier = page_decision["tier"]
if tier == "tier0_pdfium":
return await workflow.execute_activity(
pdfium_text_layer_activity, file_uri, i,
start_to_close_timeout=timedelta(seconds=30),
)
if tier == "tier1_extractous":
res = await workflow.execute_activity(
extractous_page_activity, file_uri, i,
start_to_close_timeout=timedelta(minutes=2),
)
return res.get("text", "")
res = await workflow.execute_activity(
vision_ocr_activity, file_uri, i, project_id, llm_config_id,
start_to_close_timeout=timedelta(minutes=3),
)
return res["text"]
# Bounded concurrency keeps gateway rate limits in check.
page_texts = await workflow.gather(*(run_tier(p) for p in plan))
return {
"text": "\n\n".join(t for t in page_texts if t),
"engine": "pdf_router",
"format": "pdf",
"page_count": len(plan),
"per_page_decisions": plan,
}
Why this shape:
- Each tier is its own Temporal Activity. Vision-rate-limit retries don't block fast pdfium calls. Extractous subprocess crashes retry independently. Per-tier timeouts and policies live where they belong.
- The vision activity is async-native. It calls
LLMGateway.complete()directly withawait. Noasyncio.run, no thread bridge, no nested loops. - Routing decisions are visible in the workflow's event history. You can query "what tier did page 7 use?" directly from Temporal.
- Per-page fan-out is parallel. A 50-page PDF runs N pages concurrently
(worker concurrency caps total in-flight). file_handler's sync
parse()would have processed them serially.
For local dev, scripts, and the benchmark, file_handler.parse(path, vision_client=YourClient()) is still the simplest path — pass any object
implementing VisionClient.
Limitations / known issues
- Legacy Office (.doc / .ppt / .xls): falls back to Tika best-effort. A libreoffice-based pre-converter is on the roadmap.
- Tables: neither Extractous nor Claude returns structured tables. pdfplumber-based table extraction is on the roadmap.
- No CJK OCR: Tika's default Tesseract is English-only. We have not invested in Chinese / Japanese / Korean OCR (out of scope).
- Dense tabloid pages: the router's column detector occasionally mis-routes a tabloid (single dominant photo + scattered text) to Tier 2. Watch this if your traffic is heavy on tabloid layouts.
Development
pip install -e ".[all,test]"
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mag_file_handler-0.1.0.tar.gz.
File metadata
- Download URL: mag_file_handler-0.1.0.tar.gz
- Upload date:
- Size: 34.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f605b1dd94c2d50bc6d24438f63829b971b142a510f30b024954c8f1324d503b
|
|
| MD5 |
bf23c293104d6ea1b3f93289054886b5
|
|
| BLAKE2b-256 |
8e5d19d043c1ce6e6e2846d5c5adf90cadc696090deff3c7c812d51fcb1434cd
|
Provenance
The following attestation bundles were made for mag_file_handler-0.1.0.tar.gz:
Publisher:
release.yml on magurelabs/magoneai-file-handler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mag_file_handler-0.1.0.tar.gz -
Subject digest:
f605b1dd94c2d50bc6d24438f63829b971b142a510f30b024954c8f1324d503b - Sigstore transparency entry: 1524565476
- Sigstore integration time:
-
Permalink:
magurelabs/magoneai-file-handler@52fcae47c3d1e5966228504a9ecd9330482c4ac3 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/magurelabs
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@52fcae47c3d1e5966228504a9ecd9330482c4ac3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mag_file_handler-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mag_file_handler-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cce40dd802e41da6ee261bfef120d3b26a55440396c09364419021f76daa034
|
|
| MD5 |
7f182927245b6cd0a4ec21fc6a86a711
|
|
| BLAKE2b-256 |
eb8af11bb0d44063a466b9e935b7c1314cdc593fc64378faa382fe7a6a76d78c
|
Provenance
The following attestation bundles were made for mag_file_handler-0.1.0-py3-none-any.whl:
Publisher:
release.yml on magurelabs/magoneai-file-handler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mag_file_handler-0.1.0-py3-none-any.whl -
Subject digest:
3cce40dd802e41da6ee261bfef120d3b26a55440396c09364419021f76daa034 - Sigstore transparency entry: 1524565503
- Sigstore integration time:
-
Permalink:
magurelabs/magoneai-file-handler@52fcae47c3d1e5966228504a9ecd9330482c4ac3 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/magurelabs
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@52fcae47c3d1e5966228504a9ecd9330482c4ac3 -
Trigger Event:
push
-
Statement type: