Skip to main content

Universal document parser: PDF / Office / email / images / HTML — tiered routing for cost & accuracy

Project description

mag-file-handler

Universal document parser for any file format your product receives.

from file_handler import parse

result = parse("inbound/email.eml")
print(result.text)            # extracted plain text
print(result.format)          # "eml"
print(result.engine)          # "email"
print(result.extra)           # engine-specific provenance

Why this exists

Different document types need different tools — and getting it wrong is expensive. Born-digital PDFs should hit a free text-layer extractor; scanned newspapers should hit Tika; complex slides and textbooks need a vision LLM. This library routes each document to the right engine, automatically.

System requirements

  • Python 3.10, 3.11, 3.12, or 3.13 (CPython).
  • OS / arch: Linux x86-64 (glibc ≥ 2.28), macOS x86-64 (≥ 10.12), macOS ARM64 (≥ 11), Windows x86-64. Linux ARM64 (Graviton, RPi) and Windows ARM64 are not currently supported because the extractous dependency does not ship wheels for those targets.
  • No JVM is required. extractous ships native binaries built with GraalVM AOT compilation, so installation and runtime are pure-native even though Apache Tika is used internally for the long tail of formats.
  • Optional: a VisionClient implementation (yours or your platform's LLM gateway's) for image OCR and scanned-PDF Tier 2 routing. See Vision is pluggable.

Install

pip install mag-file-handler            # core (PDF router, Office, HTML, txt, EML)
pip install mag-file-handler[email]     # + Outlook .msg support
pip install mag-file-handler[all]       # everything optional

[vision] is a documentation-only marker extra — the library ships zero LLM SDK dependencies by design. To use vision OCR, pass a VisionClient you implement (or one your platform provides) into parse(path, vision_client=...). See below for the protocol.

Usage

Library

from file_handler import parse

result = parse("/path/to/file")
if result.ok:
    process(result.text)
else:
    log.warn("parse failed: %s", result.error)

# Engine-specific provenance lives in result.extra
if result.engine == "pdf_router":
    print(result.extra["tier_counts"])    # {"tier0_pdfium": 5, "tier1_extractous": 0, "tier2_claude": 2}
    print(result.extra["claude_usage"])   # {"input_tokens": ..., "output_tokens": ...}

CLI

file-handler parse  document.pdf            # prints extracted text
file-handler parse  document.pdf --json     # full result as JSON
file-handler detect document.pdf            # format detection only
file-handler info                           # version + which engines are available

How routing works

                     FORMAT DETECTION
                     (magic bytes + ext)
                            │
        ┌───────────────────┼───────────────────────┐
        ▼                   ▼                       ▼
       PDF              Image (jpg/png/…)       Email (eml/msg)
        │                   │                       │
        ▼                   ▼                       ▼
   PDF Router          Claude Vision         parse + recurse on
   (per-page tiers)                          each attachment
   ┌──────────────┐
   │ Tier 0  pdfium native text-layer       free, ms-fast
   │ Tier 1  Extractous (Tika)              free, ~2 s
   │ Tier 2  Claude Haiku 4.5 vision        ~$0.008, ~15 s
   └──────────────┘

   Office, HTML, MD, TXT, …  →  Extractous (Tika handles them natively)

For PDFs, the per-page decision tree picks:

Signal Decision
text_layer_chars >= 100 Tier 0 — free, instant
is_broadsheet (long edge ≥ 1500 pt) Tier 1 — Extractous wins on dense newspapers
clean_columned (2–3 uniform cols) Tier 1 — Extractous wins on structured columns
else Tier 2 — LLM vision (slides, mixed cols, textbooks)

If Tier 0 / Tier 1 returns near-empty text on a page that visibly has content, the engine falls back to Tier 2 automatically (conservative — only on empty output, not on questionable quality).

Returned types

@dataclass
class ParseResult:
    text: str                      # extracted plain text
    format: str                    # "pdf", "docx", "eml", …
    engine: str                    # which engine handled it
    mime: str
    detection_confidence: str      # "magic" / "ext" / "content"
    path: str
    error: str | None              # set if parsing failed
    page_count: int | None
    extra: dict                    # engine-specific provenance
    ok: bool                       # property, True if error is None

extra always carries enough to audit a parse:

Engine Notable extra keys
pdf_router tier_counts, vision_usage, routing (per-page)
vision vision_usage, vision_client
email attachments (list of sub-ParseResults)
extractous metadata_keys_count

Vision is pluggable — and there is no default

The library ships ZERO LLM SDK dependencies. Vision (image OCR + Tier 2 of the PDF router) requires the caller to inject a VisionClient. If none is provided, vision-needing operations return an error in result.error instead of falling back to a default provider.

class VisionClient(Protocol):
    def ocr_image(
        self,
        image_bytes: bytes,
        media_type: str,
        *,
        prompt: str | None = None,
        max_tokens: int = 16384,
    ) -> tuple[str, dict[str, Any]]:
        """Return (extracted_text, usage_metadata)."""

This is intentional — production deployments use a per-org LLM gateway (model selection, access control, cost tracking, secrets management) that the library has no business knowing about. The parse(path, vision_client=...) parameter (and recursively for email attachments) is the integration point.

Temporal integration — the canonical production pattern

In production, file_handler.parse() is not called from inside a single Temporal Activity. Instead, magoneai's workflow composes file_handler's per-tier engines as separate activities, with vision as its own activity backed by LLMGateway. This gives Temporal-native retries, observability, and rate-limiting per tier — without any asyncio.run bridging.

# magoneai/temporal/file_handler/activities.py

from pathlib import Path

from temporalio import activity

from be.core.database import get_async_session
from be.llm.gateway import LLMGateway, LoadedImage, build_vision_message

# file_handler exposes building blocks; activities orchestrate them.
import file_handler
from file_handler.engines import extractous_engine, pdfium_engine, pdf_router
from file_handler.engines._ocr_helpers import (
    render_image_file_for_ocr,
    render_pdf_page_for_ocr,
)
from file_handler.engines.page_features import extract_features


@activity.defn
async def detect_format_activity(file_uri: str) -> dict:
    fmt = file_handler.detect(file_uri)
    return {"format_id": fmt.format_id.value, "mime": fmt.mime}


@activity.defn
async def extract_via_extractous_activity(file_uri: str) -> dict:
    """Tika-based extraction for DOCX/PPTX/XLSX/HTML/MD/TXT and PDF Tier 1."""
    return extractous_engine.parse_file(Path(file_uri))


@activity.defn
async def plan_pdf_route_activity(file_uri: str) -> list[dict]:
    """Per-page tier plan, made by the workflow before dispatch."""
    import pypdfium2 as pdfium
    pdf = pdfium.PdfDocument(file_uri)
    plan = []
    for i, page in enumerate(pdf):
        f = extract_features(page, i)
        tier, reason = pdf_router._decide_tier(f)
        plan.append({"page": i, "tier": tier, "reason": reason})
    return plan


@activity.defn
async def pdfium_text_layer_activity(file_uri: str, page_idx: int) -> str:
    import pypdfium2 as pdfium
    pdf = pdfium.PdfDocument(file_uri)
    return pdfium_engine.extract_text(pdf[page_idx])


@activity.defn
async def extractous_page_activity(file_uri: str, page_idx: int) -> dict:
    return extractous_engine.parse_pdf_page(Path(file_uri), page_idx)


@activity.defn
async def vision_ocr_activity(
    file_uri: str,
    page_idx: int | None,        # None = whole-file image; int = PDF page
    project_id: str,
    llm_config_id: str,
) -> dict:
    """Vision OCR via the magoneai LLM gateway. Native async — no asyncio.run."""
    if page_idx is None:
        image_bytes, media_type, _ = render_image_file_for_ocr(Path(file_uri))
    else:
        import pypdfium2 as pdfium
        pdf = pdfium.PdfDocument(file_uri)
        image_bytes, media_type, _ = render_pdf_page_for_ocr(pdf[page_idx])

    async with get_async_session() as session:
        gateway = LLMGateway(session)
        loaded = LoadedImage(
            base64_data=__import__("base64").standard_b64encode(image_bytes).decode("ascii"),
            mime_type=media_type,
            file_id=f"file_handler-{file_uri}:{page_idx}",
            size_bytes=len(image_bytes),
        )
        response = await gateway.complete(
            project_id=project_id,
            llm_config_id=llm_config_id,
            messages=[build_vision_message(text="Transcribe this page.", images=[loaded])],
            parameters={"max_tokens": 16384},
            images=[loaded],
            source_type="file_handler",
        )
    return {
        "text": response.content,
        "input_tokens": response.usage.tokens_in,
        "output_tokens": response.usage.tokens_out,
        "request_id": response.usage.request_id,
    }
# magoneai/temporal/file_handler/workflows.py

from datetime import timedelta
from temporalio import workflow

from .activities import (
    detect_format_activity,
    extract_via_extractous_activity,
    plan_pdf_route_activity,
    pdfium_text_layer_activity,
    extractous_page_activity,
    vision_ocr_activity,
)


IMAGE_FORMATS = {"jpeg", "png", "tiff", "gif", "webp", "bmp"}


@workflow.defn
class ParseDocumentWorkflow:
    @workflow.run
    async def run(self, file_uri: str, project_id: str, llm_config_id: str) -> dict:
        fmt = await workflow.execute_activity(
            detect_format_activity, file_uri,
            start_to_close_timeout=timedelta(seconds=10),
        )

        if fmt["format_id"] == "pdf":
            return await workflow.execute_child_workflow(
                ParsePdfWorkflow.run, file_uri, project_id, llm_config_id,
            )

        if fmt["format_id"] in IMAGE_FORMATS:
            res = await workflow.execute_activity(
                vision_ocr_activity, file_uri, None, project_id, llm_config_id,
                start_to_close_timeout=timedelta(minutes=3),
            )
            return {"text": res["text"], "engine": "vision", "format": fmt["format_id"]}

        # Office, HTML, MD, TXT, EML, etc. → Tika handles them, no vision.
        res = await workflow.execute_activity(
            extract_via_extractous_activity, file_uri,
            start_to_close_timeout=timedelta(minutes=2),
        )
        return {**res, "format": fmt["format_id"]}


@workflow.defn
class ParsePdfWorkflow:
    @workflow.run
    async def run(self, file_uri: str, project_id: str, llm_config_id: str) -> dict:
        plan = await workflow.execute_activity(
            plan_pdf_route_activity, file_uri,
            start_to_close_timeout=timedelta(seconds=30),
        )

        # Fan-out: each page runs its chosen tier as an independent activity.
        async def run_tier(page_decision: dict) -> str:
            i = page_decision["page"]
            tier = page_decision["tier"]
            if tier == "tier0_pdfium":
                return await workflow.execute_activity(
                    pdfium_text_layer_activity, file_uri, i,
                    start_to_close_timeout=timedelta(seconds=30),
                )
            if tier == "tier1_extractous":
                res = await workflow.execute_activity(
                    extractous_page_activity, file_uri, i,
                    start_to_close_timeout=timedelta(minutes=2),
                )
                return res.get("text", "")
            res = await workflow.execute_activity(
                vision_ocr_activity, file_uri, i, project_id, llm_config_id,
                start_to_close_timeout=timedelta(minutes=3),
            )
            return res["text"]

        # Bounded concurrency keeps gateway rate limits in check.
        page_texts = await workflow.gather(*(run_tier(p) for p in plan))
        return {
            "text": "\n\n".join(t for t in page_texts if t),
            "engine": "pdf_router",
            "format": "pdf",
            "page_count": len(plan),
            "per_page_decisions": plan,
        }

Why this shape:

  • Each tier is its own Temporal Activity. Vision-rate-limit retries don't block fast pdfium calls. Extractous subprocess crashes retry independently. Per-tier timeouts and policies live where they belong.
  • The vision activity is async-native. It calls LLMGateway.complete() directly with await. No asyncio.run, no thread bridge, no nested loops.
  • Routing decisions are visible in the workflow's event history. You can query "what tier did page 7 use?" directly from Temporal.
  • Per-page fan-out is parallel. A 50-page PDF runs N pages concurrently (worker concurrency caps total in-flight). file_handler's sync parse() would have processed them serially.

For local dev, scripts, and the benchmark, file_handler.parse(path, vision_client=YourClient()) is still the simplest path — pass any object implementing VisionClient.

Limitations / known issues

  • Legacy Office (.doc / .ppt / .xls): falls back to Tika best-effort. A libreoffice-based pre-converter is on the roadmap.
  • Tables: neither Extractous nor Claude returns structured tables. pdfplumber-based table extraction is on the roadmap.
  • No CJK OCR: Tika's default Tesseract is English-only. We have not invested in Chinese / Japanese / Korean OCR (out of scope).
  • Dense tabloid pages: the router's column detector occasionally mis-routes a tabloid (single dominant photo + scattered text) to Tier 2. Watch this if your traffic is heavy on tabloid layouts.

Development

pip install -e ".[all,test]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mag_file_handler-0.1.0.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mag_file_handler-0.1.0-py3-none-any.whl (32.8 kB view details)

Uploaded Python 3

File details

Details for the file mag_file_handler-0.1.0.tar.gz.

File metadata

  • Download URL: mag_file_handler-0.1.0.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mag_file_handler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f605b1dd94c2d50bc6d24438f63829b971b142a510f30b024954c8f1324d503b
MD5 bf23c293104d6ea1b3f93289054886b5
BLAKE2b-256 8e5d19d043c1ce6e6e2846d5c5adf90cadc696090deff3c7c812d51fcb1434cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for mag_file_handler-0.1.0.tar.gz:

Publisher: release.yml on magurelabs/magoneai-file-handler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mag_file_handler-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mag_file_handler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3cce40dd802e41da6ee261bfef120d3b26a55440396c09364419021f76daa034
MD5 7f182927245b6cd0a4ec21fc6a86a711
BLAKE2b-256 eb8af11bb0d44063a466b9e935b7c1314cdc593fc64378faa382fe7a6a76d78c

See more details on using hashes here.

Provenance

The following attestation bundles were made for mag_file_handler-0.1.0-py3-none-any.whl:

Publisher: release.yml on magurelabs/magoneai-file-handler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page