Universal document parser: PDF / Office / email / images / HTML — tiered routing for cost & accuracy

These details have not been verified by PyPI

Project links

Project description

mag-file-handler

Universal document parser for any file format your product receives.

from file_handler import parse

result = parse("inbound/email.eml")
print(result.text)            # extracted plain text
print(result.format)          # "eml"
print(result.engine)          # "email"
print(result.extra)           # engine-specific provenance

Why this exists

Different document types need different tools — and getting it wrong is expensive. Born-digital PDFs should hit a free text-layer extractor; scanned newspapers should hit Tika; complex slides and textbooks need a vision LLM. This library routes each document to the right engine, automatically.

System requirements

Python 3.10, 3.11, 3.12, or 3.13 (CPython).
OS / arch: Linux x86-64 (glibc ≥ 2.28), macOS x86-64 (≥ 10.12), macOS ARM64 (≥ 11), Windows x86-64. Linux ARM64 (Graviton, RPi) and Windows ARM64 are not currently supported because the extractous dependency does not ship wheels for those targets.
No JVM is required. extractous ships native binaries built with GraalVM AOT compilation, so installation and runtime are pure-native even though Apache Tika is used internally for the long tail of formats.
Optional: a VisionClient implementation (yours or your platform's LLM gateway's) for image OCR and scanned-PDF Tier 2 routing. See Vision is pluggable.

Install

pip install mag-file-handler            # core (PDF router, Office, HTML, txt, EML)
pip install mag-file-handler[email]     # + Outlook .msg support
pip install mag-file-handler[all]       # everything optional

[vision] is a documentation-only marker extra — the library ships zero LLM SDK dependencies by design. To use vision OCR, pass a VisionClient you implement (or one your platform provides) into parse(path, vision_client=...). See below for the protocol.

Usage

Library

from file_handler import parse

result = parse("/path/to/file")
if result.ok:
    process(result.text)
else:
    log.warn("parse failed: %s", result.error)

# Engine-specific provenance lives in result.extra
if result.engine == "pdf_router":
    print(result.extra["tier_counts"])    # {"tier0_pdfium": 5, "tier1_extractous": 0, "tier2_claude": 2}
    print(result.extra["claude_usage"])   # {"input_tokens": ..., "output_tokens": ...}

CLI

file-handler parse  document.pdf            # prints extracted text
file-handler parse  document.pdf --json     # full result as JSON
file-handler detect document.pdf            # format detection only
file-handler info                           # version + which engines are available

How routing works

                     FORMAT DETECTION
                     (magic bytes + ext)
                            │
        ┌───────────────────┼───────────────────────┐
        ▼                   ▼                       ▼
       PDF              Image (jpg/png/…)       Email (eml/msg)
        │                   │                       │
        ▼                   ▼                       ▼
   PDF Router          Claude Vision         parse + recurse on
   (per-page tiers)                          each attachment
   ┌──────────────┐
   │ Tier 0  pdfium native text-layer       free, ms-fast
   │ Tier 1  Extractous (Tika)              free, ~2 s
   │ Tier 2  Claude Haiku 4.5 vision        ~$0.008, ~15 s
   └──────────────┘

   Office, HTML, MD, TXT, …  →  Extractous (Tika handles them natively)

For PDFs, the per-page decision tree picks:

Signal	Decision
`text_layer_chars >= 100`	Tier 0 — free, instant
`is_broadsheet` (long edge ≥ 1500 pt)	Tier 1 — Extractous wins on dense newspapers
`clean_columned` (2–3 uniform cols)	Tier 1 — Extractous wins on structured columns
else	Tier 2 — LLM vision (slides, mixed cols, textbooks)

If Tier 0 / Tier 1 returns near-empty text on a page that visibly has content, the engine falls back to Tier 2 automatically (conservative — only on empty output, not on questionable quality).

Returned types

@dataclass
class ParseResult:
    text: str                      # extracted plain text
    format: str                    # "pdf", "docx", "eml", …
    engine: str                    # which engine handled it
    mime: str
    detection_confidence: str      # "magic" / "ext" / "content"
    path: str
    error: str | None              # set if parsing failed
    page_count: int | None
    extra: dict                    # engine-specific provenance
    ok: bool                       # property, True if error is None

extra always carries enough to audit a parse:

Engine	Notable `extra` keys
`pdf_router`	`tier_counts`, `vision_usage`, `routing` (per-page)
`vision`	`vision_usage`, `vision_client`
`email`	`attachments` (list of sub-ParseResults)
`extractous`	`metadata_keys_count`

Vision is pluggable — and there is no default

The library ships ZERO LLM SDK dependencies. Vision (image OCR + Tier 2 of the PDF router) requires the caller to inject a VisionClient. If none is provided, vision-needing operations return an error in result.error instead of falling back to a default provider.

class VisionClient(Protocol):
    def ocr_image(
        self,
        image_bytes: bytes,
        media_type: str,
        *,
        prompt: str | None = None,
        max_tokens: int = 16384,
    ) -> tuple[str, dict[str, Any]]:
        """Return (extracted_text, usage_metadata)."""

This is intentional — production deployments use a per-org LLM gateway (model selection, access control, cost tracking, secrets management) that the library has no business knowing about. The parse(path, vision_client=...) parameter (and recursively for email attachments) is the integration point.

Temporal integration — the canonical production pattern

In production, file_handler.parse() is not called from inside a single Temporal Activity. Instead, magoneai's workflow composes file_handler's per-tier engines as separate activities, with vision as its own activity backed by LLMGateway. This gives Temporal-native retries, observability, and rate-limiting per tier — without any asyncio.run bridging.

# magoneai/temporal/file_handler/activities.py

from pathlib import Path

from temporalio import activity

from be.core.database import get_async_session
from be.llm.gateway import LLMGateway, LoadedImage, build_vision_message

# file_handler exposes building blocks; activities orchestrate them.
import file_handler
from file_handler.engines import extractous_engine, pdfium_engine, pdf_router
from file_handler.engines._ocr_helpers import (
    render_image_file_for_ocr,
    render_pdf_page_for_ocr,
)
from file_handler.engines.page_features import extract_features


@activity.defn
async def detect_format_activity(file_uri: str) -> dict:
    fmt = file_handler.detect(file_uri)
    return {"format_id": fmt.format_id.value, "mime": fmt.mime}


@activity.defn
async def extract_via_extractous_activity(file_uri: str) -> dict:
    """Tika-based extraction for DOCX/PPTX/XLSX/HTML/MD/TXT and PDF Tier 1."""
    return extractous_engine.parse_file(Path(file_uri))


@activity.defn
async def plan_pdf_route_activity(file_uri: str) -> list[dict]:
    """Per-page tier plan, made by the workflow before dispatch."""
    import pypdfium2 as pdfium
    pdf = pdfium.PdfDocument(file_uri)
    plan = []
    for i, page in enumerate(pdf):
        f = extract_features(page, i)
        tier, reason = pdf_router._decide_tier(f)
        plan.append({"page": i, "tier": tier, "reason": reason})
    return plan


@activity.defn
async def pdfium_text_layer_activity(file_uri: str, page_idx: int) -> str:
    import pypdfium2 as pdfium
    pdf = pdfium.PdfDocument(file_uri)
    return pdfium_engine.extract_text(pdf[page_idx])


@activity.defn
async def extractous_page_activity(file_uri: str, page_idx: int) -> dict:
    return extractous_engine.parse_pdf_page(Path(file_uri), page_idx)


@activity.defn
async def vision_ocr_activity(
    file_uri: str,
    page_idx: int | None,        # None = whole-file image; int = PDF page
    project_id: str,
    llm_config_id: str,
) -> dict:
    """Vision OCR via the magoneai LLM gateway. Native async — no asyncio.run."""
    if page_idx is None:
        image_bytes, media_type, _ = render_image_file_for_ocr(Path(file_uri))
    else:
        import pypdfium2 as pdfium
        pdf = pdfium.PdfDocument(file_uri)
        image_bytes, media_type, _ = render_pdf_page_for_ocr(pdf[page_idx])

    async with get_async_session() as session:
        gateway = LLMGateway(session)
        loaded = LoadedImage(
            base64_data=__import__("base64").standard_b64encode(image_bytes).decode("ascii"),
            mime_type=media_type,
            file_id=f"file_handler-{file_uri}:{page_idx}",
            size_bytes=len(image_bytes),
        )
        response = await gateway.complete(
            project_id=project_id,
            llm_config_id=llm_config_id,
            messages=[build_vision_message(text="Transcribe this page.", images=[loaded])],
            parameters={"max_tokens": 16384},
            images=[loaded],
            source_type="file_handler",
        )
    return {
        "text": response.content,
        "input_tokens": response.usage.tokens_in,
        "output_tokens": response.usage.tokens_out,
        "request_id": response.usage.request_id,
    }

# magoneai/temporal/file_handler/workflows.py

from datetime import timedelta
from temporalio import workflow

from .activities import (
    detect_format_activity,
    extract_via_extractous_activity,
    plan_pdf_route_activity,
    pdfium_text_layer_activity,
    extractous_page_activity,
    vision_ocr_activity,
)


IMAGE_FORMATS = {"jpeg", "png", "tiff", "gif", "webp", "bmp"}


@workflow.defn
class ParseDocumentWorkflow:
    @workflow.run
    async def run(self, file_uri: str, project_id: str, llm_config_id: str) -> dict:
        fmt = await workflow.execute_activity(
            detect_format_activity, file_uri,
            start_to_close_timeout=timedelta(seconds=10),
        )

        if fmt["format_id"] == "pdf":
            return await workflow.execute_child_workflow(
                ParsePdfWorkflow.run, file_uri, project_id, llm_config_id,
            )

        if fmt["format_id"] in IMAGE_FORMATS:
            res = await workflow.execute_activity(
                vision_ocr_activity, file_uri, None, project_id, llm_config_id,
                start_to_close_timeout=timedelta(minutes=3),
            )
            return {"text": res["text"], "engine": "vision", "format": fmt["format_id"]}

        # Office, HTML, MD, TXT, EML, etc. → Tika handles them, no vision.
        res = await workflow.execute_activity(
            extract_via_extractous_activity, file_uri,
            start_to_close_timeout=timedelta(minutes=2),
        )
        return {**res, "format": fmt["format_id"]}


@workflow.defn
class ParsePdfWorkflow:
    @workflow.run
    async def run(self, file_uri: str, project_id: str, llm_config_id: str) -> dict:
        plan = await workflow.execute_activity(
            plan_pdf_route_activity, file_uri,
            start_to_close_timeout=timedelta(seconds=30),
        )

        # Fan-out: each page runs its chosen tier as an independent activity.
        async def run_tier(page_decision: dict) -> str:
            i = page_decision["page"]
            tier = page_decision["tier"]
            if tier == "tier0_pdfium":
                return await workflow.execute_activity(
                    pdfium_text_layer_activity, file_uri, i,
                    start_to_close_timeout=timedelta(seconds=30),
                )
            if tier == "tier1_extractous":
                res = await workflow.execute_activity(
                    extractous_page_activity, file_uri, i,
                    start_to_close_timeout=timedelta(minutes=2),
                )
                return res.get("text", "")
            res = await workflow.execute_activity(
                vision_ocr_activity, file_uri, i, project_id, llm_config_id,
                start_to_close_timeout=timedelta(minutes=3),
            )
            return res["text"]

        # Bounded concurrency keeps gateway rate limits in check.
        page_texts = await workflow.gather(*(run_tier(p) for p in plan))
        return {
            "text": "\n\n".join(t for t in page_texts if t),
            "engine": "pdf_router",
            "format": "pdf",
            "page_count": len(plan),
            "per_page_decisions": plan,
        }

Why this shape:

Each tier is its own Temporal Activity. Vision-rate-limit retries don't block fast pdfium calls. Extractous subprocess crashes retry independently. Per-tier timeouts and policies live where they belong.
The vision activity is async-native. It calls LLMGateway.complete() directly with await. No asyncio.run, no thread bridge, no nested loops.
Routing decisions are visible in the workflow's event history. You can query "what tier did page 7 use?" directly from Temporal.
Per-page fan-out is parallel. A 50-page PDF runs N pages concurrently (worker concurrency caps total in-flight). file_handler's sync parse() would have processed them serially.

For local dev, scripts, and the benchmark, file_handler.parse(path, vision_client=YourClient()) is still the simplest path — pass any object implementing VisionClient.

Limitations / known issues

Legacy Office (.doc / .ppt / .xls): falls back to Tika best-effort. A libreoffice-based pre-converter is on the roadmap.
Tables: neither Extractous nor Claude returns structured tables. pdfplumber-based table extraction is on the roadmap.
No CJK OCR: Tika's default Tesseract is English-only. We have not invested in Chinese / Japanese / Korean OCR (out of scope).
Dense tabloid pages: the router's column detector occasionally mis-routes a tabloid (single dominant photo + scattered text) to Tier 2. Watch this if your traffic is heavy on tabloid layouts.

Development

pip install -e ".[all,test]"
pytest

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 13, 2026

This version

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mag_file_handler-0.1.0.tar.gz (34.2 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mag_file_handler-0.1.0-py3-none-any.whl (32.8 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file mag_file_handler-0.1.0.tar.gz.

File metadata

Download URL: mag_file_handler-0.1.0.tar.gz
Upload date: May 13, 2026
Size: 34.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mag_file_handler-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f605b1dd94c2d50bc6d24438f63829b971b142a510f30b024954c8f1324d503b`
MD5	`bf23c293104d6ea1b3f93289054886b5`
BLAKE2b-256	`8e5d19d043c1ce6e6e2846d5c5adf90cadc696090deff3c7c812d51fcb1434cd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mag_file_handler-0.1.0.tar.gz:

Publisher: release.yml on magurelabs/magoneai-file-handler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mag_file_handler-0.1.0.tar.gz
- Subject digest: f605b1dd94c2d50bc6d24438f63829b971b142a510f30b024954c8f1324d503b
- Sigstore transparency entry: 1524565476
- Sigstore integration time: May 13, 2026
Source repository:
- Permalink: magurelabs/magoneai-file-handler@52fcae47c3d1e5966228504a9ecd9330482c4ac3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/magurelabs
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@52fcae47c3d1e5966228504a9ecd9330482c4ac3
- Trigger Event: push

File details

Details for the file mag_file_handler-0.1.0-py3-none-any.whl.

File metadata

Download URL: mag_file_handler-0.1.0-py3-none-any.whl
Upload date: May 13, 2026
Size: 32.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mag_file_handler-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3cce40dd802e41da6ee261bfef120d3b26a55440396c09364419021f76daa034`
MD5	`7f182927245b6cd0a4ec21fc6a86a711`
BLAKE2b-256	`eb8af11bb0d44063a466b9e935b7c1314cdc593fc64378faa382fe7a6a76d78c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mag_file_handler-0.1.0-py3-none-any.whl:

Publisher: release.yml on magurelabs/magoneai-file-handler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mag_file_handler-0.1.0-py3-none-any.whl
- Subject digest: 3cce40dd802e41da6ee261bfef120d3b26a55440396c09364419021f76daa034
- Sigstore transparency entry: 1524565503
- Sigstore integration time: May 13, 2026
Source repository:
- Permalink: magurelabs/magoneai-file-handler@52fcae47c3d1e5966228504a9ecd9330482c4ac3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/magurelabs
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@52fcae47c3d1e5966228504a9ecd9330482c4ac3
- Trigger Event: push

mag-file-handler 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mag-file-handler

Why this exists

System requirements

Install

Usage

Library

CLI

How routing works

Returned types

Vision is pluggable — and there is no default

Temporal integration — the canonical production pattern

Limitations / known issues

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance