Skip to main content

Python bindings for Olga. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Strictly-typed surface, no Any, one abi3 wheel for CPython 3.8+.

Project description

olgadoc

Four formats. One engine. 15–40× faster.

Spatial fidelity at native speed, across PDF, DOCX, XLSX, and HTML. One Document API. mypy --strict clean. No LLM in the loop.

Python bindings for Olga — a Rust document-processing engine. Built on PyO3 and maturin; one abi3 wheel covers CPython 3.8+.

Install

pip install olgadoc

Ten-second tour

import olgadoc

doc = olgadoc.Document.open("report.pdf")
print(doc.format, doc.page_count)           # ('PDF', 12)

# Will this document produce text, or does it need OCR first?
report = doc.processability()
if report.is_blocked():
    raise SystemExit([b["kind"] for b in report.blockers])

# Full-text search
for hit in doc.search("quarterly revenue"):
    print(hit["page"], hit["snippet"])

# Structured JSON tree — discriminated on ``type``
for element in doc.to_json()["elements"]:
    if element["type"] == "heading":
        print(f"h{element['level']}: {element['text']}")

Why olgadoc

  • Four formats, one API. PDF, DOCX, XLSX, and HTML all expose the same Document / Page surface. Stop juggling pdfplumber + python-docx + openpyxl + BeautifulSoup.
  • Native speed. PDF 4–8 ms · DOCX 2 ms · XLSX 1–12 ms · HTML 1–5 ms. 15–40× faster than the quality-equivalent tool on every format (benchmarks). A post-release independent reproducible audit on a 50-file mixed corpus finds olgadoc 1.62× faster and 2.62× richer in extracted content than a hand-routed best-of-breed pipeline (report).
  • Spatial fidelity, intact. Tables stay tables. Columns stay columns. Figure captions stay next to their figures. Layout carries meaning, and Olga preserves it across the round-trip to Markdown or to the typed JSON tree.
  • OCR pre-flight. doc.processability() tells you — before the pipeline starts — whether a document actually carries native text, or whether it's a scanned image that needs OCR first. Fail fast, save money.
  • Actually typed. Zero Any on the public surface. Every returned dict is a real TypedDict, Document.to_json() returns a discriminated union over 16 element variants, and mypy --strict narrows each branch.
  • No LLM in the loop. Reads the native content stream directly. Validated with an anti-LLM adversarial test — invisible canaries preserved byte-exact, deliberate typos intact, no hallucinations.

Typed surface, no Any

Every returned dict is a runtime TypedDict — introspectable at runtime and narrowed at type-check time.

from olgadoc import SearchHit

def show(hit: SearchHit) -> None:
    print(hit["page"], hit["snippet"])  # ok
    print(hit["nope"])                  # mypy: "SearchHit" has no key "nope"

Document.to_json() returns a DocumentJson tree whose elements are a discriminated JsonElement union over 16 variants (heading, paragraph, table, list, image, code_block, …). Mypy narrows each branch to exactly one.

vs alternatives

olgadoc pdfplumber unstructured docling
PDF
DOCX
XLSX partial partial
HTML partial
mypy --strict clean (no Any)
OCR pre-flight
Provenance per element
No ML model / no GPU required optional optional

What you get

  • Four formats, one API — PDF, DOCX, XLSX, HTML through Document.
  • Processability reportDocument.processability() → blockers (including EmptyContent for scanned PDFs) and degradations.
  • Cross-page tables — anchored on the first page with is_cross_page.
  • Hyperlinks, images, outline, RAG chunks, case-insensitive search.
  • Structured JSON treeDocument.to_json(), discriminated union over 16 element variants.

Examples

Five runnable scripts live in examples/:

  • quickstart.py — open a document, print a per-page preview.
  • extract_tables.py — pull every reconstructed table as TSV.
  • batch_processability.py — recursively health-check a directory.
  • search_and_extract.py — search + print surrounding page text.
  • json_walk.py — walk the typed JSON tree and narrow by type.

Building from source

pip install maturin
cd olgadoc
maturin develop --release
pytest tests/ -q

Links

License

Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olgadoc-0.1.2.tar.gz (6.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

olgadoc-0.1.2-cp38-abi3-win_amd64.whl (5.2 MB view details)

Uploaded CPython 3.8+Windows x86-64

olgadoc-0.1.2-cp38-abi3-musllinux_1_2_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

olgadoc-0.1.2-cp38-abi3-musllinux_1_2_aarch64.whl (5.9 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ ARM64

olgadoc-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

olgadoc-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (5.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

olgadoc-0.1.2-cp38-abi3-macosx_11_0_arm64.whl (7.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

olgadoc-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl (7.8 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file olgadoc-0.1.2.tar.gz.

File metadata

  • Download URL: olgadoc-0.1.2.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olgadoc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 908a0869169453badb0ac9cc1e22cc37d06378b201bb4ade2a72f68e157b52fb
MD5 bc7a81f25ac8a92399f2942b5d96b793
BLAKE2b-256 6e03c31d2ce2c8811c3d74e1910dd5ab2829dc2b91c98356fce1888ffb9ce67c

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.2.tar.gz:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.2-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: olgadoc-0.1.2-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.2 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olgadoc-0.1.2-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 592d9ce4bdd8b0c1eed4850aa7c67f2a1e9a5902ce552821e20041e25ccbfa92
MD5 897dd604a751bb570ae8c0a4befef3bb
BLAKE2b-256 f5f36714a2b4b96b7a0b4d57789b7445d14f06f77ddb46753df3372035dcc02c

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.2-cp38-abi3-win_amd64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.2-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.2-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 ba6a1a962098660063acc23f55dbec301b6a7325ac1e8ff9764466d597841f7e
MD5 32238488257dc862068764fedb596128
BLAKE2b-256 e81a83760ce65106b7d36f6f6e472942bff1ccc15ba2885fe58cc3e448e382f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.2-cp38-abi3-musllinux_1_2_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.2-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.2-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 baa7b39fcdbc6dd46f2ba8ecf3abdf42d5282ccd30832aa644148fc16552599c
MD5 b09315bfea68aee26683b503a7b03554
BLAKE2b-256 3a7ab2a76310b0cdfcd2a8de8ad7ca81c406546a428c8b6f7a7953f0314b967c

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.2-cp38-abi3-musllinux_1_2_aarch64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d346e3447d8d017c91f87f21bb0ab2fe49540083ba9745b081df921c465ff046
MD5 ac897c24062494948dd81c1c2660d417
BLAKE2b-256 db56025f3b025743e1e516b045c6838421455bbd77c48843815f7b66a2083cb4

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b3d4b8a9e008011b9dc08b8cf19491486e828d0a91c9a789523664edfed40a3d
MD5 6e1fef51d432a0005294ea4ae0a3a7ec
BLAKE2b-256 10c2d85b82493c9a86ec5a686711213916ff22e8fa371895ff94a655b9700ce9

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0f004473e623822b5c28bd4075dc46417732c1c07a71dabc5cd57e7114090805
MD5 6bc6d7c2b37685ca7798e76842c46f01
BLAKE2b-256 5e306ae113a8d83ce63e8be40e4ed2ea133c2b853b2a75cf15a31e463f28d337

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.2-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6e99aba23b8673a731a14e975df33c779891bd90e11a344b30fb04a7045b043e
MD5 36648554180fdc3069037db1588f54c4
BLAKE2b-256 ea6402d963e2cc2c8241195dbc6c412fa81a86d65f83ad666b6078d3a0a8f63b

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page