Skip to main content

Python bindings for Olga. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Strictly-typed surface, no Any, one abi3 wheel for CPython 3.8+.

Project description

olgadoc

Four formats. One engine. 15–40× faster.

Spatial fidelity at native speed, across PDF, DOCX, XLSX, and HTML. One Document API. mypy --strict clean. No LLM in the loop.

Python bindings for Olga — a Rust document-processing engine. Built on PyO3 and maturin; one abi3 wheel covers CPython 3.8+.

Install

pip install olgadoc

Ten-second tour

import olgadoc

doc = olgadoc.Document.open("report.pdf")
print(doc.format, doc.page_count)           # ('PDF', 12)

# Will this document produce text, or does it need OCR first?
report = doc.processability()
if report.is_blocked():
    raise SystemExit([b["kind"] for b in report.blockers])

# Full-text search
for hit in doc.search("quarterly revenue"):
    print(hit["page"], hit["snippet"])

# Structured JSON tree — discriminated on ``type``
for element in doc.to_json()["elements"]:
    if element["type"] == "heading":
        print(f"h{element['level']}: {element['text']}")

Why olgadoc

  • Four formats, one API. PDF, DOCX, XLSX, and HTML all expose the same Document / Page surface. Stop juggling pdfplumber + python-docx + openpyxl + BeautifulSoup.
  • Native speed. PDF 4–8 ms · DOCX 2 ms · XLSX 1–12 ms · HTML 1–5 ms. 15–40× faster than the quality-equivalent tool on every format (benchmarks). A post-release independent reproducible audit on a 50-file mixed corpus finds olgadoc 1.62× faster and 2.62× richer in extracted content than a hand-routed best-of-breed pipeline (report).
  • Spatial fidelity, intact. Tables stay tables. Columns stay columns. Figure captions stay next to their figures. Layout carries meaning, and Olga preserves it across the round-trip to Markdown or to the typed JSON tree.
  • OCR pre-flight. doc.processability() tells you — before the pipeline starts — whether a document actually carries native text, or whether it's a scanned image that needs OCR first. Fail fast, save money.
  • Actually typed. Zero Any on the public surface. Every returned dict is a real TypedDict, Document.to_json() returns a discriminated union over 16 element variants, and mypy --strict narrows each branch.
  • No LLM in the loop. Reads the native content stream directly. Validated with an anti-LLM adversarial test — invisible canaries preserved byte-exact, deliberate typos intact, no hallucinations.

Typed surface, no Any

Every returned dict is a runtime TypedDict — introspectable at runtime and narrowed at type-check time.

from olgadoc import SearchHit

def show(hit: SearchHit) -> None:
    print(hit["page"], hit["snippet"])  # ok
    print(hit["nope"])                  # mypy: "SearchHit" has no key "nope"

Document.to_json() returns a DocumentJson tree whose elements are a discriminated JsonElement union over 16 variants (heading, paragraph, table, list, image, code_block, …). Mypy narrows each branch to exactly one.

vs alternatives

olgadoc pdfplumber unstructured docling
PDF
DOCX
XLSX partial partial
HTML partial
mypy --strict clean (no Any)
OCR pre-flight
Provenance per element
No ML model / no GPU required optional optional

What you get

  • Four formats, one API — PDF, DOCX, XLSX, HTML through Document.
  • Processability reportDocument.processability() → blockers (including EmptyContent for scanned PDFs) and degradations.
  • Cross-page tables — anchored on the first page with is_cross_page.
  • Hyperlinks, images, outline, RAG chunks, case-insensitive search.
  • Structured JSON treeDocument.to_json(), discriminated union over 16 element variants.

Examples

Five runnable scripts live in examples/:

  • quickstart.py — open a document, print a per-page preview.
  • extract_tables.py — pull every reconstructed table as TSV.
  • batch_processability.py — recursively health-check a directory.
  • search_and_extract.py — search + print surrounding page text.
  • json_walk.py — walk the typed JSON tree and narrow by type.

Building from source

pip install maturin
cd olgadoc
maturin develop --release
pytest tests/ -q

Links

License

Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olgadoc-0.1.3.tar.gz (6.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

olgadoc-0.1.3-cp38-abi3-win_amd64.whl (5.2 MB view details)

Uploaded CPython 3.8+Windows x86-64

olgadoc-0.1.3-cp38-abi3-musllinux_1_2_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

olgadoc-0.1.3-cp38-abi3-musllinux_1_2_aarch64.whl (5.9 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ ARM64

olgadoc-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

olgadoc-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (5.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

olgadoc-0.1.3-cp38-abi3-macosx_11_0_arm64.whl (7.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

olgadoc-0.1.3-cp38-abi3-macosx_10_12_x86_64.whl (7.8 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file olgadoc-0.1.3.tar.gz.

File metadata

  • Download URL: olgadoc-0.1.3.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olgadoc-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b5c71b2bb5e1e9c4fc4b954b8f6184da6d7d738b63f10a00f861c410ea64bf35
MD5 b8146c4613444e362bbe4b578870bbfc
BLAKE2b-256 26f44976548d44d973fb2c3ffb41789842ab0995765542329208fee313c24aff

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.3.tar.gz:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.3-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: olgadoc-0.1.3-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.2 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olgadoc-0.1.3-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7d3e3c537119052757b333f757a9228405964389b2301b00244b1530da72cef1
MD5 312ce91e88223f19c95058e095d25997
BLAKE2b-256 83bae1c2511560d0179b107cf5d26e24ab19fa6c5ce8fa3f10f8fd54bd05a9c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.3-cp38-abi3-win_amd64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.3-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.3-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 cb88b0c48349ffc7e490cce1f7fc6c96520bacc851cac15126347d59bfa1feca
MD5 5c341e8d297de9f8d240882fea32077f
BLAKE2b-256 3cc0f924d9224ebfd5ec6c6ab41bc3126727f807741d100a4db5b41b157a8585

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.3-cp38-abi3-musllinux_1_2_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.3-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.3-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 9a9f5e170921cde1834619baaee4e02622139f8456876505ebef4a7f711f6a16
MD5 e90d5ec9fc5e97c38521f3a193b24f5b
BLAKE2b-256 7b8d2f7cc455476ecf5a04ee6420d4e5e9eb9f69b6aa61ac624420b9806e092f

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.3-cp38-abi3-musllinux_1_2_aarch64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 134b2bf37c0ecfc3ca14810e994fac6d181d4876ca407656a1de816aabf3356c
MD5 b8ee3a765e1c9c7e2cf4f179d018a428
BLAKE2b-256 38f17b61c89401551a91611bac9fc8bc44178943352dbdeb75f976611715cfa4

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e10da7a9e928ffa65478afec7b6d4b247563a7b03d8c25ca9f18dce65797899e
MD5 02b84037bbf0a221defa11bcbfcd6ad4
BLAKE2b-256 ad86f78f558b23646d487bc9672caba745ac1feaaf03c8c52608e88796c67d2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.3-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.3-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3c2e37d05e83ce91d12f15bd4585eaf4b6abbe6dbe44511555236d543f158863
MD5 608fbb9d7cc7ef0c315aaf77cda9d872
BLAKE2b-256 6e4f1904614bcf8254e58cedcd8107ff269a282854baa664e7f147172167100e

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.3-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.3-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.3-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 35b379853b5114d8058b715f0068fc346e10188f4e15f81c183baf9c9fff00e0
MD5 92ce90ae2d9c50e7dc9395875c6e8807
BLAKE2b-256 4481e1d7c30efaec050804ee295d85f57f1ebb17235c4642bf4c5bd6f7c78b00

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.3-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page