Skip to main content

Python bindings for Olga. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Strictly-typed surface, no Any, one abi3 wheel for CPython 3.8+.

Project description

olgadoc

Four formats. One engine. 15–40× faster.

Spatial fidelity at native speed, across PDF, DOCX, XLSX, and HTML. One Document API. mypy --strict clean. No LLM in the loop.

Python bindings for Olga — a Rust document-processing engine. Built on PyO3 and maturin; one abi3 wheel covers CPython 3.8+.

Install

pip install olgadoc

Ten-second tour

import olgadoc

doc = olgadoc.Document.open("report.pdf")
print(doc.format, doc.page_count)           # ('PDF', 12)

# Will this document produce text, or does it need OCR first?
report = doc.processability()
if report.is_blocked():
    raise SystemExit([b["kind"] for b in report.blockers])

# Full-text search
for hit in doc.search("quarterly revenue"):
    print(hit["page"], hit["snippet"])

# Structured JSON tree — discriminated on ``type``
for element in doc.to_json()["elements"]:
    if element["type"] == "heading":
        print(f"h{element['level']}: {element['text']}")

Why olgadoc

  • Four formats, one API. PDF, DOCX, XLSX, and HTML all expose the same Document / Page surface. Stop juggling pdfplumber + python-docx + openpyxl + BeautifulSoup.
  • Native speed. PDF 4–8 ms · DOCX 2 ms · XLSX 1–12 ms · HTML 1–5 ms. 15–40× faster than the quality-equivalent tool on every format. (benchmarks)
  • Spatial fidelity, intact. Tables stay tables. Columns stay columns. Figure captions stay next to their figures. Layout carries meaning, and Olga preserves it across the round-trip to Markdown or to the typed JSON tree.
  • OCR pre-flight. doc.processability() tells you — before the pipeline starts — whether a document actually carries native text, or whether it's a scanned image that needs OCR first. Fail fast, save money.
  • Actually typed. Zero Any on the public surface. Every returned dict is a real TypedDict, Document.to_json() returns a discriminated union over 16 element variants, and mypy --strict narrows each branch.
  • No LLM in the loop. Reads the native content stream directly. Validated with an anti-LLM adversarial test — invisible canaries preserved byte-exact, deliberate typos intact, no hallucinations.

Typed surface, no Any

Every returned dict is a runtime TypedDict — introspectable at runtime and narrowed at type-check time.

from olgadoc import SearchHit

def show(hit: SearchHit) -> None:
    print(hit["page"], hit["snippet"])  # ok
    print(hit["nope"])                  # mypy: "SearchHit" has no key "nope"

Document.to_json() returns a DocumentJson tree whose elements are a discriminated JsonElement union over 16 variants (heading, paragraph, table, list, image, code_block, …). Mypy narrows each branch to exactly one.

vs alternatives

olgadoc pdfplumber unstructured docling
PDF
DOCX
XLSX partial partial
HTML partial
mypy --strict clean (no Any)
OCR pre-flight
Provenance per element
No ML model / no GPU required optional optional

What you get

  • Four formats, one API — PDF, DOCX, XLSX, HTML through Document.
  • Processability reportDocument.processability() → blockers (including EmptyContent for scanned PDFs) and degradations.
  • Cross-page tables — anchored on the first page with is_cross_page.
  • Hyperlinks, images, outline, RAG chunks, case-insensitive search.
  • Structured JSON treeDocument.to_json(), discriminated union over 16 element variants.

Examples

Five runnable scripts live in examples/:

  • quickstart.py — open a document, print a per-page preview.
  • extract_tables.py — pull every reconstructed table as TSV.
  • batch_processability.py — recursively health-check a directory.
  • search_and_extract.py — search + print surrounding page text.
  • json_walk.py — walk the typed JSON tree and narrow by type.

Building from source

pip install maturin
cd olgadoc
maturin develop --release
pytest tests/ -q

Links

License

Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olgadoc-0.1.0.tar.gz (733.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

olgadoc-0.1.0-cp38-abi3-win_amd64.whl (5.2 MB view details)

Uploaded CPython 3.8+Windows x86-64

olgadoc-0.1.0-cp38-abi3-musllinux_1_2_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

olgadoc-0.1.0-cp38-abi3-musllinux_1_2_aarch64.whl (5.9 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ ARM64

olgadoc-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

olgadoc-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (5.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

olgadoc-0.1.0-cp38-abi3-macosx_11_0_arm64.whl (7.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

olgadoc-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl (7.8 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file olgadoc-0.1.0.tar.gz.

File metadata

  • Download URL: olgadoc-0.1.0.tar.gz
  • Upload date:
  • Size: 733.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olgadoc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cb14f919ac1370e265977d9cd3f6684cdf4de14028c8337cd62c579bd2fdf5c0
MD5 b466aa43ce2c1e380037c88c88047c51
BLAKE2b-256 16cb49f758ef53253c98617bcf4d1039af8713c0b5c926cd04c0b839c7e1f091

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.0.tar.gz:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.0-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: olgadoc-0.1.0-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.2 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olgadoc-0.1.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 efb54ce10d8110f189f930522aba857dc2ce895b320a24c524ded52746f1a56c
MD5 a003e0f7160309d0d5d9a98e876fabcb
BLAKE2b-256 b6aebabad3f0e7e561038247fa676f550f9b2c66d93455e6f170f049b8ed6479

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.0-cp38-abi3-win_amd64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.0-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.0-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 0508983a6eddc83a5e5b071d57a935dfbcda26628e1ad9092b57be1849903bb2
MD5 b76b60f9f63f7ec0ae92c475a50a4696
BLAKE2b-256 895249b3950ff8edaa7138ce6908776275e039ad83c834625f20fb8c3749ce07

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.0-cp38-abi3-musllinux_1_2_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.0-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.0-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 b29b53e361dd615a5018267e2f1b404391421839cdc229689baff52e4a1856c4
MD5 f9d1450783a9312a71a85d8c11c94812
BLAKE2b-256 33e61a01d67c669d5e11aed036757cc0408abe227332451e67b638cd0c1f8c18

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.0-cp38-abi3-musllinux_1_2_aarch64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 731d995836f35643fef918f734fdc7ea0609c427ecc656dfb394197ee0189387
MD5 6d099901fca1ebb72209ad342f3152c3
BLAKE2b-256 5d43ff9b659226e582e2ea77def3ecd656725217c0323a0af8ad31884666992b

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8578b6837dd80c27e732ddf3a580c8b120a7baa38b49be481ceccd5c139465b1
MD5 f8d3305310e5e61bed534bb3089cd834
BLAKE2b-256 621dc62a5a04c3cf26cd1ccad1da8892d00431451c00cf6b68fd4902dcd1ae46

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 684bdd1e4d74780217a9a1f424a7a31ac357be805e2ad6d3fa1b6878fb12e295
MD5 790512d58f53db17702508a59e91e39c
BLAKE2b-256 2b144ac832a6946e884fc2c0ce02086fa56a45ac5b54ea348595aed82c4deeb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.0-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f63e6987c6f46b7cab25aeb310731e3162835fb8be610b00604b0c9db48ba938
MD5 cb97b48fef5ec4268fa7c03833e8c47e
BLAKE2b-256 75341ee8791921f4886b5c20f92ced105e647f649769de054bd8aee883061b72

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page