Skip to main content

Python bindings for Olga. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Strictly-typed surface, no Any, one abi3 wheel for CPython 3.8+.

Project description

olgadoc

Four formats. One engine. 15–40× faster.

Spatial fidelity at native speed, across PDF, DOCX, XLSX, and HTML. One Document API. mypy --strict clean. No LLM in the loop.

Python bindings for Olga — a Rust document-processing engine. Built on PyO3 and maturin; one abi3 wheel covers CPython 3.8+.

Install

pip install olgadoc

Ten-second tour

import olgadoc

doc = olgadoc.Document.open("report.pdf")
print(doc.format, doc.page_count)           # ('PDF', 12)

# Will this document produce text, or does it need OCR first?
report = doc.processability()
if report.is_blocked():
    raise SystemExit([b["kind"] for b in report.blockers])

# Full-text search
for hit in doc.search("quarterly revenue"):
    print(hit["page"], hit["snippet"])

# Structured JSON tree — discriminated on ``type``
for element in doc.to_json()["elements"]:
    if element["type"] == "heading":
        print(f"h{element['level']}: {element['text']}")

Why olgadoc

  • Four formats, one API. PDF, DOCX, XLSX, and HTML all expose the same Document / Page surface. Stop juggling pdfplumber + python-docx + openpyxl + BeautifulSoup.
  • Native speed. PDF 4–8 ms · DOCX 2 ms · XLSX 1–12 ms · HTML 1–5 ms. 15–40× faster than the quality-equivalent tool on every format (benchmarks). A post-release independent reproducible audit on a 50-file mixed corpus finds olgadoc 1.62× faster and 2.62× richer in extracted content than a hand-routed best-of-breed pipeline (report).
  • Spatial fidelity, intact. Tables stay tables. Columns stay columns. Figure captions stay next to their figures. Layout carries meaning, and Olga preserves it across the round-trip to Markdown or to the typed JSON tree.
  • OCR pre-flight. doc.processability() tells you — before the pipeline starts — whether a document actually carries native text, or whether it's a scanned image that needs OCR first. Fail fast, save money.
  • Actually typed. Zero Any on the public surface. Every returned dict is a real TypedDict, Document.to_json() returns a discriminated union over 16 element variants, and mypy --strict narrows each branch.
  • No LLM in the loop. Reads the native content stream directly. Validated with an anti-LLM adversarial test — invisible canaries preserved byte-exact, deliberate typos intact, no hallucinations.

Typed surface, no Any

Every returned dict is a runtime TypedDict — introspectable at runtime and narrowed at type-check time.

from olgadoc import SearchHit

def show(hit: SearchHit) -> None:
    print(hit["page"], hit["snippet"])  # ok
    print(hit["nope"])                  # mypy: "SearchHit" has no key "nope"

Document.to_json() returns a DocumentJson tree whose elements are a discriminated JsonElement union over 16 variants (heading, paragraph, table, list, image, code_block, …). Mypy narrows each branch to exactly one.

vs alternatives

olgadoc pdfplumber unstructured docling
PDF
DOCX
XLSX partial partial
HTML partial
mypy --strict clean (no Any)
OCR pre-flight
Provenance per element
No ML model / no GPU required optional optional

What you get

  • Four formats, one API — PDF, DOCX, XLSX, HTML through Document.
  • Processability reportDocument.processability() → blockers (including EmptyContent for scanned PDFs) and degradations.
  • Cross-page tables — anchored on the first page with is_cross_page.
  • Hyperlinks, images, outline, RAG chunks, case-insensitive search.
  • Structured JSON treeDocument.to_json(), discriminated union over 16 element variants.

Examples

Five runnable scripts live in examples/:

  • quickstart.py — open a document, print a per-page preview.
  • extract_tables.py — pull every reconstructed table as TSV.
  • batch_processability.py — recursively health-check a directory.
  • search_and_extract.py — search + print surrounding page text.
  • json_walk.py — walk the typed JSON tree and narrow by type.

Building from source

pip install maturin
cd olgadoc
maturin develop --release
pytest tests/ -q

Links

License

Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olgadoc-0.1.1.tar.gz (6.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

olgadoc-0.1.1-cp38-abi3-win_amd64.whl (5.2 MB view details)

Uploaded CPython 3.8+Windows x86-64

olgadoc-0.1.1-cp38-abi3-musllinux_1_2_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

olgadoc-0.1.1-cp38-abi3-musllinux_1_2_aarch64.whl (5.9 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ ARM64

olgadoc-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

olgadoc-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (5.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

olgadoc-0.1.1-cp38-abi3-macosx_11_0_arm64.whl (7.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

olgadoc-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl (7.8 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file olgadoc-0.1.1.tar.gz.

File metadata

  • Download URL: olgadoc-0.1.1.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olgadoc-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b9362abf65237953058642487b6c60d49a89000fb0efc386158563da7e275a24
MD5 b2ef51d4ef0880d58167c142c2faef7f
BLAKE2b-256 c783c106ef6eb097ea591702d231dec7d98a362567b8fddb09a44fbd05fea37e

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.1.tar.gz:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.1-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: olgadoc-0.1.1-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.2 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olgadoc-0.1.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 357f00a0e1ec9bcab37c217115e9d21cc38a11f10508f938c622f622abd186ba
MD5 78e5ff20beae280da4e3c2f175d9b83e
BLAKE2b-256 5737aa8328263f84b54a1fee4bdff915108001f8490f2a7a0ae5735d5f3eecd4

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-win_amd64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.1-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.1-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 91bb96f5144ce5c93b4a9475b60e478acc0716429ac836bb966bde20c37594f5
MD5 7bab2359196a343c58f4f940f3c5d13c
BLAKE2b-256 9b167c9fd60b7db037b172edda1d7b1b76d1c73f7b474b72b85dcef18abd4d30

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-musllinux_1_2_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.1-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.1-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 d7266437fb9bd73fb065de30a3a629d76b900f11c5e8b1ba2328f95a3ac5447b
MD5 330163a203cb6f8a5cee60577a8050d1
BLAKE2b-256 5adb03ad88426f674264d765e4f5b510221a63d0af496ff1bcde6b8a7f3b74a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-musllinux_1_2_aarch64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 efc71d8410b570dcc8f461df4b4833d8920a21a00dda8f5ad2476b8a79e527a0
MD5 99b73e1f23c25d9d08f545933d96e71c
BLAKE2b-256 75e0eaeabb0a910bfe3b8ef9bf450f10050482cd6a2c54aa63f7231e282978d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7d93df305b4e6b88bf2851a24f7cdd07bb46c828e7d8191b72d529f1715f661b
MD5 aa1c1caa991ca5982cd5a9ba12bd79ca
BLAKE2b-256 8ccfff4e34f4aeaa7ad1e9d1d5d1be8ccd2a2bf90a267564e4dc677e2dd6d084

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 741fe2967e94c961ef2e6c4286ee3fe56803a62728759c694a79af7e2895da14
MD5 545eed2b7cce3a557c0ea97ec82b1e7b
BLAKE2b-256 c7bc9373ee1d8a623f1c0c211487d37b432d17c7d6e7512d73ae78a6524c52c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olgadoc-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for olgadoc-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 32cf9ab8801784333834ff4aabc7bb29902c618cc6ce47cc136999c1cfb7655a
MD5 8e302c82afda9318021baf8ffc28f966
BLAKE2b-256 0f2ad80a835a78d0b7bfe7284eb98102e57f61ffd9fe5463b9f017db2554163f

See more details on using hashes here.

Provenance

The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on Hugues-DTANKOUO/olga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page