Python bindings for Olga. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Strictly-typed surface, no Any, one abi3 wheel for CPython 3.8+.
Project description
olgadoc
Four formats. One engine. 15–40× faster.
Spatial fidelity at native speed, across PDF, DOCX, XLSX, and HTML. One
DocumentAPI.mypy --strictclean. No LLM in the loop.
Python bindings for Olga — a Rust document-processing engine. Built on PyO3 and maturin; one abi3 wheel covers CPython 3.8+.
Install
pip install olgadoc
Ten-second tour
import olgadoc
doc = olgadoc.Document.open("report.pdf")
print(doc.format, doc.page_count) # ('PDF', 12)
# Will this document produce text, or does it need OCR first?
report = doc.processability()
if report.is_blocked():
raise SystemExit([b["kind"] for b in report.blockers])
# Full-text search
for hit in doc.search("quarterly revenue"):
print(hit["page"], hit["snippet"])
# Structured JSON tree — discriminated on ``type``
for element in doc.to_json()["elements"]:
if element["type"] == "heading":
print(f"h{element['level']}: {element['text']}")
Why olgadoc
- Four formats, one API. PDF, DOCX, XLSX, and HTML all expose the
same
Document/Pagesurface. Stop jugglingpdfplumber+python-docx+openpyxl+BeautifulSoup. - Native speed. PDF 4–8 ms · DOCX 2 ms · XLSX 1–12 ms · HTML 1–5 ms. 15–40× faster than the quality-equivalent tool on every format (benchmarks). A post-release independent reproducible audit on a 50-file mixed corpus finds olgadoc 1.62× faster and 2.62× richer in extracted content than a hand-routed best-of-breed pipeline (report).
- Spatial fidelity, intact. Tables stay tables. Columns stay columns. Figure captions stay next to their figures. Layout carries meaning, and Olga preserves it across the round-trip to Markdown or to the typed JSON tree.
- OCR pre-flight.
doc.processability()tells you — before the pipeline starts — whether a document actually carries native text, or whether it's a scanned image that needs OCR first. Fail fast, save money. - Actually typed. Zero
Anyon the public surface. Every returned dict is a realTypedDict,Document.to_json()returns a discriminated union over 16 element variants, andmypy --strictnarrows each branch. - No LLM in the loop. Reads the native content stream directly. Validated with an anti-LLM adversarial test — invisible canaries preserved byte-exact, deliberate typos intact, no hallucinations.
Typed surface, no Any
Every returned dict is a runtime TypedDict — introspectable at
runtime and narrowed at type-check time.
from olgadoc import SearchHit
def show(hit: SearchHit) -> None:
print(hit["page"], hit["snippet"]) # ok
print(hit["nope"]) # mypy: "SearchHit" has no key "nope"
Document.to_json() returns a DocumentJson tree
whose elements are a discriminated JsonElement
union over 16 variants (heading, paragraph, table, list,
image, code_block, …). Mypy narrows each branch to exactly one.
vs alternatives
| olgadoc | pdfplumber |
unstructured |
docling |
|
|---|---|---|---|---|
| ✅ | ✅ | ✅ | ✅ | |
| DOCX | ✅ | — | ✅ | ✅ |
| XLSX | ✅ | — | partial | partial |
| HTML | ✅ | — | ✅ | partial |
mypy --strict clean (no Any) |
✅ | — | — | — |
| OCR pre-flight | ✅ | — | — | — |
| Provenance per element | ✅ | — | — | — |
| No ML model / no GPU required | ✅ | ✅ | optional | optional |
What you get
- Four formats, one API — PDF, DOCX, XLSX, HTML through
Document. - Processability report —
Document.processability()→ blockers (includingEmptyContentfor scanned PDFs) and degradations. - Cross-page tables — anchored on the first page with
is_cross_page. - Hyperlinks, images, outline, RAG chunks, case-insensitive search.
- Structured JSON tree —
Document.to_json(), discriminated union over 16 element variants.
Examples
Five runnable scripts live in
examples/:
quickstart.py— open a document, print a per-page preview.extract_tables.py— pull every reconstructed table as TSV.batch_processability.py— recursively health-check a directory.search_and_extract.py— search + print surrounding page text.json_walk.py— walk the typed JSON tree and narrow bytype.
Building from source
pip install maturin
cd olgadoc
maturin develop --release
pytest tests/ -q
Links
- Source & docs — github.com/Hugues-DTANKOUO/olga
- Benchmarks — BENCHMARKS.md
- Independent v0.1.0 audit — olga_v0.1.0_benchmark/
- API reference — hugues-dtankouo.github.io/olga
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file olgadoc-0.1.1.tar.gz.
File metadata
- Download URL: olgadoc-0.1.1.tar.gz
- Upload date:
- Size: 6.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9362abf65237953058642487b6c60d49a89000fb0efc386158563da7e275a24
|
|
| MD5 |
b2ef51d4ef0880d58167c142c2faef7f
|
|
| BLAKE2b-256 |
c783c106ef6eb097ea591702d231dec7d98a362567b8fddb09a44fbd05fea37e
|
Provenance
The following attestation bundles were made for olgadoc-0.1.1.tar.gz:
Publisher:
release.yml on Hugues-DTANKOUO/olga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olgadoc-0.1.1.tar.gz -
Subject digest:
b9362abf65237953058642487b6c60d49a89000fb0efc386158563da7e275a24 - Sigstore transparency entry: 1350166736
- Sigstore integration time:
-
Permalink:
Hugues-DTANKOUO/olga@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Hugues-DTANKOUO
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file olgadoc-0.1.1-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: olgadoc-0.1.1-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 5.2 MB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
357f00a0e1ec9bcab37c217115e9d21cc38a11f10508f938c622f622abd186ba
|
|
| MD5 |
78e5ff20beae280da4e3c2f175d9b83e
|
|
| BLAKE2b-256 |
5737aa8328263f84b54a1fee4bdff915108001f8490f2a7a0ae5735d5f3eecd4
|
Provenance
The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-win_amd64.whl:
Publisher:
release.yml on Hugues-DTANKOUO/olga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olgadoc-0.1.1-cp38-abi3-win_amd64.whl -
Subject digest:
357f00a0e1ec9bcab37c217115e9d21cc38a11f10508f938c622f622abd186ba - Sigstore transparency entry: 1350166818
- Sigstore integration time:
-
Permalink:
Hugues-DTANKOUO/olga@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Hugues-DTANKOUO
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file olgadoc-0.1.1-cp38-abi3-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: olgadoc-0.1.1-cp38-abi3-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 6.1 MB
- Tags: CPython 3.8+, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91bb96f5144ce5c93b4a9475b60e478acc0716429ac836bb966bde20c37594f5
|
|
| MD5 |
7bab2359196a343c58f4f940f3c5d13c
|
|
| BLAKE2b-256 |
9b167c9fd60b7db037b172edda1d7b1b76d1c73f7b474b72b85dcef18abd4d30
|
Provenance
The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-musllinux_1_2_x86_64.whl:
Publisher:
release.yml on Hugues-DTANKOUO/olga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olgadoc-0.1.1-cp38-abi3-musllinux_1_2_x86_64.whl -
Subject digest:
91bb96f5144ce5c93b4a9475b60e478acc0716429ac836bb966bde20c37594f5 - Sigstore transparency entry: 1350167149
- Sigstore integration time:
-
Permalink:
Hugues-DTANKOUO/olga@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Hugues-DTANKOUO
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file olgadoc-0.1.1-cp38-abi3-musllinux_1_2_aarch64.whl.
File metadata
- Download URL: olgadoc-0.1.1-cp38-abi3-musllinux_1_2_aarch64.whl
- Upload date:
- Size: 5.9 MB
- Tags: CPython 3.8+, musllinux: musl 1.2+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7266437fb9bd73fb065de30a3a629d76b900f11c5e8b1ba2328f95a3ac5447b
|
|
| MD5 |
330163a203cb6f8a5cee60577a8050d1
|
|
| BLAKE2b-256 |
5adb03ad88426f674264d765e4f5b510221a63d0af496ff1bcde6b8a7f3b74a3
|
Provenance
The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-musllinux_1_2_aarch64.whl:
Publisher:
release.yml on Hugues-DTANKOUO/olga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olgadoc-0.1.1-cp38-abi3-musllinux_1_2_aarch64.whl -
Subject digest:
d7266437fb9bd73fb065de30a3a629d76b900f11c5e8b1ba2328f95a3ac5447b - Sigstore transparency entry: 1350167462
- Sigstore integration time:
-
Permalink:
Hugues-DTANKOUO/olga@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Hugues-DTANKOUO
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file olgadoc-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: olgadoc-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 5.4 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efc71d8410b570dcc8f461df4b4833d8920a21a00dda8f5ad2476b8a79e527a0
|
|
| MD5 |
99b73e1f23c25d9d08f545933d96e71c
|
|
| BLAKE2b-256 |
75e0eaeabb0a910bfe3b8ef9bf450f10050482cd6a2c54aa63f7231e282978d3
|
Provenance
The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on Hugues-DTANKOUO/olga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olgadoc-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
efc71d8410b570dcc8f461df4b4833d8920a21a00dda8f5ad2476b8a79e527a0 - Sigstore transparency entry: 1350167358
- Sigstore integration time:
-
Permalink:
Hugues-DTANKOUO/olga@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Hugues-DTANKOUO
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file olgadoc-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: olgadoc-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 5.1 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d93df305b4e6b88bf2851a24f7cdd07bb46c828e7d8191b72d529f1715f661b
|
|
| MD5 |
aa1c1caa991ca5982cd5a9ba12bd79ca
|
|
| BLAKE2b-256 |
8ccfff4e34f4aeaa7ad1e9d1d5d1be8ccd2a2bf90a267564e4dc677e2dd6d084
|
Provenance
The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on Hugues-DTANKOUO/olga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olgadoc-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
7d93df305b4e6b88bf2851a24f7cdd07bb46c828e7d8191b72d529f1715f661b - Sigstore transparency entry: 1350167238
- Sigstore integration time:
-
Permalink:
Hugues-DTANKOUO/olga@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Hugues-DTANKOUO
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file olgadoc-0.1.1-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: olgadoc-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 7.5 MB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
741fe2967e94c961ef2e6c4286ee3fe56803a62728759c694a79af7e2895da14
|
|
| MD5 |
545eed2b7cce3a557c0ea97ec82b1e7b
|
|
| BLAKE2b-256 |
c7bc9373ee1d8a623f1c0c211487d37b432d17c7d6e7512d73ae78a6524c52c2
|
Provenance
The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on Hugues-DTANKOUO/olga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olgadoc-0.1.1-cp38-abi3-macosx_11_0_arm64.whl -
Subject digest:
741fe2967e94c961ef2e6c4286ee3fe56803a62728759c694a79af7e2895da14 - Sigstore transparency entry: 1350166929
- Sigstore integration time:
-
Permalink:
Hugues-DTANKOUO/olga@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Hugues-DTANKOUO
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file olgadoc-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: olgadoc-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 7.8 MB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32cf9ab8801784333834ff4aabc7bb29902c618cc6ce47cc136999c1cfb7655a
|
|
| MD5 |
8e302c82afda9318021baf8ffc28f966
|
|
| BLAKE2b-256 |
0f2ad80a835a78d0b7bfe7284eb98102e57f61ffd9fe5463b9f017db2554163f
|
Provenance
The following attestation bundles were made for olgadoc-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on Hugues-DTANKOUO/olga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olgadoc-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl -
Subject digest:
32cf9ab8801784333834ff4aabc7bb29902c618cc6ce47cc136999c1cfb7655a - Sigstore transparency entry: 1350167028
- Sigstore integration time:
-
Permalink:
Hugues-DTANKOUO/olga@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Hugues-DTANKOUO
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e2ca73add9f70b435297c9a4880e401de86f2da9 -
Trigger Event:
push
-
Statement type: