Open-source document intelligence for extraction, structure preservation, XML export, and RAG indexing.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

meetjethwa3

These details have not been verified by PyPI

Project description

AXIOMDoc

AXIOMDoc logo

AXIOM stands for Any-document eXtraction, Indexing, and Ontology Mapping.

AXIOMDoc is an open-source Python library for document intelligence in RAG pipelines. It is being built to ingest heterogeneous documents, preserve structure, export canonical XML and Markdown, and generate retrieval-ready indexing artifacts with provenance.

What AXIOMDoc is for

Converting PDFs, XML, DOCX, DOC, XLSX, HTML, and related formats into one canonical document model.
Preserving headings, reading order, page anchors, metadata, and layout evidence.
Falling back to OCR for image-only PDFs when text extraction is unavailable.
Exporting clean XML and Markdown representations for downstream processing.
Building chunk, section, and field-level artifacts for retrieval and context mapping.

Core requirements

Any-document ingestion across common enterprise and knowledge-document formats.
Structure fidelity so headings are not missed and body text is not promoted into headings.
Canonical export into XML, Markdown, JSON, and retrieval artifacts from one internal schema.
RAG-first indexing with chunk provenance, section paths, and page references.
XML-safe serialization that strips characters invalid under XML 1.0.

Architecture

AXIOMDoc follows a canonical-document-model approach:

Parser backends normalize source files into one schema.
Exporters transform that schema into XML, Markdown, and other artifacts.
Index builders create retrieval-ready records with explicit provenance.
Enrichment passes can later add headings, entities, forms, tables, and citation anchors.

This keeps parsing separate from retrieval and avoids binding the project to one vendor model or one OCR stack.

Install

python3 -m pip install -e .

Full parser and test dependencies:

python3 -m pip install -e ".[full,dev]"

Example

axiomdoc parse ./sample.pdf --xml-out ./sample.xml --markdown-out ./sample.md --index-out ./sample.index.json

Evaluation

AXIOMDoc evaluation plan

We are moving the evaluation stack to a manifest-driven, multi-format benchmark so we can compare AXIOMDoc on at least 1000 documents across PDF, DOCX, XLSX, XML, HTML, and text. The core pieces for that pipeline now live in:

Target evaluation size:

1000 total documents
500 PDF
200 DOCX
100 XLSX
100 XML
50 HTML
50 TXT

Target comparison set:

AXIOMDoc
Docling
PyMuPDF raw extraction baseline
pdfplumber
raw text baseline for simple structured files

We now have a completed large-corpus PDF benchmark on 1076 real PDFs for AXIOMDoc, PyMuPDF raw, and pdfplumber. docling remains in-progress on this corpus because its runtime on the same dataset is hours-scale.

1076-PDF corpus benchmark

This is still an operational benchmark, not a full scientific benchmark with human labels, so the metrics are limited to things we can measure honestly and reproduce today.

Local PDF corpus used in this run:

1076 PDFs from the local document store
13,594 total pages
median PDF length: 2 pages
max PDF length: 1178 pages

Measured results from the current run:

Library	Success Rate	Median Sec/Page	XML Well-Formed Rate	Median Heading Count	Median Markdown Chars	Median Chunk Count
AXIOMDoc	0.9991	0.01514	1.0000	5	5009	17
PyMuPDF raw	0.9991	0.00275	0.9600	0	4369	0
pdfplumber	0.9926	0.07410	0.9972	0	4316.5	0
Docling	pending	pending	pending	pending	pending	pending

Interpretation:

AXIOMDoc is slower than raw PyMuPDF, which is expected because it performs structure recovery and builds XML, Markdown, and chunk manifests.
AXIOMDoc is faster than pdfplumber on this corpus while also emitting RAG-ready chunks.
AXIOMDoc is the only completed large-corpus run here currently producing a non-zero chunk manifest.
PyMuPDF raw had the fastest median page time, but a lower XML well-formed rate because the wrapper path surfaced malformed outputs on some documents.
pdfplumber had the lowest success rate among the completed large-corpus runs.

Benchmark files:

Large-corpus benchmark commands:

.venv/bin/python benchmarks/run_benchmarks.py --manifest benchmarks/manifests/pdf-only-1076.json --libraries axiomdoc --output benchmarks/results/pdf-only-1076-axiomdoc.json
.venv/bin/python benchmarks/run_benchmarks.py --manifest benchmarks/manifests/pdf-only-1076.json --libraries pymupdf_raw --output benchmarks/results/pdf-only-1076-pymupdf.json
.venv/bin/python benchmarks/run_benchmarks.py --manifest benchmarks/manifests/pdf-only-1076.json --libraries pdfplumber --output benchmarks/results/pdf-only-1076-pdfplumber.json

Limits of this benchmark:

This run is PDF-only. The 1000-document multi-format plan exists, but only the PDF track is complete so far.
Heading recovery here is markdown heading count, not labeled precision/recall.
Markdown character count is a yield proxy, not a semantic quality score.
The docling large-corpus baseline is still pending because of runtime cost on this machine.

Labeled fixture evaluation is now available in benchmarks/labeled_eval.py and exercised in tests/test_hardening.py. That scorer currently measures expected heading recovery and table recovery against explicit JSON labels.

XML safety

XML does not allow certain control and surrogate characters. AXIOMDoc now sanitizes invalid XML 1.0 characters before serialization in src/axiomdoc/exporters/xml.py, so malformed text content does not break XML generation.

Release readiness

The repo now includes:

production PDF, DOCX, XML, and XLSX parsers
OCR fallback for image-only PDFs through the local tesseract binary
structured table preservation in XML, Markdown, and chunk manifests
pytest coverage for exporters, parser resolution, and PDF smoke behavior
labeled evaluation fixtures for heading and table recovery
a GitHub Actions test workflow at .github/workflows/tests.yml
an MIT LICENSE

Status

The project is in late release-prep. PDF, DOCX, XLSX, and XML baseline parsing are implemented, OCR fallback exists for image-only PDFs, and labeled evaluation now covers heading/table recovery on fixtures. The remaining gaps before a strict 1.0.0 are broader labeled datasets, richer scanned-document accuracy validation, and more advanced form/table semantics. The roadmap remains in docs/architecture.md.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

meetjethwa3

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.3

Mar 13, 2026

This version

1.0.2

Mar 13, 2026

1.0.1

Mar 12, 2026

1.0.0

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

axiomdoc-1.0.2.tar.gz (19.0 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

axiomdoc-1.0.2-py3-none-any.whl (18.4 kB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file axiomdoc-1.0.2.tar.gz.

File metadata

Download URL: axiomdoc-1.0.2.tar.gz
Upload date: Mar 13, 2026
Size: 19.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for axiomdoc-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`0f436c2cc93048f303f00a0b3a2e075c12b51cf001a819c40e70e676768fa13a`
MD5	`a88b364fde4541cf3d374b11b8ce2834`
BLAKE2b-256	`63234a666569840c03d4105cadb0d7ef3cd1424d83c19d505f25c076b40118fe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for axiomdoc-1.0.2.tar.gz:

Publisher: publish.yml on Meet2147/axiomdoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: axiomdoc-1.0.2.tar.gz
- Subject digest: 0f436c2cc93048f303f00a0b3a2e075c12b51cf001a819c40e70e676768fa13a
- Sigstore transparency entry: 1096400856
- Sigstore integration time: Mar 13, 2026
Source repository:
- Permalink: Meet2147/axiomdoc@6c82a7ac44af79045a407ea72d93a70fc02425fa
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Meet2147
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6c82a7ac44af79045a407ea72d93a70fc02425fa
- Trigger Event: workflow_dispatch

File details

Details for the file axiomdoc-1.0.2-py3-none-any.whl.

File metadata

Download URL: axiomdoc-1.0.2-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 18.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for axiomdoc-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77a5236b9fce1b151d95c42c23b2e8b834658d354c04a74804b81ac3520e6b9e`
MD5	`af31237d7f1f3dc39db0bc23c2ed37a7`
BLAKE2b-256	`2f3e8b5d62e841d0e77c51d0f77c5d6134bdd9fb9604f9fc544cdba7a08bebf9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for axiomdoc-1.0.2-py3-none-any.whl:

Publisher: publish.yml on Meet2147/axiomdoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: axiomdoc-1.0.2-py3-none-any.whl
- Subject digest: 77a5236b9fce1b151d95c42c23b2e8b834658d354c04a74804b81ac3520e6b9e
- Sigstore transparency entry: 1096400868
- Sigstore integration time: Mar 13, 2026
Source repository:
- Permalink: Meet2147/axiomdoc@6c82a7ac44af79045a407ea72d93a70fc02425fa
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Meet2147
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6c82a7ac44af79045a407ea72d93a70fc02425fa
- Trigger Event: workflow_dispatch

axiomdoc 1.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AXIOMDoc

What AXIOMDoc is for

Core requirements

Architecture

Install

Example

Evaluation

1076-PDF corpus benchmark

XML safety

Release readiness

Status

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance