Skip to main content

Open-source document intelligence for extraction, structure preservation, XML export, and RAG indexing.

Project description

AXIOMDoc

AXIOMDoc logo

AXIOM stands for Any-document eXtraction, Indexing, and Ontology Mapping.

AXIOMDoc is an open-source Python library for document intelligence in RAG pipelines. It is being built to ingest heterogeneous documents, preserve structure, export canonical XML and Markdown, and generate retrieval-ready indexing artifacts with provenance.

What AXIOMDoc is for

  • Converting PDFs, XML, DOCX, DOC, XLSX, HTML, and related formats into one canonical document model.
  • Preserving headings, reading order, page anchors, metadata, and layout evidence.
  • Falling back to OCR for image-only PDFs when text extraction is unavailable.
  • Exporting clean XML and Markdown representations for downstream processing.
  • Building chunk, section, and field-level artifacts for retrieval and context mapping.

Core requirements

  1. Any-document ingestion across common enterprise and knowledge-document formats.
  2. Structure fidelity so headings are not missed and body text is not promoted into headings.
  3. Canonical export into XML, Markdown, JSON, and retrieval artifacts from one internal schema.
  4. RAG-first indexing with chunk provenance, section paths, and page references.
  5. XML-safe serialization that strips characters invalid under XML 1.0.

Architecture

AXIOMDoc follows a canonical-document-model approach:

  • Parser backends normalize source files into one schema.
  • Exporters transform that schema into XML, Markdown, and other artifacts.
  • Index builders create retrieval-ready records with explicit provenance.
  • Enrichment passes can later add headings, entities, forms, tables, and citation anchors.

This keeps parsing separate from retrieval and avoids binding the project to one vendor model or one OCR stack.

Current package layout

src/axiomdoc/
  cli.py
  pipeline.py
  models.py
  indexing.py
  exporters/
    xml.py
    markdown.py
  parsers/
    base.py
    docx.py
    registry.py
    pdf.py
    xlsx.py
    xml.py
tests/
  test_exporters.py
  test_parsers.py
.github/
  workflows/tests.yml
docs/
  architecture.md
assets/
  axiomdoc-logo.svg

Install

python3 -m pip install -e .

Full parser and test dependencies:

python3 -m pip install -e ".[full,dev]"

Example

axiomdoc parse ./sample.pdf --xml-out ./sample.xml --markdown-out ./sample.md --index-out ./sample.index.json

Evaluation

AXIOMDoc evaluation plan

The comparison above is now populated from a runnable benchmark harness in benchmarks/run_benchmarks.py. This is still an operational benchmark, not a full scientific benchmark with human labels, so the metrics are limited to things we can measure honestly and reproduce today.

Comparison set used in this run:

  • AXIOMDoc
  • Docling
  • PyMuPDF raw extraction baseline
  • pdfplumber

Public PDF corpus used in this run:

  • attention-is-all-you-need.pdf
  • orimi-test.pdf
  • w3c-dummy.pdf

Measured results from the current run:

Library Success Rate Median Sec/Page XML Well-Formed Rate Median Heading Count Median Markdown Chars Median Chunk Count
AXIOMDoc 1.00 0.0112 1.00 1 386 3
Docling 1.00 1.6584 1.00 1 386 0
PyMuPDF raw 1.00 0.0015 1.00 0 390 0
pdfplumber 1.00 0.0156 1.00 0 365 0

Interpretation:

  • AXIOMDoc is substantially slower than raw PyMuPDF because it does structural classification and builds XML, Markdown, and chunk manifests.
  • AXIOMDoc is much faster than Docling on this small corpus while still emitting RAG-ready chunks.
  • Docling and AXIOMDoc both recovered markdown headings on the median document in this dataset.
  • All evaluated libraries produced well-formed XML in this benchmark because the wrapper export path enforced XML-safe serialization.
  • AXIOMDoc is the only library in this comparison currently producing a non-zero chunk manifest because the benchmark used each tool's default or near-default extraction path.

Benchmark files:

Benchmark command:

.venv/bin/python benchmarks/run_benchmarks.py --dataset-dir benchmarks/datasets/pdfs --libraries axiomdoc pymupdf_raw pdfplumber --output benchmarks/results/latest.json
.venv/bin/python benchmarks/run_benchmarks.py --dataset-dir benchmarks/datasets/pdfs --libraries docling --output benchmarks/results/docling.json

Limits of this benchmark:

  • This is PDF-only right now. DOCX, XLSX, and XML are not included yet.
  • Heading recovery here is markdown heading count, not labeled precision/recall.
  • Markdown character count is a yield proxy, not a semantic quality score.
  • The dataset is small and should be expanded before making stronger claims.

Labeled fixture evaluation is now available in benchmarks/labeled_eval.py and exercised in tests/test_hardening.py. That scorer currently measures expected heading recovery and table recovery against explicit JSON labels.

XML safety

XML does not allow certain control and surrogate characters. AXIOMDoc now sanitizes invalid XML 1.0 characters before serialization in src/axiomdoc/exporters/xml.py, so malformed text content does not break XML generation.

Release readiness

The repo now includes:

  • production PDF, DOCX, XML, and XLSX parsers
  • OCR fallback for image-only PDFs through the local tesseract binary
  • structured table preservation in XML, Markdown, and chunk manifests
  • pytest coverage for exporters, parser resolution, and PDF smoke behavior
  • labeled evaluation fixtures for heading and table recovery
  • a GitHub Actions test workflow at .github/workflows/tests.yml
  • an MIT LICENSE

Status

The project is in late release-prep. PDF, DOCX, XLSX, and XML baseline parsing are implemented, OCR fallback exists for image-only PDFs, and labeled evaluation now covers heading/table recovery on fixtures. The remaining gaps before a strict 1.0.0 are broader labeled datasets, richer scanned-document accuracy validation, and more advanced form/table semantics. The roadmap remains in docs/architecture.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

axiomdoc-1.0.0.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

axiomdoc-1.0.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file axiomdoc-1.0.0.tar.gz.

File metadata

  • Download URL: axiomdoc-1.0.0.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for axiomdoc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8557eca1f047e81eff18ad7f31903c8285b6315750842445d2a8dcf33836b920
MD5 b56654c74d30c87fdffea370a10b1789
BLAKE2b-256 66e50ad2d8f30c88bdc09a6625113b9abcf61c820f0fc41e81711a455d32a44c

See more details on using hashes here.

File details

Details for the file axiomdoc-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: axiomdoc-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for axiomdoc-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 72f9d3915df6a4fe6fdb35c0512af08a1b8c421eedf25c2a6b730d1a2b0d66e2
MD5 813aae67850db0fe3028881bd5f6736b
BLAKE2b-256 c775ee0af5bb70dfe1c9fa96f6906fff1b264a9d6f52b05417b1d5eba7740cd9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page