Skip to main content

Hardened PDF->DOCX converter. Fork of pdf2docx with stability fixes, typed API, plugin architecture, and optional ML layout/OCR/table backends.

Project description

pdf2docx-plus

Hardened fork of pdf2docx — a Python PDF → DOCX converter that actually writes editable Word documents (not Markdown, not HTML).

What's different from upstream

upstream pdf2docx pdf2docx-plus
Python support 3.10+ 3.11 / 3.12 / 3.13
Hyperlink OOXML nested inside <w:r> (invalid) paragraph-level <w:hyperlink> (valid)
NULL-byte / control chars sometimes leaks into <w:t>, corrupts DOCX stripped at run insertion
Errors single ConversionException InputError / ParseError / MakeDocxError / PasswordRequired / TimeoutExceeded
Typed API no py.typed, dataclasses, Protocol-based plugins
Return value None ConversionResult with per-page accounting
Timeout none (can hang forever) timeout_s= watchdog
Plugin architecture no swap table / layout / OCR / formula backends
REST server no pdf2docx-plus serve (FastAPI, optional)
ML hooks (opt-in) no Table Transformer, Granite-Docling, PaddleOCR, pix2tex
Tables → CSV no --tables-csv DIR
Structured logging hijacks root logger scoped pdf2docx_plus logger

Install

pip install pdf2docx-plus            # core
pip install 'pdf2docx-plus[rest]'    # + FastAPI server
pip install 'pdf2docx-plus[bench]'   # + evaluation harness
pip install 'pdf2docx-plus[ml-tables]' # + Table Transformer (torch)
pip install 'pdf2docx-plus[ml-ocr]'  # + PaddleOCR

Quick start

from pdf2docx_plus import convert

result = convert("in.pdf", "out.docx", timeout_s=120)
print(result.pages_ok, "/", result.pages_total, "pages in", result.elapsed_s, "s")

Or with more control:

from pdf2docx_plus import Converter, PluginRegistry
from pdf2docx_plus.hooks import TableTransformerDetector

plugins = PluginRegistry()
plugins.add_table_detector(TableTransformerDetector(device="cuda"))

with Converter("in.pdf", password="s3cret") as cv:
    result = cv.convert(
        "out.docx",
        pages=[0, 1, 2],
        profile="fidelity",     # "fast" | "fidelity" | "semantic"
        timeout_s=60,
        continue_on_error=True,
    )
    for p in result.page_results:
        if not p.ok:
            print(f"page {p.page_index}: {p.error}")

CLI

pdf2docx-plus convert in.pdf out.docx --timeout 120 --profile fidelity
pdf2docx-plus convert in.pdf --pages 0,2,5 --tables-csv tables/
pdf2docx-plus extract-tables in.pdf --out tables.json
pdf2docx-plus serve --host 0.0.0.0 --port 8000
pdf2docx-plus version

REST server

pip install 'pdf2docx-plus[rest]'
pdf2docx-plus serve --port 8000
# in another shell:
curl -F file=@in.pdf -F profile=fidelity http://localhost:8000/convert -o out.docx

Endpoints:

Method Path Body Returns
POST /convert multipart file, optional password, profile, timeout_s DOCX bytes + X-Pages-Ok / X-Pages-Failed / X-Elapsed-Seconds headers
POST /extract-tables multipart file, optional password JSON {"tables": [...]}
GET /healthz {"status": "ok"}
GET /version {"version": "..."}

Plugin architecture

Four extension points, all Protocol-based:

from pdf2docx_plus.plugins import (
    TableDetector, LayoutDetector, OcrEngine, FormulaRecognizer
)

Register any implementation on PluginRegistry and pass it to Converter. Plugins never kill a conversion — exceptions raised inside a plugin are logged and skipped.

Built-in ML hooks (opt-in extras):

Hook Backend Extra Weights license
TableTransformerDetector HuggingFace microsoft/table-transformer-* ml-tables MIT
GraniteDoclingLayoutDetector ibm-granite/granite-docling-258M ml-layout Apache-2.0
PaddleOcrEngine PaddleOCR ml-ocr Apache-2.0
Pix2TexFormulaRecognizer pix2tex ml-formula MIT
UniMERNetFormulaRecognizer UniMERNet (bring weights) manual Apache-2.0

Benchmark

pip install 'pdf2docx-plus[bench]'
python -m bench.run --corpus bench/corpus --out bench/reports/latest.json

Metrics implemented: text F1, TEDS (apted), reading-order Kendall-tau, rendered SSIM (via LibreOffice + scikit-image), and editability ratio.

Seed corpus in this repo: 3 financial fund PDFs (born-digital). Drop more under bench/corpus/<name>/input.pdf and, optionally, expected_text.txt, expected_tables.json, expected_order.json for scoring.

Current baseline on the seed corpus (76 pages, CPU):

awhkef                  9 pages   0 failed    7.1 s   74 KB
first_sentier          58 pages   0 failed   15.8 s  155 KB
kfs_bosera              9 pages   0 failed    4.3 s   87 KB
TOTAL                  76 pages   0 failed   27.7 s  2.75 pg/s

Licensing

pdf2docx-plus is MIT, but depends on PyMuPDF (AGPL-3.0) — this propagates to you if you redistribute or expose as a network service. See LICENSING.md for the full dependency matrix, AGPL implications, and the future pypdfium2 migration path.

What's NOT done yet (roadmap)

This fork covers Phase 0 (foundation) and most of Phase 1 (stability

  • typed API) from the original 21-week PDF2DOCX_FORK_PLAN.md. Phases 2–5 are scaffolded via the plugin architecture but the ML-backed hooks need real integration work to reach the v1.0 success criteria in the plan (TEDS ≥ 0.90, text F1 ≥ 0.98, reading-order Kendall-tau ≥ 0.90).

Specifically, still open:

  • Train / evaluate Table Transformer + Granite-Docling against an annotated corpus (plan §K).
  • Cross-page table stitching heuristic (§B.7).
  • Header/footer → w:hdr / w:ftr emission (§C.13).
  • Math recognition pipeline wiring (§F.24).
  • Scanned-PDF OCR routing + auto-detect (§G.25).
  • styles.xml rewrite (§H.27) — currently we still use python-docx defaults.
  • pypdfium2 backend for permissive licensing (§6).

Credits

Forked from ArtifexSoftware/pdf2docx (originally by @dothinking). MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2docx_plus-0.6.1.tar.gz (155.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2docx_plus-0.6.1-py3-none-any.whl (201.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf2docx_plus-0.6.1.tar.gz.

File metadata

  • Download URL: pdf2docx_plus-0.6.1.tar.gz
  • Upload date:
  • Size: 155.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf2docx_plus-0.6.1.tar.gz
Algorithm Hash digest
SHA256 0ca2c2767842621acc48af52194938a00e3daeb221a450dfb7e2311b4245e462
MD5 06cbe65ab7fa3c9be6f178704b9f0ec0
BLAKE2b-256 89763ab7c92c99c29f4356a48d486367eebbaaccf76884a9fef5179a33333f16

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2docx_plus-0.6.1.tar.gz:

Publisher: publish.yml on mithunvoe/pdf2docx-plus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdf2docx_plus-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: pdf2docx_plus-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 201.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf2docx_plus-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0166e192b2a2ae0c1091b606f7e516ab997a8ab5a902adc7452cc44adb150898
MD5 b56c53a9eba4df0d4be409f86efd588c
BLAKE2b-256 2b7edd1923817646d8a800e06b25cdd1192200d13c142b91d051aaeb6d3bd0f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2docx_plus-0.6.1-py3-none-any.whl:

Publisher: publish.yml on mithunvoe/pdf2docx-plus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page