Skip to main content

Formatting-preserving PDF-to-DOCX converter that fixes bullet lists, hyperlinks, CJK fonts, and scanned PDFs

Project description

pdf2docx-healer

PyPI version Python versions License: MIT CI/CD

A drop-in replacement for pdf2docx that actually preserves your formatting.

pdf2docx is a great PDF-to-DOCX converter, but it drops bullet lists, loses hyperlinks, mangles CJK fonts, and chokes on scanned PDFs. pdf2docx-healer wraps pdf2docx and heals all of these issues in a post-processing pass — so your Word documents come out looking the way they should.


Why this exists

Problem pdf2docx alone With pdf2docx-healer
Bullet lists (, -, *) Flattened to plain text, no Word list style Proper List Bullet style with real Word numbering
Numbered lists (1., a., i.) Lost or merged into one paragraph List Number style; lettered/roman via OOXML injection
Nested lists (3+ levels) Indentation lost Level detected from indent, applied to Word
Hyperlinks URL text is plain, not clickable Wrapped in real <w:hyperlink> elements with blue/underline
CJK fonts (Chinese/Japanese/Korean) Font names like SimSun may not resolve Fallback chain maps to system-available CJK fonts
Scanned PDFs (image-only) "Words count: 0" warning, empty output OCR via Tesseract, then normal conversion
Section headers styled as lists Headers like "4. Numbered List" get list style Detected as headers, kept as Normal paragraphs

Install

pip install pdf2docx-healer

For OCR support on scanned PDFs, also install Tesseract and the optional extra:

pip install "pdf2docx-healer[ocr]"

Quick start

Python API

from docx_healer import heal

# Simplest usage — output goes to "report.docx"
heal("report.pdf", "report.docx")
from docx_healer import heal, HealerConfig

# Full control via config
config = HealerConfig(
    ocr_enabled=True,          # OCR for scanned/image PDFs
    ocr_lang="eng",            # Tesseract language code
    ocr_dpi=300,               # OCR resolution
    ocr_threshold=0.3,         # Fraction of textless pages to trigger OCR
    fix_lists=True,            # Detect & style bullet/numbered lists
    fix_hyperlinks=True,       # Wrap URL text in clickable hyperlinks
    fix_fonts=True,            # Map CJK/unavailable fonts to system fonts
    aggressive_lists=False,    # More aggressive paragraph splitting
    verbose=True,              # Print progress
)

heal("scanned_report.pdf", "output.docx", config=config)

Command line

# Basic conversion
pdf2docx-heal input.pdf -o output.docx

# Scanned PDF with OCR
pdf2docx-heal input.pdf --ocr --ocr-lang eng

# Quiet mode (no progress output)
pdf2docx-heal input.pdf -q

# Skip specific fixes
pdf2docx-heal input.pdf --no-lists --no-hyperlinks

Run pdf2docx-heal --help to see all options.


What it fixes

  • Bullet lists — Detects Unicode (, , , ) and ASCII (-, *, +) bullets, applies Word's List Bullet style. Nested bullets (up to 5 levels) detected from indentation.
  • Numbered lists — Detects decimal (1.), parenthesized ((1)), lettered (a.), roman (i.), and outline (1.1) numbering. Lettered/roman use OOXML injection with correct numFmt since Word's built-in styles only support decimal.
  • Hyperlinks — Scans runs for http://, https://, www., mailto:, ftp:// and wraps them in <w:hyperlink> elements with external relationship targets. Multiple URLs in one run all get converted.
  • CJK font fallback — Maps embedded font names (SimSun, MS-Mincho, HYGoThic-Medium) to system-available equivalents across Windows/macOS/Linux. Character-range detection maps unknown fonts by script (CJK, Arabic, Hebrew, Thai, Devanagari, Cyrillic).
  • Scanned PDF OCR — Detects image-only PDFs and runs Tesseract OCR via PyMuPDF. Falls back gracefully if Tesseract isn't installed.
  • Smart header detection — Headers like "4. Numbered List" are detected via sequential-reset analysis and kept as Normal paragraphs instead of being styled as list items.

Requirements

  • Python 3.8+
  • pdf2docx >= 0.5.0, PyMuPDF >= 1.23.0, python-docx >= 0.8.11, lxml
  • Tesseract (optional, for OCR)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2docx_healer-0.1.4.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2docx_healer-0.1.4-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file pdf2docx_healer-0.1.4.tar.gz.

File metadata

  • Download URL: pdf2docx_healer-0.1.4.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf2docx_healer-0.1.4.tar.gz
Algorithm Hash digest
SHA256 5791f90f572b3ee6ffd0e536c4de00e0783eb6013216586e7080cab962741ff1
MD5 466a28146007359789f80175bd942c37
BLAKE2b-256 9c96ea201b77f938dd375a99ebab380c4e236c85b114c36e90f0cc8cd873d61a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2docx_healer-0.1.4.tar.gz:

Publisher: publish.yml on krockxz/pdf2docx-healer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdf2docx_healer-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: pdf2docx_healer-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 24.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf2docx_healer-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b8ad89baaf9d0e87b7042e1498944d5287a2fdaa540f236bf97b9fafbd098e84
MD5 57623709d719bdec62497cdf914abb75
BLAKE2b-256 569c5570b93283a43b8ec63ab8e5c638e2b8e6e6b09d9f545f3d417590decae8

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2docx_healer-0.1.4-py3-none-any.whl:

Publisher: publish.yml on krockxz/pdf2docx-healer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page