Formatting-preserving PDF-to-DOCX converter that fixes bullet lists, hyperlinks, CJK fonts, and scanned PDFs
Project description
pdf2docx-healer
A drop-in replacement for pdf2docx that actually preserves your formatting.
pdf2docx is a great PDF-to-DOCX converter, but it drops bullet lists, loses hyperlinks, mangles CJK fonts, and chokes on scanned PDFs. pdf2docx-healer wraps pdf2docx and heals all of these issues in a post-processing pass — so your Word documents come out looking the way they should.
Why this exists
| Problem | pdf2docx alone |
With pdf2docx-healer |
|---|---|---|
Bullet lists (•, -, *) |
Flattened to plain text, no Word list style | Proper List Bullet style with real Word numbering |
Numbered lists (1., a., i.) |
Lost or merged into one paragraph | List Number style; lettered/roman via OOXML injection |
| Nested lists (3+ levels) | Indentation lost | Level detected from indent, applied to Word |
| Hyperlinks | URL text is plain, not clickable | Wrapped in real <w:hyperlink> elements with blue/underline |
| CJK fonts (Chinese/Japanese/Korean) | Font names like SimSun may not resolve |
Fallback chain maps to system-available CJK fonts |
| Scanned PDFs (image-only) | "Words count: 0" warning, empty output | OCR via Tesseract, then normal conversion |
| Section headers styled as lists | Headers like "4. Numbered List" get list style | Detected as headers, kept as Normal paragraphs |
Install
pip install pdf2docx-healer
For OCR support on scanned PDFs, also install Tesseract and the optional extra:
pip install "pdf2docx-healer[ocr]"
Quick start
Python API
from docx_healer import heal
# Simplest usage — output goes to "report.docx"
heal("report.pdf", "report.docx")
from docx_healer import heal, HealerConfig
# Full control via config
config = HealerConfig(
ocr_enabled=True, # OCR for scanned/image PDFs
ocr_lang="eng", # Tesseract language code
ocr_dpi=300, # OCR resolution
ocr_threshold=0.3, # Fraction of textless pages to trigger OCR
fix_lists=True, # Detect & style bullet/numbered lists
fix_hyperlinks=True, # Wrap URL text in clickable hyperlinks
fix_fonts=True, # Map CJK/unavailable fonts to system fonts
aggressive_lists=False, # More aggressive paragraph splitting
verbose=True, # Print progress
)
heal("scanned_report.pdf", "output.docx", config=config)
Command line
# Basic conversion
pdf2docx-heal input.pdf -o output.docx
# Scanned PDF with OCR
pdf2docx-heal input.pdf --ocr --ocr-lang eng
# Quiet mode (no progress output)
pdf2docx-heal input.pdf -q
# Skip specific fixes
pdf2docx-heal input.pdf --no-lists --no-hyperlinks
Run pdf2docx-heal --help to see all options.
What it fixes
- Bullet lists — Detects Unicode (
•,◦,▪,–) and ASCII (-,*,+) bullets, applies Word'sList Bulletstyle. Nested bullets (up to 5 levels) detected from indentation. - Numbered lists — Detects decimal (
1.), parenthesized ((1)), lettered (a.), roman (i.), and outline (1.1) numbering. Lettered/roman use OOXML injection with correctnumFmtsince Word's built-in styles only support decimal. - Hyperlinks — Scans runs for
http://,https://,www.,mailto:,ftp://and wraps them in<w:hyperlink>elements with external relationship targets. Multiple URLs in one run all get converted. - CJK font fallback — Maps embedded font names (
SimSun,MS-Mincho,HYGoThic-Medium) to system-available equivalents across Windows/macOS/Linux. Character-range detection maps unknown fonts by script (CJK, Arabic, Hebrew, Thai, Devanagari, Cyrillic). - Scanned PDF OCR — Detects image-only PDFs and runs Tesseract OCR via PyMuPDF. Falls back gracefully if Tesseract isn't installed.
- Smart header detection — Headers like
"4. Numbered List"are detected via sequential-reset analysis and kept as Normal paragraphs instead of being styled as list items.
Requirements
- Python 3.8+
pdf2docx >= 0.5.0,PyMuPDF >= 1.23.0,python-docx >= 0.8.11,lxml- Tesseract (optional, for OCR)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2docx_healer-0.1.4.tar.gz.
File metadata
- Download URL: pdf2docx_healer-0.1.4.tar.gz
- Upload date:
- Size: 23.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5791f90f572b3ee6ffd0e536c4de00e0783eb6013216586e7080cab962741ff1
|
|
| MD5 |
466a28146007359789f80175bd942c37
|
|
| BLAKE2b-256 |
9c96ea201b77f938dd375a99ebab380c4e236c85b114c36e90f0cc8cd873d61a
|
Provenance
The following attestation bundles were made for pdf2docx_healer-0.1.4.tar.gz:
Publisher:
publish.yml on krockxz/pdf2docx-healer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2docx_healer-0.1.4.tar.gz -
Subject digest:
5791f90f572b3ee6ffd0e536c4de00e0783eb6013216586e7080cab962741ff1 - Sigstore transparency entry: 1855039189
- Sigstore integration time:
-
Permalink:
krockxz/pdf2docx-healer@ac1a6e4d64ec5a8d70b47068a92eddf485f43a5d -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/krockxz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ac1a6e4d64ec5a8d70b47068a92eddf485f43a5d -
Trigger Event:
push
-
Statement type:
File details
Details for the file pdf2docx_healer-0.1.4-py3-none-any.whl.
File metadata
- Download URL: pdf2docx_healer-0.1.4-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8ad89baaf9d0e87b7042e1498944d5287a2fdaa540f236bf97b9fafbd098e84
|
|
| MD5 |
57623709d719bdec62497cdf914abb75
|
|
| BLAKE2b-256 |
569c5570b93283a43b8ec63ab8e5c638e2b8e6e6b09d9f545f3d417590decae8
|
Provenance
The following attestation bundles were made for pdf2docx_healer-0.1.4-py3-none-any.whl:
Publisher:
publish.yml on krockxz/pdf2docx-healer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2docx_healer-0.1.4-py3-none-any.whl -
Subject digest:
b8ad89baaf9d0e87b7042e1498944d5287a2fdaa540f236bf97b9fafbd098e84 - Sigstore transparency entry: 1855039208
- Sigstore integration time:
-
Permalink:
krockxz/pdf2docx-healer@ac1a6e4d64ec5a8d70b47068a92eddf485f43a5d -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/krockxz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ac1a6e4d64ec5a8d70b47068a92eddf485f43a5d -
Trigger Event:
push
-
Statement type: