Formatting-preserving PDF-to-DOCX converter that fixes bullet lists, hyperlinks, CJK fonts, and scanned PDFs
Project description
pdf2docx-healer
A drop-in replacement for pdf2docx that actually preserves your formatting.
pdf2docx is a great PDF-to-DOCX converter, but it has a frustrating habit of dropping bullet lists, losing hyperlinks, mangling CJK fonts, and choking on scanned PDFs. pdf2docx-healer wraps pdf2docx and heals all of these issues in a post-processing pass — so your Word documents come out looking the way they should.
Why this exists
| Problem | pdf2docx alone |
With pdf2docx-healer |
|---|---|---|
Bullet lists (•, -, *) |
Often flattened to plain text, no Word list style | Proper List Bullet style with real Word numbering |
Numbered lists (1., a., i.) |
Lost or merged into one paragraph | List Number style; lettered/roman via OOXML injection |
| Nested lists (3+ levels) | Indentation lost | Level detected from indent, applied to Word |
| Hyperlinks | URL text is plain, not clickable | Wrapped in real <w:hyperlink> elements with blue/underline |
| CJK fonts (Chinese/Japanese/Korean) | Font names like SimSun may not resolve |
Fallback chain maps to system-available CJK fonts |
| Scanned PDFs (image-only) | "Words count: 0" warning, empty output | OCR via Tesseract, then normal conversion |
| Section headers styled as lists | Headers like "4. Numbered List" get list style | Detected as headers, kept as Normal paragraphs |
Install
pip install pdf2docx-healer
For OCR support on scanned PDFs, also install Tesseract and the optional extra:
pip install "pdf2docx-healer[ocr]"
Quick start
Python API
from docx_healer import heal
# Simplest usage — output goes to "report.docx"
heal("report.pdf", "report.docx")
from docx_healer import heal, HealerConfig
# Full control via config
config = HealerConfig(
ocr_enabled=True, # OCR for scanned/image PDFs
ocr_lang="eng", # Tesseract language code
ocr_dpi=300, # OCR resolution
ocr_threshold=0.3, # Fraction of textless pages to trigger OCR
fix_lists=True, # Detect & style bullet/numbered lists
fix_hyperlinks=True, # Wrap URL text in clickable hyperlinks
fix_fonts=True, # Map CJK/unavailable fonts to system fonts
aggressive_lists=False, # More aggressive paragraph splitting
verbose=True, # Print progress
)
heal("scanned_report.pdf", "output.docx", config=config)
Command line
# Basic conversion
pdf2docx-heal input.pdf -o output.docx
# Scanned PDF with OCR
pdf2docx-heal input.pdf --ocr --ocr-lang eng
# Quiet mode (no progress output)
pdf2docx-heal input.pdf -q
# Skip specific fixes
pdf2docx-heal input.pdf --no-lists --no-hyperlinks
Run pdf2docx-heal --help to see all options.
What it fixes
Bullet lists
Detects Unicode bullets (•, ◦, ▪, –, etc.) and ASCII bullets (-, *, +) and applies Word's built-in List Bullet style. Nested bullets (up to 5 levels) are detected from indentation and mapped to the right list level.
Numbered lists
Detects decimal (1.), parenthesized ((1)), lettered (a., b.), roman (i., ii.), and outline (1.1, 1.2) numbering. Decimal and parenthesized use Word's List Number style. Lettered and roman formats use OOXML numbering injection with the correct numFmt (lowerLetter, lowerRoman) since Word's built-in styles only support decimal.
Hyperlinks
Scans all paragraph runs for URL patterns (http://, https://, www., mailto:, ftp://) and wraps them in proper OOXML <w:hyperlink> elements with external relationship targets. Multiple URLs in a single run are all converted. Hyperlink text gets blue color and underline styling.
CJK font fallback
Maps PDF-embedded font names (like SimSun, MS-Mincho, HYGoThic-Medium) to system-available equivalents across Windows, macOS, and Linux. Falls back through a chain: e.g. SimSun → 宋体 → Microsoft YaHei → 微軟雅黑 → Arial Unicode MS → Noto Sans CJK SC. Character-range detection also maps unknown fonts based on the script being rendered (CJK, Arabic, Hebrew, Thai, Devanagari, Cyrillic).
Scanned PDF OCR
Detects image-only PDFs (no text layer) and runs them through PyMuPDF's OCR pipeline (requires Tesseract). If Tesseract isn't installed, it falls back gracefully instead of crashing. The OCR'd PDF is then converted normally.
Smart header detection
Section headers that look like list items (e.g. "4. Numbered List") are detected via sequential-reset analysis and title-case heuristics, and kept as Normal paragraphs instead of being styled as list items.
How it works
pdf2docx-healer runs a 4-phase pipeline:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ 1. Pre-parse│ -> │ 2. Intercept │ -> │ 3. Convert │ -> │ 4. Post-process│
│ (OCR if │ │ (monkey- │ │ (pdf2docx │ │ (lists, links,│
│ needed) │ │ patch) │ │ core) │ │ fonts) │
└─────────────┘ └──────────────┘ └─────────────┘ └──────────────┘
- Pre-parse — If OCR is enabled, detects whether the PDF is scanned and runs Tesseract OCR to add a text layer.
- Intercept — Monkey-patches
pdf2docx's deadis_list_item()function with real bullet/number detection, and forceslist_not_table=Trueso list blocks aren't parsed as tables. - Convert — Runs
pdf2docxwith the patched internals. - Post-process — Opens the output DOCX with
python-docxand fixes lists (splitting, styling, numbering XML), hyperlinks (OOXML injection), and fonts (fallback mapping).
Configuration reference
HealerConfig fields
| Field | Type | Default | Description |
|---|---|---|---|
ocr_enabled |
bool |
False |
Enable OCR for scanned PDFs |
ocr_lang |
str |
"eng" |
Tesseract language code |
ocr_dpi |
int |
300 |
OCR resolution in DPI |
ocr_threshold |
float |
0.3 |
Fraction of textless pages to trigger OCR |
fix_lists |
bool |
True |
Detect & style bullet/numbered lists |
fix_hyperlinks |
bool |
True |
Wrap URL text in clickable hyperlinks |
fix_fonts |
bool |
True |
Map CJK/unavailable fonts to system fonts |
aggressive_lists |
bool |
False |
More aggressive paragraph splitting |
verbose |
bool |
True |
Print progress output |
CLI flags
| Flag | Description |
|---|---|
pdf |
Input PDF file path (positional) |
-o, --output |
Output DOCX path (default: input with .docx) |
--ocr |
Enable OCR for scanned PDFs |
--ocr-lang |
Tesseract language code (default: eng) |
--ocr-dpi |
OCR resolution in DPI (default: 300) |
--ocr-threshold |
Fraction of textless pages to trigger OCR (default: 0.3) |
--no-lists |
Skip list detection and formatting |
--no-hyperlinks |
Skip hyperlink extraction |
--no-font-fix |
Skip CJK font fallback mapping |
--aggressive |
Use aggressive paragraph splitting |
-q, --quiet |
Suppress progress output |
Requirements
- Python 3.8+
pdf2docx >= 0.5.0PyMuPDF >= 1.23.0python-docx >= 0.8.11lxml- Tesseract (optional, for OCR)
Limitations
pdf2docx(the underlying engine) is no longer actively maintained. This package works around its bugs but can't fix fundamental parsing limitations.- OCR requires Tesseract installed separately. Without it, scanned PDFs fall back to image-only output.
- Hyperlink text that
pdf2docxdrops during conversion (due to overlapping link annotations) cannot be recovered — only URL text that survives conversion gets wrapped. - Some paragraph merge patterns by
pdf2docx(without\nseparators) may persist.
License
MIT — see LICENSE.
Contributing
Issues and pull requests welcome at github.com/krockxz/pdf2docx-healer.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2docx_healer-0.1.3.tar.gz.
File metadata
- Download URL: pdf2docx_healer-0.1.3.tar.gz
- Upload date:
- Size: 25.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac683634c8c71fb2602fdd4051a8e7bf053164528b29049cffbf0ebe7fc0b580
|
|
| MD5 |
b136d17ce90cb9c8428cf244ecaca392
|
|
| BLAKE2b-256 |
e5f7b991577194cc96185601f48a2a0f6e52ed51d973364c24055d7c36f76fc9
|
Provenance
The following attestation bundles were made for pdf2docx_healer-0.1.3.tar.gz:
Publisher:
publish.yml on krockxz/pdf2docx-healer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2docx_healer-0.1.3.tar.gz -
Subject digest:
ac683634c8c71fb2602fdd4051a8e7bf053164528b29049cffbf0ebe7fc0b580 - Sigstore transparency entry: 1855029015
- Sigstore integration time:
-
Permalink:
krockxz/pdf2docx-healer@6a5355a8f9c5d3a0c934c0c640585c099e8033c6 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/krockxz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6a5355a8f9c5d3a0c934c0c640585c099e8033c6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pdf2docx_healer-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pdf2docx_healer-0.1.3-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8935092e46bea0df6e5a3b3ca3464c14789b282e639c4abaef379599fe1c7fa0
|
|
| MD5 |
ec419924367b02daf2dc21ddc06c1126
|
|
| BLAKE2b-256 |
38ebb968f6347701a289b1d2a2f77fca828b3f9b591384dc56029ccb34666b7b
|
Provenance
The following attestation bundles were made for pdf2docx_healer-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on krockxz/pdf2docx-healer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2docx_healer-0.1.3-py3-none-any.whl -
Subject digest:
8935092e46bea0df6e5a3b3ca3464c14789b282e639c4abaef379599fe1c7fa0 - Sigstore transparency entry: 1855029073
- Sigstore integration time:
-
Permalink:
krockxz/pdf2docx-healer@6a5355a8f9c5d3a0c934c0c640585c099e8033c6 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/krockxz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6a5355a8f9c5d3a0c934c0c640585c099e8033c6 -
Trigger Event:
push
-
Statement type: