Convert PDF, DOCX, CSV, and image files to Markdown.
Project description
documint2md - Convert PDF, DOCX and CSV to Markdown
documint2md is a small Python CLI and library (package doc2md) that turns PDF, DOCX, CSV, and image files into consistent, deterministic Markdown. It is built for documentation flows where the same source should always produce the same Markdown output, even when run on different machines or in CI.
Highlights
- Text-first conversions for PDF (
pdfminer.six), DOCX (Mammoth → BeautifulSoup →markdownify), and CSV (Pandas + Markdown table) controls the format you care about. - OCR support for images and scanned PDFs (opt-in for PDFs).
- Small CLI plus a library API that can drop right into scripts, CI, or exploratory sessions.
- Deterministic normalization (newline, whitespace, blank lines) and CLI contracts that keep automation predictable.
- Interactive terminal UI with a short
/command list plus/morefor advanced tools and OCR/session controls.
Quick start
- Create a virtualenv, install reproducible dependencies, and activate it (Python 3.11+):
Set-Location 'C:\path\to\documint2md' py -m venv .venv & .\.venv\Scripts\Activate.ps1 python -m pip install --upgrade pip python -m pip install --require-hashes -r requirements.txt
- Convert a few sample files so “it works”:
doc2md .\tests\fixtures\in\sample.docx python -m doc2md.cli .\tests\fixtures\in\sample.pdf python -m doc2md.cli .\tests\fixtures\in\sample.csv python -m doc2md.cli .\tests\fixtures\in\sample.png
- Drop into interactive mode (no inputs) to explore
/files,/format, and/output.
Reproducible installs (Windows)
- Core runtime:
python -m pip install --require-hashes -r requirements.txt
- Full feature set (PDF engines + OCR):
python -m pip install --require-hashes -r requirements-all.txt
- Dev/test dependencies:
python -m pip install --require-hashes -r requirements-dev.txt
- Regenerate lock files when dependencies change:
.\scripts\lock_requirements.ps1
Installation
From TestPyPI (for testing)
py -m pip install --upgrade pip
py -m pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ --pre documint2md
doc2md --help
From PyPI (production)
py -m pip install --upgrade pip
py -m pip install documint2md
doc2md --help
Optional extras (PDF engines + OCR) when installing from PyPI:
py -m pip install "documint2md[all]"
CLI usage
Run doc2md <file> (or python -m doc2md.cli <file>) to convert a single input. By default the Markdown lands in docs_out/<input filename>.md. Use -o <file> to force a path and -o - to stream to stdout. Omit inputs to open the interactive picker, or pass --interactive for the picker even inside scripts.
python -m doc2md.cli file.docx -o file.md
python -m doc2md.cli file.pdf
python -m doc2md.cli table.csv
python -m doc2md.cli scan.png
doc2md # interactive mode
CLI contract
- Default output is
docs_out/<input filename>.md;-o <file>overrides the destination,-o -writes to stdout. - Interactive mode (no input) opens a curses-like UI tied to
docs_in;/filesloads the list and/moreexposes advanced commands (history, profiles, UI, session toggles). - Errors and diagnostics stream to stderr.
- Exit codes:
2usage/argument error,3unsupported format,4conversion failure,5output write failure.
CLI options
--format pdf|docx|csv|imageforces the parser instead of inferring from the extension.--engine pdfminer|pdftext|markerselects the PDF engine (defaultpdfminer;markerstays text-only unless assets are enabled explicitly).--ocror--ocr-mode autoenables OCR fallback for PDFs when text extraction is empty.--ocr-mode never|auto|alwayscontrols OCR behavior for PDFs (defaultnever).--ocr-lang essets OCR language (defaultes).--ocr-device cpu|gpu:0overrides OCR device selection.--ocr-render-scale 2.0controls PDF render scale for OCR.--ocr-min-score 0.5filters low-confidence OCR text.--csv-na ""controls how empty values render.--csv-float-format "%.6g"stabilizes floating-point output when needed.--profile <name>loads defaults fromdoc2md.toml--stats,--profile-report,--quiet,--debug,--version,--theme,--interactive,--no-inputtoggle output, logging, and interactivity.
OCR setup (optional)
Recommended (CPU + GPU side-by-side):
.\scripts\setup_ocr_envs.ps1
See docs/OCR Dual Environment Setup.md for GPU verification, fallback index, and usage.
Quick run (GPU):
.\scripts\doc2md-gpu.cmd docs_in\ocr_samples\sample_text.png --ocr-lang en --ocr-device gpu:0 --yes -o docs_out\sample_text.gpu.md
Quick run (CPU):
.\scripts\doc2md-cpu.cmd docs_in\ocr_samples\sample_text.png --ocr-lang en --ocr-device cpu --yes -o docs_out\sample_text.cpu.md
CPU:
python -m pip install paddlepaddle==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
python -m pip install paddleocr==3.4.0
GPU (Windows; choose one CUDA index):
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddleocr==3.4.0
If model download issues:
$env:PADDLE_PDX_MODEL_SOURCE = "BOS"
$env:PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK = "True"
Performance tips:
- Batch multiple files in one command to reuse OCR initialization.
- For scanned PDFs, use
--ocr-render-scale 1.0to trade accuracy for speed. - Prefer
--ocr-mode autofor PDFs so OCR runs only on textless pages. - First OCR run is slow due to model downloads; subsequent runs are faster.
Interactive mode
When you run doc2md without inputs, the CLI opens a full-screen picker. Interact with /files (space to select, enter to convert), type / to see the short command list, and use /more for advanced tools (history, profiles, UI theme, session toggles). OCR is configured via /ocr subcommands (e.g. /ocr mode auto, /ocr lang es). The footer keeps the current format/engine/output in view while the header shows version + cwd. Use Ctrl+P/Ctrl+N for command history.
Library API
doc2md.pdf_to_markdown(path)– extracts text-only Markdown from PDFs (OCR optional viaocr_mode).doc2md.docx_to_markdown(path)– converts DOCX → Mammoth HTML → Markdown viamarkdownifywith deterministic heading/list settings.doc2md.csv_to_markdown(path)– parses CSV files withpandasand emits clean Markdown tables.doc2md.image_to_markdown(path)– runs OCR on image files and returns Markdown text.- Input types:
str | PathLike; return type:str. - Exceptions:
ConversionErrorfor failures,UnsupportedFormatErrorfor unsupported formats/engines.
Normalization rules
- Normalize newlines to
\n. - Strip trailing whitespace per line.
- Cap consecutive blank lines at two.
- Remove trailing blank lines and end every non-empty output with a single newline.
Testing & fixtures
python -m pip install --require-hashes -r requirements-dev.txt
python -m pytest
python -m compileall .
python -m doc2md.cli .\tests\fixtures\in\sample.docx -o .\docs_out\sample.docx.md
python -m doc2md.cli .\tests\fixtures\in\sample.pdf -o .\docs_out\sample.pdf.md
python -m doc2md.cli .\tests\fixtures\in\sample.csv -o .\docs_out\sample.csv.md
Edge-case fixtures live in tests/fixtures/in with golden Markdown in tests/fixtures/out. Use docs_in as your local drop zone.
Publishing
Releases are tag-driven via GitHub Actions + Trusted Publishing.
- TestPyPI: push a tag like
v1.0.1rc1to triggerrelease-testpypi.yml. - PyPI: push a tag like
v1.0.1to triggerrelease-pypi.yml.
Release checklist
- Update
pyproject.tomlversion. - Regenerate
requirements.txt,requirements-all.txt, andrequirements-dev.txt. - Run tests and CLI smoke conversions.
- Build and check distributions before upload.
Contributing
- Work on
dev, open a PR tomain, and keepmainrelease-ready. Tags onmaintrigger publishing workflows. - Drop samples into
docs_inand run the CLI to confirm conversions. Read.github/copilot-instructions.mdfor repo-specific guidance, keep diffs small, and explain fixture changes when extraction output shifts.
Notes
- The interactive UI pauses ~2 seconds after success so the confirmation stays on screen unless you pass
--quiet. - History helpers:
doc2md history,search,rerun,jump,recent,explain, andui. - The CLI exposes both quick (
/files,/format,/output) and advanced (/more) helpers to explore settings without re-running the command.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file documint2md-1.0.1.tar.gz.
File metadata
- Download URL: documint2md-1.0.1.tar.gz
- Upload date:
- Size: 54.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4cce1b002819ac2259abae6c8d796417bcb3c8a81fd414df8f7111386786e814
|
|
| MD5 |
6b5b36e1a5417e8755643904886a6c9f
|
|
| BLAKE2b-256 |
23107a0ea6afe5fe3b624a8680f080cc1ef8bba4663420e5e0c140c1c6974952
|
Provenance
The following attestation bundles were made for documint2md-1.0.1.tar.gz:
Publisher:
release-pypi.yml on myucordero/documint2md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
documint2md-1.0.1.tar.gz -
Subject digest:
4cce1b002819ac2259abae6c8d796417bcb3c8a81fd414df8f7111386786e814 - Sigstore transparency entry: 912659587
- Sigstore integration time:
-
Permalink:
myucordero/documint2md@a58311752eb98d249a63298718357f35df656b0f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/myucordero
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-pypi.yml@a58311752eb98d249a63298718357f35df656b0f -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file documint2md-1.0.1-py3-none-any.whl.
File metadata
- Download URL: documint2md-1.0.1-py3-none-any.whl
- Upload date:
- Size: 40.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d54dc271bb61e2f7b2b2d4912e48315e2b3d7f87652cefe466e5073eec3a387
|
|
| MD5 |
cbc789a48e771cec16217726534dd875
|
|
| BLAKE2b-256 |
1aa5c0f9ca1d927333d650dc8f201ce07f47209927db9a68ed131ba8817603ff
|
Provenance
The following attestation bundles were made for documint2md-1.0.1-py3-none-any.whl:
Publisher:
release-pypi.yml on myucordero/documint2md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
documint2md-1.0.1-py3-none-any.whl -
Subject digest:
2d54dc271bb61e2f7b2b2d4912e48315e2b3d7f87652cefe466e5073eec3a387 - Sigstore transparency entry: 912659632
- Sigstore integration time:
-
Permalink:
myucordero/documint2md@a58311752eb98d249a63298718357f35df656b0f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/myucordero
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-pypi.yml@a58311752eb98d249a63298718357f35df656b0f -
Trigger Event:
workflow_dispatch
-
Statement type: