Convert PDF, DOCX, CSV, and image files to Markdown.

These details have not been verified by PyPI

Project description

documint2md - Convert PDF, DOCX, CSV, and Images to Markdown

documint2md is a small Python CLI and library (package doc2md) that turns PDF, DOCX, CSV, and image files into consistent, deterministic Markdown. It is built for documentation flows where the same source should always produce the same Markdown output, even when run on different machines or in CI.

Highlights

Text-first conversions for PDF (pdfminer.six), DOCX (Mammoth → BeautifulSoup → markdownify), and CSV (Pandas + Markdown table) controls the format you care about.
OCR support for images and scanned PDFs (opt-in for PDFs), including HEIC/HEIF via an optional decoder.
Small CLI plus a library API that can drop right into scripts, CI, or exploratory sessions.
Deterministic normalization (newline, whitespace, blank lines) and CLI contracts that keep automation predictable.
Interactive terminal UI with a short / command list plus /more for advanced tools and OCR/session controls.

Quick start

WSL quick start

Use the native WSL checkout for development:

/home/marco/dev/documint2md

From PowerShell:

wsl -d Ubuntu --cd /home/marco/dev/documint2md -- bash -lc "~/.local/bin/uv venv .venv --python 3.12 --seed"
wsl -d Ubuntu --cd /home/marco/dev/documint2md -- bash -lc ".venv/bin/python -m pip install --require-hashes -r requirements-dev.txt"
wsl -d Ubuntu --cd /home/marco/dev/documint2md -- bash -lc ".venv/bin/python -m pip install -e '.[markdown,ocr,universal-lite,pymupdf4llm]'"
wsl -d Ubuntu --cd /home/marco/dev/documint2md -- bash -lc "./scripts/verify_wsl_workflows.sh"

See docs/WSL_DEVELOPMENT.md for clean-environment checks, live OCR testing, and optional engine notes. Avoid using /mnt/c/dev/documint2md for active WSL development.

Windows quick start

Create a virtualenv, install reproducible dependencies, and activate it (Python 3.11+):

Set-Location 'C:\path\to\documint2md'
py -m venv .venv
& .\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install --require-hashes -r requirements.txt

Convert a few sample files so “it works”:

doc2md .\tests\fixtures\in\sample.docx
python -m doc2md.cli .\tests\fixtures\in\sample.pdf
python -m doc2md.cli .\tests\fixtures\in\sample.csv
python -m doc2md.cli .\tests\fixtures\in\sample.png
python -m doc2md.cli .\docs_in\iphone_scan.heic

Drop into interactive mode (no inputs) to explore /files, /format, and /output.

Reproducible installs (Windows)

Core runtime:

python -m pip install --require-hashes -r requirements.txt

Full feature set (PDF engines + OCR):

python -m pip install --require-hashes -r requirements-all.txt

Dev/test dependencies:

python -m pip install --require-hashes -r requirements-dev.txt

Regenerate lock files when dependencies change:
```
.\scripts\lock_requirements.ps1
```

Installation

From TestPyPI (for testing)

py -m pip install --upgrade pip
py -m pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ --pre documint2md
doc2md --help

From PyPI (production)

py -m pip install --upgrade pip
py -m pip install documint2md
doc2md --help

Optional extras when installing from PyPI:

py -m pip install "documint2md[all]"
py -m pip install "documint2md[markdown]"
py -m pip install "documint2md[pymupdf4llm]"
py -m pip install "documint2md[universal-lite]"
py -m pip install "documint2md[docling]"
py -m pip install "documint2md[universal]"

universal-lite installs MarkItDown only. docling installs the heavier structured conversion stack. universal remains a backward-compatible alias that installs both.

CLI usage

Run doc2md <file> (or python -m doc2md.cli <file>) to convert a single input. By default the Markdown lands in docs_out/<input filename>.md. Use -o <file> to force a path and -o - to stream to stdout. Omit inputs to open the interactive picker, or pass --interactive for the picker even inside scripts.

python -m doc2md.cli file.docx -o file.md
python -m doc2md.cli file.pdf
python -m doc2md.cli table.csv
python -m doc2md.cli scan.png
doc2md  # interactive mode

CLI contract

Default output is docs_out/<input filename>.md; -o <file> overrides the destination, -o - writes to stdout.
Interactive mode (no input) opens a curses-like UI tied to docs_in; /files loads the list and /more exposes advanced commands (history, profiles, UI, session toggles).
Errors and diagnostics stream to stderr.
Exit codes: 2 usage/argument error, 3 unsupported format, 4 conversion failure, 5 output write failure.

CLI options

--format pdf|docx|csv|image|any forces the parser instead of inferring from the extension. Use any with a universal engine for formats such as PPTX, XLSX, HTML, JSON, XML, and EPUB.
--engine pdfminer|pdftext|marker|pymupdf4llm|docling|markitdown selects the conversion engine. pdfminer remains the default; docling and markitdown are universal opt-in engines.
--md-style normalize|gfm|none controls Markdown post-processing (default normalize). gfm requires the optional markdown extra.
--ocr or --ocr-mode auto enables OCR fallback for PDFs when text extraction is empty.
--ocr-mode never|auto|always controls OCR behavior for PDFs (default never).
--ocr-lang es sets OCR language (default es).
--ocr-device cpu|gpu:0 overrides OCR device selection.
--ocr-render-scale 2.0 controls PDF render scale for OCR.
--ocr-min-score 0.5 filters low-confidence OCR text.
--ocr-layout plain|blocks|heuristic controls OCR layout reconstruction. plain preserves OCR lines, blocks merges lines into paragraph blocks, and heuristic also promotes high-confidence headings and simple grid tables.
--ocr-debug-json <path> writes OCR geometry, row/block roles, confidence, and rendered Markdown to JSON for diagnosis.
--ocr-debug-image <path> writes a bbox overlay image for image inputs.
--csv-na "" controls how empty values render.
--csv-float-format "%.6g" stabilizes floating-point output when needed.
--profile <name> loads defaults from doc2md.toml
--stats, --profile-report, --quiet, --debug, --version, --theme, --interactive, --no-input toggle output, logging, and interactivity.

OCR setup (optional)

Recommended (CPU + GPU side-by-side):

.\scripts\setup_ocr_envs.ps1

See docs/OCR Dual Environment Setup.md for GPU verification, fallback index, and usage.

Quick run (GPU):

.\scripts\doc2md-gpu.cmd docs_in\ocr_samples\sample_text.png --ocr-lang en --ocr-device gpu:0 --yes -o docs_out\sample_text.gpu.md

Quick run (CPU):

.\scripts\doc2md-cpu.cmd docs_in\ocr_samples\sample_text.png --ocr-lang en --ocr-device cpu --yes -o docs_out\sample_text.cpu.md

Project skill for folder-based image OCR:

image-folder-ocr-to-markdown defines the exact repo workflow for converting JPG and HEIC images in one folder into .md files in a chosen output folder.

CPU:

python -m pip install paddlepaddle==3.2.2
python -m pip install paddleocr==3.4.0

GPU (Windows; choose one CUDA index):

python -m pip install paddlepaddle-gpu==3.2.2 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddleocr==3.4.0

If model download issues:

$env:PADDLE_PDX_MODEL_SOURCE = "BOS"
$env:PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK = "True"

Performance tips:

Batch multiple files in one command to reuse OCR initialization.
For scanned PDFs, use --ocr-render-scale 1.0 to trade accuracy for speed.
Prefer --ocr-mode auto for PDFs so OCR runs only on textless pages.
First OCR run is slow due to model downloads; subsequent runs are faster.

Layout tips:

Use --ocr-layout plain when exact OCR line order matters.
Use --ocr-layout blocks for readable prose without heading/table inference.
Use --ocr-layout heuristic --md-style gfm for best-effort formatted Markdown.
Use --ocr-debug-json and --ocr-debug-image when headings, tables, or reading order need inspection.
Keep OCR layout changes covered by tests/fixtures/ocr_layout_benchmarks.json; it provides deterministic receipt, table, prose, and low-confidence-noise cases without requiring live OCR model output.

Interactive mode

When you run doc2md without inputs, the CLI opens a full-screen picker. Interact with /files (space to select, enter to convert), type / to see the short command list, and use /more for advanced tools (history, profiles, UI theme, session toggles). OCR is configured via /ocr subcommands (e.g. /ocr mode auto, /ocr lang es). The footer keeps the current format/engine/output in view while the header shows version + cwd. Use Ctrl+P/Ctrl+N for command history.

Library API

doc2md.pdf_to_markdown(path) – extracts text-only Markdown from PDFs (OCR optional via ocr_mode).
doc2md.docx_to_markdown(path) – converts DOCX → Mammoth HTML → Markdown via markdownify with deterministic heading/list settings.
doc2md.csv_to_markdown(path) – parses CSV files with pandas and emits clean Markdown tables.
doc2md.image_to_markdown(path) – runs OCR on image files and returns Markdown text.
doc2md.any_to_markdown(path, engine) – uses an optional universal engine (docling or markitdown) for additional formats.
Input types: str | PathLike; return type: str.
Exceptions: ConversionError for failures, UnsupportedFormatError for unsupported formats/engines.

Normalization rules

Normalize newlines to \n.
Strip trailing whitespace per line.
Cap consecutive blank lines at two.
Remove trailing blank lines and end every non-empty output with a single newline.

Markdown formatting

Every CLI conversion has a post-conversion Markdown style step:

normalize keeps the existing deterministic newline and trailing-whitespace contract.
gfm runs mdformat with the GFM plugin for consistent tables, task lists, strikethrough, and autolinks.
none leaves converter output untouched after the converter itself returns.

Version 2.0.0 keeps normalize as the default. This can change exact Markdown bytes compared with 1.x; use --md-style none for raw converter output in compatibility-sensitive workflows.

Examples:

python -m doc2md.cli .\tests\fixtures\in\sample.csv --preview --md-style normalize
python -m doc2md.cli .\tests\fixtures\in\sample.csv --preview --md-style gfm
python -m doc2md.cli .\docs_in\slides.pptx --format any --engine markitdown --md-style gfm

Testing & fixtures

python -m pip install --require-hashes -r requirements-dev.txt
python -m pytest
python -m compileall .
python -m doc2md.cli .\tests\fixtures\in\sample.docx -o .\docs_out\sample.docx.md
python -m doc2md.cli .\tests\fixtures\in\sample.pdf -o .\docs_out\sample.pdf.md
python -m doc2md.cli .\tests\fixtures\in\sample.csv -o .\docs_out\sample.csv.md

Live OCR tests are opt-in because they can download models and vary by runtime:

$env:DOC2MD_RUN_LIVE_OCR = "1"
$env:PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK = "True"
python -m pytest tests\test_ocr_integration.py

Edge-case fixtures live in tests/fixtures/in with golden Markdown in tests/fixtures/out. Use docs_in as your local drop zone.

Publishing

Releases are tag-driven via GitHub Actions + Trusted Publishing.

TestPyPI: push a tag like v1.0.1rc1 to trigger release-testpypi.yml.
PyPI: push a tag like v1.0.1 to trigger release-pypi.yml.

Release checklist

Update pyproject.toml version.
Regenerate requirements.txt, requirements-all.txt, and requirements-dev.txt.
Run tests and CLI smoke conversions.
For WSL, run ./scripts/verify_wsl_workflows.sh.
Build and check distributions before upload.

Contributing

Start feature work from the latest dev on a short-lived branch such as codex/<topic>.
Open feature branch PRs into dev. Do not target main directly for feature work.
When dev is validated and release-ready, open a separate dev -> main PR. Tags on main trigger publishing workflows.
Repo-local PR workflow skill: .opencode/skills/documint-pr-workflow/SKILL.md.
Drop samples into docs_in and run the CLI to confirm conversions. Read .github/copilot-instructions.md for repo-specific guidance, keep diffs small, and explain fixture changes when extraction output shifts.

Notes

The interactive UI pauses ~2 seconds after success so the confirmation stays on screen unless you pass --quiet.
History helpers: doc2md history, search, rerun, jump, recent, explain, and ui.
The CLI exposes both quick (/files, /format, /output) and advanced (/more) helpers to explore settings without re-running the command.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.0

May 7, 2026

1.0.1

Feb 4, 2026

1.0.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

documint2md-2.0.0.tar.gz (68.3 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

documint2md-2.0.0-py3-none-any.whl (50.7 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file documint2md-2.0.0.tar.gz.

File metadata

Download URL: documint2md-2.0.0.tar.gz
Upload date: May 7, 2026
Size: 68.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for documint2md-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`96b9d54b264252f3c4bde587ad0f7a48b576f601fa843b08f59a4ec6cf20efe8`
MD5	`33dea7c9ba9862f74fbafe1b81bbf86b`
BLAKE2b-256	`a1bff94a86eb6226403202fb65a108a3ae6b77f6b7bd7191990c44334285c382`

See more details on using hashes here.

Provenance

The following attestation bundles were made for documint2md-2.0.0.tar.gz:

Publisher: release-pypi.yml on myucordero/documint2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: documint2md-2.0.0.tar.gz
- Subject digest: 96b9d54b264252f3c4bde587ad0f7a48b576f601fa843b08f59a4ec6cf20efe8
- Sigstore transparency entry: 1462176280
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: myucordero/documint2md@d101dca8086ba0e1e21394bfbfa782498d0e6bc0
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/myucordero
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-pypi.yml@d101dca8086ba0e1e21394bfbfa782498d0e6bc0
- Trigger Event: push

File details

Details for the file documint2md-2.0.0-py3-none-any.whl.

File metadata

Download URL: documint2md-2.0.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 50.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for documint2md-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d07537167eb8c282c9f0faadedd4736ef0b62e15292b1b02770ff88c8930f877`
MD5	`4f822db2186a80e27595a603513f5386`
BLAKE2b-256	`9867ff6939ae6d76ddbc9038454df55c7d982b4b6736458c6374403f9f7d1f94`

See more details on using hashes here.

Provenance

The following attestation bundles were made for documint2md-2.0.0-py3-none-any.whl:

Publisher: release-pypi.yml on myucordero/documint2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: documint2md-2.0.0-py3-none-any.whl
- Subject digest: d07537167eb8c282c9f0faadedd4736ef0b62e15292b1b02770ff88c8930f877
- Sigstore transparency entry: 1462176285
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: myucordero/documint2md@d101dca8086ba0e1e21394bfbfa782498d0e6bc0
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/myucordero
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-pypi.yml@d101dca8086ba0e1e21394bfbfa782498d0e6bc0
- Trigger Event: push

documint2md 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

documint2md - Convert PDF, DOCX, CSV, and Images to Markdown

Highlights

Quick start

WSL quick start

Windows quick start

Reproducible installs (Windows)

Installation

From TestPyPI (for testing)

From PyPI (production)

CLI usage

CLI contract

CLI options

OCR setup (optional)

Interactive mode

Library API

Normalization rules

Markdown formatting

Testing & fixtures

Publishing

Release checklist

Contributing

Notes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance