Skip to main content

Convert PDF, DOCX, CSV, and image files to Markdown.

Project description

documint2md - Convert PDF, DOCX and CSV to Markdown

documint2md is a small Python CLI and library (package doc2md) that turns PDF, DOCX, CSV, and image files into consistent, deterministic Markdown. It is built for documentation flows where the same source should always produce the same Markdown output, even when run on different machines or in CI.

Highlights

  • Text-first conversions for PDF (pdfminer.six), DOCX (Mammoth → BeautifulSoup → markdownify), and CSV (Pandas + Markdown table) controls the format you care about.
  • OCR support for images and scanned PDFs (opt-in for PDFs).
  • Small CLI plus a library API that can drop right into scripts, CI, or exploratory sessions.
  • Deterministic normalization (newline, whitespace, blank lines) and CLI contracts that keep automation predictable.
  • Interactive terminal UI with a short / command list plus /more for advanced tools and OCR/session controls.

Quick start

  1. Create a virtualenv, install reproducible dependencies, and activate it (Python 3.11+):
    Set-Location 'C:\path\to\documint2md'
    py -m venv .venv
    & .\.venv\Scripts\Activate.ps1
    python -m pip install --upgrade pip
    python -m pip install --require-hashes -r requirements.txt
    
  2. Convert a few sample files so “it works”:
    doc2md .\tests\fixtures\in\sample.docx
    python -m doc2md.cli .\tests\fixtures\in\sample.pdf
    python -m doc2md.cli .\tests\fixtures\in\sample.csv
    python -m doc2md.cli .\tests\fixtures\in\sample.png
    
  3. Drop into interactive mode (no inputs) to explore /files, /format, and /output.

Reproducible installs (Windows)

  • Core runtime:
    python -m pip install --require-hashes -r requirements.txt
    
  • Full feature set (PDF engines + OCR):
    python -m pip install --require-hashes -r requirements-all.txt
    
  • Dev/test dependencies:
    python -m pip install --require-hashes -r requirements-dev.txt
    
  • Regenerate lock files when dependencies change:
    .\scripts\lock_requirements.ps1
    

Installation

From TestPyPI (for testing)

py -m pip install --upgrade pip
py -m pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ --pre documint2md
doc2md --help

From PyPI (production)

py -m pip install --upgrade pip
py -m pip install documint2md
doc2md --help

Optional extras (PDF engines + OCR) when installing from PyPI:

py -m pip install "documint2md[all]"

CLI usage

Run doc2md <file> (or python -m doc2md.cli <file>) to convert a single input. By default the Markdown lands in docs_out/<input filename>.md. Use -o <file> to force a path and -o - to stream to stdout. Omit inputs to open the interactive picker, or pass --interactive for the picker even inside scripts.

python -m doc2md.cli file.docx -o file.md
python -m doc2md.cli file.pdf
python -m doc2md.cli table.csv
python -m doc2md.cli scan.png
doc2md  # interactive mode

CLI contract

  • Default output is docs_out/<input filename>.md; -o <file> overrides the destination, -o - writes to stdout.
  • Interactive mode (no input) opens a curses-like UI tied to docs_in; /files loads the list and /more exposes advanced commands (history, profiles, UI, session toggles).
  • Errors and diagnostics stream to stderr.
  • Exit codes: 2 usage/argument error, 3 unsupported format, 4 conversion failure, 5 output write failure.

CLI options

  • --format pdf|docx|csv|image forces the parser instead of inferring from the extension.
  • --engine pdfminer|pdftext|marker selects the PDF engine (default pdfminer; marker stays text-only unless assets are enabled explicitly).
  • --ocr or --ocr-mode auto enables OCR fallback for PDFs when text extraction is empty.
  • --ocr-mode never|auto|always controls OCR behavior for PDFs (default never).
  • --ocr-lang es sets OCR language (default es).
  • --ocr-device cpu|gpu:0 overrides OCR device selection.
  • --ocr-render-scale 2.0 controls PDF render scale for OCR.
  • --ocr-min-score 0.5 filters low-confidence OCR text.
  • --csv-na "" controls how empty values render.
  • --csv-float-format "%.6g" stabilizes floating-point output when needed.
  • --profile <name> loads defaults from doc2md.toml
  • --stats, --profile-report, --quiet, --debug, --version, --theme, --interactive, --no-input toggle output, logging, and interactivity.

OCR setup (optional)

Recommended (CPU + GPU side-by-side):

.\scripts\setup_ocr_envs.ps1

See docs/OCR Dual Environment Setup.md for GPU verification, fallback index, and usage.

Quick run (GPU):

.\scripts\doc2md-gpu.cmd docs_in\ocr_samples\sample_text.png --ocr-lang en --ocr-device gpu:0 --yes -o docs_out\sample_text.gpu.md

Quick run (CPU):

.\scripts\doc2md-cpu.cmd docs_in\ocr_samples\sample_text.png --ocr-lang en --ocr-device cpu --yes -o docs_out\sample_text.cpu.md

CPU:

python -m pip install paddlepaddle==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
python -m pip install paddleocr==3.4.0

GPU (Windows; choose one CUDA index):

python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddleocr==3.4.0

If model download issues:

$env:PADDLE_PDX_MODEL_SOURCE = "BOS"
$env:PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK = "True"

Performance tips:

  • Batch multiple files in one command to reuse OCR initialization.
  • For scanned PDFs, use --ocr-render-scale 1.0 to trade accuracy for speed.
  • Prefer --ocr-mode auto for PDFs so OCR runs only on textless pages.
  • First OCR run is slow due to model downloads; subsequent runs are faster.

Interactive mode

When you run doc2md without inputs, the CLI opens a full-screen picker. Interact with /files (space to select, enter to convert), type / to see the short command list, and use /more for advanced tools (history, profiles, UI theme, session toggles). OCR is configured via /ocr subcommands (e.g. /ocr mode auto, /ocr lang es). The footer keeps the current format/engine/output in view while the header shows version + cwd. Use Ctrl+P/Ctrl+N for command history.

Library API

  • doc2md.pdf_to_markdown(path) – extracts text-only Markdown from PDFs (OCR optional via ocr_mode).
  • doc2md.docx_to_markdown(path) – converts DOCX → Mammoth HTML → Markdown via markdownify with deterministic heading/list settings.
  • doc2md.csv_to_markdown(path) – parses CSV files with pandas and emits clean Markdown tables.
  • doc2md.image_to_markdown(path) – runs OCR on image files and returns Markdown text.
  • Input types: str | PathLike; return type: str.
  • Exceptions: ConversionError for failures, UnsupportedFormatError for unsupported formats/engines.

Normalization rules

  • Normalize newlines to \n.
  • Strip trailing whitespace per line.
  • Cap consecutive blank lines at two.
  • Remove trailing blank lines and end every non-empty output with a single newline.

Testing & fixtures

python -m pip install --require-hashes -r requirements-dev.txt
python -m pytest
python -m compileall .
python -m doc2md.cli .\tests\fixtures\in\sample.docx -o .\docs_out\sample.docx.md
python -m doc2md.cli .\tests\fixtures\in\sample.pdf -o .\docs_out\sample.pdf.md
python -m doc2md.cli .\tests\fixtures\in\sample.csv -o .\docs_out\sample.csv.md

Edge-case fixtures live in tests/fixtures/in with golden Markdown in tests/fixtures/out. Use docs_in as your local drop zone.

Publishing

Releases are tag-driven via GitHub Actions + Trusted Publishing.

  • TestPyPI: push a tag like v1.0.1rc1 to trigger release-testpypi.yml.
  • PyPI: push a tag like v1.0.1 to trigger release-pypi.yml.

Release checklist

  • Update pyproject.toml version.
  • Regenerate requirements.txt, requirements-all.txt, and requirements-dev.txt.
  • Run tests and CLI smoke conversions.
  • Build and check distributions before upload.

Contributing

  • Work on dev, open a PR to main, and keep main release-ready. Tags on main trigger publishing workflows.
  • Drop samples into docs_in and run the CLI to confirm conversions. Read .github/copilot-instructions.md for repo-specific guidance, keep diffs small, and explain fixture changes when extraction output shifts.

Notes

  • The interactive UI pauses ~2 seconds after success so the confirmation stays on screen unless you pass --quiet.
  • History helpers: doc2md history, search, rerun, jump, recent, explain, and ui.
  • The CLI exposes both quick (/files, /format, /output) and advanced (/more) helpers to explore settings without re-running the command.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

documint2md-1.0.1.tar.gz (54.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

documint2md-1.0.1-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file documint2md-1.0.1.tar.gz.

File metadata

  • Download URL: documint2md-1.0.1.tar.gz
  • Upload date:
  • Size: 54.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for documint2md-1.0.1.tar.gz
Algorithm Hash digest
SHA256 4cce1b002819ac2259abae6c8d796417bcb3c8a81fd414df8f7111386786e814
MD5 6b5b36e1a5417e8755643904886a6c9f
BLAKE2b-256 23107a0ea6afe5fe3b624a8680f080cc1ef8bba4663420e5e0c140c1c6974952

See more details on using hashes here.

Provenance

The following attestation bundles were made for documint2md-1.0.1.tar.gz:

Publisher: release-pypi.yml on myucordero/documint2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file documint2md-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: documint2md-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for documint2md-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2d54dc271bb61e2f7b2b2d4912e48315e2b3d7f87652cefe466e5073eec3a387
MD5 cbc789a48e771cec16217726534dd875
BLAKE2b-256 1aa5c0f9ca1d927333d650dc8f201ce07f47209927db9a68ed131ba8817603ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for documint2md-1.0.1-py3-none-any.whl:

Publisher: release-pypi.yml on myucordero/documint2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page