Skip to main content

Convert any directory of docs (DOCX, PPTX, PDF, XLSX, CSV) to clean Markdown, with a watch mode that auto-syncs on change.

Project description

mdpack

Convert any directory of docs to clean Markdown, ready for RAG / LLM ingestion.

One CLI. Point it at a folder of DOCX / PPTX / PDF / XLSX / CSV files, get back a mirrored tree of Markdown — frontmatter-tagged with source path and converter used, inline base64 images stripped, no surprises.

Want it to auto-sync on every save instead of running by hand? Use mdpack watch.

Install

pip install mdpack

For DOCX and PPTX, install pandoc:

brew install pandoc          # macOS
apt install pandoc           # Ubuntu / Debian

PDF is optional (Docling pulls ~1GB of torch/transformers) — only install if you need it:

pip install 'mdpack[pdf]'

Check your setup:

mdpack doctor

Usage

Convert a whole directory

mdpack convert ~/Desktop/reports
# Writes Markdown into ~/Desktop/reports/converted/

Input tree is mirrored: reports/q1/sales.xlsxreports/converted/q1/sales.md.

Convert a single file

mdpack convert proposal.docx -o out/

Options

  • -o, --output PATH — output directory (default: <src>/converted for dirs).
  • --force — re-convert even if the output is newer than the source.
  • --quiet — only print errors.

Incremental by default — mdpack skips files whose output is newer than the source.

Inspect supported formats

mdpack formats
Supported formats:
  csv    .csv
  xlsx   .xlsx
  docx   .docx
  pptx   .pptx
  pdf    .pdf

Watch mode

The killer feature. Instead of running mdpack convert every time you save a file, point mdpack watch at a directory and it stays running — every create / modify / delete / rename is batched with a 1.5s debounce and applied to the output tree.

mdpack watch ~/Desktop/reports
# Watches ~/Desktop/reports, keeps ~/Desktop/reports/converted/ in sync. Ctrl-C to stop.

Or with a separate output directory:

mdpack watch ~/Desktop/A -o ~/Desktop/B

What it does on each event:

Source change What happens in output
new .docx added corresponding .md created
.xlsx edited .md re-generated
.csv deleted .md deleted
file renamed old .md deleted, new one created
file inside the output dir touched ignored (no infinite loops)

On startup, watch does one incremental sync pass first, so the output is already aligned when event handling begins. Use --no-initial-sync to skip that, or --force-initial-sync to rebuild everything.

Keeping it running in the background

mdpack watch runs in the foreground. Pick whichever background option suits your setup:

tmux — simplest, survives terminal close:

tmux new -d -s mdpack 'mdpack watch ~/Desktop/reports'
tmux attach -t mdpack          # inspect

nohup — crude but works everywhere:

nohup mdpack watch ~/Desktop/reports > ~/mdpack.log 2>&1 &

launchd (macOS) — start on login, auto-restart on crash. Save as ~/Library/LaunchAgents/com.example.mdpack.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key><string>com.example.mdpack</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/mdpack</string>
    <string>watch</string>
    <string>/Users/you/Desktop/reports</string>
  </array>
  <key>RunAtLoad</key><true/>
  <key>KeepAlive</key><true/>
  <key>StandardOutPath</key><string>/tmp/mdpack.log</string>
  <key>StandardErrorPath</key><string>/tmp/mdpack.log</string>
</dict>
</plist>

Load it: launchctl load ~/Library/LaunchAgents/com.example.mdpack.plist

systemd user unit (Linux) — similar idea, save as ~/.config/systemd/user/mdpack.service:

[Unit]
Description=mdpack watch

[Service]
ExecStart=/usr/bin/mdpack watch %h/Desktop/reports
Restart=on-failure

[Install]
WantedBy=default.target

Enable: systemctl --user enable --now mdpack.service


Pair with mdrag

mdrag is a companion project — a local, offline Markdown semantic-search MCP server for Claude Code / Cursor / Cline.

Fully-automatic pipeline (source changes → Markdown updated → index updated):

# Terminal 1: keep Markdown in sync with the source dir
mdpack watch ~/Desktop/reports -o ~/Desktop/reports-md

# Terminal 2 (or launched by Claude Code): serve the vault + auto-reindex
mdrag vault add reports ~/Desktop/reports-md     # one-time registration
mdrag serve                                      # watches the vault, re-indexes on change

Now edit any .docx / .pptx / .xlsx under ~/Desktop/reports/ — within ~3 seconds, the matching .md is rewritten, mdrag notices and re-embeds it, and your next search from Claude Code sees the updated content. No manual steps. Both tools are loosely coupled — they don't know about each other, they just both watch the middle directory.


What the output looks like

Every converted file gets a YAML frontmatter block so downstream tools know where it came from:

---
title: Q1 Sales Review
source: q1/sales.xlsx
converter: xlsx
converter_version: mdpack 0.2.0
converted_at: 2026-04-16T05:30:00Z
---

# sales

## Summary
| Region | Revenue | YoY |
|---|---|---|
| APAC | 4.2M | +12% |
...

Roadmap

Next up (0.3.0): HTML and EPUB (pandoc), and ready-to-use background scripts (maybe a mdpack install-service that writes the plist / systemd unit for you).

Scanned / image-only PDFs (OCR) remain intentionally out of scope — if you need them, run Docling with its OCR pipeline upstream, or use tesseract.

Development

git clone https://github.com/andyleimc-source/mdpack
cd mdpack
python -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdpack-0.2.1.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mdpack-0.2.1-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file mdpack-0.2.1.tar.gz.

File metadata

  • Download URL: mdpack-0.2.1.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for mdpack-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f09a3ab9568b78d993cb70b1636dc0f66bcc8b6f46eef530d0b183657ec5e3b6
MD5 461b9a388b8fc0a328566b0a8d6c7b63
BLAKE2b-256 2dc48a0ba10debd4eedd23a06224bf98e988c525f3d79a412d76b078630e2e60

See more details on using hashes here.

File details

Details for the file mdpack-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: mdpack-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for mdpack-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 238c9926b38e53d165719b4aedf0fa04487e5c403fec6efdfbecbf24ff6c5e08
MD5 f2f4165f31cd3981996dfdaaf33d09a6
BLAKE2b-256 1fc016bf33b6902109f939074b9777bca8a600390ed4765eb910b7c5c53f06f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page