Skip to main content

Convert any directory of docs (DOCX, PPTX, PDF, XLSX, CSV) to clean Markdown, with a watch mode that auto-syncs on change.

Project description

mdpack

Convert any directory of docs to clean Markdown, ready for RAG / LLM ingestion.

One CLI. Point it at a folder of DOCX / PPTX / PDF / XLSX / CSV files, get back a mirrored tree of Markdown — frontmatter-tagged with source path and converter used, inline base64 images stripped, no surprises.

Want it to auto-sync on every save instead of running by hand? Use mdpack watch.

Install

pip install mdpack

For DOCX and PPTX, install pandoc:

brew install pandoc          # macOS
apt install pandoc           # Ubuntu / Debian

PDF is optional (Docling pulls ~1GB of torch/transformers) — only install if you need it:

pip install 'mdpack[pdf]'

Check your setup:

mdpack doctor

Usage

Convert a whole directory

mdpack convert ~/Desktop/reports
# Writes Markdown into ~/Desktop/reports/converted/

Input tree is mirrored: reports/q1/sales.xlsxreports/converted/q1/sales.md.

Convert a single file

mdpack convert proposal.docx -o out/

Options

  • -o, --output PATH — output directory (default: <src>/converted for dirs).
  • --force — re-convert even if the output is newer than the source.
  • --quiet — only print errors. --no-progress — hide the progress bar.
  • -j, --jobs N — worker threads (default 1). Helps for DOCX/PPTX-heavy trees.
  • --max-size SIZE — skip files larger than this (default 100MB, 0 = unlimited).
  • --pdf-max-size SIZE — separate cap for PDFs (default 50MB).
  • --include-hidden / --follow-symlinks — both off by default.
  • --exclude PATTERN — gitignore-syntax pattern, repeatable.
  • --ignore-file PATH — extra ignore file (default: <src>/.mdpackignore if present).
  • --respect-gitignore — also honour <src>/.gitignore.
  • --max-depth N — cap recursion depth.

Incremental by default — mdpack skips files whose output is newer than the source.

Large directories

Pointing mdpack at ~ or ~/Documents will not cripple your machine: by default junk dirs (.git, node_modules, .venv, __pycache__, Library, .Trash, …) are skipped, files over 100 MB are skipped, PDFs over 50 MB are skipped, and symlinks are not followed. Drop a .mdpackignore file (gitignore syntax) at the root for project-specific overrides:

drafts/
*.tmp.csv
archive/2024-old/

Inspect supported formats

mdpack formats
Supported formats:
  csv    .csv
  xlsx   .xlsx
  docx   .docx
  pptx   .pptx
  pdf    .pdf

Watch mode

The killer feature. Instead of running mdpack convert every time you save a file, point mdpack watch at a directory and it stays running — every create / modify / delete / rename is batched with a 1.5s debounce and applied to the output tree.

mdpack watch ~/Desktop/reports
# Watches ~/Desktop/reports, keeps ~/Desktop/reports/converted/ in sync. Ctrl-C to stop.

Or with a separate output directory:

mdpack watch ~/Desktop/A -o ~/Desktop/B

What it does on each event:

Source change What happens in output
new .docx added corresponding .md created
.xlsx edited .md re-generated
.csv deleted .md deleted
file renamed old .md deleted, new one created
file inside the output dir touched ignored (no infinite loops)

On startup, watch does one incremental sync pass first, so the output is already aligned when event handling begins. Use --no-initial-sync to skip that, or --force-initial-sync to rebuild everything.

Keeping it running in the background

mdpack watch runs in the foreground. Pick whichever background option suits your setup:

tmux — simplest, survives terminal close:

tmux new -d -s mdpack 'mdpack watch ~/Desktop/reports'
tmux attach -t mdpack          # inspect

nohup — crude but works everywhere:

nohup mdpack watch ~/Desktop/reports > ~/mdpack.log 2>&1 &

launchd (macOS) — start on login, auto-restart on crash. Save as ~/Library/LaunchAgents/com.example.mdpack.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key><string>com.example.mdpack</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/mdpack</string>
    <string>watch</string>
    <string>/Users/you/Desktop/reports</string>
  </array>
  <key>RunAtLoad</key><true/>
  <key>KeepAlive</key><true/>
  <key>StandardOutPath</key><string>/tmp/mdpack.log</string>
  <key>StandardErrorPath</key><string>/tmp/mdpack.log</string>
</dict>
</plist>

Load it: launchctl load ~/Library/LaunchAgents/com.example.mdpack.plist

systemd user unit (Linux) — similar idea, save as ~/.config/systemd/user/mdpack.service:

[Unit]
Description=mdpack watch

[Service]
ExecStart=/usr/bin/mdpack watch %h/Desktop/reports
Restart=on-failure

[Install]
WantedBy=default.target

Enable: systemctl --user enable --now mdpack.service


Pair with mdrag

mdrag is a companion project — a local, offline Markdown semantic-search MCP server for Claude Code / Cursor / Cline.

Fully-automatic pipeline (source changes → Markdown updated → index updated):

# Terminal 1: keep Markdown in sync with the source dir
mdpack watch ~/Desktop/reports -o ~/Desktop/reports-md

# Terminal 2 (or launched by Claude Code): serve the vault + auto-reindex
mdrag vault add reports ~/Desktop/reports-md     # one-time registration
mdrag serve                                      # watches the vault, re-indexes on change

Now edit any .docx / .pptx / .xlsx under ~/Desktop/reports/ — within ~3 seconds, the matching .md is rewritten, mdrag notices and re-embeds it, and your next search from Claude Code sees the updated content. No manual steps. Both tools are loosely coupled — they don't know about each other, they just both watch the middle directory.


What the output looks like

Every converted file gets a YAML frontmatter block so downstream tools know where it came from:

---
title: Q1 Sales Review
source: q1/sales.xlsx
converter: xlsx
converter_version: mdpack 0.3.0
converted_at: 2026-04-16T05:30:00Z
---

# sales

## Summary
| Region | Revenue | YoY |
|---|---|---|
| APAC | 4.2M | +12% |
...

Roadmap

Next up (0.3.1): a separate concurrency lane for PDFs (so -j N parallelises DOCX/PPTX without multiplying Docling's 1 GB model footprint).

Then: HTML and EPUB (pandoc), and ready-to-use background scripts (maybe a mdpack install-service that writes the plist / systemd unit for you).

Scanned / image-only PDFs (OCR) remain intentionally out of scope — if you need them, run Docling with its OCR pipeline upstream, or use tesseract.

Development

git clone https://github.com/andyleimc-source/mdpack
cd mdpack
python -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdpack-0.3.0.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mdpack-0.3.0-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file mdpack-0.3.0.tar.gz.

File metadata

  • Download URL: mdpack-0.3.0.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for mdpack-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0e1db3f6fe36be236384be9c141459abc6d7c095f5ca0bfe8b1ff9ffeb9f5aeb
MD5 375e69318b5645380c69e8b4f31ee09c
BLAKE2b-256 f2c30488fc44420a2d47794c3a7a4b6c8a8dbae3f46c9cb656f78ad00d9132a5

See more details on using hashes here.

File details

Details for the file mdpack-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: mdpack-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for mdpack-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 63ec011ebeed18aaf6b8afba41e943e4f8b2508d36f108e0651b65de5d579355
MD5 e33900a8dc1b5d013dadb758970c9cd7
BLAKE2b-256 da8c0df03ad5495039b244616b717220c15d525bb37fb0f43a83c1ab9d8275ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page