Convert any directory of docs (DOCX, PPTX, PDF, XLSX, CSV) to clean Markdown, with a watch mode that auto-syncs on change.

These details have not been verified by PyPI

Project links

Project description

mdpack

Convert any directory of docs to clean Markdown, ready for RAG / LLM ingestion.

One CLI. Point it at a folder of DOCX / PPTX / PDF / XLSX / CSV files, get back a mirrored tree of Markdown — frontmatter-tagged with source path and converter used, inline base64 images stripped, no surprises.

Want it to auto-sync on every save instead of running by hand? Use mdpack watch.

Install

pip install mdpack

For DOCX and PPTX, install pandoc:

brew install pandoc          # macOS
apt install pandoc           # Ubuntu / Debian

PDF is optional (Docling pulls ~1GB of torch/transformers) — only install if you need it:

pip install 'mdpack[pdf]'

Check your setup:

mdpack doctor

Usage

Convert a whole directory

mdpack convert ~/Desktop/reports
# Writes Markdown into ~/Desktop/reports/converted/

Input tree is mirrored: reports/q1/sales.xlsx → reports/converted/q1/sales.md.

Convert a single file

mdpack convert proposal.docx -o out/

Options

-o, --output PATH — output directory (default: <src>/converted for dirs).
--force — re-convert even if the output is newer than the source.
--quiet — only print errors. --no-progress — hide the progress bar.
-j, --jobs N — worker threads (default 1). Helps for DOCX/PPTX-heavy trees.
--max-size SIZE — skip files larger than this (default 100MB, 0 = unlimited).
--pdf-max-size SIZE — separate cap for PDFs (default 50MB).
--include-hidden / --follow-symlinks — both off by default.
--exclude PATTERN — gitignore-syntax pattern, repeatable.
--ignore-file PATH — extra ignore file (default: <src>/.mdpackignore if present).
--respect-gitignore — also honour <src>/.gitignore.
--max-depth N — cap recursion depth.

Incremental by default — mdpack skips files whose output is newer than the source.

Large directories

Pointing mdpack at ~ or ~/Documents will not cripple your machine: by default junk dirs (.git, node_modules, .venv, __pycache__, Library, .Trash, …) are skipped, files over 100 MB are skipped, PDFs over 50 MB are skipped, and symlinks are not followed. Drop a .mdpackignore file (gitignore syntax) at the root for project-specific overrides:

drafts/
*.tmp.csv
archive/2024-old/

Inspect supported formats

mdpack formats

Supported formats:
  csv    .csv
  xlsx   .xlsx
  docx   .docx
  pptx   .pptx
  pdf    .pdf

Watch mode

The killer feature. Instead of running mdpack convert every time you save a file, point mdpack watch at a directory and it stays running — every create / modify / delete / rename is batched with a 1.5s debounce and applied to the output tree.

mdpack watch ~/Desktop/reports
# Watches ~/Desktop/reports, keeps ~/Desktop/reports/converted/ in sync. Ctrl-C to stop.

Or with a separate output directory:

mdpack watch ~/Desktop/A -o ~/Desktop/B

What it does on each event:

Source change	What happens in output
new `.docx` added	corresponding `.md` created
`.xlsx` edited	`.md` re-generated
`.csv` deleted	`.md` deleted
file renamed	old `.md` deleted, new one created
file inside the output dir touched	ignored (no infinite loops)

On startup, watch does one incremental sync pass first, so the output is already aligned when event handling begins. Use --no-initial-sync to skip that, or --force-initial-sync to rebuild everything.

Keeping it running in the background

mdpack watch runs in the foreground. Pick whichever background option suits your setup:

tmux — simplest, survives terminal close:

tmux new -d -s mdpack 'mdpack watch ~/Desktop/reports'
tmux attach -t mdpack          # inspect

nohup — crude but works everywhere:

nohup mdpack watch ~/Desktop/reports > ~/mdpack.log 2>&1 &

launchd (macOS) — start on login, auto-restart on crash. Save as ~/Library/LaunchAgents/com.example.mdpack.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key><string>com.example.mdpack</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/mdpack</string>
    <string>watch</string>
    <string>/Users/you/Desktop/reports</string>
  </array>
  <key>RunAtLoad</key><true/>
  <key>KeepAlive</key><true/>
  <key>StandardOutPath</key><string>/tmp/mdpack.log</string>
  <key>StandardErrorPath</key><string>/tmp/mdpack.log</string>
</dict>
</plist>

Load it: launchctl load ~/Library/LaunchAgents/com.example.mdpack.plist

systemd user unit (Linux) — similar idea, save as ~/.config/systemd/user/mdpack.service:

[Unit]
Description=mdpack watch

[Service]
ExecStart=/usr/bin/mdpack watch %h/Desktop/reports
Restart=on-failure

[Install]
WantedBy=default.target

Enable: systemctl --user enable --now mdpack.service

Pair with mdrag

mdrag is a companion project — a local, offline Markdown semantic-search MCP server for Claude Code / Cursor / Cline.

Fully-automatic pipeline (source changes → Markdown updated → index updated):

# Terminal 1: keep Markdown in sync with the source dir
mdpack watch ~/Desktop/reports -o ~/Desktop/reports-md

# Terminal 2 (or launched by Claude Code): serve the vault + auto-reindex
mdrag vault add reports ~/Desktop/reports-md     # one-time registration
mdrag serve                                      # watches the vault, re-indexes on change

Now edit any .docx / .pptx / .xlsx under ~/Desktop/reports/ — within ~3 seconds, the matching .md is rewritten, mdrag notices and re-embeds it, and your next search from Claude Code sees the updated content. No manual steps. Both tools are loosely coupled — they don't know about each other, they just both watch the middle directory.

What the output looks like

Every converted file gets a YAML frontmatter block so downstream tools know where it came from:

---
title: Q1 Sales Review
source: q1/sales.xlsx
converter: xlsx
converter_version: mdpack 0.3.0
converted_at: 2026-04-16T05:30:00Z
---

# sales

## Summary
| Region | Revenue | YoY |
|---|---|---|
| APAC | 4.2M | +12% |
...

Roadmap

Next up (0.3.1): a separate concurrency lane for PDFs (so -j N parallelises DOCX/PPTX without multiplying Docling's 1 GB model footprint).

Then: HTML and EPUB (pandoc), and ready-to-use background scripts (maybe a mdpack install-service that writes the plist / systemd unit for you).

Scanned / image-only PDFs (OCR) remain intentionally out of scope — if you need them, run Docling with its OCR pipeline upstream, or use tesseract.

Development

git clone https://github.com/andyleimc-source/mdpack
cd mdpack
python -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/pytest

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Apr 16, 2026

0.2.1

Apr 16, 2026

0.2.0

Apr 16, 2026

0.1.0

Apr 16, 2026

0.0.1

Apr 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdpack-0.3.0.tar.gz (27.8 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mdpack-0.3.0-py3-none-any.whl (24.6 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file mdpack-0.3.0.tar.gz.

File metadata

Download URL: mdpack-0.3.0.tar.gz
Upload date: Apr 16, 2026
Size: 27.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for mdpack-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`0e1db3f6fe36be236384be9c141459abc6d7c095f5ca0bfe8b1ff9ffeb9f5aeb`
MD5	`375e69318b5645380c69e8b4f31ee09c`
BLAKE2b-256	`f2c30488fc44420a2d47794c3a7a4b6c8a8dbae3f46c9cb656f78ad00d9132a5`

See more details on using hashes here.

File details

Details for the file mdpack-0.3.0-py3-none-any.whl.

File metadata

Download URL: mdpack-0.3.0-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 24.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for mdpack-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63ec011ebeed18aaf6b8afba41e943e4f8b2508d36f108e0651b65de5d579355`
MD5	`e33900a8dc1b5d013dadb758970c9cd7`
BLAKE2b-256	`da8c0df03ad5495039b244616b717220c15d525bb37fb0f43a83c1ab9d8275ff`

See more details on using hashes here.

mdpack 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mdpack

Install

Usage

Convert a whole directory

Convert a single file

Options

Large directories

Inspect supported formats

Watch mode

Keeping it running in the background

Pair with mdrag

What the output looks like

Roadmap

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes