Batch conversion, asset extraction, and RAG-ready output toolkit for Microsoft MarkItDown.
Project description
MarkItDown Plus
Batch conversion, asset extraction, RAG-ready Markdown, JSONL chunks, and cleaner AI document pipelines for Microsoft MarkItDown.
MarkItDown Plus is an enhancement toolkit built on top of Microsoft MarkItDown. It adds folder conversion, recursive processing, optional parallel workers, Markdown cleanup, multiple chunking strategies, lightweight asset extraction, conversion manifests, and JSONL output for RAG workflows.
This project is independent and is not affiliated with Microsoft. It is designed as a companion CLI for the Microsoft MarkItDown ecosystem.
Why MarkItDown Plus?
Microsoft MarkItDown is excellent for converting individual files to Markdown. MarkItDown Plus focuses on the next step: turning many documents into clean, AI-ready project output.
Key features:
- Batch convert files and folders
- Recursive directory conversion
- Parallel conversion with
--workers - Optional tqdm progress with
--progress - RAG-ready JSONL chunk export
- Chunk strategies:
heading,fixed,semantic-lite - Markdown cleanup for common PDF/document artifacts
- Basic asset extraction for DOCX / PPTX / XLSX / HTML
manifest.json,failed.json, and large-run JSONL manifest streaming- Unicode-safe output filenames
- PayPal funding link included through GitHub Sponsors/Funding
Installation
pip install markitdown-plus
For progress bars:
pip install "markitdown-plus[progress]"
For development tests and coverage:
pip install -e ".[dev]"
pytest
Quick Start
Convert a folder:
markitdown-plus convert ./docs --output ./out
Convert recursively:
markitdown-plus convert ./docs --output ./out --recursive
Convert only specific file types:
markitdown-plus convert ./docs --output ./out --types pdf,docx,pptx,xlsx,html,csv
Clean Markdown and export RAG chunks:
markitdown-plus convert ./docs --output ./out --clean --rag
Use parallel workers:
markitdown-plus convert ./docs --output ./out --recursive --workers 4 --progress
Use auto worker count:
markitdown-plus convert ./docs --output ./out --workers 0
Extract assets when supported:
markitdown-plus convert ./docs --output ./out --extract-assets
Use a specific chunking strategy:
markitdown-plus convert ./docs --output ./out --rag --chunk-strategy semantic-lite
Output Structure
A normal batch run creates:
out/
markdown/
report.md
metadata/
report.json
manifest.json
With RAG enabled:
out/
markdown/
report.md
chunks/
report.jsonl
metadata/
report.json
manifest.json
With asset extraction enabled:
out/
markdown/
report.md
assets/
report_img_001.png
report_img_002.jpg
metadata/
report.json
manifest.json
For very large jobs, MarkItDown Plus avoids huge manifest.json files by streaming records:
out/
manifest.json
manifest-records.jsonl
failed.jsonl
Chunk Strategies
heading
Default. Preserves Markdown heading paths and is best for most structured documents.
markitdown-plus convert ./docs -o ./out --rag --chunk-strategy heading
fixed
Creates stable chunk sizes and ignores heading boundaries. Useful for embedding pipelines that prefer consistent lengths.
markitdown-plus convert ./docs -o ./out --rag --chunk-strategy fixed
semantic-lite
Dependency-free rule-based topical splitting. It starts new chunks at obvious semantic cues such as headings, summary, conclusion, recommendations, and other section-like paragraphs.
markitdown-plus convert ./docs -o ./out --rag --chunk-strategy semantic-lite
Asset Extraction
--extract-assets currently supports lightweight extraction for:
.docx.pptx.xlsx.html/.htmlocal image references
PDF image extraction is intentionally left for a later version because reliable PDF asset extraction requires heavier format-specific dependencies.
When assets are extracted, MarkItDown Plus appends an Extracted Assets section to the generated Markdown and records asset metadata in the file-level metadata JSON.
Single File Commands
Convert one file directly:
markitdown-plus single report.pdf -o report.md
Clean an existing Markdown file:
markitdown-plus clean dirty.md -o clean.md
Chunk an existing Markdown file:
markitdown-plus chunk clean.md -o chunks.jsonl --chunk-strategy fixed
Development
git clone https://github.com/lamguo/markitdown-plus.git
cd markitdown-plus
pip install -e ".[dev]"
pytest
The test configuration includes a coverage gate:
pytest --cov=markitdown_plus --cov-fail-under=85
Optional property and benchmark tests are included. They are skipped automatically if hypothesis or pytest-benchmark is not installed.
GitHub Topics
Suggested topics for the repository:
markitdown
microsoft-markitdown
markdown
rag
llm
document-conversion
pdf-to-markdown
docx-to-markdown
batch-conversion
jsonl
asset-extraction
ai-tools
Support This Project
If MarkItDown Plus helps you save time or build better AI document pipelines, you can support development here:
- Star this repository
- Support via PayPal: https://www.paypal.me/lamguo
Thank you for supporting open-source development.
License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markitdown_plus-0.2.0.tar.gz.
File metadata
- Download URL: markitdown_plus-0.2.0.tar.gz
- Upload date:
- Size: 55.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e9d865242279d3c86f72caa54c866b29a81aac7a0fab5e2801ee0e071b98571
|
|
| MD5 |
ad76278ee7c16a60ba38e846bea53a27
|
|
| BLAKE2b-256 |
24644c0216bb5c5d5c82bc6d84502f3c15b9fbb835c5abf85ecd8423e724a721
|
File details
Details for the file markitdown_plus-0.2.0-py3-none-any.whl.
File metadata
- Download URL: markitdown_plus-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b8d68091cc397e7c82ea6c73759d3d64d59d3bb80a784a2c8a4a9932ed6caa7
|
|
| MD5 |
b06deb3e10f2a6bc9cebb013f8eea42b
|
|
| BLAKE2b-256 |
9d9eab2abdd7ed995cd97a95b21b74ce970b6149e42ac8ac100b07777a37ced2
|