Skip to main content

Batch conversion, asset extraction, and RAG-ready output toolkit for Microsoft MarkItDown.

Project description

MarkItDown Plus

Batch conversion, asset extraction, RAG-ready Markdown, JSONL chunks, and cleaner AI document pipelines for Microsoft MarkItDown.

MarkItDown Plus is an enhancement toolkit built on top of Microsoft MarkItDown. It adds folder conversion, recursive processing, optional parallel workers, Markdown cleanup, multiple chunking strategies, lightweight asset extraction, conversion manifests, and JSONL output for RAG workflows.

This project is independent and is not affiliated with Microsoft. It is designed as a companion CLI for the Microsoft MarkItDown ecosystem.

Why MarkItDown Plus?

Microsoft MarkItDown is excellent for converting individual files to Markdown. MarkItDown Plus focuses on the next step: turning many documents into clean, AI-ready project output.

Key features:

  • Batch convert files and folders
  • Recursive directory conversion
  • Parallel conversion with --workers
  • Optional tqdm progress with --progress
  • RAG-ready JSONL chunk export
  • Chunk strategies: heading, fixed, semantic-lite
  • Markdown cleanup for common PDF/document artifacts
  • Basic asset extraction for DOCX / PPTX / XLSX / HTML
  • manifest.json, failed.json, and large-run JSONL manifest streaming
  • Unicode-safe output filenames
  • PayPal funding link included through GitHub Sponsors/Funding

Installation

pip install markitdown-plus

For progress bars:

pip install "markitdown-plus[progress]"

For development tests and coverage:

pip install -e ".[dev]"
pytest

Quick Start

Convert a folder:

markitdown-plus convert ./docs --output ./out

Convert recursively:

markitdown-plus convert ./docs --output ./out --recursive

Convert only specific file types:

markitdown-plus convert ./docs --output ./out --types pdf,docx,pptx,xlsx,html,csv

Clean Markdown and export RAG chunks:

markitdown-plus convert ./docs --output ./out --clean --rag

Use parallel workers:

markitdown-plus convert ./docs --output ./out --recursive --workers 4 --progress

Use auto worker count:

markitdown-plus convert ./docs --output ./out --workers 0

Extract assets when supported:

markitdown-plus convert ./docs --output ./out --extract-assets

Use a specific chunking strategy:

markitdown-plus convert ./docs --output ./out --rag --chunk-strategy semantic-lite

Output Structure

A normal batch run creates:

out/
  markdown/
    report.md
  metadata/
    report.json
  manifest.json

With RAG enabled:

out/
  markdown/
    report.md
  chunks/
    report.jsonl
  metadata/
    report.json
  manifest.json

With asset extraction enabled:

out/
  markdown/
    report.md
  assets/
    report_img_001.png
    report_img_002.jpg
  metadata/
    report.json
  manifest.json

For very large jobs, MarkItDown Plus avoids huge manifest.json files by streaming records:

out/
  manifest.json
  manifest-records.jsonl
  failed.jsonl

Chunk Strategies

heading

Default. Preserves Markdown heading paths and is best for most structured documents.

markitdown-plus convert ./docs -o ./out --rag --chunk-strategy heading

fixed

Creates stable chunk sizes and ignores heading boundaries. Useful for embedding pipelines that prefer consistent lengths.

markitdown-plus convert ./docs -o ./out --rag --chunk-strategy fixed

semantic-lite

Dependency-free rule-based topical splitting. It starts new chunks at obvious semantic cues such as headings, summary, conclusion, recommendations, and other section-like paragraphs.

markitdown-plus convert ./docs -o ./out --rag --chunk-strategy semantic-lite

Asset Extraction

--extract-assets currently supports lightweight extraction for:

  • .docx
  • .pptx
  • .xlsx
  • .html / .htm local image references

PDF image extraction is intentionally left for a later version because reliable PDF asset extraction requires heavier format-specific dependencies.

When assets are extracted, MarkItDown Plus appends an Extracted Assets section to the generated Markdown and records asset metadata in the file-level metadata JSON.

Single File Commands

Convert one file directly:

markitdown-plus single report.pdf -o report.md

Clean an existing Markdown file:

markitdown-plus clean dirty.md -o clean.md

Chunk an existing Markdown file:

markitdown-plus chunk clean.md -o chunks.jsonl --chunk-strategy fixed

Development

git clone https://github.com/lamguo/markitdown-plus.git
cd markitdown-plus
pip install -e ".[dev]"
pytest

The test configuration includes a coverage gate:

pytest --cov=markitdown_plus --cov-fail-under=85

Optional property and benchmark tests are included. They are skipped automatically if hypothesis or pytest-benchmark is not installed.

GitHub Topics

Suggested topics for the repository:

markitdown
microsoft-markitdown
markdown
rag
llm
document-conversion
pdf-to-markdown
docx-to-markdown
batch-conversion
jsonl
asset-extraction
ai-tools

Support This Project

If MarkItDown Plus helps you save time or build better AI document pipelines, you can support development here:

Thank you for supporting open-source development.

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_plus-0.2.0.tar.gz (55.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markitdown_plus-0.2.0-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file markitdown_plus-0.2.0.tar.gz.

File metadata

  • Download URL: markitdown_plus-0.2.0.tar.gz
  • Upload date:
  • Size: 55.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for markitdown_plus-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7e9d865242279d3c86f72caa54c866b29a81aac7a0fab5e2801ee0e071b98571
MD5 ad76278ee7c16a60ba38e846bea53a27
BLAKE2b-256 24644c0216bb5c5d5c82bc6d84502f3c15b9fbb835c5abf85ecd8423e724a721

See more details on using hashes here.

File details

Details for the file markitdown_plus-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for markitdown_plus-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b8d68091cc397e7c82ea6c73759d3d64d59d3bb80a784a2c8a4a9932ed6caa7
MD5 b06deb3e10f2a6bc9cebb013f8eea42b
BLAKE2b-256 9d9eab2abdd7ed995cd97a95b21b74ce970b6149e42ac8ac100b07777a37ced2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page