A package that converts almost any file format to Markdown.
Project description
MarkItDown-Pro
MarkItDown-Pro is a Python library that converts 50+ document formats into Markdown, built to power RAG (Retrieval-Augmented Generation) pipelines for semantic search. It extends Microsoft MarkItDown with Azure AI services, per-page OCR, and customizable converter pipelines.
Features
- Async-first API -- all public methods are
async, designed for concurrent document processing - Per-page PDF routing -- classifies each page as text or image, extracts text locally and OCRs only image pages
- Customizable pipelines -- inject your own converter order per handler to optimize for quality, speed, or cost
- GPT Vision OCR -- concurrent page-by-page OCR via Azure OpenAI (gpt-5.4-mini default)
- Gotenberg integration -- convert Office files to PDF for full OCR via Gotenberg HTTP API
- Azure Document Intelligence -- layout-aware text extraction with the
prebuilt-layoutmodel - Azure Speech-to-Text -- audio transcription with automatic language detection
- Structured logging --
Component | filename.ext | page N | method | messageformat for Log Analytics - Graceful degradation -- missing API keys or services are handled automatically; converters fall back silently
Supported Formats
| Category | Formats |
|---|---|
.pdf (text, scanned, mixed — per-page routing) |
|
| Office | .docx, .pptx (via Gotenberg + DocIntelligence + MarkItDown) |
| Spreadsheet | .csv, .tsv, .xls, .xlsx |
| Images | .png, .jpg, .jpeg, .gif, .bmp, .svg, .tiff, .webp, .heic, .heif |
| Audio | .mp3, .wav |
.eml, .msg, .p7s |
|
| Archives | .pst (Outlook) |
| E-books | .epub |
| Notebooks | .ipynb |
| Markup | .html, .htm, .xml, .json, .ndjson, .yaml, .yml |
| Text | .txt, .md, .py, .go |
Architecture
ConversionPipeline (async)
|-- detect extension
|-- route to Handler
| |-- try Converter 1 (primary)
| |-- try Converter 2 (fallback)
| |-- try Converter N
|-- validate content
|-- clean markdown
Default Converter Pipelines
| Handler | Pipeline (in order) | What each captures |
|---|---|---|
| PDFHandler | MarkItDown (all-text only) → PagePDFConverter (per-page: PyMuPDF → GPT Vision → DocIntelligence) | Text + images + scanned content |
| OfficeHandler | Gotenberg → DocIntelligence → MarkItDown | Text + images (via Gotenberg PDF conversion) |
| ImageHandler | GPT Vision (primary model) → GPT Vision (fallback model) | OCR on images |
| AudioHandler | Azure Speech | Transcription |
| TabularHandler | openpyxl/pandas | Tables to markdown |
| MarkupHandler | BeautifulSoup/yaml/json | Structured markup |
| TextHandler | chardet encoding detection | Raw text |
| EmailHandler | Python email parser | Email text + attachments |
Installation
Prerequisites
- Python >= 3.13
- uv (package manager)
- System dependencies:
ffmpeg(audio) - Optional: Gotenberg Docker service (for Office → PDF OCR)
Install
git clone https://github.com/your-org/markitdown-pro.git
cd markitdown-pro
# Install all dependencies (creates .venv automatically)
uv sync
# With dev tools (pytest, ruff)
uv sync --dev
Configure Environment
Create a .env file in the project root:
# Azure OpenAI (required for GPT Vision OCR)
AZURE_OPENAI_ENDPOINT="https://<resource>.openai.azure.com"
AZURE_OPENAI_API_KEY="your-key"
AZURE_OPENAI_API_VERSION="2024-12-01-preview"
# Azure Document Intelligence (required for doc intelligence fallback)
AZURE_DOCINTEL_ENDPOINT="https://<resource>.cognitiveservices.azure.com"
AZURE_DOCINTEL_KEY="your-key"
# Azure Speech (required for audio transcription)
AZURE_SPEECH_KEY="your-key"
AZURE_SPEECH_REGION="eastus"
# Gotenberg (optional — for Office → PDF → OCR)
GOTENBERG_URL="http://gotenberg:3000"
# OCR model configuration (optional — defaults shown)
MARKITDOWN_OCR_MODEL="gpt-5.4-mini"
MARKITDOWN_OCR_FALLBACK_MODEL="gpt-5.4"
MARKITDOWN_OCR_TIMEOUT="60.0"
MARKITDOWN_OCR_MAX_RETRIES="6"
MARKITDOWN_MIN_IMAGE_AREA="150000"
# General
LOG_LEVEL=20 # 10=DEBUG, 20=INFO, 30=WARNING
All services are optional -- the library degrades gracefully when credentials are missing.
Usage
Basic
import asyncio
from markitdown_pro.conversion_pipeline import ConversionPipeline
async def main():
pipeline = ConversionPipeline()
try:
md = await pipeline.convert_document_to_md("/path/to/document.pdf")
print(md)
finally:
await pipeline.aclose()
asyncio.run(main())
Custom Pipeline (speed-first)
Skip Gotenberg and GPT Vision, use only local converters:
from markitdown_pro.conversion_pipeline import ConversionPipeline
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.converters.doc_intel_converter import DocIntelligenceConverter
from markitdown_pro.handlers.office_handler import OfficeHandler
# Office: MarkItDown first (fast, local), DocIntelligence fallback
office = OfficeHandler(pipeline=[
(MarkItDownConverter(), "MarkItDown"),
(DocIntelligenceConverter(), "DocIntelligence"),
])
pipeline = ConversionPipeline(office_handler=office)
Custom Pipeline (quality-first with Gotenberg)
Ensure Office files go through Gotenberg for full OCR:
from markitdown_pro.converters.gotenberg_converter import GotenbergConverter
from markitdown_pro.converters.doc_intel_converter import DocIntelligenceConverter
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.handlers.office_handler import OfficeHandler
office = OfficeHandler(pipeline=[
(GotenbergConverter(gotenberg_url="http://localhost:3000"), "Gotenberg"),
(DocIntelligenceConverter(), "DocIntelligence"),
(MarkItDownConverter(), "MarkItDown"),
])
pipeline = ConversionPipeline(office_handler=office)
Custom Pipeline (cost-first, no API calls)
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.handlers.office_handler import OfficeHandler
office = OfficeHandler(pipeline=[
(MarkItDownConverter(), "MarkItDown"),
])
pipeline = ConversionPipeline(office_handler=office)
Custom OCR Model
from markitdown_pro.converters.gpt_vision_converter import GPTVisionConverter
from markitdown_pro.handlers.image_handler import ImageHandler
image = ImageHandler(pipeline=[
GPTVisionConverter(model_name="gpt-5.4-nano"), # cheapest
])
pipeline = ConversionPipeline(image_handler=image)
All Handler Overrides
pipeline = ConversionPipeline(
pdf_handler=my_pdf_handler,
office_handler=my_office_handler,
image_handler=my_image_handler,
audio_handler=my_audio_handler,
# text, tabular, markup, email, epub, pst, ipynb also injectable
)
Gotenberg Setup
Gotenberg converts Office files to PDF for per-page OCR. Run it as a Docker container:
docker run -d -p 3000:3000 gotenberg/gotenberg:8
Or in Docker Compose:
services:
gotenberg:
image: gotenberg/gotenberg:8
ports:
- "3000:3000"
Set the URL in your environment:
GOTENBERG_URL=http://localhost:3000
Testing
# Unit tests (fast, no credentials, ~0.2s)
uv run pytest tests/unit/ -v
# Integration tests (requires Azure credentials in .env)
uv run pytest -m integration -v
# Local-only integration tests (no Azure needed)
uv run pytest tests/integration/test_text_handler.py tests/integration/test_markup_handler.py tests/integration/test_tabular_handler.py tests/integration/test_email_handler.py -v
# All tests
uv run pytest tests/unit/ tests/integration/ -v
Test expectations are defined per-handler in tests/data/test_expectations.yaml.
Development
# Lint
uv run ruff check markitdown_pro/
# Format
uv run ruff format markitdown_pro/
# Build package
uv build
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markitdown_pro-2.1.1.tar.gz.
File metadata
- Download URL: markitdown_pro-2.1.1.tar.gz
- Upload date:
- Size: 31.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
854896f7897204270a7f7fb4d7fc58a691d06c603e8c9db9fcbb27ccad356993
|
|
| MD5 |
af104ded35e14e127686a80b4cf8c3e3
|
|
| BLAKE2b-256 |
aebb378a2b0df27e6df868729dc406400f00ec52ca5a73e0cfd7fae697fb5011
|
File details
Details for the file markitdown_pro-2.1.1-py3-none-any.whl.
File metadata
- Download URL: markitdown_pro-2.1.1-py3-none-any.whl
- Upload date:
- Size: 59.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70f299cdec645117088d53c807eb721e905abc36b1e77a8656d61c612f6574e5
|
|
| MD5 |
c41092e1ffdec2e9ca896a82d4a3a92f
|
|
| BLAKE2b-256 |
2837e9f0184b2fb9805ad9572b8663d29aaecba867e0513a566d6f4ce4978aaf
|