A package that converts almost any file format to Markdown.

These details have not been verified by PyPI

Project description

MarkItDown-Pro

MarkItDown-Pro is a Python library that converts 50+ document formats into Markdown, built to power RAG (Retrieval-Augmented Generation) pipelines for semantic search. It extends Microsoft MarkItDown with Azure AI services, per-page OCR, and customizable converter pipelines.

Features

Async-first API -- all public methods are async, designed for concurrent document processing
Per-page PDF routing -- classifies each page as text or image, extracts text locally and OCRs only image pages
Customizable pipelines -- inject your own converter order per handler to optimize for quality, speed, or cost
GPT Vision OCR -- concurrent page-by-page OCR via Azure OpenAI (gpt-5.4-mini default)
Gotenberg integration -- convert Office files to PDF for full OCR via Gotenberg HTTP API
Azure Document Intelligence -- layout-aware text extraction with the prebuilt-layout model
Azure Speech-to-Text -- audio transcription with automatic language detection
Structured logging -- Component | filename.ext | page N | method | message format for Log Analytics
Graceful degradation -- missing API keys or services are handled automatically; converters fall back silently

Supported Formats

Category	Formats
PDF	`.pdf` (text, scanned, mixed — per-page routing)
Office	`.docx`, `.pptx` (via Gotenberg + DocIntelligence + MarkItDown)
Spreadsheet	`.csv`, `.tsv`, `.xls`, `.xlsx`
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.svg`, `.tiff`, `.webp`, `.heic`, `.heif`
Audio	`.mp3`, `.wav`
Email	`.eml`, `.msg`, `.p7s`
Archives	`.pst` (Outlook)
E-books	`.epub`
Notebooks	`.ipynb`
Markup	`.html`, `.htm`, `.xml`, `.json`, `.ndjson`, `.yaml`, `.yml`
Text	`.txt`, `.md`, `.py`, `.go`

Architecture

ConversionPipeline (async)
    |-- detect extension
    |-- route to Handler
    |       |-- try Converter 1 (primary)
    |       |-- try Converter 2 (fallback)
    |       |-- try Converter N
    |-- validate content
    |-- clean markdown

Default Converter Pipelines

Handler	Pipeline (in order)	What each captures
PDFHandler	MarkItDown (all-text only) → PagePDFConverter (per-page: PyMuPDF → GPT Vision → DocIntelligence)	Text + images + scanned content
OfficeHandler	Gotenberg → DocIntelligence → MarkItDown	Text + images (via Gotenberg PDF conversion)
ImageHandler	GPT Vision (primary model) → GPT Vision (fallback model)	OCR on images
AudioHandler	Azure Speech	Transcription
TabularHandler	openpyxl/pandas	Tables to markdown
MarkupHandler	BeautifulSoup/yaml/json	Structured markup
TextHandler	chardet encoding detection	Raw text
EmailHandler	Python email parser	Email text + attachments

Installation

Prerequisites

Python >= 3.13
uv (package manager)
System dependencies: ffmpeg (audio)
Optional: Gotenberg Docker service (for Office → PDF OCR)

Install

git clone https://github.com/your-org/markitdown-pro.git
cd markitdown-pro

# Install all dependencies (creates .venv automatically)
uv sync

# With dev tools (pytest, ruff)
uv sync --dev

Configure Environment

Create a .env file in the project root:

# Azure OpenAI (required for GPT Vision OCR)
AZURE_OPENAI_ENDPOINT="https://<resource>.openai.azure.com"
AZURE_OPENAI_API_KEY="your-key"
AZURE_OPENAI_API_VERSION="2024-12-01-preview"

# Azure Document Intelligence (required for doc intelligence fallback)
AZURE_DOCINTEL_ENDPOINT="https://<resource>.cognitiveservices.azure.com"
AZURE_DOCINTEL_KEY="your-key"

# Azure Speech (required for audio transcription)
AZURE_SPEECH_KEY="your-key"
AZURE_SPEECH_REGION="eastus"

# Gotenberg (optional — for Office → PDF → OCR)
GOTENBERG_URL="http://gotenberg:3000"

# OCR model configuration (optional — defaults shown)
MARKITDOWN_OCR_MODEL="gpt-5.4-mini"
MARKITDOWN_OCR_FALLBACK_MODEL="gpt-5.4"
MARKITDOWN_OCR_TIMEOUT="60.0"
MARKITDOWN_OCR_MAX_RETRIES="6"
MARKITDOWN_MIN_IMAGE_AREA="150000"

# General
LOG_LEVEL=20  # 10=DEBUG, 20=INFO, 30=WARNING

All services are optional -- the library degrades gracefully when credentials are missing.

Usage

Basic

import asyncio
from markitdown_pro.conversion_pipeline import ConversionPipeline

async def main():
    pipeline = ConversionPipeline()
    try:
        md = await pipeline.convert_document_to_md("/path/to/document.pdf")
        print(md)
    finally:
        await pipeline.aclose()

asyncio.run(main())

Custom Pipeline (speed-first)

Skip Gotenberg and GPT Vision, use only local converters:

from markitdown_pro.conversion_pipeline import ConversionPipeline
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.converters.doc_intel_converter import DocIntelligenceConverter
from markitdown_pro.handlers.office_handler import OfficeHandler

# Office: MarkItDown first (fast, local), DocIntelligence fallback
office = OfficeHandler(pipeline=[
    (MarkItDownConverter(), "MarkItDown"),
    (DocIntelligenceConverter(), "DocIntelligence"),
])

pipeline = ConversionPipeline(office_handler=office)

Custom Pipeline (quality-first with Gotenberg)

Ensure Office files go through Gotenberg for full OCR:

from markitdown_pro.converters.gotenberg_converter import GotenbergConverter
from markitdown_pro.converters.doc_intel_converter import DocIntelligenceConverter
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.handlers.office_handler import OfficeHandler

office = OfficeHandler(pipeline=[
    (GotenbergConverter(gotenberg_url="http://localhost:3000"), "Gotenberg"),
    (DocIntelligenceConverter(), "DocIntelligence"),
    (MarkItDownConverter(), "MarkItDown"),
])

pipeline = ConversionPipeline(office_handler=office)

Custom Pipeline (cost-first, no API calls)

from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.handlers.office_handler import OfficeHandler

office = OfficeHandler(pipeline=[
    (MarkItDownConverter(), "MarkItDown"),
])

pipeline = ConversionPipeline(office_handler=office)

Custom OCR Model

from markitdown_pro.converters.gpt_vision_converter import GPTVisionConverter
from markitdown_pro.handlers.image_handler import ImageHandler

image = ImageHandler(pipeline=[
    GPTVisionConverter(model_name="gpt-5.4-nano"),  # cheapest
])

pipeline = ConversionPipeline(image_handler=image)

All Handler Overrides

pipeline = ConversionPipeline(
    pdf_handler=my_pdf_handler,
    office_handler=my_office_handler,
    image_handler=my_image_handler,
    audio_handler=my_audio_handler,
    # text, tabular, markup, email, epub, pst, ipynb also injectable
)

Gotenberg Setup

Gotenberg converts Office files to PDF for per-page OCR. Run it as a Docker container:

docker run -d -p 3000:3000 gotenberg/gotenberg:8

Or in Docker Compose:

services:
  gotenberg:
    image: gotenberg/gotenberg:8
    ports:
      - "3000:3000"

Set the URL in your environment:

GOTENBERG_URL=http://localhost:3000

Testing

# Unit tests (fast, no credentials, ~0.2s)
uv run pytest tests/unit/ -v

# Integration tests (requires Azure credentials in .env)
uv run pytest -m integration -v

# Local-only integration tests (no Azure needed)
uv run pytest tests/integration/test_text_handler.py tests/integration/test_markup_handler.py tests/integration/test_tabular_handler.py tests/integration/test_email_handler.py -v

# All tests
uv run pytest tests/unit/ tests/integration/ -v

Test expectations are defined per-handler in tests/data/test_expectations.yaml.

Development

# Lint
uv run ruff check markitdown_pro/

# Format
uv run ruff format markitdown_pro/

# Build package
uv build

License

MIT

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

2.1.2

Apr 24, 2026

2.1.1

Apr 24, 2026

This version

2.1.0

Apr 19, 2026

2.0.0

Apr 16, 2026

1.3.7

Oct 24, 2025

1.3.6

Oct 20, 2025

1.3.5

Oct 20, 2025

1.3.4

Oct 19, 2025

1.3.3

Oct 17, 2025

1.3.2

Oct 16, 2025

1.3.1

Oct 16, 2025

1.3.0

Oct 15, 2025

1.2.3

Oct 14, 2025

1.2.2

Oct 8, 2025

1.1.2

Sep 21, 2025

1.1.1

Aug 31, 2025

1.1.0

Aug 23, 2025

1.0.4

Aug 19, 2025

1.0.3

Aug 18, 2025

1.0.2

Aug 15, 2025

1.0.1

Aug 14, 2025

1.0.0

Aug 14, 2025

0.1.1

Jul 24, 2025

0.1.0

May 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_pro-2.1.0.tar.gz (31.4 MB view details)

Uploaded Apr 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markitdown_pro-2.1.0-py3-none-any.whl (58.5 kB view details)

Uploaded Apr 19, 2026 Python 3

File details

Details for the file markitdown_pro-2.1.0.tar.gz.

File metadata

Download URL: markitdown_pro-2.1.0.tar.gz
Upload date: Apr 19, 2026
Size: 31.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for markitdown_pro-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`81234eff6417530dd78da831eac695161d84bdddc2469599482e7994c18b2186`
MD5	`40b1ea535126d369ebe03d9fe5cdce5a`
BLAKE2b-256	`da9bea3d7dd6310db535a7314f2de3c66079292c856db490e8b31c1db158c0a9`

See more details on using hashes here.

File details

Details for the file markitdown_pro-2.1.0-py3-none-any.whl.

File metadata

Download URL: markitdown_pro-2.1.0-py3-none-any.whl
Upload date: Apr 19, 2026
Size: 58.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for markitdown_pro-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`786fd97ed7172d9acd2f92aa2887c969002a0ffdbd1d33d6fece6facc33d7ae2`
MD5	`e709b2a0c011589fff867a0a8276f8f2`
BLAKE2b-256	`24b5279159644a08dc4552db93cd168bf031d513b60dcb368d4245513404e1b4`

See more details on using hashes here.

markitdown-pro 2.1.0

Navigation

Verified details

Owner

Unverified details

Meta

Classifiers

Project description

MarkItDown-Pro

Features

Supported Formats

Architecture

Default Converter Pipelines

Installation

Prerequisites

Install

Configure Environment

Usage

Basic

Custom Pipeline (speed-first)

Custom Pipeline (quality-first with Gotenberg)

Custom Pipeline (cost-first, no API calls)

Custom OCR Model

All Handler Overrides

Gotenberg Setup

Testing

Development

License

Project details

Verified details

Owner

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes