A package that converts almost any file format to Markdown.

These details have not been verified by PyPI

Project description

MarkItDown-Pro

MarkItDown-Pro is an improvement of the Microsoft MarkItDown repository, enhancing gaps and extending functionality by leveraging Azure Document Intelligence SDK, Unstructured.io, and other Azure services and libraries. The result is a comprehensive Python library and command-line tool designed to convert diverse document formats into Markdown with graceful fallbacks, including OCR support via GPT-4o-mini.

Folder Structure
Features & Highlights
How It Works
File-by-File Explanation
Testing
Usage & Examples
Environment Variables
FAQ

Folder Structure

A typical layout for MarkItDown-Pro might look like this:

markitdown-pro/
├── .env
├── README.md
├── requirements.txt
├── main.py
├── conversion_pipeline.py
├── common
│   └── utils.py
├── converters
│   ├── markitdown_wrapper.py
│   ├── azure_docint.py
│   ├── unstructured_wrapper.py
│   └── gpt4o_mini_vision.py
├── handlers
│   ├── pst_handler.py
│   ├── email_handler.py
│   ├── zip_handler.py
│   ├── audio_handler.py
│   └── pdf_handler.py
└──  tests
    ├── data
    └── test.py

Folder/File	Description
main.py	Entry point for CLI usage; uses `argparse` to accept file paths.
conversion_pipeline.py	Orchestrates the fallback chain for converting documents to Markdown.
common/	Shared utility functions, e.g. for file detection, text cleanup, etc.
converters/	Contains modules for using various 3rd-party libraries or services to extract text.
handlers/	Specialized handlers for specific file types (PST, EML, ZIP, audio, PDF scanning).
.env	Environment variables (e.g., credentials for Azure GPT-4o-mini, Azure Doc Intelligence).
requirements.txt	Python dependencies needed to install and run this project.
tests/test_markitdownpro.py	Recursively scans /tests/data/ and attempts to convert each file using convert_document_to_md
README.md	This documentation file, explaining usage and details of the project.

Features & Highlights

MarkItDown with LLM
- Uses MarkItDown to convert documents to Markdown, optionally leveraging an OpenAI LLM to create image captions if you have an OPENAI_API_KEY.
- Auto-checks for exiftool if you want EXIF metadata in your images.
Whisper-Based Audio Transcription
- Converts audio files (.mp3, .wav, .ogg, etc.) into text using OpenAI Whisper.
- Gracefully falls back if Whisper is not installed.
PST Extraction
- Parses Outlook PST files with libratom, extracting emails and attachments recursively.
Scanned PDF Detection & Concurrency
- Identifies PDFs with no text or embedded images, and automatically performs OCR on each page with GPT-4o-mini.
- Offers concurrent page-by-page OCR for faster performance.
Fallback to Azure Document Intelligence & Unstructured
- If standard MarkItDown or specialized handlers fail or yield insufficient text, it tries Azure’s Document Intelligence to extract textual layout.
- Unstructured.io library for broad coverage of file types.
GPT-4 Vision (or GPT-4o-mini) for Images & OCR
- If an image or partially scanned PDF is detected, we can pass it to GPT-4o-mini for OCR.
- Supports local images (base64 encoding) or remote image URLs directly.
Handles ZIP & EML
- ZIP: Unzips and processes each file inside, concatenating the results.
- EML: Extracts email text, attachments, and processes attachments recursively.
Graceful LLM Handling
- If no OPENAI_API_KEY or GPT-4o-mini credentials are provided, it simply skips LLM-based features, logging a warning.
Helper Methods for URL & Stream Conversion
- convert_document_from_url(url, output_md)
- convert_document_from_stream(stream, extension, output_md)
- convert_document_to_md(local_path, output_md)
Easy-to-Extend Architecture Each file type has its own handler. Each text-extraction library has its own converter. The main pipeline provides a centralized fallback sequence.
Environment-Driven Configuration

Pulls API keys, endpoints, and paths from .env to keep secrets out of source code.

Rich File Type Handling

Category	File Type(s)
PDF	.pdf
PowerPoint	.pot, .potm, .ppt, .pptm, .pptx
Word Processing	.abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw
Excel/Spreadsheet	.et, .fods, .uos1, .uos2, .wk2, .xls, .xlsb, .xlsm, .xlsx, .xlw
Images	.bmp, .gif, .heic, .jpeg, .jpg, .png, .prn, .svg, .tiff, .webp
Audio	.mp3, .wav, .ogg, .flac, .m4a, .aac, .wma, .webm, .opus
HTML	.htm, .html
Text-Based Formats	.csv, .json, .xml, .txt
ZIP Files	(Iterates over contents)
Email	.eml, .p7s
PST	.pst
EPUB	.epub
Markdown	.md
Org Mode	.org
Open Office	.odt, .sgl
Other	.eth, .mw, .pbd, .sdp, .uof, .web
Plain Text	.txt
reStructured Text	.rst
Rich Text	.rtf
StarOffice	.sxg
TSV	.tsv
Apple	.cwk, .mcw, .pages
Data Interchange	.dif
dBase	.dbf
Microsoft Office	.docx, .xlsx, .pptx
HEIF Image Format	.heif

How It Works

Detect File Type: The pipeline checks the file extension or general signature (.pdf, .zip, .eml, .docx, .mp3, etc.).
Specialized Handlers: If the file is PST, EML, ZIP, or audio, it’s handed off to a dedicated module that handles that format.
MarkItDown: For most generic document conversions, we first try MarkItDown.
Unstructured: If MarkItDown fails or yields minimal text, we turn to Unstructured.io next.
- Why? It's typically cheaper than Azure Document Intelligence, and can handle partial OCR scenarios (via Tesseract, PaddleOCR, etc., if you configure OCR_AGENT).
Azure Document Intelligence: If Unstructured also fails or yields minimal text, we try Azure Document Intelligence (prebuilt-layout).
GPT-4o-mini: As a final fallback or specifically for OCR on images/scanned pages.
Saves the extracted text to a .md file once any method returns sufficient content.

File-by-File Explanation

Main Files

conversion_pipeline.py The core logic that orchestrates the fallback chain. Checks each handler or converter in a specific order. Once a successful conversion with enough text is found, it writes to .md and stops.

Common Utils

common/utils.py
- File Detection: Contains helper functions like is_pdf, is_audio, detect_extension.
- Markdown Cleaning: Functions like clean_markdown() and ensure_minimum_content() to tidy up text and ensure it’s not empty.

Converters

converters/markitdown_wrapper.py
- Wraps the MarkItDown library for docx/image extraction, EXIF reading, and optional LLM-based image captioning.
- If MarkItDown is not installed, or fails, returns None.
converters/azure_docint.py
- Leverages Azure’s Document Intelligence (prebuilt-layout) to extract text from PDFs and other document types in Markdown format.
converters/unstructured_wrapper.py
- Uses the Unstructured.io library to parse documents. Useful for handling broad, less-common file types.
converters/gpt4o_mini_vision.py
- Uses GPT-4o-mini (Azure ChatOpenAI) for OCR tasks on images or scanned PDFs.
- Concurrent or simple page-by-page approaches for PDFs.
- Can pass URL-based images or local images via Base64 encoding.

Handlers

handlers/pst_handler.py
- Parses PST archives with libratom and extracts emails + attachments. Calls back into the pipeline for each attachment.
handlers/email_handler.py
- Processes .eml files, extracting plain text, attachments, etc. Recursively processes attachments.
handlers/zip_handler.py
- Unzips files, recurses into the pipeline for each contained file, and concatenates all Markdown output.
handlers/audio_handler.py
- Uses OpenAI Whisper to transcribe .mp3, .wav, .ogg, etc.
- Caches the model in memory to speed up repeated use.
handlers/pdf_handler.py
- Utility to detect if a PDF is text-only, text+images, or fully scanned.
- Coordinates with GPT-4o-mini for OCR if needed.

Installation

Clone the Repo

git clone https://github.com/YourName/markitdown-pro.git
cd markitdown-pro

Create a Virtual Environment (recommended)

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Create a Virtual Environment (recommended)

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

Note: You may also need system dependencies for libraries like PyMuPDF, libratom, etc.

Set Up .env

Copy the sample .env to your root folder, and fill in your Azure or OpenAI API keys, etc. For example:

AZURE_DOCINTEL_ENDPOINT="https://<your-region>.api.cognitive.microsoft.com"
AZURE_DOCINTEL_KEY="YOUR_AZURE_KEY"
AZURE_OPENAI_API_KEY="your azure open ai key"
AZURE_OPENAI_API_VERSION="your azure open ai api version"
AZURE_OPENAI_ENDPOINT="your azure open ai endpoint"
AZURE_SPEECH_ENDPOINT="azure speech service endpoint - for audio conversion"
AZURE_SPEECH_KEY="azure speech service key - for audio conversion"
AZURE_SPEECH_REGION="azure speech service region - for audio conversion"

Make sure to source it or ensure python-dotenv can read it.

Testing

We use pytest for running our test suite. The test files and scripts are located in the /tests directory:

pytest tests/test_markitdownpro.py

Usage

CLI Usage

Basic:
```
python main.py /path/to/document.pdf
```
This will produce /path/to/document.md if successful.

Specify Output Path:

python main.py /path/to/document.pst --output my_pst_output.md

Programmatic Usage

You can import and call the pipeline directly from your Python code:

from conversion_pipeline import convert_document_to_md, convert_document_from_url

# 1) Local file example
md_text = convert_document_to_md("/path/to/my_file.pdf")
print("Extracted Markdown:", md_text)

# 2) URL example
md_from_url = convert_document_from_url("https://example.com/my_doc.docx", output_md="output_doc.md")
print("Output saved to output_doc.md")

Extra: Vector Database Chunking

After converting a document to Markdown, it’s common to chunk the text before sending it to a vector database. Here’s a minimal example using LangChain:

from langchain.text_splitter import MarkdownTextSplitter

# Load your Markdown content
with open('your_markdown_file.md', 'r') as file:
    markdown_text = file.read()

# Initialize the MarkdownTextSplitter
markdown_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)

# Split the text into chunks
chunks = markdown_splitter.create_documents([markdown_text])

# Each chunk is now ready for embedding and ingestion into your vector database
for chunk in chunks:
    # e.g., push chunk.text to your VectorDB
    pass

FAQ

What if MarkItDown or Whisper is not installed? The pipeline checks for each library’s availability. If a library is missing or fails, it gracefully moves on to the next fallback.
Do I need Azure/OpenAI credentials?

Azure: If you want to use Document Intelligence or GPT-4o-mini, yes. OpenAI: If you want MarkItDown’s LLM-based image captioning or are using Whisper from openai’s library, you need appropriate credentials or local models. How do I handle large PST files? Large PSTs can be slow to process, especially if they contain many attachments. We parse them message-by-message, recursively handling attachments. For extremely large archives, you might want to increase concurrency or filter out attachments you don’t need.

Does GPT-4o-mini require a publicly accessible image URL?

If you provide a local file path, the code base64-encodes it. This is ideal for truly local images. If you have a publicly hosted image, you can pass its URL directly.

Why is Unstructured tried before Azure Doc Intelligence now? We observed that Unstructured is typically lower cost to run (especially with Tesseract or local OCR) compared to Azure’s $10 per 1,000 pages. So if MarkItDown fails, we want to try Unstructured next to potentially save cost. If that also fails, we move to Azure.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.3.7

Oct 24, 2025

1.3.6

Oct 20, 2025

1.3.5

Oct 20, 2025

1.3.4

Oct 19, 2025

1.3.3

Oct 17, 2025

1.3.2

Oct 16, 2025

1.3.1

Oct 16, 2025

1.3.0

Oct 15, 2025

1.2.3

Oct 14, 2025

1.2.2

Oct 8, 2025

1.1.2

Sep 21, 2025

1.1.1

Aug 31, 2025

1.1.0

Aug 23, 2025

1.0.4

Aug 19, 2025

1.0.3

Aug 18, 2025

1.0.2

Aug 15, 2025

1.0.1

Aug 14, 2025

This version

1.0.0

Aug 14, 2025

0.1.1

Jul 24, 2025

0.1.0

May 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_pro-1.0.0.tar.gz (39.6 kB view details)

Uploaded Aug 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markitdown_pro-1.0.0-py3-none-any.whl (46.2 kB view details)

Uploaded Aug 14, 2025 Python 3

File details

Details for the file markitdown_pro-1.0.0.tar.gz.

File metadata

Download URL: markitdown_pro-1.0.0.tar.gz
Upload date: Aug 14, 2025
Size: 39.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for markitdown_pro-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`10f8d218388c17deb93c5af6934cd51918f2f7775456d8ca7ddf713ffba9141e`
MD5	`a51a73a7c8af5a5ce674462791ed8419`
BLAKE2b-256	`7f3a4cc61aa95e0626e17f1a06fe913a755ecfe7c19d9ec441f0f3f231162bcd`

See more details on using hashes here.

File details

Details for the file markitdown_pro-1.0.0-py3-none-any.whl.

File metadata

Download URL: markitdown_pro-1.0.0-py3-none-any.whl
Upload date: Aug 14, 2025
Size: 46.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for markitdown_pro-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c1d2047c83c4e49e7383c95dfa794d83ae59530cf674dd226743f87572d954e3`
MD5	`f4c1815772f38cb5f2456260d79dd208`
BLAKE2b-256	`f8a26d820dcd6375256be3958aa13063585b9aedac3839cae0abfe6b278ec1ae`

See more details on using hashes here.

markitdown-pro 1.0.0

Navigation

Verified details

Owner

Unverified details

Meta

Classifiers

Project description

MarkItDown-Pro

Table of Contents

Folder Structure

Features & Highlights

How It Works

File-by-File Explanation

Main Files

Common Utils

Converters

Handlers

Installation

Testing

Usage

CLI Usage

Programmatic Usage

Extra: Vector Database Chunking

FAQ

Project details

Verified details

Owner

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes