A package that converts almost any file format to Markdown.
Project description
MarkItDown-Pro
MarkItDown-Pro is an improvement of the Microsoft MarkItDown repository, enhancing gaps and extending functionality by leveraging Azure Document Intelligence SDK, Unstructured.io, and other Azure services and libraries. The result is a comprehensive Python library and command-line tool designed to convert diverse document formats into Markdown with graceful fallbacks, including OCR support via GPT-4o-mini.
Table of Contents
- Folder Structure
- Features & Highlights
- How It Works
- File-by-File Explanation
- Testing
- Usage & Examples
- Environment Variables
- FAQ
Folder Structure
A typical layout for MarkItDown-Pro might look like this:
markitdown-pro/
├── .env
├── README.md
├── requirements.txt
├── main.py
├── conversion_pipeline.py
├── common
│ └── utils.py
├── converters
│ ├── markitdown_wrapper.py
│ ├── azure_docint.py
│ ├── unstructured_wrapper.py
│ └── gpt4o_mini_vision.py
├── handlers
│ ├── pst_handler.py
│ ├── email_handler.py
│ ├── zip_handler.py
│ ├── audio_handler.py
│ └── pdf_handler.py
└── tests
├── data
└── test.py
| Folder/File | Description |
|---|---|
| main.py | Entry point for CLI usage; uses argparse to accept file paths. |
| conversion_pipeline.py | Orchestrates the fallback chain for converting documents to Markdown. |
| common/ | Shared utility functions, e.g. for file detection, text cleanup, etc. |
| converters/ | Contains modules for using various 3rd-party libraries or services to extract text. |
| handlers/ | Specialized handlers for specific file types (PST, EML, ZIP, audio, PDF scanning). |
| .env | Environment variables (e.g., credentials for Azure GPT-4o-mini, Azure Doc Intelligence). |
| requirements.txt | Python dependencies needed to install and run this project. |
| tests/test_markitdownpro.py | Recursively scans /tests/data/ and attempts to convert each file using convert_document_to_md |
| README.md | This documentation file, explaining usage and details of the project. |
Features & Highlights
-
MarkItDown with LLM
- Uses MarkItDown to convert documents to Markdown, optionally leveraging an OpenAI LLM to create image captions if you have an OPENAI_API_KEY.
- Auto-checks for
exiftoolif you want EXIF metadata in your images.
-
Whisper-Based Audio Transcription
- Converts audio files (
.mp3,.wav,.ogg, etc.) into text using OpenAI Whisper. - Gracefully falls back if Whisper is not installed.
- Converts audio files (
-
PST Extraction
- Parses Outlook PST files with
libratom, extracting emails and attachments recursively.
- Parses Outlook PST files with
-
Scanned PDF Detection & Concurrency
- Identifies PDFs with no text or embedded images, and automatically performs OCR on each page with GPT-4o-mini.
- Offers concurrent page-by-page OCR for faster performance.
-
Fallback to Azure Document Intelligence & Unstructured
- If standard MarkItDown or specialized handlers fail or yield insufficient text, it tries Azure’s Document Intelligence to extract textual layout.
- Unstructured.io library for broad coverage of file types.
-
GPT-4 Vision (or GPT-4o-mini) for Images & OCR
- If an image or partially scanned PDF is detected, we can pass it to GPT-4o-mini for OCR.
- Supports local images (base64 encoding) or remote image URLs directly.
-
Handles ZIP & EML
- ZIP: Unzips and processes each file inside, concatenating the results.
- EML: Extracts email text, attachments, and processes attachments recursively.
-
Graceful LLM Handling
- If no OPENAI_API_KEY or GPT-4o-mini credentials are provided, it simply skips LLM-based features, logging a warning.
-
Helper Methods for URL & Stream Conversion
convert_document_from_url(url, output_md)convert_document_from_stream(stream, extension, output_md)convert_document_to_md(local_path, output_md)
-
Easy-to-Extend Architecture Each file type has its own handler. Each text-extraction library has its own converter. The main pipeline provides a centralized fallback sequence.
-
Environment-Driven Configuration
- Pulls API keys, endpoints, and paths from
.envto keep secrets out of source code.
- Rich File Type Handling
| Category | File Type(s) |
|---|---|
| PowerPoint | .pot, .potm, .ppt, .pptm, .pptx |
| Word Processing | .abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw |
| Excel/Spreadsheet | .et, .fods, .uos1, .uos2, .wk2, .xls, .xlsb, .xlsm, .xlsx, .xlw |
| Images | .bmp, .gif, .heic, .jpeg, .jpg, .png, .prn, .svg, .tiff, .webp |
| Audio | .mp3, .wav, .ogg, .flac, .m4a, .aac, .wma, .webm, .opus |
| HTML | .htm, .html |
| Text-Based Formats | .csv, .json, .xml, .txt |
| ZIP Files | (Iterates over contents) |
| .eml, .p7s | |
| PST | .pst |
| EPUB | .epub |
| Markdown | .md |
| Org Mode | .org |
| Open Office | .odt, .sgl |
| Other | .eth, .mw, .pbd, .sdp, .uof, .web |
| Plain Text | .txt |
| reStructured Text | .rst |
| Rich Text | .rtf |
| StarOffice | .sxg |
| TSV | .tsv |
| Apple | .cwk, .mcw, .pages |
| Data Interchange | .dif |
| dBase | .dbf |
| Microsoft Office | .docx, .xlsx, .pptx |
| HEIF Image Format | .heif |
How It Works
- Detect File Type: The pipeline checks the file extension or general signature (
.pdf,.zip,.eml,.docx,.mp3, etc.). - Specialized Handlers: If the file is PST, EML, ZIP, or audio, it’s handed off to a dedicated module that handles that format.
- MarkItDown: For most generic document conversions, we first try MarkItDown.
- Unstructured: If MarkItDown fails or yields minimal text, we turn to Unstructured.io next.
- Why? It's typically cheaper than Azure Document Intelligence, and can handle partial OCR scenarios (via Tesseract, PaddleOCR, etc., if you configure
OCR_AGENT).
- Why? It's typically cheaper than Azure Document Intelligence, and can handle partial OCR scenarios (via Tesseract, PaddleOCR, etc., if you configure
- Azure Document Intelligence: If Unstructured also fails or yields minimal text, we try Azure Document Intelligence (prebuilt-layout).
- GPT-4o-mini: As a final fallback or specifically for OCR on images/scanned pages.
- Saves the extracted text to a
.mdfile once any method returns sufficient content.
File-by-File Explanation
Main Files
conversion_pipeline.pyThe core logic that orchestrates the fallback chain. Checks each handler or converter in a specific order. Once a successful conversion with enough text is found, it writes to.mdand stops.
Common Utils
common/utils.py- File Detection: Contains helper functions like
is_pdf,is_audio,detect_extension. - Markdown Cleaning: Functions like
clean_markdown()andensure_minimum_content()to tidy up text and ensure it’s not empty.
- File Detection: Contains helper functions like
Converters
-
converters/markitdown_wrapper.py- Wraps the MarkItDown library for docx/image extraction, EXIF reading, and optional LLM-based image captioning.
- If MarkItDown is not installed, or fails, returns
None.
-
converters/azure_docint.py- Leverages Azure’s Document Intelligence (prebuilt-layout) to extract text from PDFs and other document types in Markdown format.
-
converters/unstructured_wrapper.py- Uses the Unstructured.io library to parse documents. Useful for handling broad, less-common file types.
-
converters/gpt4o_mini_vision.py- Uses GPT-4o-mini (Azure ChatOpenAI) for OCR tasks on images or scanned PDFs.
- Concurrent or simple page-by-page approaches for PDFs.
- Can pass URL-based images or local images via Base64 encoding.
Handlers
-
handlers/pst_handler.py- Parses PST archives with
libratomand extracts emails + attachments. Calls back into the pipeline for each attachment.
- Parses PST archives with
-
handlers/email_handler.py- Processes
.emlfiles, extracting plain text, attachments, etc. Recursively processes attachments.
- Processes
-
handlers/zip_handler.py- Unzips files, recurses into the pipeline for each contained file, and concatenates all Markdown output.
-
handlers/audio_handler.py- Uses OpenAI Whisper to transcribe
.mp3,.wav,.ogg, etc. - Caches the model in memory to speed up repeated use.
- Uses OpenAI Whisper to transcribe
-
handlers/pdf_handler.py- Utility to detect if a PDF is text-only, text+images, or fully scanned.
- Coordinates with GPT-4o-mini for OCR if needed.
Installation
- Clone the Repo
git clone https://github.com/YourName/markitdown-pro.git cd markitdown-pro
- Create a Virtual Environment (recommended)
python -m venv venv source venv/bin/activate # or venv\Scripts\activate on Windows
- Create a Virtual Environment (recommended)
python -m venv venv source venv/bin/activate # or venv\Scripts\activate on Windows
- Install Dependencies
pip install --upgrade pip pip install -r requirements.txt
Note: You may also need system dependencies for libraries like PyMuPDF, libratom, etc.
- Set Up .env
- Copy the sample .env to your root folder, and fill in your Azure or OpenAI API keys, etc. For example:
AZURE_DOCINTEL_ENDPOINT="https://<your-region>.api.cognitive.microsoft.com" AZURE_DOCINTEL_KEY="YOUR_AZURE_KEY" AZURE_OPENAI_API_KEY="your azure open ai key" AZURE_OPENAI_API_VERSION="your azure open ai api version" AZURE_OPENAI_ENDPOINT="your azure open ai endpoint" AZURE_SPEECH_ENDPOINT="azure speech service endpoint - for audio conversion" AZURE_SPEECH_KEY="azure speech service key - for audio conversion" AZURE_SPEECH_REGION="azure speech service region - for audio conversion"
Make sure to source it or ensure python-dotenv can read it.
Testing
We use pytest for running our test suite. The test files and scripts are located in the /tests directory:
pytest tests/test_markitdownpro.py
Usage
CLI Usage
-
Basic:
python main.py /path/to/document.pdf
This will produce /path/to/document.md if successful.
-
Specify Output Path:
python main.py /path/to/document.pst --output my_pst_output.md
Programmatic Usage
You can import and call the pipeline directly from your Python code:
from conversion_pipeline import convert_document_to_md, convert_document_from_url
# 1) Local file example
md_text = convert_document_to_md("/path/to/my_file.pdf")
print("Extracted Markdown:", md_text)
# 2) URL example
md_from_url = convert_document_from_url("https://example.com/my_doc.docx", output_md="output_doc.md")
print("Output saved to output_doc.md")
FAQ
-
What if MarkItDown or Whisper is not installed? The pipeline checks for each library’s availability. If a library is missing or fails, it gracefully moves on to the next fallback.
-
Do I need Azure/OpenAI credentials?
Azure: If you want to use Document Intelligence or GPT-4o-mini, yes. OpenAI: If you want MarkItDown’s LLM-based image captioning or are using Whisper from openai’s library, you need appropriate credentials or local models. How do I handle large PST files? Large PSTs can be slow to process, especially if they contain many attachments. We parse them message-by-message, recursively handling attachments. For extremely large archives, you might want to increase concurrency or filter out attachments you don’t need.
- Does GPT-4o-mini require a publicly accessible image URL?
If you provide a local file path, the code base64-encodes it. This is ideal for truly local images. If you have a publicly hosted image, you can pass its URL directly.
- Why is Unstructured tried before Azure Doc Intelligence now? We observed that Unstructured is typically lower cost to run (especially with Tesseract or local OCR) compared to Azure’s $10 per 1,000 pages. So if MarkItDown fails, we want to try Unstructured next to potentially save cost. If that also fails, we move to Azure.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markitdown_pro-1.3.6.tar.gz.
File metadata
- Download URL: markitdown_pro-1.3.6.tar.gz
- Upload date:
- Size: 49.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
105a7083e69db3aa7d2441db1f52064e6dae3b1c0ffaba5055635bad2ca4fc2a
|
|
| MD5 |
28adfabfdd337be347ca4e9ffa3d87a6
|
|
| BLAKE2b-256 |
456e025205ff17dc6167614df95c037339be9765a63197424db4b1b9e2356819
|
File details
Details for the file markitdown_pro-1.3.6-py3-none-any.whl.
File metadata
- Download URL: markitdown_pro-1.3.6-py3-none-any.whl
- Upload date:
- Size: 55.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bccc1381c4ae4a7df2093898ba216d00ae9043785f10f39817e66e7d2773870
|
|
| MD5 |
0b917fac22a5d130f507bbf7405c8de8
|
|
| BLAKE2b-256 |
59e123f1f1aa30183b7bb91954a16215ba94afb027d9909fd97c0b9f830ad562
|