Skip to main content

A package that converts almost any file format to Markdown.

Project description

MarkItDown-Pro

MarkItDown-Pro is an improvement of the Microsoft MarkItDown repository, enhancing gaps and extending functionality by leveraging Azure Document Intelligence SDK, Unstructured.io, and other Azure services and libraries. The result is a comprehensive Python library and command-line tool designed to convert diverse document formats into Markdown with graceful fallbacks, including OCR support via GPT-4o-mini.


Table of Contents


Folder Structure

A typical layout for MarkItDown-Pro might look like this:

markitdown-pro/
├── .env
├── README.md
├── requirements.txt
├── main.py
├── conversion_pipeline.py
├── common
│   └── utils.py
├── converters
│   ├── markitdown_wrapper.py
│   ├── azure_docint.py
│   ├── unstructured_wrapper.py
│   └── gpt4o_mini_vision.py
├── handlers
│   ├── pst_handler.py
│   ├── email_handler.py
│   ├── zip_handler.py
│   ├── audio_handler.py
│   └── pdf_handler.py
└──  tests
    ├── data
    └── test.py
Folder/File Description
main.py Entry point for CLI usage; uses argparse to accept file paths.
conversion_pipeline.py Orchestrates the fallback chain for converting documents to Markdown.
common/ Shared utility functions, e.g. for file detection, text cleanup, etc.
converters/ Contains modules for using various 3rd-party libraries or services to extract text.
handlers/ Specialized handlers for specific file types (PST, EML, ZIP, audio, PDF scanning).
.env Environment variables (e.g., credentials for Azure GPT-4o-mini, Azure Doc Intelligence).
requirements.txt Python dependencies needed to install and run this project.
tests/test_markitdownpro.py Recursively scans /tests/data/ and attempts to convert each file using convert_document_to_md
README.md This documentation file, explaining usage and details of the project.

Features & Highlights

  1. MarkItDown with LLM

    • Uses MarkItDown to convert documents to Markdown, optionally leveraging an OpenAI LLM to create image captions if you have an OPENAI_API_KEY.
    • Auto-checks for exiftool if you want EXIF metadata in your images.
  2. Whisper-Based Audio Transcription

    • Converts audio files (.mp3, .wav, .ogg, etc.) into text using OpenAI Whisper.
    • Gracefully falls back if Whisper is not installed.
  3. PST Extraction

    • Parses Outlook PST files with libratom, extracting emails and attachments recursively.
  4. Scanned PDF Detection & Concurrency

    • Identifies PDFs with no text or embedded images, and automatically performs OCR on each page with GPT-4o-mini.
    • Offers concurrent page-by-page OCR for faster performance.
  5. Fallback to Azure Document Intelligence & Unstructured

    • If standard MarkItDown or specialized handlers fail or yield insufficient text, it tries Azure’s Document Intelligence to extract textual layout.
    • Unstructured.io library for broad coverage of file types.
  6. GPT-4 Vision (or GPT-4o-mini) for Images & OCR

    • If an image or partially scanned PDF is detected, we can pass it to GPT-4o-mini for OCR.
    • Supports local images (base64 encoding) or remote image URLs directly.
  7. Handles ZIP & EML

    • ZIP: Unzips and processes each file inside, concatenating the results.
    • EML: Extracts email text, attachments, and processes attachments recursively.
  8. Graceful LLM Handling

    • If no OPENAI_API_KEY or GPT-4o-mini credentials are provided, it simply skips LLM-based features, logging a warning.
  9. Helper Methods for URL & Stream Conversion

    • convert_document_from_url(url, output_md)
    • convert_document_from_stream(stream, extension, output_md)
    • convert_document_to_md(local_path, output_md)
  10. Easy-to-Extend Architecture Each file type has its own handler. Each text-extraction library has its own converter. The main pipeline provides a centralized fallback sequence.

  11. Environment-Driven Configuration

  • Pulls API keys, endpoints, and paths from .env to keep secrets out of source code.
  1. Rich File Type Handling
Category File Type(s)
PDF .pdf
PowerPoint .pot, .potm, .ppt, .pptm, .pptx
Word Processing .abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw
Excel/Spreadsheet .et, .fods, .uos1, .uos2, .wk2, .xls, .xlsb, .xlsm, .xlsx, .xlw
Images .bmp, .gif, .heic, .jpeg, .jpg, .png, .prn, .svg, .tiff, .webp
Audio .mp3, .wav, .ogg, .flac, .m4a, .aac, .wma, .webm, .opus
HTML .htm, .html
Text-Based Formats .csv, .json, .xml, .txt
ZIP Files (Iterates over contents)
Email .eml, .p7s
PST .pst
EPUB .epub
Markdown .md
Org Mode .org
Open Office .odt, .sgl
Other .eth, .mw, .pbd, .sdp, .uof, .web
Plain Text .txt
reStructured Text .rst
Rich Text .rtf
StarOffice .sxg
TSV .tsv
Apple .cwk, .mcw, .pages
Data Interchange .dif
dBase .dbf
Microsoft Office .docx, .xlsx, .pptx
HEIF Image Format .heif

How It Works

  1. Detect File Type: The pipeline checks the file extension or general signature (.pdf, .zip, .eml, .docx, .mp3, etc.).
  2. Specialized Handlers: If the file is PST, EML, ZIP, or audio, it’s handed off to a dedicated module that handles that format.
  3. MarkItDown: For most generic document conversions, we first try MarkItDown.
  4. Unstructured: If MarkItDown fails or yields minimal text, we turn to Unstructured.io next.
    • Why? It's typically cheaper than Azure Document Intelligence, and can handle partial OCR scenarios (via Tesseract, PaddleOCR, etc., if you configure OCR_AGENT).
  5. Azure Document Intelligence: If Unstructured also fails or yields minimal text, we try Azure Document Intelligence (prebuilt-layout).
  6. GPT-4o-mini: As a final fallback or specifically for OCR on images/scanned pages.
  7. Saves the extracted text to a .md file once any method returns sufficient content.

File-by-File Explanation

Main Files

  • conversion_pipeline.py The core logic that orchestrates the fallback chain. Checks each handler or converter in a specific order. Once a successful conversion with enough text is found, it writes to .md and stops.

Common Utils

  • common/utils.py
    • File Detection: Contains helper functions like is_pdf, is_audio, detect_extension.
    • Markdown Cleaning: Functions like clean_markdown() and ensure_minimum_content() to tidy up text and ensure it’s not empty.

Converters

  • converters/markitdown_wrapper.py

    • Wraps the MarkItDown library for docx/image extraction, EXIF reading, and optional LLM-based image captioning.
    • If MarkItDown is not installed, or fails, returns None.
  • converters/azure_docint.py

    • Leverages Azure’s Document Intelligence (prebuilt-layout) to extract text from PDFs and other document types in Markdown format.
  • converters/unstructured_wrapper.py

    • Uses the Unstructured.io library to parse documents. Useful for handling broad, less-common file types.
  • converters/gpt4o_mini_vision.py

    • Uses GPT-4o-mini (Azure ChatOpenAI) for OCR tasks on images or scanned PDFs.
    • Concurrent or simple page-by-page approaches for PDFs.
    • Can pass URL-based images or local images via Base64 encoding.

Handlers

  • handlers/pst_handler.py

    • Parses PST archives with libratom and extracts emails + attachments. Calls back into the pipeline for each attachment.
  • handlers/email_handler.py

    • Processes .eml files, extracting plain text, attachments, etc. Recursively processes attachments.
  • handlers/zip_handler.py

    • Unzips files, recurses into the pipeline for each contained file, and concatenates all Markdown output.
  • handlers/audio_handler.py

    • Uses OpenAI Whisper to transcribe .mp3, .wav, .ogg, etc.
    • Caches the model in memory to speed up repeated use.
  • handlers/pdf_handler.py

    • Utility to detect if a PDF is text-only, text+images, or fully scanned.
    • Coordinates with GPT-4o-mini for OCR if needed.

Installation

  1. Clone the Repo
    git clone https://github.com/YourName/markitdown-pro.git
    cd markitdown-pro
    
  2. Create a Virtual Environment (recommended)
    python -m venv venv
    source venv/bin/activate  # or venv\Scripts\activate on Windows
    
  3. Create a Virtual Environment (recommended)
    python -m venv venv
    source venv/bin/activate  # or venv\Scripts\activate on Windows
    
  4. Install Dependencies
    pip install --upgrade pip
    pip install -r requirements.txt
    

Note: You may also need system dependencies for libraries like PyMuPDF, libratom, etc.

  1. Set Up .env
  • Copy the sample .env to your root folder, and fill in your Azure or OpenAI API keys, etc. For example:
    AZURE_DOCINTEL_ENDPOINT="https://<your-region>.api.cognitive.microsoft.com"
    AZURE_DOCINTEL_KEY="YOUR_AZURE_KEY"
    AZURE_OPENAI_API_KEY="your azure open ai key"
    AZURE_OPENAI_API_VERSION="your azure open ai api version"
    AZURE_OPENAI_ENDPOINT="your azure open ai endpoint"
    AZURE_SPEECH_ENDPOINT="azure speech service endpoint - for audio conversion"
    AZURE_SPEECH_KEY="azure speech service key - for audio conversion"
    AZURE_SPEECH_REGION="azure speech service region - for audio conversion"
    
    Make sure to source it or ensure python-dotenv can read it.

Testing

We use pytest for running our test suite. The test files and scripts are located in the /tests directory:

pytest tests/test_markitdownpro.py

Usage

CLI Usage

  1. Basic:

    python main.py /path/to/document.pdf
    

    This will produce /path/to/document.md if successful.

  2. Specify Output Path:

    python main.py /path/to/document.pst --output my_pst_output.md
    

Programmatic Usage

You can import and call the pipeline directly from your Python code:

from conversion_pipeline import convert_document_to_md, convert_document_from_url

# 1) Local file example
md_text = convert_document_to_md("/path/to/my_file.pdf")
print("Extracted Markdown:", md_text)

# 2) URL example
md_from_url = convert_document_from_url("https://example.com/my_doc.docx", output_md="output_doc.md")
print("Output saved to output_doc.md")

FAQ

  1. What if MarkItDown or Whisper is not installed? The pipeline checks for each library’s availability. If a library is missing or fails, it gracefully moves on to the next fallback.

  2. Do I need Azure/OpenAI credentials?

Azure: If you want to use Document Intelligence or GPT-4o-mini, yes. OpenAI: If you want MarkItDown’s LLM-based image captioning or are using Whisper from openai’s library, you need appropriate credentials or local models. How do I handle large PST files? Large PSTs can be slow to process, especially if they contain many attachments. We parse them message-by-message, recursively handling attachments. For extremely large archives, you might want to increase concurrency or filter out attachments you don’t need.

  1. Does GPT-4o-mini require a publicly accessible image URL?

If you provide a local file path, the code base64-encodes it. This is ideal for truly local images. If you have a publicly hosted image, you can pass its URL directly.

  1. Why is Unstructured tried before Azure Doc Intelligence now? We observed that Unstructured is typically lower cost to run (especially with Tesseract or local OCR) compared to Azure’s $10 per 1,000 pages. So if MarkItDown fails, we want to try Unstructured next to potentially save cost. If that also fails, we move to Azure.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_pro-1.3.7.tar.gz (49.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markitdown_pro-1.3.7-py3-none-any.whl (56.2 kB view details)

Uploaded Python 3

File details

Details for the file markitdown_pro-1.3.7.tar.gz.

File metadata

  • Download URL: markitdown_pro-1.3.7.tar.gz
  • Upload date:
  • Size: 49.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for markitdown_pro-1.3.7.tar.gz
Algorithm Hash digest
SHA256 fc2a9c7de853f957d494c43bd87a7f9f418626aa6e5395ab73440ba37a90a53e
MD5 979f1f59b5b6e0bba14f4b8e312e9a4b
BLAKE2b-256 0e7231025ecb8aff07446e4574950b0ad366c6624e5b0fcf5bafe90d395430c6

See more details on using hashes here.

File details

Details for the file markitdown_pro-1.3.7-py3-none-any.whl.

File metadata

  • Download URL: markitdown_pro-1.3.7-py3-none-any.whl
  • Upload date:
  • Size: 56.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for markitdown_pro-1.3.7-py3-none-any.whl
Algorithm Hash digest
SHA256 631cf858fcb8e78d521bb77b653c357269e89eb6e1de3bea9c1ba96033f069bd
MD5 30c68688f7ee81b78dbb7996abd67426
BLAKE2b-256 eb240862925210d8aca5960320729532dde1a5a9db490e2cd98bcff503906128

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page