Skip to main content

A package that converts almost any file format to Markdown.

Project description

MarkItDown-Pro

MarkItDown-Pro is an improvement of the Microsoft MarkItDown repository, enhancing gaps and extending functionality by leveraging Azure Document Intelligence SDK, Unstructured.io, and other Azure services and libraries. The result is a comprehensive Python library and command-line tool designed to convert diverse document formats into Markdown with graceful fallbacks, including OCR support via GPT-4o-mini.


Table of Contents


Folder Structure

A typical layout for MarkItDown-Pro might look like this:

markitdown-pro/
├── .env
├── README.md
├── requirements.txt
├── main.py
├── conversion_pipeline.py
├── common
│   └── utils.py
├── converters
│   ├── markitdown_wrapper.py
│   ├── azure_docint.py
│   ├── unstructured_wrapper.py
│   └── gpt4o_mini_vision.py
├── handlers
│   ├── pst_handler.py
│   ├── email_handler.py
│   ├── zip_handler.py
│   ├── audio_handler.py
│   └── pdf_handler.py
└──  tests
    ├── data
    └── test.py
Folder/File Description
main.py Entry point for CLI usage; uses argparse to accept file paths.
conversion_pipeline.py Orchestrates the fallback chain for converting documents to Markdown.
common/ Shared utility functions, e.g. for file detection, text cleanup, etc.
converters/ Contains modules for using various 3rd-party libraries or services to extract text.
handlers/ Specialized handlers for specific file types (PST, EML, ZIP, audio, PDF scanning).
.env Environment variables (e.g., credentials for Azure GPT-4o-mini, Azure Doc Intelligence).
requirements.txt Python dependencies needed to install and run this project.
tests/test_markitdownpro.py Recursively scans /tests/data/ and attempts to convert each file using convert_document_to_md
README.md This documentation file, explaining usage and details of the project.

Features & Highlights

  1. MarkItDown with LLM

    • Uses MarkItDown to convert documents to Markdown, optionally leveraging an OpenAI LLM to create image captions if you have an OPENAI_API_KEY.
    • Auto-checks for exiftool if you want EXIF metadata in your images.
  2. Whisper-Based Audio Transcription

    • Converts audio files (.mp3, .wav, .ogg, etc.) into text using OpenAI Whisper.
    • Gracefully falls back if Whisper is not installed.
  3. PST Extraction

    • Parses Outlook PST files with libratom, extracting emails and attachments recursively.
  4. Scanned PDF Detection & Concurrency

    • Identifies PDFs with no text or embedded images, and automatically performs OCR on each page with GPT-4o-mini.
    • Offers concurrent page-by-page OCR for faster performance.
  5. Fallback to Azure Document Intelligence & Unstructured

    • If standard MarkItDown or specialized handlers fail or yield insufficient text, it tries Azure’s Document Intelligence to extract textual layout.
    • Unstructured.io library for broad coverage of file types.
  6. GPT-4 Vision (or GPT-4o-mini) for Images & OCR

    • If an image or partially scanned PDF is detected, we can pass it to GPT-4o-mini for OCR.
    • Supports local images (base64 encoding) or remote image URLs directly.
  7. Handles ZIP & EML

    • ZIP: Unzips and processes each file inside, concatenating the results.
    • EML: Extracts email text, attachments, and processes attachments recursively.
  8. Graceful LLM Handling

    • If no OPENAI_API_KEY or GPT-4o-mini credentials are provided, it simply skips LLM-based features, logging a warning.
  9. Helper Methods for URL & Stream Conversion

    • convert_document_from_url(url, output_md)
    • convert_document_from_stream(stream, extension, output_md)
    • convert_document_to_md(local_path, output_md)
  10. Easy-to-Extend Architecture Each file type has its own handler. Each text-extraction library has its own converter. The main pipeline provides a centralized fallback sequence.

  11. Environment-Driven Configuration

  • Pulls API keys, endpoints, and paths from .env to keep secrets out of source code.
  1. Rich File Type Handling
Category File Type(s)
PDF .pdf
PowerPoint .pot, .potm, .ppt, .pptm, .pptx
Word Processing .abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw
Excel/Spreadsheet .et, .fods, .uos1, .uos2, .wk2, .xls, .xlsb, .xlsm, .xlsx, .xlw
Images .bmp, .gif, .heic, .jpeg, .jpg, .png, .prn, .svg, .tiff, .webp
Audio .mp3, .wav, .ogg, .flac, .m4a, .aac, .wma, .webm, .opus
HTML .htm, .html
Text-Based Formats .csv, .json, .xml, .txt
ZIP Files (Iterates over contents)
Email .eml, .p7s
PST .pst
EPUB .epub
Markdown .md
Org Mode .org
Open Office .odt, .sgl
Other .eth, .mw, .pbd, .sdp, .uof, .web
Plain Text .txt
reStructured Text .rst
Rich Text .rtf
StarOffice .sxg
TSV .tsv
Apple .cwk, .mcw, .pages
Data Interchange .dif
dBase .dbf
Microsoft Office .docx, .xlsx, .pptx
HEIF Image Format .heif

How It Works

  1. Detect File Type: The pipeline checks the file extension or general signature (.pdf, .zip, .eml, .docx, .mp3, etc.).
  2. Specialized Handlers: If the file is PST, EML, ZIP, or audio, it’s handed off to a dedicated module that handles that format.
  3. MarkItDown: For most generic document conversions, we first try MarkItDown.
  4. Unstructured: If MarkItDown fails or yields minimal text, we turn to Unstructured.io next.
    • Why? It's typically cheaper than Azure Document Intelligence, and can handle partial OCR scenarios (via Tesseract, PaddleOCR, etc., if you configure OCR_AGENT).
  5. Azure Document Intelligence: If Unstructured also fails or yields minimal text, we try Azure Document Intelligence (prebuilt-layout).
  6. GPT-4o-mini: As a final fallback or specifically for OCR on images/scanned pages.
  7. Saves the extracted text to a .md file once any method returns sufficient content.

File-by-File Explanation

Main Files

  • conversion_pipeline.py The core logic that orchestrates the fallback chain. Checks each handler or converter in a specific order. Once a successful conversion with enough text is found, it writes to .md and stops.

Common Utils

  • common/utils.py
    • File Detection: Contains helper functions like is_pdf, is_audio, detect_extension.
    • Markdown Cleaning: Functions like clean_markdown() and ensure_minimum_content() to tidy up text and ensure it’s not empty.

Converters

  • converters/markitdown_wrapper.py

    • Wraps the MarkItDown library for docx/image extraction, EXIF reading, and optional LLM-based image captioning.
    • If MarkItDown is not installed, or fails, returns None.
  • converters/azure_docint.py

    • Leverages Azure’s Document Intelligence (prebuilt-layout) to extract text from PDFs and other document types in Markdown format.
  • converters/unstructured_wrapper.py

    • Uses the Unstructured.io library to parse documents. Useful for handling broad, less-common file types.
  • converters/gpt4o_mini_vision.py

    • Uses GPT-4o-mini (Azure ChatOpenAI) for OCR tasks on images or scanned PDFs.
    • Concurrent or simple page-by-page approaches for PDFs.
    • Can pass URL-based images or local images via Base64 encoding.

Handlers

  • handlers/pst_handler.py

    • Parses PST archives with libratom and extracts emails + attachments. Calls back into the pipeline for each attachment.
  • handlers/email_handler.py

    • Processes .eml files, extracting plain text, attachments, etc. Recursively processes attachments.
  • handlers/zip_handler.py

    • Unzips files, recurses into the pipeline for each contained file, and concatenates all Markdown output.
  • handlers/audio_handler.py

    • Uses OpenAI Whisper to transcribe .mp3, .wav, .ogg, etc.
    • Caches the model in memory to speed up repeated use.
  • handlers/pdf_handler.py

    • Utility to detect if a PDF is text-only, text+images, or fully scanned.
    • Coordinates with GPT-4o-mini for OCR if needed.

Installation

  1. Clone the Repo
    git clone https://github.com/YourName/markitdown-pro.git
    cd markitdown-pro
    
  2. Create a Virtual Environment (recommended)
    python -m venv venv
    source venv/bin/activate  # or venv\Scripts\activate on Windows
    
  3. Create a Virtual Environment (recommended)
    python -m venv venv
    source venv/bin/activate  # or venv\Scripts\activate on Windows
    
  4. Install Dependencies
    pip install --upgrade pip
    pip install -r requirements.txt
    

Note: You may also need system dependencies for libraries like PyMuPDF, libratom, etc.

  1. Set Up .env
  • Copy the sample .env to your root folder, and fill in your Azure or OpenAI API keys, etc. For example:
    AZURE_DOCINTEL_ENDPOINT="https://<your-region>.api.cognitive.microsoft.com"
    AZURE_DOCINTEL_KEY="YOUR_AZURE_KEY"
    AZURE_OPENAI_API_KEY="your azure open ai key"
    AZURE_OPENAI_API_VERSION="your azure open ai api version"
    AZURE_OPENAI_ENDPOINT="your azure open ai endpoint"
    AZURE_SPEECH_ENDPOINT="azure speech service endpoint - for audio conversion"
    AZURE_SPEECH_KEY="azure speech service key - for audio conversion"
    AZURE_SPEECH_REGION="azure speech service region - for audio conversion"
    
    Make sure to source it or ensure python-dotenv can read it.

Testing

We use pytest for running our test suite. The test files and scripts are located in the /tests directory:

pytest tests/test_markitdownpro.py

Usage

CLI Usage

  1. Basic:

    python main.py /path/to/document.pdf
    

    This will produce /path/to/document.md if successful.

  2. Specify Output Path:

    python main.py /path/to/document.pst --output my_pst_output.md
    

Programmatic Usage

You can import and call the pipeline directly from your Python code:

from conversion_pipeline import convert_document_to_md, convert_document_from_url

# 1) Local file example
md_text = convert_document_to_md("/path/to/my_file.pdf")
print("Extracted Markdown:", md_text)

# 2) URL example
md_from_url = convert_document_from_url("https://example.com/my_doc.docx", output_md="output_doc.md")
print("Output saved to output_doc.md")

Extra: Vector Database Chunking

After converting a document to Markdown, it’s common to chunk the text before sending it to a vector database. Here’s a minimal example using LangChain:

from langchain.text_splitter import MarkdownTextSplitter

# Load your Markdown content
with open('your_markdown_file.md', 'r') as file:
    markdown_text = file.read()

# Initialize the MarkdownTextSplitter
markdown_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)

# Split the text into chunks
chunks = markdown_splitter.create_documents([markdown_text])

# Each chunk is now ready for embedding and ingestion into your vector database
for chunk in chunks:
    # e.g., push chunk.text to your VectorDB
    pass

FAQ

  1. What if MarkItDown or Whisper is not installed? The pipeline checks for each library’s availability. If a library is missing or fails, it gracefully moves on to the next fallback.

  2. Do I need Azure/OpenAI credentials?

Azure: If you want to use Document Intelligence or GPT-4o-mini, yes. OpenAI: If you want MarkItDown’s LLM-based image captioning or are using Whisper from openai’s library, you need appropriate credentials or local models. How do I handle large PST files? Large PSTs can be slow to process, especially if they contain many attachments. We parse them message-by-message, recursively handling attachments. For extremely large archives, you might want to increase concurrency or filter out attachments you don’t need.

  1. Does GPT-4o-mini require a publicly accessible image URL?

If you provide a local file path, the code base64-encodes it. This is ideal for truly local images. If you have a publicly hosted image, you can pass its URL directly.

  1. Why is Unstructured tried before Azure Doc Intelligence now? We observed that Unstructured is typically lower cost to run (especially with Tesseract or local OCR) compared to Azure’s $10 per 1,000 pages. So if MarkItDown fails, we want to try Unstructured next to potentially save cost. If that also fails, we move to Azure.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_pro-1.0.0.tar.gz (39.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markitdown_pro-1.0.0-py3-none-any.whl (46.2 kB view details)

Uploaded Python 3

File details

Details for the file markitdown_pro-1.0.0.tar.gz.

File metadata

  • Download URL: markitdown_pro-1.0.0.tar.gz
  • Upload date:
  • Size: 39.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for markitdown_pro-1.0.0.tar.gz
Algorithm Hash digest
SHA256 10f8d218388c17deb93c5af6934cd51918f2f7775456d8ca7ddf713ffba9141e
MD5 a51a73a7c8af5a5ce674462791ed8419
BLAKE2b-256 7f3a4cc61aa95e0626e17f1a06fe913a755ecfe7c19d9ec441f0f3f231162bcd

See more details on using hashes here.

File details

Details for the file markitdown_pro-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: markitdown_pro-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 46.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for markitdown_pro-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c1d2047c83c4e49e7383c95dfa794d83ae59530cf674dd226743f87572d954e3
MD5 f4c1815772f38cb5f2456260d79dd208
BLAKE2b-256 f8a26d820dcd6375256be3958aa13063585b9aedac3839cae0abfe6b278ec1ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page