Skip to main content

Convert documents (Office, PDF) to Markdown — optimized for Persian/Farsi and multilingual content

Project description

libmumd

Convert documents to clean, LLM-ready Markdown.

Python License GitHub release


Why libmumd?

Most document converters produce messy output — broken tables, lost formatting, garbled non-English text. libmumd is different:

  • Persian & Arabic first-class support — Correctly handles RTL text, Persian typography, and Arabic script
  • Better than markitdown alone — Uses PyMuPDF's layout engine for cleaner, more accurate conversion
  • Multi-language support — Handles Persian, Arabic, Chinese, Japanese, Korean, and European languages
  • Table detection — Automatically converts complex tables to Markdown format
  • Figure & image handling — Extracts and references images properly
  • Layout-aware — Preserves reading order, headers, and document structure
  • No GPU required — Runs on any machine with Python

Features

Feature Description
PDF → Markdown High-quality extraction with layout preservation
Office → Markdown Convert .docx, .pptx, .xlsx, and more via LibreOffice
Smart table parsing Complex tables become clean Markdown tables
Image extraction Embedded images are saved and referenced
Header detection Font sizes map to # heading levels automatically
Inline formatting Preserves bold, italic, and code
Multi-column layouts Reconstructs natural reading order
OCR fallback Handles scanned documents when text layer is missing

Installation

pip install git+https://github.com/erfan-ashtari/libmumd.git

Or install from source:

git clone https://github.com/erfan-ashtari/libmumd.git
cd libmumd
pip install .

Quick Start

Command Line

# Convert PDF to Markdown
libmumd document.pdf

# Convert Office document
libmumd report.docx output.md

Python

from libmumd import convert_file

# Basic usage
result = convert_file("document.pdf")
print(result)
# {'status': 'ok', 'chars': 4523, 'output': 'document.md'}

# Custom output path
result = convert_file("presentation.pptx", "slides.md")

Supported Formats

Format Extensions Conversion Method
PDF .pdf PyMuPDF (native)
Word .docx, .doc LibreOffice
PowerPoint .pptx, .ppt LibreOffice
Excel .xlsx, .xls LibreOffice
OpenDocument .odt, .odp, .ods LibreOffice
Rich Text .rtf LibreOffice
Other Any markitdown fallback

Requirements

Python Packages (Auto-installed)

  • pymupdf4llm — PDF extraction engine
  • markitdown — Fallback converter

LibreOffice (Required for Office Files)

LibreOffice is needed to convert Word, PowerPoint, and Excel files.

OS Installation
Windows winget install --id TheDocumentFoundation.LibreOffice
macOS brew install --cask libreoffice
Linux sudo apt-get install libreoffice

Or download from libreoffice.org.

Note: PDF conversion works without LibreOffice. Only Office document conversion requires it.

Output Quality Comparison

Aspect markitdown only libmumd
Table formatting Inconsistent Clean Markdown tables
Multi-language Basic Full Unicode support
Layout preservation None Reading order preserved
Image handling Limited Extracted and referenced
Header detection None Automatic heading levels

Persian (Farsi) & Arabic Support

libmumd is built with Persian and Arabic documents in mind:

  • RTL text handling — Correctly processes right-to-left text
  • Persian typography — Preserves proper character connections and diacritics
  • Mixed content — Handles documents with both Persian/Arabic and English text
  • PDF extraction — Extracts Persian text without garbling or losing characters
  • Font support — Works with Persian fonts like IRANSans, Vazirmatn, and more
from libmumd import convert_file

# Convert a Persian PDF document
result = convert_file("persian-document.pdf")
# Output preserves RTL text and Persian characters correctly

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License — see the LICENSE file for details.

Dependency Licenses

  • pymupdf4llmAGPL-3.0 (required for PDF conversion)
  • markitdown — MIT
  • LibreOfficeMPL-2.0

Users of this package must comply with the AGPL-3.0 license for pymupdf4llm.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libmumd-0.1.0.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

libmumd-0.1.0-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file libmumd-0.1.0.tar.gz.

File metadata

  • Download URL: libmumd-0.1.0.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for libmumd-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d7b65fb949ddfa274907d7f913b826f751807d7c137bfebdc034a76bec2884dc
MD5 33ab343c0448f24981f50c98f9a7d502
BLAKE2b-256 6e2bd0f4a07f23512eb9fad55ef98d4c0ec2247c575302067255de05f1e17f1e

See more details on using hashes here.

File details

Details for the file libmumd-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: libmumd-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for libmumd-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 88f97bbbe6015ec8fd3b9c8125282e7bf65965a342a3d1a22e762dc7ab0bac3e
MD5 ff899550f76f752572e4e69ece9627da
BLAKE2b-256 0e32dfef4ec4c025bf856ba468ac77a7df42ac7398aea2d619784655415aed74

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page