Skip to main content

Convert documents (Office, PDF) to Markdown — optimized for Persian/Farsi and multilingual content

Project description

libmumd

Convert documents to clean, LLM-ready Markdown.

Python License GitHub release PyPI Downloads


Why libmumd?

Most document converters produce messy output — broken tables, lost formatting, garbled non-English text. libmumd is different:

  • Persian & Arabic first-class support — Correctly handles RTL text, Persian typography, and Arabic script
  • Better than markitdown alone — Uses PyMuPDF's layout engine for cleaner, more accurate conversion
  • Multi-language support — Handles Persian, Arabic, Chinese, Japanese, Korean, and European languages
  • Table detection — Automatically converts complex tables to Markdown format
  • Figure & image handling — Extracts and references images properly
  • Layout-aware — Preserves reading order, headers, and document structure
  • No GPU required — Runs on any machine with Python

Features

Feature Description
PDF → Markdown High-quality extraction with layout preservation
Office → Markdown Convert .docx, .pptx, .xlsx, and more via LibreOffice
Smart table parsing Complex tables become clean Markdown tables
Image extraction Embedded images are saved and referenced
Header detection Font sizes map to # heading levels automatically
Inline formatting Preserves bold, italic, and code
Multi-column layouts Reconstructs natural reading order
OCR fallback Handles scanned documents when text layer is missing

Installation

pip install libmumd

Or install from GitHub:

pip install git+https://github.com/erfan-ashtari/libmumd.git

Quick Start

Command Line

# Convert PDF to Markdown
libmumd document.pdf

# Convert Office document
libmumd report.docx output.md

Python

from libmumd import convert_file

# Basic usage
result = convert_file("document.pdf")
print(result)
# {'status': 'ok', 'chars': 4523, 'output': 'document.md'}

# Custom output path
result = convert_file("presentation.pptx", "slides.md")

Supported Formats

Format Extensions Conversion Method
PDF .pdf PyMuPDF (native)
Word .docx, .doc LibreOffice
PowerPoint .pptx, .ppt LibreOffice
Excel .xlsx, .xls LibreOffice
OpenDocument .odt, .odp, .ods LibreOffice
Rich Text .rtf LibreOffice
Other Any markitdown fallback

Requirements

Python Packages (Auto-installed)

  • pymupdf4llm — PDF extraction engine
  • markitdown — Fallback converter

LibreOffice (Required for Office Files)

LibreOffice is needed to convert Word, PowerPoint, and Excel files.

OS Installation
Windows winget install --id TheDocumentFoundation.LibreOffice
macOS brew install --cask libreoffice
Linux sudo apt-get install libreoffice

Or download from libreoffice.org.

Note: PDF conversion works without LibreOffice. Only Office document conversion requires it.

Output Quality Comparison

Aspect markitdown only libmumd
Table formatting Inconsistent Clean Markdown tables
Multi-language Basic Full Unicode support
Layout preservation None Reading order preserved
Image handling Limited Extracted and referenced
Header detection None Automatic heading levels

Persian (Farsi) & Arabic Support

libmumd is built with Persian and Arabic documents in mind:

  • RTL text handling — Correctly processes right-to-left text
  • Persian typography — Preserves proper character connections and diacritics
  • Mixed content — Handles documents with both Persian/Arabic and English text
  • PDF extraction — Extracts Persian text without garbling or losing characters
  • Font support — Works with Persian fonts like IRANSans, Vazirmatn, and more
from libmumd import convert_file

# Convert a Persian PDF document
result = convert_file("persian-document.pdf")
# Output preserves RTL text and Persian characters correctly

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License — see the LICENSE file for details.

Dependency Licenses

  • pymupdf4llmAGPL-3.0 (required for PDF conversion)
  • markitdown — MIT
  • LibreOfficeMPL-2.0

Users of this package must comply with the AGPL-3.0 license for pymupdf4llm.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libmumd-0.1.1.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

libmumd-0.1.1-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file libmumd-0.1.1.tar.gz.

File metadata

  • Download URL: libmumd-0.1.1.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for libmumd-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4a4e49bf751a32ba6cf22cfc5147c1adff513b7a4c62fb7ca59bc20a58900f15
MD5 07d2b919a2f9a70bf82719e8ab5b612a
BLAKE2b-256 ad45bd7d3cc21d283dcae22593ff9a9e407c68cfcaea7263373debb65095dccf

See more details on using hashes here.

File details

Details for the file libmumd-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: libmumd-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for libmumd-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2ccb7ad7f782c17933bdfd666f08f157403b677894135c3faff8898ea40eb5ba
MD5 36c4cfbf3247bd9c77d9179937a135d0
BLAKE2b-256 9b228f345f7c486636e06f7ce858796dc7a0deb52cccd7503ee4dc5765003a15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page