Convert documents (Office, PDF) to Markdown — optimized for Persian/Farsi and multilingual content

These details have not been verified by PyPI

Project links

Project description

libmumd

Convert documents to clean, LLM-ready Markdown.

Why libmumd?

Most document converters produce messy output — broken tables, lost formatting, garbled non-English text. libmumd is different:

Persian & Arabic first-class support — Correctly handles RTL text, Persian typography, and Arabic script
Better than markitdown alone — Uses PyMuPDF's layout engine for cleaner, more accurate conversion
Multi-language support — Handles Persian, Arabic, Chinese, Japanese, Korean, and European languages
Table detection — Automatically converts complex tables to Markdown format
Figure & image handling — Extracts and references images properly
Layout-aware — Preserves reading order, headers, and document structure
No GPU required — Runs on any machine with Python

Features

Feature	Description
PDF → Markdown	High-quality extraction with layout preservation
Office → Markdown	Convert `.docx`, `.pptx`, `.xlsx`, and more via LibreOffice
Smart table parsing	Complex tables become clean Markdown tables
Image extraction	Embedded images are saved and referenced
Header detection	Font sizes map to `#` heading levels automatically
Inline formatting	Preserves bold, italic, and `code`
Multi-column layouts	Reconstructs natural reading order
OCR fallback	Handles scanned documents when text layer is missing

Installation

pip install git+https://github.com/erfan-ashtari/libmumd.git

Or install from source:

git clone https://github.com/erfan-ashtari/libmumd.git
cd libmumd
pip install .

Quick Start

Command Line

# Convert PDF to Markdown
libmumd document.pdf

# Convert Office document
libmumd report.docx output.md

Python

from libmumd import convert_file

# Basic usage
result = convert_file("document.pdf")
print(result)
# {'status': 'ok', 'chars': 4523, 'output': 'document.md'}

# Custom output path
result = convert_file("presentation.pptx", "slides.md")

Supported Formats

Format	Extensions	Conversion Method
PDF	`.pdf`	PyMuPDF (native)
Word	`.docx`, `.doc`	LibreOffice
PowerPoint	`.pptx`, `.ppt`	LibreOffice
Excel	`.xlsx`, `.xls`	LibreOffice
OpenDocument	`.odt`, `.odp`, `.ods`	LibreOffice
Rich Text	`.rtf`	LibreOffice
Other	Any	markitdown fallback

Requirements

Python Packages (Auto-installed)

pymupdf4llm — PDF extraction engine
markitdown — Fallback converter

LibreOffice (Required for Office Files)

LibreOffice is needed to convert Word, PowerPoint, and Excel files.

OS	Installation
Windows	`winget install --id TheDocumentFoundation.LibreOffice`
macOS	`brew install --cask libreoffice`
Linux	`sudo apt-get install libreoffice`

Or download from libreoffice.org.

Note: PDF conversion works without LibreOffice. Only Office document conversion requires it.

Output Quality Comparison

Aspect	markitdown only	libmumd
Table formatting	Inconsistent	Clean Markdown tables
Multi-language	Basic	Full Unicode support
Layout preservation	None	Reading order preserved
Image handling	Limited	Extracted and referenced
Header detection	None	Automatic heading levels

Persian (Farsi) & Arabic Support

libmumd is built with Persian and Arabic documents in mind:

RTL text handling — Correctly processes right-to-left text
Persian typography — Preserves proper character connections and diacritics
Mixed content — Handles documents with both Persian/Arabic and English text
PDF extraction — Extracts Persian text without garbling or losing characters
Font support — Works with Persian fonts like IRANSans, Vazirmatn, and more

from libmumd import convert_file

# Convert a Persian PDF document
result = convert_file("persian-document.pdf")
# Output preserves RTL text and Persian characters correctly

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License — see the LICENSE file for details.

Dependency Licenses

pymupdf4llm — AGPL-3.0 (required for PDF conversion)
markitdown — MIT
LibreOffice — MPL-2.0

Users of this package must comply with the AGPL-3.0 license for pymupdf4llm.

Acknowledgments

PyMuPDF4LLM — PDF extraction engine
markitdown — Fallback converter
LibreOffice — Office document handling

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jun 28, 2026

This version

0.1.0

Jun 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libmumd-0.1.0.tar.gz (6.8 kB view details)

Uploaded Jun 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

libmumd-0.1.0-py3-none-any.whl (7.0 kB view details)

Uploaded Jun 28, 2026 Python 3

File details

Details for the file libmumd-0.1.0.tar.gz.

File metadata

Download URL: libmumd-0.1.0.tar.gz
Upload date: Jun 28, 2026
Size: 6.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for libmumd-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d7b65fb949ddfa274907d7f913b826f751807d7c137bfebdc034a76bec2884dc`
MD5	`33ab343c0448f24981f50c98f9a7d502`
BLAKE2b-256	`6e2bd0f4a07f23512eb9fad55ef98d4c0ec2247c575302067255de05f1e17f1e`

See more details on using hashes here.

File details

Details for the file libmumd-0.1.0-py3-none-any.whl.

File metadata

Download URL: libmumd-0.1.0-py3-none-any.whl
Upload date: Jun 28, 2026
Size: 7.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for libmumd-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`88f97bbbe6015ec8fd3b9c8125282e7bf65965a342a3d1a22e762dc7ab0bac3e`
MD5	`ff899550f76f752572e4e69ece9627da`
BLAKE2b-256	`0e32dfef4ec4c025bf856ba468ac77a7df42ac7398aea2d619784655415aed74`

See more details on using hashes here.

libmumd 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

libmumd

Why libmumd?

Features

Installation

Quick Start

Command Line

Python

Supported Formats

Requirements

Python Packages (Auto-installed)

LibreOffice (Required for Office Files)

Output Quality Comparison

Persian (Farsi) & Arabic Support

Contributing

License

Dependency Licenses

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes