Convert documents (Office, PDF) to Markdown — optimized for Persian/Farsi and multilingual content
Project description
Why libmumd?
Most document converters produce messy output — broken tables, lost formatting, garbled non-English text. libmumd is different:
- Persian & Arabic first-class support — Correctly handles RTL text, Persian typography, and Arabic script
- Better than markitdown alone — Uses PyMuPDF's layout engine for cleaner, more accurate conversion
- Multi-language support — Handles Persian, Arabic, Chinese, Japanese, Korean, and European languages
- Table detection — Automatically converts complex tables to Markdown format
- Figure & image handling — Extracts and references images properly
- Layout-aware — Preserves reading order, headers, and document structure
- No GPU required — Runs on any machine with Python
Features
| Feature | Description |
|---|---|
| PDF → Markdown | High-quality extraction with layout preservation |
| Office → Markdown | Convert .docx, .pptx, .xlsx, and more via LibreOffice |
| Smart table parsing | Complex tables become clean Markdown tables |
| Image extraction | Embedded images are saved and referenced |
| Header detection | Font sizes map to # heading levels automatically |
| Inline formatting | Preserves bold, italic, and code |
| Multi-column layouts | Reconstructs natural reading order |
| OCR fallback | Handles scanned documents when text layer is missing |
Installation
pip install libmumd
Or install from GitHub:
pip install git+https://github.com/erfan-ashtari/libmumd.git
Quick Start
Command Line
# Convert PDF to Markdown
libmumd document.pdf
# Convert Office document
libmumd report.docx output.md
Python
from libmumd import convert_file
# Basic usage
result = convert_file("document.pdf")
print(result)
# {'status': 'ok', 'chars': 4523, 'output': 'document.md'}
# Custom output path
result = convert_file("presentation.pptx", "slides.md")
Supported Formats
| Format | Extensions | Conversion Method |
|---|---|---|
.pdf |
PyMuPDF (native) | |
| Word | .docx, .doc |
LibreOffice |
| PowerPoint | .pptx, .ppt |
LibreOffice |
| Excel | .xlsx, .xls |
LibreOffice |
| OpenDocument | .odt, .odp, .ods |
LibreOffice |
| Rich Text | .rtf |
LibreOffice |
| Other | Any | markitdown fallback |
Requirements
Python Packages (Auto-installed)
pymupdf4llm— PDF extraction enginemarkitdown— Fallback converter
LibreOffice (Required for Office Files)
LibreOffice is needed to convert Word, PowerPoint, and Excel files.
| OS | Installation |
|---|---|
| Windows | winget install --id TheDocumentFoundation.LibreOffice |
| macOS | brew install --cask libreoffice |
| Linux | sudo apt-get install libreoffice |
Or download from libreoffice.org.
Note: PDF conversion works without LibreOffice. Only Office document conversion requires it.
Output Quality Comparison
| Aspect | markitdown only | libmumd |
|---|---|---|
| Table formatting | Inconsistent | Clean Markdown tables |
| Multi-language | Basic | Full Unicode support |
| Layout preservation | None | Reading order preserved |
| Image handling | Limited | Extracted and referenced |
| Header detection | None | Automatic heading levels |
Persian (Farsi) & Arabic Support
libmumd is built with Persian and Arabic documents in mind:
- RTL text handling — Correctly processes right-to-left text
- Persian typography — Preserves proper character connections and diacritics
- Mixed content — Handles documents with both Persian/Arabic and English text
- PDF extraction — Extracts Persian text without garbling or losing characters
- Font support — Works with Persian fonts like IRANSans, Vazirmatn, and more
from libmumd import convert_file
# Convert a Persian PDF document
result = convert_file("persian-document.pdf")
# Output preserves RTL text and Persian characters correctly
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License — see the LICENSE file for details.
Dependency Licenses
Users of this package must comply with the AGPL-3.0 license for pymupdf4llm.
Acknowledgments
- PyMuPDF4LLM — PDF extraction engine
- markitdown — Fallback converter
- LibreOffice — Office document handling
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file libmumd-0.1.1.tar.gz.
File metadata
- Download URL: libmumd-0.1.1.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a4e49bf751a32ba6cf22cfc5147c1adff513b7a4c62fb7ca59bc20a58900f15
|
|
| MD5 |
07d2b919a2f9a70bf82719e8ab5b612a
|
|
| BLAKE2b-256 |
ad45bd7d3cc21d283dcae22593ff9a9e407c68cfcaea7263373debb65095dccf
|
File details
Details for the file libmumd-0.1.1-py3-none-any.whl.
File metadata
- Download URL: libmumd-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ccb7ad7f782c17933bdfd666f08f157403b677894135c3faff8898ea40eb5ba
|
|
| MD5 |
36c4cfbf3247bd9c77d9179937a135d0
|
|
| BLAKE2b-256 |
9b228f345f7c486636e06f7ce858796dc7a0deb52cccd7503ee4dc5765003a15
|