Convert Office documents, PDFs, and images to Markdown
Project description
Office to Markdown Converter (o2md)
A tool that auto-detects Excel, Word, PowerPoint, PDF, Ichitaro, and image files and converts them to roughly adequate Markdown.
This tool was built using Devin.
For production-grade conversion, consider markitdown or docling.
For a feature comparison with docling and markitdown, see o2md_comparison.md.
Japanese documentation (README_JP.md)
Overview
o2md is a Python tool that converts Microsoft Office documents (Excel, Word, PowerPoint), PDFs, Ichitaro documents (.jtd/.jtt), and image files (JPEG/PNG/GIF/BMP/TIFF/WebP) to roughly adequate Markdown. It auto-detects file types and selects the appropriate conversion engine. o2md does not use trendy machine learning-based conversion; instead, it performs good old-fashioned logic-based conversion. Image files and image-based PDFs (scanned documents, etc.) are processed with OCR (Tesseract/manga-ocr/sarashina2.2-ocr) for text extraction. Legacy formats (.xls, .doc, .ppt) and shape/image rendering depend on LibreOffice. Text-only conversion works without LibreOffice.
Key Features
- Unified interface: Convert all Office documents and PDFs with a single command
- Auto file detection: Automatically selects the appropriate conversion method based on file extension
- Legacy format support: Supports
.xlsx/.xls,.docx/.doc,.pptx/.ppt - SVG/PNG output: Export shapes and charts as SVG (default) or PNG
- Excel (
x2md): Convert worksheets including tables, charts, and shapes - Word (
d2md): Convert documents with headings, tables, images, and lists - PowerPoint (
p2md): Convert slides with shapes, tables, and text - PDF (
pdf2md): Convert PDFs with text extraction and OCR (Tesseract/manga-ocr/sarashina2.2-ocr) - Ichitaro (
jtd2md): Parse OLE2 binary format to extract text, tables, and bold text - Image OCR (
img2md): Extract text from images via OCR and convert to Markdown (Tesseract/manga-ocr/sarashina) - Image extraction: Automatically extract and embed shapes and charts as images
- Complex element handling: Slides with mixed tables and shapes are rendered as full-slide images
sarashina2.2-ocr is amazing. Highly recommended if you have a GPU!
Installation
Prerequisites
- Python 3.10+
- uv (recommended) or pip
- LibreOffice (optional, required for shape rendering and legacy format conversion)
Install from PyPI
# Basic install
pip install o2md
# With manga-ocr
pip install o2md[manga-ocr]
# With sarashina2.2-ocr
pip install o2md[sarashina]
# With docling (AI table detection)
pip install o2md[docling]
# All optional dependencies
pip install o2md[manga-ocr,sarashina,docling]
Using uv:
# Basic install
uv pip install o2md
# With manga-ocr
uv pip install o2md[manga-ocr]
# With sarashina2.2-ocr
uv pip install o2md[sarashina]
# All optional dependencies
uv pip install o2md[manga-ocr,sarashina,docling]
Local Development Setup
# Install uv (if not installed)
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
# Clone the repository
git clone https://github.com/matsu582/o2md.git
cd o2md
# Install dependencies
uv sync
# Include manga-ocr
uv sync --extra manga-ocr
# Include sarashina2.2-ocr
uv sync --extra sarashina
# Include docling (AI table detection)
uv sync --extra docling
# Include all optional dependencies
uv sync --all-extras
Install Tesseract OCR (Optional)
Used by default for OCR text extraction from PDFs and images. Without Tesseract, OCR is unavailable (manga-ocr or sarashina2.2-ocr can be used as alternatives).
# macOS
brew install tesseract tesseract-lang
# Ubuntu/Debian
sudo apt-get install tesseract-ocr tesseract-ocr-jpn tesseract-ocr-eng
# Windows
# Download from https://github.com/UB-Mannheim/tesseract/wiki
Install LibreOffice (Optional)
Required for legacy format (.xls, .doc, .ppt) conversion and shape/image rendering. Text-only conversion works without LibreOffice.
# macOS
brew install libreoffice
# Ubuntu/Debian
sudo apt-get install libreoffice
# Windows
# Download from https://www.libreoffice.org/download/download/
Behavior Without LibreOffice
Without LibreOffice, a warning is displayed at startup and the following degraded behavior applies:
| Feature | With LibreOffice | Without LibreOffice |
|---|---|---|
| .docx / .xlsx / .pptx text conversion | OK | OK |
| .doc / .xls / .ppt conversion | OK | Error |
| Shape/vector image conversion | OK | Skipped |
| Slide image rendering | OK | Skipped |
| Chart image conversion | OK | Skipped |
Tip: Legacy files (.doc, .xls, .ppt) can be pre-converted to newer formats (.docx, .xlsx, .pptx) for text conversion without LibreOffice.
Usage
Basic Usage
After installing from PyPI:
# Convert Excel file
o2md data.xlsx
# Convert Word file
o2md document.docx
# Convert PowerPoint file
o2md presentation.pptx
# Convert PDF file
o2md document.pdf
# Convert Ichitaro file
o2md document.jtd
# Convert image file (OCR text extraction)
o2md photo.jpg
When running from a local clone (development):
uv run o2md input_files/data.xlsx
uv run o2md input_files/document.docx
uv run o2md input_files/presentation.pptx
uv run o2md input_files/document.pdf
uv run o2md input_files/document.jtd
uv run o2md input_files/photo.jpg
Individual Commands
Each conversion engine can also be used as a standalone command:
# PyPI install
d2md document.docx
x2md data.xlsx
p2md presentation.pptx
pdf2md document.pdf
jtd2md document.jtd
img2md photo.jpg
# Local clone
uv run d2md input_files/document.docx
uv run x2md input_files/data.xlsx
uv run p2md input_files/presentation.pptx
uv run pdf2md input_files/document.pdf
uv run jtd2md input_files/document.jtd
uv run img2md input_files/photo.jpg
Options
# Specify output directory
o2md data.xlsx -o custom_output
# Use heading text for links in Word documents
o2md document.docx --use-heading-text
# Output images in PNG format (default: SVG)
o2md data.xlsx --format png
# Specify OCR engine for PDF/image conversion (default: tesseract)
o2md document.pdf --ocr-engine tesseract
o2md document.pdf --ocr-engine manga-ocr
o2md document.pdf --ocr-engine sarashina
# Use tessdata_best for high-accuracy OCR
o2md document.pdf --tessdata-dir ~/tessdata_best
# Text extraction mode (output .txt only)
o2md data.xlsx --text
o2md document.docx --text
o2md presentation.pptx --text
o2md document.pdf --text
o2md photo.jpg --text
Note: When running from a local clone, prefix commands with
uv run(e.g.,uv run o2md data.xlsx).
Batch Folder Conversion
# Convert all supported files in a folder (top-level only)
o2md ./input_files/
# Recursively process subfolders
o2md ./input_files/ -r
# Specify output directory
o2md ./input_files/ -r -o output_all
Output structure for folder input:
input_files/
pdfs/a.pdf
b.xlsx
->
output/
pdfs/a.md + images/
b.md
Legacy Format Files
# Legacy formats are automatically converted to newer formats before processing
o2md old_file.xls
o2md old_doc.doc
o2md old_presentation.ppt
Command-Line Options
| Option | Description |
|---|---|
file |
Office file or folder to convert (required) |
-o, --output-dir |
Output directory (default: ./output) |
-r, --recursive |
[Folder mode] Recursively process subfolders |
--format |
Image output format (svg or png, default: svg) |
--use-heading-text |
[Word only] Use heading text instead of chapter numbers for links |
--shape-metadata |
[Word/Excel only] Output shape metadata |
--ocr-engine |
[PDF/Image] OCR engine (tesseract/manga-ocr/sarashina, default: tesseract) |
--tessdata-dir |
[PDF only] Specify tessdata directory (for tessdata_best) |
--docling |
[PDF only] Enable docling table detection (detects borderless tables) |
--text |
Text extraction mode (output .txt only) |
-v, --verbose |
Show detailed debug output |
-h, --help |
Show help message |
Supported File Formats
| File Type | Extensions | Command | Key Features |
|---|---|---|---|
| Excel | .xlsx, .xls |
x2md |
Tables, charts, shapes, formulas |
| Word | .docx, .doc |
d2md |
Headings, tables, images, lists |
| PowerPoint | .pptx, .ppt |
p2md |
Slides, shapes, tables, text |
.pdf |
pdf2md |
Image conversion, text extraction, OCR | |
| Ichitaro | .jtd, .jtt |
jtd2md |
Text, tables, bold, headings |
| Image | .jpg, .jpeg, .png, .gif, .bmp, .tiff, .tif, .webp |
img2md |
OCR text extraction, image embedding |
Output Format
After conversion, the following files are generated:
output/
+-- [filename].md # Markdown file
+-- images/ # Image folder
+-- [filename]_image_001.svg # SVG by default
+-- [filename]_image_002.svg
+-- ...
SVG is a vector format that maintains quality at any zoom level. Use --format png if PNG is needed.
Conversion Details
Excel (x2md)
- Convert worksheet tables to Markdown tables
- Extract charts as individual images (bar, line, pie, scatter)
- Output chart data as Markdown tables below chart images
- Convert shapes (autoshapes, images, etc.) to images
- Output formula values
- Multi-sheet processing
- Shape clustering with isolated rendering
- Border-based table detection
Word (d2md)
- Preserve heading levels (
#,##,###, etc.) - Paragraph and text decoration (bold, italic, underline, strikethrough, superscript/subscript)
- Bullet and numbered lists
- Convert tables to Markdown tables
- Extract embedded images
- Preserve hyperlinks
- Auto-generate table of contents
- Chapter reference link conversion
- Equation conversion (OMML -> LaTeX)
- Shape and canvas image rendering
- Charts rendered as images at their original position (bar, line, pie, scatter)
- Chart data output as Markdown tables below chart images
PowerPoint (p2md)
- Per-slide heading generation
- Text box paragraphs and lists
- Convert tables to Markdown tables
- Render shape groups as single images
- Extract speaker notes
- Complex slide handling: Slides with mixed tables and shapes are rendered as full-slide images with text alongside
- .ppt file support: Auto-conversion via LibreOffice
Complex Slide Handling
Slides are rendered as full images when they contain mixed elements:
- Text boxes + shapes
- Tables + shapes
- Complex layouts (shapes with visual decorations)
Visual decoration criteria:
- Has solid fill
- Has border width > 0
This ensures callouts, colored rectangles, and other decorative elements are properly rendered as images.
Ichitaro (jtd2md)
- Parse OLE2 Compound Document binary format
- Extract UTF-16BE text from DocumentText stream
- Detect table structure from border information and output as Markdown tables
- Estimate headings from font size (larger than body text ->
##) - Detect bold text via TAG 0020 (character formatting) (
**bold**) - Add trailing spaces for Markdown viewer compatibility
- Extract footnote text
- Output OLE2 metadata (creation date, etc.)
Image OCR (img2md)
- Extract text from images (JPEG/PNG/GIF/BMP/TIFF/WebP) via OCR
- Copy original image to images folder and embed image link in Markdown
- OCR engine selection: Tesseract (default) / manga-ocr / sarashina2.2-ocr
- Japanese path support (via cv2.imdecode)
--textoption support (output .txt only)
PDF (pdf2md)
- Extract embedded text and tables
- Scan pages (image-based PDFs) are rendered as images with OCR text extraction
- Tesseract OCR (default): Document OCR, Japanese + English
- manga-ocr + comic-text-detector: Manga/comic OCR
- sarashina2.2-ocr: End-to-End VLM high-accuracy OCR (GPU recommended)
- tessdata_best support: Specify high-accuracy models via
--tessdata-dir - docling table detection:
--doclingoption detects borderless tables- High-accuracy table detection using TableFormer model
- Supports slide PDFs and tables drawn as shapes
- Detected tables are wrapped in
<details>tags
- Output: Per-page images + Markdown file
Limitations
Excel (x2md)
- Macros are not converted
- Complex conditional formatting is not preserved
- Pivot tables are output as static tables
- Shape rendering may differ due to LibreOffice limitations
Word (d2md)
- Complex layouts (columns, text boxes) are simplified
- Footnotes and endnotes are processed as regular text
- Shape rendering may differ due to LibreOffice limitations
- Comments are not converted
PowerPoint (p2md)
- Animations are not converted
- Embedded videos are not converted (still images only)
- Slide master design elements are not reflected
- Shape rendering may differ due to LibreOffice limitations
PDF (pdf2md)
- Encrypted PDFs cannot be processed
- Text extraction accuracy may decrease for complex PDF layouts
- tessdata_best requires separate download (
--tessdata-diroption) - docling table detection requires additional installation
- Install with
pip install o2md[docling]oruv sync --extra docling - Processing time: ~7-15 seconds per page (CPU only)
- Install with
- manga-ocr requires additional installation
- Install with
pip install o2md[manga-ocr]oruv sync --extra manga-ocr
- Install with
- sarashina2.2-ocr requires additional installation
- Install with
pip install o2md[sarashina]oruv sync --extra sarashina - GPU recommended (CUDA 8GB+ / Apple Silicon MPS 16GB+ unified memory)
- Model (~7.8GB) is auto-downloaded on first run
- Direct image-to-structured-Markdown output (no text detection needed)
- Supports Japanese vertical text, tables, and formulas
- Install with
Ichitaro (jtd2md)
- Only OLE2 format (Ichitaro ver8+) is supported (ver5-7 legacy format is not supported)
- Image/shape extraction is not supported
- Bold detection is per-paragraph (inline character decoration is not supported)
- Complex tables with merged cells may have layout issues
Image OCR (img2md)
- OCR accuracy depends on image quality and resolution
- Handwritten text recognition accuracy may be low
- Tesseract OCR requires prior system installation
- manga-ocr requires
pip install o2md[manga-ocr]oruv sync --extra manga-ocr - sarashina2.2-ocr requires
pip install o2md[sarashina]oruv sync --extra sarashina(GPU recommended)
Project Structure
o2md/
+-- pyproject.toml # Project configuration and dependencies
+-- o2md/ # Main package
| +-- __init__.py
| +-- o2md.py # Unified CLI (auto file detection and conversion)
| +-- x2md.py # Excel conversion engine
| +-- d2md.py # Word conversion engine
| +-- p2md.py # PowerPoint conversion engine
| +-- pdf2md.py # PDF conversion engine
| +-- jtd2md.py # Ichitaro conversion engine
| +-- img2md.py # Image OCR conversion engine
| +-- utils.py # Shared utilities
| +-- omml_converter/ # OMML to LaTeX conversion
+-- tests/ # Test suite
+-- input_files/ # Test input files
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file o2md-0.1.0.tar.gz.
File metadata
- Download URL: o2md-0.1.0.tar.gz
- Upload date:
- Size: 334.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9da36d97a7cdf4ab8a5d105607ba0707e30888b88578c081f56c17a53cceb569
|
|
| MD5 |
34dd680c16e88b59a6150732eccd2966
|
|
| BLAKE2b-256 |
d8139c91f69d76932f90fe8e675f99a5bb2eb179874bee9ecec05b0b82cd39a4
|
File details
Details for the file o2md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: o2md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 351.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb7c5a1d23691ca555e7c94170a2a42ecd9d38d9c467d35a811bd8ab852f96f0
|
|
| MD5 |
a1fe307e93ecec93516a0dc75b4941eb
|
|
| BLAKE2b-256 |
1d300ee77563a897cf31d4538ef0c0c35124bd66e30e09e20d188c8345cefa36
|