Convert PDF, Office, images, text/JSON/XML, ZIP archives, and web URLs to Markdown.
Project description
mdengine
Single Python distribution for converting PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx/.xlsm), images (OCR), plain text / JSON / XML, and ZIP archives into Markdown (and related assets). Install only the extras you need; everything imports under the md_generator package.
- PyPI name:
mdengine(import package:md_generator) - Source: github.com/vishal7090/md-generator
- Python: 3.10+
- License: MIT
Quick links: On a new computer · Command-line execution · Python library · HTTP API · MCP · Development · Code of Conduct
On a new computer
Use this checklist the first time you run the tools on a machine that does not have the project yet.
- Install Python 3.10 or newer from python.org (Windows: enable Add python.exe to PATH in the installer). Confirm in a new terminal:
python --version. - (Recommended) Create an isolated environment so dependencies do not clash with other projects:
python -m venv .venv
Then activate it: Windows (PowerShell).\.venv\Scripts\Activate.ps1· Windows (CMD).venv\Scripts\activate.bat· macOS / Linuxsource .venv/bin/activate. - Install this package with the extras you need (see Optional dependency extras for what each extra does):
pip install "mdengine[pdf,word]"
If the package is not on PyPI yet, clone the repository,cdinto the repo root, then:pip install -e ".[pdf,word]"
- Confirm the CLI is on your PATH:
md-pdf --help(ormd-word --help, etc.). If you see “command not found”, the folder wherepipputs scripts (often.venv\Scriptson Windows or.venv/binon Unix) must be on yourPATH, or you must run commands from an activated virtual environment. - Run one conversion with a real file path, for example:
md-pdf path\to\report.pdf out.md
Full flags and everymd-*command are in Command-line execution.
Installation
From the repository root (editable install for development):
pip install -e .
With format-specific and HTTP extras:
pip install -e ".[pdf,word,api]"
pip install -e ".[ppt,xlsx,image,archive,api,mcp]"
From PyPI (once published):
pip install "mdengine[pdf,word]"
pip install "mdengine[all]"
Optional dependency extras
| Extra | Purpose |
|---|---|
pdf |
PDF extraction (PyMuPDF, pdfplumber) |
word |
DOCX → Markdown (mammoth, markdownify) |
ppt |
PPTX and embedded content (python-pptx, Pillow, lxml, mammoth, PyMuPDF, …) |
xlsx |
Excel → Markdown (openpyxl) |
image |
Image I/O for OCR pipelines (Pillow) |
image-ocr |
Heavy OCR backends (pytesseract, paddle, easyocr, …) |
text |
TXT / JSON / XML converter (stdlib-oriented; marker extra) |
archive |
ZIP → Markdown layout (Pillow; optional tesseract for inline image OCR) |
url |
HTTP(S) HTML → Markdown (httpx, readability-lxml, markdownify, BeautifulSoup, lxml) |
url-full |
url plus PDF/Word/PPTX/XLSX/archive stack for post-converting downloaded linked files to Markdown |
api |
FastAPI, uvicorn, httpx, pydantic-settings |
mcp |
MCP servers (mcp, fastmcp where used) |
dev |
pytest + API/MCP test helpers |
all |
Large superset of dependencies (use only if you need everything) |
Nested ZIP and office files inside archives require the corresponding extras (e.g. archive plus pdf for PDFs inside a ZIP).
Command-line execution
All converters can be run from a terminal after you install the package (with the right extras for that format). Each tool is a normal executable on your PATH (no need to open Python yourself unless you choose the shim workflow below).
1. Install (once)
pip install "mdengine[pdf,word]" # adjust extras: ppt, xlsx, image, archive, text, …
# or from a clone:
pip install -e ".[pdf,word,archive]"
2. Check that the command is available
md-pdf --help
md-zip --help
If the shell reports “command not found”, ensure the Python Scripts directory is on your PATH (same place pip installs console scripts).
3. Commands (command-line entry points)
| Command | Implements | One-line example |
|---|---|---|
md-pdf |
md_generator.pdf.converter:main |
md-pdf report.pdf out.md |
md-word |
md_generator.word.converter:main |
md-word notes.docx body.md |
md-ppt |
md_generator.ppt.converter:main |
md-ppt deck.pptx ./ppt-out |
md-xlsx |
md_generator.xlsx.converter:main |
md-xlsx -i data.xlsx -o ./excel-out (also .csv) |
md-image |
md_generator.image.converter:main |
md-image ./scans page.md |
md-text |
md_generator.text.converter:main |
md-text config.xml out.md |
md-zip |
md_generator.archive.converter:main |
md-zip bundle.zip ./zip-out |
md-url |
md_generator.url.converter:main |
md-url https://example.com/doc ./web-out --artifact-layout |
Every command accepts -h / --help for full flags (artifact layout, OCR, ZIP options, etc.).
4. Copy-paste examples (terminal)
bash / macOS / Linux
md-pdf manual.pdf ./artifact --artifact-layout
md-word letter.docx letter.md --images-dir ./letter-images
md-ppt slides.pptx ./ppt-artifact --artifact-layout
md-xlsx -i sales.xlsx -o ./md-sheets --split
md-xlsx -i export.csv -o ./csv-out
md-image ./photos ocr.md --engines tess --strategy best
md-text data.json data.md
md-zip archive.zip ./unzipped-md
md-url https://example.com/page ./page-bundle --artifact-layout
Windows PowerShell (same commands; use backslashes for paths if you prefer)
md-pdf .\manual.pdf .\out\doc.md
md-zip .\archive.zip .\zip-out
md-url https://example.com/page .\page-bundle --artifact-layout
Windows CMD
md-pdf manual.pdf out\doc.md
md-zip archive.zip zip-out
md-url https://example.com/page page-bundle --artifact-layout
5. Run without pip install (repo clone + PYTHONPATH)
The folders pdf-to-md/, word-to-md/, url-to-md/, … contain a thin converter.py that calls the same code as md-pdf, md-word, etc. From the repository root, point Python at src so md_generator imports, then run the shim:
PowerShell
$env:PYTHONPATH = "$PWD\src"
python pdf-to-md\converter.py input.pdf out.md
CMD
set PYTHONPATH=src
python pdf-to-md\converter.py input.pdf out.md
bash
PYTHONPATH=src python pdf-to-md/converter.py input.pdf out.md
6. Convert every file in docs/ (strictly command-line)
To process all supported files under the docs/ folder using only the installed md-* tools (no Python snippets), use the batch driver:
| Platform | Command (run from repository root unless noted) |
|---|---|
| Windows | powershell -ExecutionPolicy Bypass -File scripts/run-docs-cli.ps1 |
| Windows | Or double-click / run docs/run-all-cli.cmd (changes to repo root, then runs the script on docs\) |
| macOS / Linux | bash scripts/run-docs-cli.sh |
Optional environment variables for the shell script: DOCS_DIR, OUT_DIR, IMAGE_ENGINES (default tess). PowerShell script parameters: -DocsDir, -OutDir, -ImageEngines.
Outputs are written to docs/cli-output/<basename>/ (one subfolder per input file). .csv files are converted with md-xlsx (same engine as Excel). .md files are skipped.
Python library
Import from md_generator.<format> after installing the matching extras.
from pathlib import Path
from md_generator.pdf.pdf_extract import ConvertOptions, convert_pdf
from md_generator.pdf.utils import resolve_output
pdf = Path("input.pdf")
out = resolve_output(Path("out-dir"), artifact_layout=True, images_dir=None)
convert_pdf(pdf, out, ConvertOptions(verbose=True))
Word (DOCX)
from pathlib import Path
from md_generator.word.converter import convert_docx_to_markdown
convert_docx_to_markdown(
Path("input.docx"),
Path("out/body.md"),
images_dir=Path("out/images"),
verbose=False,
)
PowerPoint
from pathlib import Path
from md_generator.ppt.convert_impl import convert_pptx
from md_generator.ppt.options import ConvertOptions
convert_pptx(
Path("slides.pptx"),
Path("artifact-dir"),
ConvertOptions(artifact_layout=True, extract_embedded_deep=False),
)
Excel
from pathlib import Path
from md_generator.xlsx.convert_config import ConvertConfig
from md_generator.xlsx.converter_core import convert_excel_to_markdown
result = convert_excel_to_markdown(
Path("book.xlsx"),
Path("out-dir"),
config=ConvertConfig(),
)
print(result.paths_written)
Images (OCR)
from pathlib import Path
from md_generator.image.convert_impl import ConvertOptions, convert_images
convert_images(
Path("scan.png"),
Path("out.md"),
ConvertOptions(
engines=("tess",),
strategy="best",
title="OCR",
tess_lang="eng",
tesseract_cmd=None,
paddle_lang="en",
paddle_use_angle_cls=True,
easy_langs=("en",),
verbose=False,
),
)
Text / JSON / XML
from pathlib import Path
from md_generator.text.convert_impl import convert_text_file
from md_generator.text.options import ConvertOptions
convert_text_file(
Path("data.json"),
Path("out.md"),
ConvertOptions(artifact_layout=False, verbose=False),
)
ZIP archive
from pathlib import Path
from md_generator.archive.convert_impl import convert_zip
from md_generator.archive.options import ConvertOptions
convert_zip(
Path("upload.zip"),
Path("artifact-out"),
ConvertOptions(
enable_office=True,
use_image_to_md=True,
verbose=False,
),
)
repo_root on ConvertOptions is deprecated and ignored; converters are loaded in-process from md_generator.
HTTP API (FastAPI)
All format APIs follow a similar pattern:
POST /convert/sync— upload a file (most converters) or send JSON (url-to-md); response is often a ZIP (artifact bundle) for larger formats.POST /convert/jobs— async job; returnsjob_id.GET /convert/jobs/{job_id}— status.GET /convert/jobs/{job_id}/download— download result when ready.
Upload field name is file (multipart form) for file-based converters. Use httpx or curl -F "file=@path/to/file". URL conversion uses a JSON body (url or urls); see url-to-md/README.md.
Run with Uvicorn
Install mdengine[api] plus the format extra(s), then run the app object from the table below.
| Service | Uvicorn target | Required extras (typical) |
|---|---|---|
md_generator.pdf.api.main:app |
pdf, api |
|
| Word | md_generator.word.api.main:app |
word, api, mcp (Word mounts FastMCP) |
| PPTX | md_generator.ppt.api.main:app |
ppt, api, mcp |
| XLSX | md_generator.xlsx.api.app:app |
xlsx, api |
| Image | md_generator.image.api.main:app |
image, api, mcp |
| Text/JSON/XML | md_generator.text.api.main:app |
text, api, mcp |
| ZIP | md_generator.archive.api.main:app |
archive, api, mcp (+ extras for nested office/PDF) |
| URL / HTML | md_generator.url.api.main:app |
url, api, mcp |
Examples:
uvicorn md_generator.pdf.api.main:app --host 127.0.0.1 --port 8001
uvicorn md_generator.word.api.main:app --host 127.0.0.1 --port 8002
uvicorn md_generator.archive.api.main:app --host 127.0.0.1 --port 8010
uvicorn md_generator.url.api.main:app --host 127.0.0.1 --port 8011
MCP over HTTP on the same server
These apps mount an MCP HTTP app at /mcp (Streamable HTTP / framework-specific). Start the API as above, then point an MCP client at http://<host>:<port>/mcp where supported.
Environment variables (limits & CORS)
Prefixes differ per service (often read from a .env file next to the process):
| Service | Prefix | Examples |
|---|---|---|
PDF_TO_MD_ |
PDF_TO_MD_MAX_UPLOAD_MB, PDF_TO_MD_MAX_SYNC_UPLOAD_MB, PDF_TO_MD_TEMP_DIR, PDF_TO_MD_CORS_ORIGINS |
|
| Word | WORD_TO_MD_ |
WORD_TO_MD_MAX_UPLOAD_MB, WORD_TO_MD_MAX_SYNC_UPLOAD_MB, WORD_TO_MD_JOB_TTL_SECONDS, WORD_TO_MD_TEMP_DIR, WORD_TO_MD_CORS_ORIGINS |
| ZIP | ZIP_TO_MD_ |
ZIP_TO_MD_MAX_UPLOAD_MB, ZIP_TO_MD_MAX_SYNC_UPLOAD_MB, ZIP_TO_MD_JOB_TTL_SECONDS, ZIP_TO_MD_TEMP_DIR, ZIP_TO_MD_CORS_ORIGINS, optional image post-pass defaults |
| PPTX | PPT_TO_MD_ |
PPT_TO_MD_MAX_UPLOAD_MB, … |
| Text | TXT_JSON_XML_TO_MD_ |
same pattern |
| XLSX | XLSX_TO_MD_ |
XLSX_TO_MD_TEMP_DIR, XLSX_TO_MD_CORS_ORIGINS, etc. (see md_generator.xlsx.api.app) |
| URL | URL_TO_MD_ |
URL_TO_MD_MAX_SYNC_URLS, URL_TO_MD_MAX_SYNC_CRAWL_PAGES, URL_TO_MD_MAX_JOB_URLS, URL_TO_MD_JOB_TTL_SECONDS, URL_TO_MD_TEMP_DIR, URL_TO_MD_CORS_ORIGINS |
Exact variable names match the ApiSettings / helper functions in each api/settings or api/app module.
MCP (Model Context Protocol)
Two usage patterns:
- Bundled with FastAPI — run Uvicorn as in the previous section; use path
/mcpon the same host/port. - Standalone process — run a small
__main__module (stdio, SSE, or streamable-http) for use with Cursor, Claude Desktop, or other MCP hosts.
Standalone MCP processes
| Converter | Command (examples) |
|---|---|
| ZIP | python -m md_generator.archive.api.mcp_server / --transport sse / --transport streamable-http |
| Text/JSON/XML | python -m md_generator.text.api.mcp_server |
| Word (FastMCP) | python -m md_generator.word.api.mcp_server / --transport stdio (default) or streamable-http, plus --host / --port when needed |
| PDF (FastMCP) | python -m md_generator.pdf.api.mcp_server / --transport stdio / sse / streamable-http |
| PPTX | python -m md_generator.ppt.api.mcp_server (see module docstring for flags) |
| Image | python -m md_generator.image.api.mcp_server (see module for CLI) |
| URL / HTML | python -m md_generator.url.api.mcp_server / --transport sse / --transport streamable-http |
Word and XLSX also ship a small runner script in the repo:
python word-to-md/run.py api --host 127.0.0.1 --port 8002
python word-to-md/run.py mcp --transport stdio
python xlsx-to-md/run.py api --port 8003
python xlsx-to-md/run.py mcp --transport stdio
The XLSX MCP server is built in code (build_mcp_server() in md_generator.xlsx.mcp_server) and is mounted on the XLSX FastAPI app when MCP dependencies are installed.
Install mdengine[mcp] (and usually [api] when using HTTP) for MCP-related imports to resolve.
Development
pip install -e ".[dev,all]" # or a smaller subset of extras
python -m pytest
Tests live under each legacy folder’s tests/ directory (e.g. pdf-to-md/tests/); pyproject.toml configures pythonpath = ["src"] so md_generator resolves without a separate PYTHONPATH.
Repository layout
| Path | Role |
|---|---|
LICENSE |
MIT license text |
CODE_OF_CONDUCT.md |
Contributor Covenant 2.1 |
src/md_generator/ |
Library source (all formats + api subpackages) |
pyproject.toml |
Packaging, extras, CLI entry points, pytest |
*-to-md/ |
Docs, tests, fixtures, thin converter.py shims, some run.py helpers |
README.md |
This document |
For deeper behavior per format, see the original README files under each *-to-md/ folder where they still exist.
Legal
This project is released under the MIT License. A copy of the license text is included in the repository root.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mdengine-0.2.2.tar.gz.
File metadata
- Download URL: mdengine-0.2.2.tar.gz
- Upload date:
- Size: 99.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85bc2ce8edcc095b791117fdc729410ef3d09762db2e205fa1d21af6481e3a17
|
|
| MD5 |
f76ba30c1664709c97bb4e9883f2ae2d
|
|
| BLAKE2b-256 |
53b8020b9acd5e0399b4b39e8b02d1da4f65bece436f51581b63df8ebb67f6da
|
Provenance
The following attestation bundles were made for mdengine-0.2.2.tar.gz:
Publisher:
python-publish.yml on vishal7090/md-generator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mdengine-0.2.2.tar.gz -
Subject digest:
85bc2ce8edcc095b791117fdc729410ef3d09762db2e205fa1d21af6481e3a17 - Sigstore transparency entry: 1325750647
- Sigstore integration time:
-
Permalink:
vishal7090/md-generator@a34c7a5b0c5aae5b72a113d975ccf3f27551aeb6 -
Branch / Tag:
refs/tags/v0.2.2-release - Owner: https://github.com/vishal7090
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@a34c7a5b0c5aae5b72a113d975ccf3f27551aeb6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file mdengine-0.2.2-py3-none-any.whl.
File metadata
- Download URL: mdengine-0.2.2-py3-none-any.whl
- Upload date:
- Size: 135.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19d154e9f357c467f89053f091f80e07816c29efb1b30bb883129c4774811392
|
|
| MD5 |
12269621320f34a0c6710f557fa0d2d7
|
|
| BLAKE2b-256 |
4d2d74445d8f26dc3c666b4377d01b0539d285513eb2089af6059cbca3aa1801
|
Provenance
The following attestation bundles were made for mdengine-0.2.2-py3-none-any.whl:
Publisher:
python-publish.yml on vishal7090/md-generator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mdengine-0.2.2-py3-none-any.whl -
Subject digest:
19d154e9f357c467f89053f091f80e07816c29efb1b30bb883129c4774811392 - Sigstore transparency entry: 1325750728
- Sigstore integration time:
-
Permalink:
vishal7090/md-generator@a34c7a5b0c5aae5b72a113d975ccf3f27551aeb6 -
Branch / Tag:
refs/tags/v0.2.2-release - Owner: https://github.com/vishal7090
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@a34c7a5b0c5aae5b72a113d975ccf3f27551aeb6 -
Trigger Event:
release
-
Statement type: