Universal document parser — converts PDF, DOCX, XLSX, CSV to Markdown strings for AI pipelines
Project description
mdextract
Universal document → Markdown parser for AI pipelines.
Converts PDF, DOCX, XLSX, and CSV files into clean Markdown strings with a single function call. Designed to be the extraction layer in RAG systems, LLM pipelines, and document processing workflows.
import mdextract
text = mdextract.parse_file("quarterly_report.pdf")
response = llm.chat(f"Summarise this:\n\n{text}")
Features
| Format | Output |
|---|---|
.pdf |
Markdown with headings (detected by font size) and GFM tables |
.docx |
Markdown preserving Word heading styles (Heading 1–6, Title) and tables |
.xlsx |
One # Sheet Name section + GFM table per worksheet |
.csv |
Single GFM Markdown table |
- Zero configuration — just point it at a file
- Returns a string — no temp files, no disk I/O required
- Layout-aware for PDFs — tables are detected and rendered separately from body text; headings are inferred from font size
- Scanned PDF / OCR support — image-only pages are automatically processed with Tesseract; supports any language and
tessdata_besthigh-accuracy models - AI-pipeline friendly — output is plain UTF-8 Markdown, ready for chunking, embedding, or prompt injection
Installation
pip install mdextract
Or with uv:
uv add mdextract
Quickstart
Functional API (recommended)
import mdextract
# Any supported format — auto-detected from extension
text: str = mdextract.parse_file("report.pdf")
text: str = mdextract.parse_file("data.xlsx")
text: str = mdextract.parse_file("table.csv")
text: str = mdextract.parse_file("document.docx")
# Scanned / image-only PDF — French
text: str = mdextract.parse_file("rapport.pdf", ocr_lang="fra")
# Scanned PDF — French + English mixed, using high-accuracy models
BEST = r"C:\Program Files\Tesseract-OCR\tessdata_best"
text: str = mdextract.parse_file("rapport.pdf", ocr_lang="fra", tessdata_dir=BEST)
Per-format helpers
from mdextract import parse_pdf, parse_docx, parse_csv, parse_xlsx
text = parse_pdf("report.pdf")
text = parse_docx("contract.docx")
text = parse_csv("users.csv")
text = parse_xlsx("financials.xlsx")
Class API
Useful when you want to reuse an instance or save output to disk:
from mdextract import mdextract
parser = mdextract()
# Returns Markdown string
text = parser.parse_file("report.pdf")
# Also write to disk
text = parser.parse_file("report.pdf", output="report.md")
# Inspect supported formats
print(parser.supported_extensions)
# ['.csv', '.docx', '.pdf', '.xlsx']
AI Pipeline Examples
RAG (Retrieval-Augmented Generation)
import mdextract
from your_vectorstore import embed_and_store
for file in Path("docs/").glob("**/*"):
try:
markdown = mdextract.parse_file(str(file))
embed_and_store(source=str(file), content=markdown)
except ValueError:
pass # unsupported format, skip
LLM document Q&A
import mdextract
import openai
context = mdextract.parse_file("annual_report.pdf")
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a financial analyst."},
{"role": "user", "content": f"Answer based on this document:\n\n{context}\n\nQuestion: What was the net revenue?"},
],
)
Batch processing
import mdextract
from pathlib import Path
results = {}
for path in Path("uploads/").iterdir():
try:
results[path.name] = mdextract.parse_file(str(path))
except (ValueError, FileNotFoundError) as e:
results[path.name] = f"Error: {e}"
Format Notes
- Character-level extraction via pdfplumber
- Tables detected automatically using ruling lines; table cells excluded from body text stream
- Headings detected by font size relative to the dominant body font size
- Page separators inserted as
---with<!-- Page N -->comments - Scanned pages (no embedded text) are automatically OCR'd via Tesseract — no extra code needed
OCR parameters
| Parameter | Default | Description |
|---|---|---|
ocr_lang |
"eng" |
Tesseract language code(s). Use "fra" for French, "eng+fra" for mixed. |
tessdata_dir |
None |
Path to a custom tessdata folder. Point this at tessdata_best for higher accuracy. |
import mdextract
# Standard English OCR (default)
text = mdextract.parse_file("scan.pdf")
# French OCR — standard models
text = mdextract.parse_file("rapport.pdf", ocr_lang="fra")
# French OCR — high-accuracy models (tessdata_best)
BEST = r"C:\Program Files\Tesseract-OCR\tessdata_best"
text = mdextract.parse_file("rapport.pdf", ocr_lang="fra", tessdata_dir=BEST)
See OCR Setup below for installation instructions.
DOCX
- Heading levels mapped from Word's built-in styles (
Heading 1→#,Title→#, etc.) - List items detected via
w:numPrXML nodes and rendered as- item - Merged table cells are handled; content is joined with a space
XLSX
- Each worksheet becomes a top-level section:
# Sheet Name - Fully empty rows at the end of a sheet are stripped
- Cell values are coerced to strings;
Nonecells become empty strings - Multi-sheet workbooks produce multiple sections separated by
---
CSV
- First row treated as the header
- UTF-8 BOM handled automatically (
utf-8-sigencoding) - Short rows padded to match the column count of the widest row
Error Handling
import mdextract
try:
text = mdextract.parse_file("report.pdf")
except FileNotFoundError:
print("File does not exist")
except ValueError as e:
print(e) # "Unsupported file type '.xyz'. Supported: .csv, .docx, .pdf, .xlsx"
OCR Setup (Scanned PDFs)
For PDFs that contain scanned images instead of embedded text, mdextract automatically falls back to Tesseract OCR. Two extra packages and a Tesseract installation are required.
1 — Install Tesseract
Windows: Download and run the UB Mannheim installer: https://github.com/UB-Mannheim/tesseract/wiki
Default install path: C:\Program Files\Tesseract-OCR\tesseract.exe
macOS:
brew install tesseract
Linux (Debian/Ubuntu):
sudo apt install tesseract-ocr
2 — Install Python OCR dependencies
pip install pymupdf pytesseract
Or with extras (if published with OCR extras):
pip install "mdextract[ocr]"
3 — Download language models
Standard models (faster, smaller — ~5 MB each):
# Windows — save to Tesseract's tessdata folder
$dir = "C:\Program Files\Tesseract-OCR\tessdata"
Invoke-WebRequest -Uri "https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata" -OutFile "$dir\fra.traineddata"
Invoke-WebRequest -Uri "https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata" -OutFile "$dir\eng.traineddata"
Best models (higher accuracy, larger — ~20 MB each):
# Windows — save to a separate tessdata_best folder
$best = "C:\Program Files\Tesseract-OCR\tessdata_best"
New-Item -ItemType Directory -Path $best -Force
Invoke-WebRequest -Uri "https://github.com/tesseract-ocr/tessdata_best/raw/main/fra.traineddata" -OutFile "$best\fra.traineddata"
Invoke-WebRequest -Uri "https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata" -OutFile "$best\eng.traineddata"
All available languages: https://github.com/tesseract-ocr/tessdata_best
4 — Verify
& "C:\Program Files\Tesseract-OCR\tesseract.exe" --list-langs
# Should include: eng, fra (and any others you downloaded)
Using tessdata_best in code
import mdextract
BEST = r"C:\Program Files\Tesseract-OCR\tessdata_best"
text = mdextract.parse_file(
"rapport.pdf",
ocr_lang="fra", # language code
tessdata_dir=BEST, # point at tessdata_best folder
)
Note: Every language you pass in
ocr_langmust have a matching.traineddatafile in thetessdata_dirfolder. Forocr_lang="fra+eng"you need bothfra.traineddataandeng.traineddata.
Requirements
- Python ≥ 3.11
- pdfplumber — PDF extraction
- python-docx — DOCX parsing
- openpyxl — XLSX parsing
CSV parsing uses the Python standard library only.
Optional (scanned PDF / OCR support):
- pymupdf — PDF page rendering to image
- pytesseract — Tesseract OCR wrapper
- Tesseract OCR ≥ 5 installed on the system
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mdextract-1.0.0.tar.gz.
File metadata
- Download URL: mdextract-1.0.0.tar.gz
- Upload date:
- Size: 11.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4e315dc2dd897d1636cde0683b651dfe98b0c1a1fd054a8a15ffb5818469874
|
|
| MD5 |
4fe4819467e44874e8311fb00e00cf83
|
|
| BLAKE2b-256 |
68ad7de22cf9403c01104e3be3cc946482c19b03681899b07cf5cb8f4447a492
|
File details
Details for the file mdextract-1.0.0-py3-none-any.whl.
File metadata
- Download URL: mdextract-1.0.0-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e52bb6b72d6d27abeaf94dda21de3253bd9fcca7ef4c50029e7bd9d6c75b6698
|
|
| MD5 |
f1209ff30f27ed7131f571faa3707a25
|
|
| BLAKE2b-256 |
cef4688530aa560abf8ff36e74d4cd2a44004e012c3e6c1f0393a2e945dd8964
|