Universal document parser — converts PDF, DOCX, XLSX, CSV to Markdown strings for AI pipelines

These details have not been verified by PyPI

Project links

Project description

mdextract

Universal document → Markdown parser for AI pipelines.

Converts PDF, DOCX, XLSX, and CSV files into clean Markdown strings with a single function call. Designed to be the extraction layer in RAG systems, LLM pipelines, and document processing workflows.

import mdextract

text = mdextract.parse_file("quarterly_report.pdf")
response = llm.chat(f"Summarise this:\n\n{text}")

Features

Format	Output
`.pdf`	Markdown with headings (detected by font size) and GFM tables
`.docx`	Markdown preserving Word heading styles (`Heading 1–6`, `Title`) and tables
`.xlsx`	One `# Sheet Name` section + GFM table per worksheet
`.csv`	Single GFM Markdown table

Zero configuration — just point it at a file
Returns a string — no temp files, no disk I/O required
Layout-aware for PDFs — tables are detected and rendered separately from body text; headings are inferred from font size
Scanned PDF / OCR support — image-only pages are automatically processed with Tesseract; supports any language and tessdata_best high-accuracy models
AI-pipeline friendly — output is plain UTF-8 Markdown, ready for chunking, embedding, or prompt injection

Installation

pip install mdextract

Or with uv:

uv add mdextract

Quickstart

Functional API (recommended)

import mdextract

# Any supported format — auto-detected from extension
text: str = mdextract.parse_file("report.pdf")
text: str = mdextract.parse_file("data.xlsx")
text: str = mdextract.parse_file("table.csv")
text: str = mdextract.parse_file("document.docx")

# Scanned / image-only PDF — French
text: str = mdextract.parse_file("rapport.pdf", ocr_lang="fra")

# Scanned PDF — French + English mixed, using high-accuracy models
BEST = r"C:\Program Files\Tesseract-OCR\tessdata_best"
text: str = mdextract.parse_file("rapport.pdf", ocr_lang="fra", tessdata_dir=BEST)

Per-format helpers

from mdextract import parse_pdf, parse_docx, parse_csv, parse_xlsx

text = parse_pdf("report.pdf")
text = parse_docx("contract.docx")
text = parse_csv("users.csv")
text = parse_xlsx("financials.xlsx")

Class API

Useful when you want to reuse an instance or save output to disk:

from mdextract import mdextract

parser = mdextract()

# Returns Markdown string
text = parser.parse_file("report.pdf")

# Also write to disk
text = parser.parse_file("report.pdf", output="report.md")

# Inspect supported formats
print(parser.supported_extensions)
# ['.csv', '.docx', '.pdf', '.xlsx']

AI Pipeline Examples

RAG (Retrieval-Augmented Generation)

import mdextract
from your_vectorstore import embed_and_store

for file in Path("docs/").glob("**/*"):
    try:
        markdown = mdextract.parse_file(str(file))
        embed_and_store(source=str(file), content=markdown)
    except ValueError:
        pass  # unsupported format, skip

LLM document Q&A

import mdextract
import openai

context = mdextract.parse_file("annual_report.pdf")

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a financial analyst."},
        {"role": "user", "content": f"Answer based on this document:\n\n{context}\n\nQuestion: What was the net revenue?"},
    ],
)

Batch processing

import mdextract
from pathlib import Path

results = {}
for path in Path("uploads/").iterdir():
    try:
        results[path.name] = mdextract.parse_file(str(path))
    except (ValueError, FileNotFoundError) as e:
        results[path.name] = f"Error: {e}"

Format Notes

PDF

Character-level extraction via pdfplumber
Tables detected automatically using ruling lines; table cells excluded from body text stream
Headings detected by font size relative to the dominant body font size
Page separators inserted as --- with  comments
Scanned pages (no embedded text) are automatically OCR'd via Tesseract — no extra code needed

OCR parameters

Parameter	Default	Description
`ocr_lang`	`"eng"`	Tesseract language code(s). Use `"fra"` for French, `"eng+fra"` for mixed.
`tessdata_dir`	`None`	Path to a custom `tessdata` folder. Point this at `tessdata_best` for higher accuracy.

import mdextract

# Standard English OCR (default)
text = mdextract.parse_file("scan.pdf")

# French OCR — standard models
text = mdextract.parse_file("rapport.pdf", ocr_lang="fra")

# French OCR — high-accuracy models (tessdata_best)
BEST = r"C:\Program Files\Tesseract-OCR\tessdata_best"
text = mdextract.parse_file("rapport.pdf", ocr_lang="fra", tessdata_dir=BEST)

See OCR Setup below for installation instructions.

DOCX

Heading levels mapped from Word's built-in styles (Heading 1 → #, Title → #, etc.)
List items detected via w:numPr XML nodes and rendered as - item
Merged table cells are handled; content is joined with a space

XLSX

Each worksheet becomes a top-level section: # Sheet Name
Fully empty rows at the end of a sheet are stripped
Cell values are coerced to strings; None cells become empty strings
Multi-sheet workbooks produce multiple sections separated by ---

CSV

First row treated as the header
UTF-8 BOM handled automatically (utf-8-sig encoding)
Short rows padded to match the column count of the widest row

Error Handling

import mdextract

try:
    text = mdextract.parse_file("report.pdf")
except FileNotFoundError:
    print("File does not exist")
except ValueError as e:
    print(e)  # "Unsupported file type '.xyz'. Supported: .csv, .docx, .pdf, .xlsx"

OCR Setup (Scanned PDFs)

For PDFs that contain scanned images instead of embedded text, mdextract automatically falls back to Tesseract OCR. Two extra packages and a Tesseract installation are required.

1 — Install Tesseract

Windows: Download and run the UB Mannheim installer: https://github.com/UB-Mannheim/tesseract/wiki

Default install path: C:\Program Files\Tesseract-OCR\tesseract.exe

macOS:

brew install tesseract

Linux (Debian/Ubuntu):

sudo apt install tesseract-ocr

2 — Install Python OCR dependencies

pip install pymupdf pytesseract

Or with extras (if published with OCR extras):

pip install "mdextract[ocr]"

3 — Download language models

Standard models (faster, smaller — ~5 MB each):

# Windows — save to Tesseract's tessdata folder
$dir = "C:\Program Files\Tesseract-OCR\tessdata"
Invoke-WebRequest -Uri "https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata" -OutFile "$dir\fra.traineddata"
Invoke-WebRequest -Uri "https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata" -OutFile "$dir\eng.traineddata"

Best models (higher accuracy, larger — ~20 MB each):

# Windows — save to a separate tessdata_best folder
$best = "C:\Program Files\Tesseract-OCR\tessdata_best"
New-Item -ItemType Directory -Path $best -Force
Invoke-WebRequest -Uri "https://github.com/tesseract-ocr/tessdata_best/raw/main/fra.traineddata" -OutFile "$best\fra.traineddata"
Invoke-WebRequest -Uri "https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata" -OutFile "$best\eng.traineddata"

All available languages: https://github.com/tesseract-ocr/tessdata_best

4 — Verify

& "C:\Program Files\Tesseract-OCR\tesseract.exe" --list-langs
# Should include: eng, fra (and any others you downloaded)

Using tessdata_best in code

import mdextract

BEST = r"C:\Program Files\Tesseract-OCR\tessdata_best"

text = mdextract.parse_file(
    "rapport.pdf",
    ocr_lang="fra",          # language code
    tessdata_dir=BEST,       # point at tessdata_best folder
)

Note: Every language you pass in ocr_lang must have a matching .traineddata file in the tessdata_dir folder. For ocr_lang="fra+eng" you need both fra.traineddata and eng.traineddata.

Requirements

Python ≥ 3.11
pdfplumber — PDF extraction
python-docx — DOCX parsing
openpyxl — XLSX parsing

CSV parsing uses the Python standard library only.

Optional (scanned PDF / OCR support):

pymupdf — PDF page rendering to image
pytesseract — Tesseract OCR wrapper
Tesseract OCR ≥ 5 installed on the system

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Mar 31, 2026

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdextract-1.0.0.tar.gz (11.1 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mdextract-1.0.0-py3-none-any.whl (13.5 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file mdextract-1.0.0.tar.gz.

File metadata

Download URL: mdextract-1.0.0.tar.gz
Upload date: Mar 31, 2026
Size: 11.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mdextract-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a4e315dc2dd897d1636cde0683b651dfe98b0c1a1fd054a8a15ffb5818469874`
MD5	`4fe4819467e44874e8311fb00e00cf83`
BLAKE2b-256	`68ad7de22cf9403c01104e3be3cc946482c19b03681899b07cf5cb8f4447a492`

See more details on using hashes here.

File details

Details for the file mdextract-1.0.0-py3-none-any.whl.

File metadata

Download URL: mdextract-1.0.0-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mdextract-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e52bb6b72d6d27abeaf94dda21de3253bd9fcca7ef4c50029e7bd9d6c75b6698`
MD5	`f1209ff30f27ed7131f571faa3707a25`
BLAKE2b-256	`cef4688530aa560abf8ff36e74d4cd2a44004e012c3e6c1f0393a2e945dd8964`

See more details on using hashes here.

mdextract 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mdextract

Features

Installation

Quickstart

Functional API (recommended)

Per-format helpers

Class API

AI Pipeline Examples

RAG (Retrieval-Augmented Generation)

LLM document Q&A

Batch processing

Format Notes

PDF

OCR parameters

DOCX

XLSX

CSV

Error Handling

OCR Setup (Scanned PDFs)

1 — Install Tesseract

2 — Install Python OCR dependencies

3 — Download language models

4 — Verify

Using tessdata_best in code

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes