Universal document parser — converts PDF, DOCX, XLSX, CSV to Markdown strings for AI pipelines

These details have not been verified by PyPI

Project links

Project description

mdextract

Universal document → Markdown parser for AI pipelines.

Converts PDF, DOCX, XLSX, and CSV files into clean Markdown strings with a single function call. Designed to be the extraction layer in RAG systems, LLM pipelines, and document processing workflows.

import mdextract

text = mdextract.parse_file("quarterly_report.pdf")
response = llm.chat(f"Summarise this:\n\n{text}")

Features

Format	Output
`.pdf`	Markdown with headings (detected by font size) and GFM tables
`.docx`	Markdown preserving Word heading styles (`Heading 1–6`, `Title`) and tables
`.xlsx`	One `# Sheet Name` section + GFM table per worksheet
`.csv`	Single GFM Markdown table

Zero configuration — just point it at a file
Returns a string — no temp files, no disk I/O required
Layout-aware for PDFs — tables are detected and rendered separately from body text; headings are inferred from font size
AI-pipeline friendly — output is plain UTF-8 Markdown, ready for chunking, embedding, or prompt injection

Installation

pip install mdextract

Or with uv:

uv add mdextract

Quickstart

Functional API (recommended)

import mdextract

# Any supported format — auto-detected from extension
text: str = mdextract.parse_file("report.pdf")
text: str = mdextract.parse_file("data.xlsx")
text: str = mdextract.parse_file("table.csv")
text: str = mdextract.parse_file("document.docx")

Per-format helpers

from mdextract import parse_pdf, parse_docx, parse_csv, parse_xlsx

text = parse_pdf("report.pdf")
text = parse_docx("contract.docx")
text = parse_csv("users.csv")
text = parse_xlsx("financials.xlsx")

Class API

Useful when you want to reuse an instance or save output to disk:

from mdextract import mdextract

parser = mdextract()

# Returns Markdown string
text = parser.parse_file("report.pdf")

# Also write to disk
text = parser.parse_file("report.pdf", output="report.md")

# Inspect supported formats
print(parser.supported_extensions)
# ['.csv', '.docx', '.pdf', '.xlsx']

AI Pipeline Examples

RAG (Retrieval-Augmented Generation)

import mdextract
from your_vectorstore import embed_and_store

for file in Path("docs/").glob("**/*"):
    try:
        markdown = mdextract.parse_file(str(file))
        embed_and_store(source=str(file), content=markdown)
    except ValueError:
        pass  # unsupported format, skip

LLM document Q&A

import mdextract
import openai

context = mdextract.parse_file("annual_report.pdf")

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a financial analyst."},
        {"role": "user", "content": f"Answer based on this document:\n\n{context}\n\nQuestion: What was the net revenue?"},
    ],
)

Batch processing

import mdextract
from pathlib import Path

results = {}
for path in Path("uploads/").iterdir():
    try:
        results[path.name] = mdextract.parse_file(str(path))
    except (ValueError, FileNotFoundError) as e:
        results[path.name] = f"Error: {e}"

Format Notes

PDF

Character-level extraction via pdfplumber
Tables detected automatically using ruling lines; table cells excluded from body text stream
Headings detected by font size relative to the dominant body font size
Page separators inserted as --- with  comments

DOCX

Heading levels mapped from Word's built-in styles (Heading 1 → #, Title → #, etc.)
List items detected via w:numPr XML nodes and rendered as - item
Merged table cells are handled; content is joined with a space

XLSX

Each worksheet becomes a top-level section: # Sheet Name
Fully empty rows at the end of a sheet are stripped
Cell values are coerced to strings; None cells become empty strings
Multi-sheet workbooks produce multiple sections separated by ---

CSV

First row treated as the header
UTF-8 BOM handled automatically (utf-8-sig encoding)
Short rows padded to match the column count of the widest row

Error Handling

import mdextract

try:
    text = mdextract.parse_file("report.pdf")
except FileNotFoundError:
    print("File does not exist")
except ValueError as e:
    print(e)  # "Unsupported file type '.xyz'. Supported: .csv, .docx, .pdf, .xlsx"

Requirements

Python ≥ 3.11
pdfplumber — PDF extraction
python-docx — DOCX parsing
openpyxl — XLSX parsing

CSV parsing uses the Python standard library only.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Mar 31, 2026

This version

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdextract-0.1.0.tar.gz (8.4 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mdextract-0.1.0-py3-none-any.whl (10.6 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file mdextract-0.1.0.tar.gz.

File metadata

Download URL: mdextract-0.1.0.tar.gz
Upload date: Mar 30, 2026
Size: 8.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mdextract-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9f4778ac50e06fecbf60bdf148e30f4e2f3c2db2184fa9aaced50973bf73ad01`
MD5	`33e0572976709cf11ab3373054591f49`
BLAKE2b-256	`af1e22ffab2a4bd3c708a5b16f561765ee338c3ca76b4aae675843c05ce81309`

See more details on using hashes here.

File details

Details for the file mdextract-0.1.0-py3-none-any.whl.

File metadata

Download URL: mdextract-0.1.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mdextract-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`858de85a1e59f32d6f54838ac049f699d6f718bdd064fd2ef15078779d5a4a56`
MD5	`dbd1726235f2122ac3916520b61f12c9`
BLAKE2b-256	`257a5cf6f4159091ee02725dcb5f6d606bc7ebf55af47ed96b28d445bca843b8`

See more details on using hashes here.

mdextract 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mdextract

Features

Installation

Quickstart

Functional API (recommended)

Per-format helpers

Class API

AI Pipeline Examples

RAG (Retrieval-Augmented Generation)

LLM document Q&A

Batch processing

Format Notes

PDF

DOCX

XLSX

CSV

Error Handling

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes