Skip to main content

Universal document parser — converts PDF, DOCX, XLSX, CSV to Markdown strings for AI pipelines

Project description

mdextract

Universal document → Markdown parser for AI pipelines.

Converts PDF, DOCX, XLSX, and CSV files into clean Markdown strings with a single function call. Designed to be the extraction layer in RAG systems, LLM pipelines, and document processing workflows.

import mdextract

text = mdextract.parse_file("quarterly_report.pdf")
response = llm.chat(f"Summarise this:\n\n{text}")

Features

Format Output
.pdf Markdown with headings (detected by font size) and GFM tables
.docx Markdown preserving Word heading styles (Heading 1–6, Title) and tables
.xlsx One # Sheet Name section + GFM table per worksheet
.csv Single GFM Markdown table
  • Zero configuration — just point it at a file
  • Returns a string — no temp files, no disk I/O required
  • Layout-aware for PDFs — tables are detected and rendered separately from body text; headings are inferred from font size
  • AI-pipeline friendly — output is plain UTF-8 Markdown, ready for chunking, embedding, or prompt injection

Installation

pip install mdextract

Or with uv:

uv add mdextract

Quickstart

Functional API (recommended)

import mdextract

# Any supported format — auto-detected from extension
text: str = mdextract.parse_file("report.pdf")
text: str = mdextract.parse_file("data.xlsx")
text: str = mdextract.parse_file("table.csv")
text: str = mdextract.parse_file("document.docx")

Per-format helpers

from mdextract import parse_pdf, parse_docx, parse_csv, parse_xlsx

text = parse_pdf("report.pdf")
text = parse_docx("contract.docx")
text = parse_csv("users.csv")
text = parse_xlsx("financials.xlsx")

Class API

Useful when you want to reuse an instance or save output to disk:

from mdextract import mdextract

parser = mdextract()

# Returns Markdown string
text = parser.parse_file("report.pdf")

# Also write to disk
text = parser.parse_file("report.pdf", output="report.md")

# Inspect supported formats
print(parser.supported_extensions)
# ['.csv', '.docx', '.pdf', '.xlsx']

AI Pipeline Examples

RAG (Retrieval-Augmented Generation)

import mdextract
from your_vectorstore import embed_and_store

for file in Path("docs/").glob("**/*"):
    try:
        markdown = mdextract.parse_file(str(file))
        embed_and_store(source=str(file), content=markdown)
    except ValueError:
        pass  # unsupported format, skip

LLM document Q&A

import mdextract
import openai

context = mdextract.parse_file("annual_report.pdf")

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a financial analyst."},
        {"role": "user", "content": f"Answer based on this document:\n\n{context}\n\nQuestion: What was the net revenue?"},
    ],
)

Batch processing

import mdextract
from pathlib import Path

results = {}
for path in Path("uploads/").iterdir():
    try:
        results[path.name] = mdextract.parse_file(str(path))
    except (ValueError, FileNotFoundError) as e:
        results[path.name] = f"Error: {e}"

Format Notes

PDF

  • Character-level extraction via pdfplumber
  • Tables detected automatically using ruling lines; table cells excluded from body text stream
  • Headings detected by font size relative to the dominant body font size
  • Page separators inserted as --- with <!-- Page N --> comments

DOCX

  • Heading levels mapped from Word's built-in styles (Heading 1#, Title#, etc.)
  • List items detected via w:numPr XML nodes and rendered as - item
  • Merged table cells are handled; content is joined with a space

XLSX

  • Each worksheet becomes a top-level section: # Sheet Name
  • Fully empty rows at the end of a sheet are stripped
  • Cell values are coerced to strings; None cells become empty strings
  • Multi-sheet workbooks produce multiple sections separated by ---

CSV

  • First row treated as the header
  • UTF-8 BOM handled automatically (utf-8-sig encoding)
  • Short rows padded to match the column count of the widest row

Error Handling

import mdextract

try:
    text = mdextract.parse_file("report.pdf")
except FileNotFoundError:
    print("File does not exist")
except ValueError as e:
    print(e)  # "Unsupported file type '.xyz'. Supported: .csv, .docx, .pdf, .xlsx"

Requirements

CSV parsing uses the Python standard library only.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdextract-0.1.0.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mdextract-0.1.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file mdextract-0.1.0.tar.gz.

File metadata

  • Download URL: mdextract-0.1.0.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mdextract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9f4778ac50e06fecbf60bdf148e30f4e2f3c2db2184fa9aaced50973bf73ad01
MD5 33e0572976709cf11ab3373054591f49
BLAKE2b-256 af1e22ffab2a4bd3c708a5b16f561765ee338c3ca76b4aae675843c05ce81309

See more details on using hashes here.

File details

Details for the file mdextract-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mdextract-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for mdextract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 858de85a1e59f32d6f54838ac049f699d6f718bdd064fd2ef15078779d5a4a56
MD5 dbd1726235f2122ac3916520b61f12c9
BLAKE2b-256 257a5cf6f4159091ee02725dcb5f6d606bc7ebf55af47ed96b28d445bca843b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page