High-performance Microsoft Office document extraction to Markdown
Project description
undoc
High-performance Microsoft Office document extraction to Markdown.
Installation
pip install undoc
Usage
Basic Usage
from undoc import parse_file
# Parse a document
doc = parse_file("document.docx")
# Convert to Markdown
markdown = doc.to_markdown()
print(markdown)
# Convert to plain text
text = doc.to_text()
# Convert to JSON
json_data = doc.to_json()
With Context Manager
from undoc import parse_file
with parse_file("document.xlsx") as doc:
print(doc.to_markdown(frontmatter=True))
print(f"Sections: {doc.section_count}")
print(f"Resources: {doc.resource_count}")
Parse from Bytes
from undoc import parse_bytes
with open("document.pptx", "rb") as f:
data = f.read()
doc = parse_bytes(data)
markdown = doc.to_markdown()
Extract Resources (Images)
from undoc import parse_file
doc = parse_file("document.docx")
# Get all resource IDs
resource_ids = doc.get_resource_ids()
for rid in resource_ids:
# Get resource metadata
info = doc.get_resource_info(rid)
print(f"Resource: {info['filename']} ({info['mime_type']})")
# Get resource binary data
data = doc.get_resource_data(rid)
# Save to file
with open(info['filename'], 'wb') as f:
f.write(data)
Document Metadata
from undoc import parse_file
doc = parse_file("document.docx")
print(f"Title: {doc.title}")
print(f"Author: {doc.author}")
print(f"Sections: {doc.section_count}")
print(f"Resources: {doc.resource_count}")
Supported Formats
- DOCX - Microsoft Word documents
- XLSX - Microsoft Excel spreadsheets
- PPTX - Microsoft PowerPoint presentations
Features
- RAG-Ready Output: Structured Markdown optimized for RAG/LLM applications
- High Performance: Native Rust implementation via FFI
- Asset Extraction: Images and embedded resources
- Metadata Preservation: Document properties, styles, formatting
- Cross-Platform: Windows, Linux, macOS (Intel & ARM)
API Reference
Functions
parse_file(path)- Parse document from file pathparse_bytes(data)- Parse document from bytesversion()- Get library version
Undoc Class
Conversion Methods
to_markdown(frontmatter=False, escape_special=False, paragraph_spacing=False)- Convert to Markdownto_text()- Convert to plain textto_json(compact=False)- Convert to JSONplain_text()- Get plain text (fast extraction)
Properties
title- Document titleauthor- Document authorsection_count- Number of sectionsresource_count- Number of resources
Resource Methods
get_resource_ids()- List of resource IDsget_resource_info(id)- Resource metadataget_resource_data(id)- Resource binary data
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
undoc-0.2.2-py3-none-any.whl
(4.0 MB
view details)
File details
Details for the file undoc-0.2.2-py3-none-any.whl.
File metadata
- Download URL: undoc-0.2.2-py3-none-any.whl
- Upload date:
- Size: 4.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a079562f197a1d8617b2a47da92259d2efc2dcf875a53b2e239f530a69ab4b33
|
|
| MD5 |
69f9e24ddb9482a3e069eaf79eacaf48
|
|
| BLAKE2b-256 |
ebd46cf7cbac3fafa41daa3438b029abaea0404734d45029cbbe7deeccb36bf4
|