High-performance Microsoft Office document extraction to Markdown

These details have not been verified by PyPI

Project links

Project description

undoc

High-performance Microsoft Office document extraction to Markdown.

Installation

pip install undoc

Usage

Basic Usage

from undoc import parse_file

# Parse a document
doc = parse_file("document.docx")

# Convert to Markdown
markdown = doc.to_markdown()
print(markdown)

# Convert to plain text
text = doc.to_text()

# Convert to JSON
json_data = doc.to_json()

With Context Manager

from undoc import parse_file

with parse_file("document.xlsx") as doc:
    print(doc.to_markdown(frontmatter=True))
    print(f"Sections: {doc.section_count}")
    print(f"Resources: {doc.resource_count}")

Parse from Bytes

from undoc import parse_bytes

with open("document.pptx", "rb") as f:
    data = f.read()

doc = parse_bytes(data)
markdown = doc.to_markdown()

Extract Resources (Images)

from undoc import parse_file

doc = parse_file("document.docx")

# Get all resource IDs
resource_ids = doc.get_resource_ids()

for rid in resource_ids:
    # Get resource metadata
    info = doc.get_resource_info(rid)
    print(f"Resource: {info['filename']} ({info['mime_type']})")

    # Get resource binary data
    data = doc.get_resource_data(rid)

    # Save to file
    with open(info['filename'], 'wb') as f:
        f.write(data)

Document Metadata

from undoc import parse_file

doc = parse_file("document.docx")

print(f"Title: {doc.title}")
print(f"Author: {doc.author}")
print(f"Sections: {doc.section_count}")
print(f"Resources: {doc.resource_count}")

Supported Formats

DOCX - Microsoft Word documents
XLSX - Microsoft Excel spreadsheets
PPTX - Microsoft PowerPoint presentations

Features

RAG-Ready Output: Structured Markdown optimized for RAG/LLM applications
High Performance: Native Rust implementation via FFI
Asset Extraction: Images and embedded resources
Metadata Preservation: Document properties, styles, formatting
Cross-Platform: Windows, Linux, macOS (Intel & ARM)

API Reference

Functions

parse_file(path) - Parse document from file path
parse_bytes(data) - Parse document from bytes
version() - Get library version

Undoc Class

Conversion Methods

to_markdown(frontmatter=False, escape_special=False, paragraph_spacing=False) - Convert to Markdown
to_text() - Convert to plain text
to_json(compact=False) - Convert to JSON
plain_text() - Get plain text (fast extraction)

Properties

title - Document title
author - Document author
section_count - Number of sections
resource_count - Number of resources

Resource Methods

get_resource_ids() - List of resource IDs
get_resource_info(id) - Resource metadata
get_resource_data(id) - Resource binary data

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

May 21, 2026

0.3.0

May 12, 2026

This version

0.2.2

May 9, 2026

0.2.1

Apr 27, 2026

0.2.0

Apr 19, 2026

0.1.20

Apr 15, 2026

0.1.19

Apr 14, 2026

0.1.18

Mar 19, 2026

0.1.17

Mar 9, 2026

0.1.16

Feb 21, 2026

0.1.15

Feb 21, 2026

0.1.13

Jan 31, 2026

0.1.11

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

undoc-0.2.2-py3-none-any.whl (4.0 MB view details)

Uploaded May 9, 2026 Python 3

File details

Details for the file undoc-0.2.2-py3-none-any.whl.

File metadata

Download URL: undoc-0.2.2-py3-none-any.whl
Upload date: May 9, 2026
Size: 4.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for undoc-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a079562f197a1d8617b2a47da92259d2efc2dcf875a53b2e239f530a69ab4b33`
MD5	`69f9e24ddb9482a3e069eaf79eacaf48`
BLAKE2b-256	`ebd46cf7cbac3fafa41daa3438b029abaea0404734d45029cbbe7deeccb36bf4`

See more details on using hashes here.

undoc 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

undoc

Installation

Usage

Basic Usage

With Context Manager

Parse from Bytes

Extract Resources (Images)

Document Metadata

Supported Formats

Features

API Reference

Functions

Undoc Class

Conversion Methods

Properties

Resource Methods

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes