Skip to main content

High-performance Microsoft Office document extraction to Markdown

Project description

undoc

High-performance Microsoft Office document extraction to Markdown.

Installation

pip install undoc

Usage

Basic Usage

from undoc import parse_file

# Parse a document
doc = parse_file("document.docx")

# Convert to Markdown
markdown = doc.to_markdown()
print(markdown)

# Convert to plain text
text = doc.to_text()

# Convert to JSON
json_data = doc.to_json()

With Context Manager

from undoc import parse_file

with parse_file("document.xlsx") as doc:
    print(doc.to_markdown(frontmatter=True))
    print(f"Sections: {doc.section_count}")
    print(f"Resources: {doc.resource_count}")

Parse from Bytes

from undoc import parse_bytes

with open("document.pptx", "rb") as f:
    data = f.read()

doc = parse_bytes(data)
markdown = doc.to_markdown()

Extract Resources (Images)

from undoc import parse_file

doc = parse_file("document.docx")

# Get all resource IDs
resource_ids = doc.get_resource_ids()

for rid in resource_ids:
    # Get resource metadata
    info = doc.get_resource_info(rid)
    print(f"Resource: {info['filename']} ({info['mime_type']})")

    # Get resource binary data
    data = doc.get_resource_data(rid)

    # Save to file
    with open(info['filename'], 'wb') as f:
        f.write(data)

Document Metadata

from undoc import parse_file

doc = parse_file("document.docx")

print(f"Title: {doc.title}")
print(f"Author: {doc.author}")
print(f"Sections: {doc.section_count}")
print(f"Resources: {doc.resource_count}")

Supported Formats

  • DOCX - Microsoft Word documents
  • XLSX - Microsoft Excel spreadsheets
  • PPTX - Microsoft PowerPoint presentations

Features

  • RAG-Ready Output: Structured Markdown optimized for RAG/LLM applications
  • High Performance: Native Rust implementation via FFI
  • Asset Extraction: Images and embedded resources
  • Metadata Preservation: Document properties, styles, formatting
  • Cross-Platform: Windows, Linux, macOS (Intel & ARM)

API Reference

Functions

  • parse_file(path) - Parse document from file path
  • parse_bytes(data) - Parse document from bytes
  • version() - Get library version

Undoc Class

Conversion Methods

  • to_markdown(frontmatter=False, escape_special=False, paragraph_spacing=False) - Convert to Markdown
  • to_text() - Convert to plain text
  • to_json(compact=False) - Convert to JSON
  • plain_text() - Get plain text (fast extraction)

Properties

  • title - Document title
  • author - Document author
  • section_count - Number of sections
  • resource_count - Number of resources

Resource Methods

  • get_resource_ids() - List of resource IDs
  • get_resource_info(id) - Resource metadata
  • get_resource_data(id) - Resource binary data

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

undoc-0.1.11-py3-none-any.whl (3.1 MB view details)

Uploaded Python 3

File details

Details for the file undoc-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: undoc-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for undoc-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 eac5a20f7697b6b6af4d0df9a3a94695c7c413c71b239c0ec4b2f0d75be9297f
MD5 cd2ae3bc7bf5dc99573acb27f2d30c20
BLAKE2b-256 9fb1aa70d50e185aea2d712b7e4a1dae80ffc72b23a80b23972cad2f7c00d266

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page