Skip to main content

Native Python bindings for OfficeMD document extraction

Project description

officemd

Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.

Install

uv add officemd
# or
pip install officemd

For the CLI without adding to a project:

uvx officemd markdown report.docx

CLI

officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd render report.docx
officemd diff old.docx new.docx

SDK

from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes

content = Path("report.docx").read_bytes()

# Markdown
print(markdown_from_bytes(content, format="docx"))

# Structured JSON IR
print(extract_ir_json(content, format="docx"))

# Docling JSON
print(docling_from_bytes(content, format="docx"))

Typed OOXML patching with reports

import officemd
from pathlib import Path

content = Path("report.docx").read_bytes()
patch = officemd.DocxPatch(
    scoped_replacements=[
        officemd.ScopedDocxReplace(
            officemd.DocxTextScope.ALL_TEXT,
            officemd.TextReplace("word", "term"),
        )
    ]
)
# ALL_TEXT includes document content plus free-text metadata/app/custom fields.

single = officemd.patch_docx_with_report(content, patch)
print(single.report.replacements_applied)

batch = officemd.patch_docx_batch_with_report([content, content], patch, workers=4)
for item in batch:
    print(item.report.parts_scanned, item.report.parts_modified, item.report.replacements_applied)

Additional patch scopes are available for free-text metadata/comment fields:

  • DocxTextScope.METADATA_CORE, METADATA_APP, METADATA_CUSTOM, METADATA_ALL
  • PptxTextScope.COMMENT_AUTHORS, METADATA_CORE, METADATA_APP, METADATA_CUSTOM, METADATA_ALL
  • XlsxTextScope.COMMENTS, COMMENT_AUTHORS, METADATA_CORE, METADATA_APP, METADATA_CUSTOM, METADATA_ALL

ALL_TEXT now means all free-text fields, i.e. document content plus metadata/comment-author text where applicable.

Formatting-preserving replacement is available for OOXML content text:

patch = officemd.DocxPatch(
    scoped_replacements=[
        officemd.ScopedDocxReplace(
            officemd.DocxTextScope.BODY,
            officemd.TextReplace("Confidential", "", preserve_formatting=True),
        )
    ]
)

Semantics:

  • a match may span multiple runs
  • the first matched run's formatting wins
  • later consumed runs are left empty in v1
  • metadata/comment-author fields still use simple text replacement

Supported Formats

Format Extension Markdown JSON IR Docling
Word .docx yes yes yes
Excel .xlsx yes yes yes
CSV .csv yes yes -
PowerPoint .pptx yes yes yes
PDF .pdf yes yes -

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

officemd-0.1.6.tar.gz (1.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

officemd-0.1.6-cp312-abi3-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.12+Windows x86-64

officemd-0.1.6-cp312-abi3-manylinux_2_34_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.34+ x86-64

officemd-0.1.6-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.8 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

officemd-0.1.6-cp312-abi3-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

officemd-0.1.6-cp312-abi3-macosx_10_12_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file officemd-0.1.6.tar.gz.

File metadata

  • Download URL: officemd-0.1.6.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for officemd-0.1.6.tar.gz
Algorithm Hash digest
SHA256 cfb9fb85f183e699ae6a1fd35c37a715295a8996e4f5a91530bba9b917e3477a
MD5 1b655b120fb4e179fbcd25ac9072f6ce
BLAKE2b-256 3d173172a1517cd7f7fcae14a0ffd80a60e9d72c3297dcba1064f3e0de038b8f

See more details on using hashes here.

Provenance

The following attestation bundles were made for officemd-0.1.6.tar.gz:

Publisher: release.yml on ThomAub/officemd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file officemd-0.1.6-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: officemd-0.1.6-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for officemd-0.1.6-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 85d449d34102d84d8d7452d49ea8e05ae90e7c60ed5e10c1183f35bfc470f8e0
MD5 a06201956f0e1d340f301a521b197c85
BLAKE2b-256 5896a02870f3fd96a7de42ec257757b0e96529f2cf8d68c2e66423eeafb6b8dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for officemd-0.1.6-cp312-abi3-win_amd64.whl:

Publisher: release.yml on ThomAub/officemd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file officemd-0.1.6-cp312-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for officemd-0.1.6-cp312-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 204d6dbc39dd6a31aed1f68220ecfd87d5d34fe73c18be7757fa76f6d20b448d
MD5 88fd92c16ea3b3d53b6031e892654e4b
BLAKE2b-256 e679edb3198398547782f13452c2e65ff059ee6f1734f7ab6714c1e2bcf8e2db

See more details on using hashes here.

Provenance

The following attestation bundles were made for officemd-0.1.6-cp312-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on ThomAub/officemd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file officemd-0.1.6-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for officemd-0.1.6-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b8a0e2be9dc3f2b05075f71317d606e8647ca5ee26f3f77949aa64e8d9ba2514
MD5 ae2028cfca9881b85797e55701c2931a
BLAKE2b-256 034e444c83f4dd50bddb833b3a80c99caca6244020385e5d35ae992c1890c0ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for officemd-0.1.6-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on ThomAub/officemd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file officemd-0.1.6-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for officemd-0.1.6-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1c5b2c5074a6d236041c4553f298ad1412b0adb92a0b93c9e68f8561d5a57af8
MD5 53bf15486818316cd71699c423c62b87
BLAKE2b-256 1a3d2a53c6a0a117591e0ddf618c4473e2a102153662a72126539affdd49a4e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for officemd-0.1.6-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on ThomAub/officemd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file officemd-0.1.6-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for officemd-0.1.6-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 34e61e8038d6e4e97f941e8e10eb1248c7c298ac6158175fddee1095235b7ebc
MD5 734698fb8ea4cd47579741c136d02b62
BLAKE2b-256 0f7c2869bf468f73ff05acdace3fad99cfa58e28ada27b82d2fde87a9627be9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for officemd-0.1.6-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on ThomAub/officemd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page