Skip to main content

Extract text, images, footnotes, comments, headers/footers, and tracked changes from .docx files as JSON — wraps the docx-extractor Rust binary.

Project description

docx-extractor-cli

Python wrapper around the docx-extractor Rust CLI. Extracts text, headings, lists, tables, footnotes, comments (with anchors), tracked changes, headers/footers, and embedded images from a .docx file and returns structured JSON.

The wheel bundles the prebuilt Rust binary for your platform — no Rust toolchain needed, no network access required at install time. This makes it usable inside restricted-egress environments like Claude Desktop's analysis sandbox where downloading directly from GitHub Releases is blocked.

The PyPI distribution name is docx-extractor-cli (the unhyphenated docx-extractor is taken on PyPI by an unrelated project). The Python import name is docx_extractor, and the console script is docx-extractor.

Install

pip install docx-extractor-cli

Prebuilt wheels are published for:

  • Linux x86-64 (manylinux 2.28+ — Debian 11+, RHEL 8+, Ubuntu 20.04+)
  • macOS x86-64 (Intel)
  • macOS aarch64 (Apple Silicon)
  • Windows x86-64

For other targets, build the Rust binary from source.

Use from Python

import docx_extractor

doc = docx_extractor.extract("/path/to/file.docx")
print(doc["metadata"]["title"])
for section in doc["sections"]:
    print(section)

extract() signature:

docx_extractor.extract(
    path: str,
    *,
    pretty: bool = False,
    output: str | None = None,
    no_images: bool = False,
    max_image_bytes: int | None = None,
    timeout: float | None = None,
) -> dict | None
  • Returns the parsed JSON document as a dict (or None when output is given — the JSON is written to that path instead).
  • Raises docx_extractor.DocxExtractorError on non-zero exit, carrying the binary's stderr text.

Use from the shell

The wheel installs a docx-extractor console script that's a thin pass-through to the bundled binary. Same CLI as the Rust release:

docx-extractor /path/to/file.docx --pretty
docx-extractor /path/to/file.docx --output document.json --no-images
docx-extractor /path/to/file.docx --max-image-bytes 5242880

Flags:

Flag Description
--pretty / -p Pretty-print JSON.
--output <path> / -o <path> Write to a file instead of stdout.
--no-images Skip base64 image bytes (per-section images references are preserved).
--max-image-bytes <n> Per-image size cap (default: 10 MiB).

Use inside Claude Desktop's analysis sandbox

This is the primary motivation for the wheel. When a user uploads a .docx to a Claude Desktop chat, the file lands at /mnt/user-data/uploads/... inside a Linux sandbox where GitHub Release downloads are blocked but PyPI is allowlisted.

pip install docx-extractor
docx-extractor /mnt/user-data/uploads/foo.docx --no-images --output /tmp/doc.json

Then parse /tmp/doc.json in Python. --no-images is strongly recommended for chat workflows — base64 image bytes dominate token cost. Opt back in only when the user explicitly asks about images.

JSON schema

See the main project README for the full schema.

macOS Gatekeeper

Unsigned binaries delivered via pip install run fine as subprocess invocations (no GUI launch, no quarantine prompt). If you ever hit a Gatekeeper warning when invoking the binary directly:

xattr -dr com.apple.quarantine "$(python -c 'import docx_extractor._binary as b; print(b.path())')"

Versioning

The PyPI package version mirrors the Rust binary version exactly. Installing docx-extractor-cli==0.4.0 ships the v0.4.0 Rust binary.

License

MIT — see the main repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx_extractor_cli-0.4.0.tar.gz (9.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docx_extractor_cli-0.4.0-cp311-cp311-win_amd64.whl (534.5 kB view details)

Uploaded CPython 3.11Windows x86-64

docx_extractor_cli-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl (638.8 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_x86_64.whl (567.6 kB view details)

Uploaded CPython 3.11macOS 11.0+ x86-64

docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_arm64.whl (512.6 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file docx_extractor_cli-0.4.0.tar.gz.

File metadata

  • Download URL: docx_extractor_cli-0.4.0.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docx_extractor_cli-0.4.0.tar.gz
Algorithm Hash digest
SHA256 8c25c611665ccfef8e887e53a56aaca6ad4c8af4a7b80004335e364364dc6a62
MD5 1007027cd7baded091b219714736253f
BLAKE2b-256 c57f30f59491eebb5b3de4c0f5819534fed6b60e02c28f200d6c84a0ce89cd23

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.0.tar.gz:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docx_extractor_cli-0.4.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docx_extractor_cli-0.4.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 caa777a6cac03ef37c60c38dda9563d31dfb9a7a08bff3dca14a533b49cde34c
MD5 f0194fdfda3411fe7588c4ec1f829ba6
BLAKE2b-256 148b60a8e74e7edbb9129a403ebfcf5b402623fedbb241f4eda73b12b5ed4619

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.0-cp311-cp311-win_amd64.whl:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docx_extractor_cli-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docx_extractor_cli-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d83a60f40628379109485ca09b1ade29e6399c8fb77047f91f6d47585145d7c0
MD5 d5994d62aab627335c1170d6fa262092
BLAKE2b-256 cc3c938bda4f2fcb855f1787f7bd5b26617242b5717fe7267892e9afce924304

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 ddb33fc5bdeaea55e71e35560ead86a205a4d442f2efa1e3bf191edafa127294
MD5 c838b12d7b633c06c1106523586bd72f
BLAKE2b-256 ce75063054a21b55e8aee950066537fa67bc55c697883770d391e9e124a7b49d

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_x86_64.whl:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ab51cf3f14e37c5edeba22c32bd8e2fcd08790f2e69fac2e6bb019391cb834c8
MD5 68ada2aa1eb84f6b3f3525da0f855be3
BLAKE2b-256 04420d489f107c6529085518d650a38c45cf89e220a7dacc12e15790dd6f49bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page