Skip to main content

Extract text, images, footnotes, comments, headers/footers, and tracked changes from .docx files as JSON — wraps the docx-extractor Rust binary.

Project description

docx-extractor-cli

Python wrapper around the docx-extractor Rust CLI. Extracts text, headings, lists, tables, footnotes, comments (with anchors), tracked changes, headers/footers, and embedded images from a .docx file and returns structured JSON.

The wheel bundles the prebuilt Rust binary for your platform — no Rust toolchain needed, no network access required at install time. This makes it usable inside restricted-egress environments like Claude Desktop's analysis sandbox where downloading directly from GitHub Releases is blocked.

The PyPI distribution name is docx-extractor-cli (the unhyphenated docx-extractor is taken on PyPI by an unrelated project). The Python import name is docx_extractor, and the console script is docx-extractor.

Install

pip install docx-extractor-cli

Prebuilt wheels are published for:

  • Linux x86-64 (manylinux 2.28+ — Debian 11+, RHEL 8+, Ubuntu 20.04+)
  • macOS x86-64 (Intel)
  • macOS aarch64 (Apple Silicon)
  • Windows x86-64

For other targets, build the Rust binary from source.

Use from Python

import docx_extractor

doc = docx_extractor.extract("/path/to/file.docx")
print(doc["metadata"]["title"])
for section in doc["sections"]:
    print(section)

extract() signature:

docx_extractor.extract(
    path: str,
    *,
    pretty: bool = False,
    output: str | None = None,
    no_images: bool = False,
    max_image_bytes: int | None = None,
    timeout: float | None = None,
) -> dict | None
  • Returns the parsed JSON document as a dict (or None when output is given — the JSON is written to that path instead).
  • Raises docx_extractor.DocxExtractorError on non-zero exit, carrying the binary's stderr text.

Use from the shell

The wheel installs a docx-extractor console script that's a thin pass-through to the bundled binary. Same CLI as the Rust release:

docx-extractor /path/to/file.docx --pretty
docx-extractor /path/to/file.docx --output document.json --no-images
docx-extractor /path/to/file.docx --max-image-bytes 5242880

Flags:

Flag Description
--pretty / -p Pretty-print JSON.
--output <path> / -o <path> Write to a file instead of stdout.
--no-images Skip base64 image bytes (per-section images references are preserved).
--max-image-bytes <n> Per-image size cap (default: 10 MiB).

Use inside Claude Desktop's analysis sandbox

This is the primary motivation for the wheel. When a user uploads a .docx to a Claude Desktop chat, the file lands at /mnt/user-data/uploads/... inside a Linux sandbox where GitHub Release downloads are blocked but PyPI is allowlisted.

pip install docx-extractor
docx-extractor /mnt/user-data/uploads/foo.docx --no-images --output /tmp/doc.json

Then parse /tmp/doc.json in Python. --no-images is strongly recommended for chat workflows — base64 image bytes dominate token cost. Opt back in only when the user explicitly asks about images.

JSON schema

See the main project README for the full schema.

macOS Gatekeeper

Unsigned binaries delivered via pip install run fine as subprocess invocations (no GUI launch, no quarantine prompt). If you ever hit a Gatekeeper warning when invoking the binary directly:

xattr -dr com.apple.quarantine "$(python -c 'import docx_extractor._binary as b; print(b.path())')"

Versioning

The PyPI package version mirrors the Rust binary version exactly. Installing docx-extractor-cli==0.4.0 ships the v0.4.0 Rust binary.

License

MIT — see the main repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx_extractor_cli-0.4.1.tar.gz (9.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docx_extractor_cli-0.4.1-py3-none-win_amd64.whl (534.6 kB view details)

Uploaded Python 3Windows x86-64

docx_extractor_cli-0.4.1-py3-none-manylinux_2_28_x86_64.whl (638.7 kB view details)

Uploaded Python 3manylinux: glibc 2.28+ x86-64

docx_extractor_cli-0.4.1-py3-none-macosx_11_0_x86_64.whl (567.8 kB view details)

Uploaded Python 3macOS 11.0+ x86-64

docx_extractor_cli-0.4.1-py3-none-macosx_11_0_arm64.whl (512.6 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file docx_extractor_cli-0.4.1.tar.gz.

File metadata

  • Download URL: docx_extractor_cli-0.4.1.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docx_extractor_cli-0.4.1.tar.gz
Algorithm Hash digest
SHA256 5b7500dd57baac17127209f5ca21df16df3b591d265c51f3b25af1a6ca682e2c
MD5 3672165075be2704368c0df0f4dd1dfa
BLAKE2b-256 60bb9c12d86df9db19eed7ae7116c3ee56252bc8fc2c241a270d680dcda5129f

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.1.tar.gz:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docx_extractor_cli-0.4.1-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for docx_extractor_cli-0.4.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 bfc431d340e052151eab2ea9884e77f4945a77bdb576231efffd288156998784
MD5 cd9c2093c3321ad6927cc49ac16c1b90
BLAKE2b-256 a3d87b135d17618ab7caa40940303f83da07693c2a603bf16248284ab7564eb0

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.1-py3-none-win_amd64.whl:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docx_extractor_cli-0.4.1-py3-none-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docx_extractor_cli-0.4.1-py3-none-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 42b6adcd30d13da59111dc89355d5dbcf5c80fc2702f6377fe32ca124ce2ea56
MD5 1dec2f6b04ff28942ac295160c8d3814
BLAKE2b-256 9c59d84fed1f94334dc119a047d1f144f3bb46294d1724016707a132ce47334c

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.1-py3-none-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docx_extractor_cli-0.4.1-py3-none-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for docx_extractor_cli-0.4.1-py3-none-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 c01b6aa498fd858738fab23daaa2f677b3c22482c2c98c9bc927bcc3504e58d4
MD5 5bdd8f4696a466724cf747407e526ce1
BLAKE2b-256 70b0206769dabe8864270e8d2f95c29a273fd82872b88e62e35d1e0d70ae8f60

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.1-py3-none-macosx_11_0_x86_64.whl:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docx_extractor_cli-0.4.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for docx_extractor_cli-0.4.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6f756baf25d44dbc17f6a2879967834de12c9cf9dd5f68cc491e9eff940a9af4
MD5 81ba93cf8d62dae74619750b3ea393c2
BLAKE2b-256 1ea2dbe5b8b2ff3b1d1141d28a42856eb4eaabf2636fb5aad3b644db95f30bd1

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_extractor_cli-0.4.1-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on Maks417/docx-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page