Extract text, images, footnotes, comments, headers/footers, and tracked changes from .docx files as JSON — wraps the docx-extractor Rust binary.
Project description
docx-extractor-cli
Python wrapper around the docx-extractor Rust CLI. Extracts text, headings, lists, tables, footnotes, comments (with anchors), tracked changes, headers/footers, and embedded images from a .docx file and returns structured JSON.
The wheel bundles the prebuilt Rust binary for your platform — no Rust toolchain needed, no network access required at install time. This makes it usable inside restricted-egress environments like Claude Desktop's analysis sandbox where downloading directly from GitHub Releases is blocked.
The PyPI distribution name is
docx-extractor-cli(the unhyphenateddocx-extractoris taken on PyPI by an unrelated project). The Python import name isdocx_extractor, and the console script isdocx-extractor.
Install
pip install docx-extractor-cli
Prebuilt wheels are published for:
- Linux x86-64 (manylinux 2.28+ — Debian 11+, RHEL 8+, Ubuntu 20.04+)
- macOS x86-64 (Intel)
- macOS aarch64 (Apple Silicon)
- Windows x86-64
For other targets, build the Rust binary from source.
Use from Python
import docx_extractor
doc = docx_extractor.extract("/path/to/file.docx")
print(doc["metadata"]["title"])
for section in doc["sections"]:
print(section)
extract() signature:
docx_extractor.extract(
path: str,
*,
pretty: bool = False,
output: str | None = None,
no_images: bool = False,
max_image_bytes: int | None = None,
timeout: float | None = None,
) -> dict | None
- Returns the parsed JSON document as a
dict(orNonewhenoutputis given — the JSON is written to that path instead). - Raises
docx_extractor.DocxExtractorErroron non-zero exit, carrying the binary's stderr text.
Use from the shell
The wheel installs a docx-extractor console script that's a thin pass-through to the bundled binary. Same CLI as the Rust release:
docx-extractor /path/to/file.docx --pretty
docx-extractor /path/to/file.docx --output document.json --no-images
docx-extractor /path/to/file.docx --max-image-bytes 5242880
Flags:
| Flag | Description |
|---|---|
--pretty / -p |
Pretty-print JSON. |
--output <path> / -o <path> |
Write to a file instead of stdout. |
--no-images |
Skip base64 image bytes (per-section images references are preserved). |
--max-image-bytes <n> |
Per-image size cap (default: 10 MiB). |
Use inside Claude Desktop's analysis sandbox
This is the primary motivation for the wheel. When a user uploads a .docx to a Claude Desktop chat, the file lands at /mnt/user-data/uploads/... inside a Linux sandbox where GitHub Release downloads are blocked but PyPI is allowlisted.
pip install docx-extractor
docx-extractor /mnt/user-data/uploads/foo.docx --no-images --output /tmp/doc.json
Then parse /tmp/doc.json in Python. --no-images is strongly recommended for chat workflows — base64 image bytes dominate token cost. Opt back in only when the user explicitly asks about images.
JSON schema
See the main project README for the full schema.
macOS Gatekeeper
Unsigned binaries delivered via pip install run fine as subprocess invocations (no GUI launch, no quarantine prompt). If you ever hit a Gatekeeper warning when invoking the binary directly:
xattr -dr com.apple.quarantine "$(python -c 'import docx_extractor._binary as b; print(b.path())')"
Versioning
The PyPI package version mirrors the Rust binary version exactly. Installing docx-extractor-cli==0.4.0 ships the v0.4.0 Rust binary.
License
MIT — see the main repo.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docx_extractor_cli-0.4.0.tar.gz.
File metadata
- Download URL: docx_extractor_cli-0.4.0.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c25c611665ccfef8e887e53a56aaca6ad4c8af4a7b80004335e364364dc6a62
|
|
| MD5 |
1007027cd7baded091b219714736253f
|
|
| BLAKE2b-256 |
c57f30f59491eebb5b3de4c0f5819534fed6b60e02c28f200d6c84a0ce89cd23
|
Provenance
The following attestation bundles were made for docx_extractor_cli-0.4.0.tar.gz:
Publisher:
release.yml on Maks417/docx-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docx_extractor_cli-0.4.0.tar.gz -
Subject digest:
8c25c611665ccfef8e887e53a56aaca6ad4c8af4a7b80004335e364364dc6a62 - Sigstore transparency entry: 1624854338
- Sigstore integration time:
-
Permalink:
Maks417/docx-extractor@dee3415396d761773521db5d18772c5d3231664c -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Maks417
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@dee3415396d761773521db5d18772c5d3231664c -
Trigger Event:
push
-
Statement type:
File details
Details for the file docx_extractor_cli-0.4.0-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: docx_extractor_cli-0.4.0-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 534.5 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
caa777a6cac03ef37c60c38dda9563d31dfb9a7a08bff3dca14a533b49cde34c
|
|
| MD5 |
f0194fdfda3411fe7588c4ec1f829ba6
|
|
| BLAKE2b-256 |
148b60a8e74e7edbb9129a403ebfcf5b402623fedbb241f4eda73b12b5ed4619
|
Provenance
The following attestation bundles were made for docx_extractor_cli-0.4.0-cp311-cp311-win_amd64.whl:
Publisher:
release.yml on Maks417/docx-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docx_extractor_cli-0.4.0-cp311-cp311-win_amd64.whl -
Subject digest:
caa777a6cac03ef37c60c38dda9563d31dfb9a7a08bff3dca14a533b49cde34c - Sigstore transparency entry: 1624854364
- Sigstore integration time:
-
Permalink:
Maks417/docx-extractor@dee3415396d761773521db5d18772c5d3231664c -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Maks417
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@dee3415396d761773521db5d18772c5d3231664c -
Trigger Event:
push
-
Statement type:
File details
Details for the file docx_extractor_cli-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: docx_extractor_cli-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 638.8 kB
- Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d83a60f40628379109485ca09b1ade29e6399c8fb77047f91f6d47585145d7c0
|
|
| MD5 |
d5994d62aab627335c1170d6fa262092
|
|
| BLAKE2b-256 |
cc3c938bda4f2fcb855f1787f7bd5b26617242b5717fe7267892e9afce924304
|
Provenance
The following attestation bundles were made for docx_extractor_cli-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl:
Publisher:
release.yml on Maks417/docx-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docx_extractor_cli-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl -
Subject digest:
d83a60f40628379109485ca09b1ade29e6399c8fb77047f91f6d47585145d7c0 - Sigstore transparency entry: 1624854373
- Sigstore integration time:
-
Permalink:
Maks417/docx-extractor@dee3415396d761773521db5d18772c5d3231664c -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Maks417
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@dee3415396d761773521db5d18772c5d3231664c -
Trigger Event:
push
-
Statement type:
File details
Details for the file docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_x86_64.whl.
File metadata
- Download URL: docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_x86_64.whl
- Upload date:
- Size: 567.6 kB
- Tags: CPython 3.11, macOS 11.0+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddb33fc5bdeaea55e71e35560ead86a205a4d442f2efa1e3bf191edafa127294
|
|
| MD5 |
c838b12d7b633c06c1106523586bd72f
|
|
| BLAKE2b-256 |
ce75063054a21b55e8aee950066537fa67bc55c697883770d391e9e124a7b49d
|
Provenance
The following attestation bundles were made for docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_x86_64.whl:
Publisher:
release.yml on Maks417/docx-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_x86_64.whl -
Subject digest:
ddb33fc5bdeaea55e71e35560ead86a205a4d442f2efa1e3bf191edafa127294 - Sigstore transparency entry: 1624854349
- Sigstore integration time:
-
Permalink:
Maks417/docx-extractor@dee3415396d761773521db5d18772c5d3231664c -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Maks417
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@dee3415396d761773521db5d18772c5d3231664c -
Trigger Event:
push
-
Statement type:
File details
Details for the file docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 512.6 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab51cf3f14e37c5edeba22c32bd8e2fcd08790f2e69fac2e6bb019391cb834c8
|
|
| MD5 |
68ada2aa1eb84f6b3f3525da0f855be3
|
|
| BLAKE2b-256 |
04420d489f107c6529085518d650a38c45cf89e220a7dacc12e15790dd6f49bc
|
Provenance
The following attestation bundles were made for docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_arm64.whl:
Publisher:
release.yml on Maks417/docx-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docx_extractor_cli-0.4.0-cp311-cp311-macosx_11_0_arm64.whl -
Subject digest:
ab51cf3f14e37c5edeba22c32bd8e2fcd08790f2e69fac2e6bb019391cb834c8 - Sigstore transparency entry: 1624854346
- Sigstore integration time:
-
Permalink:
Maks417/docx-extractor@dee3415396d761773521db5d18772c5d3231664c -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Maks417
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@dee3415396d761773521db5d18772c5d3231664c -
Trigger Event:
push
-
Statement type: