Shrink and sanitize Word (.docx) documents by converting Visio embeddings, compressing images, and stripping metadata
Project description
docx-shrinker
Shrink and sanitize Word (.docx) documents. Converts embedded Visio diagrams to raster images, compresses oversized media, deduplicates files, and strips metadata, comments, tracked changes, macros, and other cruft.
What it does
- Convert Visio embeddings —
.vsdx→ PDF (via Visio COM) → JPG/PNG (via PyMuPDF). Falls back to keeping the EMF preview when Visio is unavailable. - Convert OLE objects — Replaces legacy VML
<w:object>blocks with modern DrawingML<w:drawing>inline pictures. - Compress images — Resizes raster images exceeding a pixel width threshold and re-compresses JPGs.
- Deduplicate media — Identifies identical files by hash and rewrites relationships to point to a single copy.
- Strip personal info — Removes author, last modified by, company, manager, keywords, and other document properties.
- Remove comments and tracked changes — Deletes comment files and accepts all revisions inline.
- Strip bookmarks — Removes auto-generated bookmarks (
_GoBack, empty). - Remove garbage parts — Thumbnail, VBA macros, printer settings, ActiveX controls, custom XML data.
- Clean up — Updates
[Content_Types].xmland.relsfiles to reflect removed parts. - Validate output — Checks ZIP integrity and presence of
[Content_Types].xmlbefore finalizing.
Requirements
- Python 3.10+
- PyMuPDF (
pymupdf) — image compression and PDF-to-image rendering - pywin32 — Visio COM automation (Windows only; Visio conversion is skipped if unavailable)
- Microsoft Visio (optional) — required only for converting embedded
.vsdxto high-quality images
Installation
pip install docx-shrinker
Or with uv:
uv tool install docx-shrinker
Usage
Command line
docx-shrinker report.docx
This produces report (shrunk).docx in the same directory.
Specify an output path:
docx-shrinker report.docx output.docx
Options
| Flag | Default | Description |
|---|---|---|
--format {jpg,png} |
jpg |
Image format for converted Visio figures |
--dpi N |
300 |
Rasterization DPI for Visio conversion |
--quality N |
95 |
JPG quality (1–100). Ignored for PNG. |
--max-width N |
2000 |
Max pixel width for raster images. 0 to disable. |
-i, --interactive |
off | After conversion, show top 5 largest images and offer to re-convert at different quality |
--version |
Show version and exit |
Examples
Convert Visio figures to PNG at 150 DPI:
docx-shrinker report.docx --format png --dpi 150
Aggressive compression (lower quality, smaller max width):
docx-shrinker report.docx --quality 80 --max-width 1200
Interactive mode to fine-tune large images:
docx-shrinker report.docx -i
Python API
from docx_shrinker import shrink_docx
result = shrink_docx("input.docx", "output.docx", fmt="jpg", dpi=300, quality=95)
print(f"{result['original_size_mb']} MB -> {result['new_size_mb']} MB")
print(f"Reduction: {result['reduction_percent']}%")
The result dict contains:
| Key | Type | Description |
|---|---|---|
original_size_mb |
float |
Original file size |
new_size_mb |
float |
Output file size |
reduction_mb |
float |
Size saved |
reduction_percent |
float |
Percentage reduction |
output_path |
str |
Path to the output file |
visio_converted |
list |
(name, size_kb) tuples for each converted Visio diagram |
visio_removed |
int |
Number of .vsdx embeddings removed |
images_compressed |
list |
(filename, old_kb, new_kb) tuples |
duplicates_removed |
int |
Number of duplicate media files removed |
comments_removed |
int |
Number of comment files removed |
bookmarks_removed |
int |
Number of bookmarks removed |
garbage_removed |
list |
Names of removed garbage parts |
warnings |
list |
Warning messages (e.g., Visio unavailable) |
How it works
A .docx file is a ZIP archive containing XML and media files. docx-shrinker extracts the archive into a temp directory, applies all transformations in-place, then repacks it into a new ZIP. The original file is never modified.
Visio diagrams embedded as OLE objects include both the full .vsdx source and a low-resolution EMF preview image. docx-shrinker replaces these with a high-quality raster render and strips the heavy .vsdx originals — often the single biggest source of bloat.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docx_shrinker-0.1.0.tar.gz.
File metadata
- Download URL: docx_shrinker-0.1.0.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4294a96784ed170c8ecea68471f8b193fa0579126d4fb5f16508df537eb48d7
|
|
| MD5 |
68b92dab069bbb3fbb34fd7b5a677999
|
|
| BLAKE2b-256 |
1e80db1a61f0bbdda3903b6890a0d89ffbf378aac0b332732100672f70866011
|
Provenance
The following attestation bundles were made for docx_shrinker-0.1.0.tar.gz:
Publisher:
python-publish.yml on Cognitohazard/docx-shrinker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docx_shrinker-0.1.0.tar.gz -
Subject digest:
f4294a96784ed170c8ecea68471f8b193fa0579126d4fb5f16508df537eb48d7 - Sigstore transparency entry: 1097879443
- Sigstore integration time:
-
Permalink:
Cognitohazard/docx-shrinker@1bc2a0c9b3d57ceb066795298b0c65fd64f7f3bb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Cognitohazard
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@1bc2a0c9b3d57ceb066795298b0c65fd64f7f3bb -
Trigger Event:
release
-
Statement type:
File details
Details for the file docx_shrinker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: docx_shrinker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5560c8d803dc137a56919dc09ade224f532159c6635de0f842027f8a6152d7e
|
|
| MD5 |
e8c38f60161ca086f8bb5e692ff2972a
|
|
| BLAKE2b-256 |
e4c6d116676527829c70234603fc3bfe5d320045c7c1de142f9e413c5b54298a
|
Provenance
The following attestation bundles were made for docx_shrinker-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on Cognitohazard/docx-shrinker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docx_shrinker-0.1.0-py3-none-any.whl -
Subject digest:
f5560c8d803dc137a56919dc09ade224f532159c6635de0f842027f8a6152d7e - Sigstore transparency entry: 1097879457
- Sigstore integration time:
-
Permalink:
Cognitohazard/docx-shrinker@1bc2a0c9b3d57ceb066795298b0c65fd64f7f3bb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Cognitohazard
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@1bc2a0c9b3d57ceb066795298b0c65fd64f7f3bb -
Trigger Event:
release
-
Statement type: