Shrink and sanitize Word (.docx) documents by converting Visio embeddings, compressing images, and stripping metadata
Project description
docx-shrinker
Shrink and sanitize Word (.docx) documents. Converts embedded Visio diagrams to raster images, compresses oversized media, deduplicates files, and strips metadata, comments, tracked changes, macros, and other cruft.
What it does
- Convert Visio embeddings —
.vsdx→ PDF (via Visio COM) → JPG/PNG (via PyMuPDF). Falls back to keeping the EMF preview when Visio is unavailable. - Convert OLE objects — Replaces legacy VML
<w:object>blocks with modern DrawingML<w:drawing>inline pictures. - Compress images — Resizes raster images exceeding a pixel width threshold and re-compresses JPGs.
- Deduplicate media — Identifies identical files by hash and rewrites relationships to point to a single copy.
- Strip personal info — Removes author, last modified by, company, manager, keywords, and other document properties.
- Remove comments and tracked changes — Deletes comment files and accepts all revisions inline.
- Strip bookmarks — Removes auto-generated bookmarks (
_GoBack, empty). - Remove garbage parts — Thumbnail, VBA macros, printer settings, ActiveX controls, custom XML data.
- Clean up — Updates
[Content_Types].xmland.relsfiles to reflect removed parts. - Validate output — Checks ZIP integrity and presence of
[Content_Types].xmlbefore finalizing.
Requirements
- Python 3.10+
- PyMuPDF (
pymupdf) — image compression and PDF-to-image rendering - pywin32 — Visio COM automation (Windows only; Visio conversion is skipped if unavailable)
- Microsoft Visio (optional) — required only for converting embedded
.vsdxto high-quality images
Installation
pip install docx-shrinker
Or with uv:
uv tool install docx-shrinker
Usage
Command line
docx-shrinker report.docx
This produces report (shrunk).docx in the same directory.
Specify an output path:
docx-shrinker report.docx output.docx
Options
| Flag | Default | Description |
|---|---|---|
--format {jpg,png} |
jpg |
Image format for converted Visio figures |
--dpi N |
300 |
Effective rasterization DPI. Every figure renders at this DPI unless the result would exceed --max-megapixels. |
--quality N |
95 |
JPG quality (1–100). Ignored for PNG. |
--max-megapixels N |
100 |
Cap on output pixel count per image, in megapixels. Images exceeding the cap are downscaled, preserving aspect ratio. 0 to disable. |
-i, --interactive |
off | After conversion, show top 5 largest images and offer to re-convert at different quality |
--version |
Show version and exit |
Examples
Convert Visio figures to PNG at 150 DPI:
docx-shrinker report.docx --format png --dpi 150
Aggressive compression (lower quality, tighter megapixel cap):
docx-shrinker report.docx --quality 80 --max-megapixels 25
Interactive mode to fine-tune large images:
docx-shrinker report.docx -i
Python API
from docx_shrinker import shrink_docx
result = shrink_docx("input.docx", "output.docx", fmt="jpg", dpi=300, quality=95)
print(f"{result['original_size_mb']} MB -> {result['new_size_mb']} MB")
print(f"Reduction: {result['reduction_percent']}%")
The result dict contains:
| Key | Type | Description |
|---|---|---|
original_size_mb |
float |
Original file size |
new_size_mb |
float |
Output file size |
reduction_mb |
float |
Size saved |
reduction_percent |
float |
Percentage reduction |
output_path |
str |
Path to the output file |
visio_converted |
list |
(name, size_kb) tuples for each converted Visio diagram |
visio_removed |
int |
Number of .vsdx embeddings removed |
images_compressed |
list |
(filename, old_kb, new_kb) tuples |
duplicates_removed |
int |
Number of duplicate media files removed |
comments_removed |
int |
Number of comment files removed |
bookmarks_removed |
int |
Number of bookmarks removed |
garbage_removed |
list |
Names of removed garbage parts |
warnings |
list |
Warning messages (e.g., Visio unavailable) |
How it works
A .docx file is a ZIP archive containing XML and media files. docx-shrinker extracts the archive into a temp directory, applies all transformations in-place, then repacks it into a new ZIP. The original file is never modified.
Visio diagrams embedded as OLE objects include both the full .vsdx source and a low-resolution EMF preview image. docx-shrinker replaces these with a high-quality raster render and strips the heavy .vsdx originals — often the single biggest source of bloat.
Technical reference
Processing pipeline
shrink_docx() unpacks the .docx ZIP into a temp directory, applies all transformations in-place, then repacks into a new ZIP. The original file is never modified.
1. Visio conversion (.vsdx → PDF → image)
Embedded Visio diagrams are OLE objects containing both the full .vsdx source and a low-resolution EMF preview. The conversion pipeline:
-
_export_vsdx_to_pdf— Opens each.vsdxvia Visio COM (ExportAsFixedFormat) and exports to PDF. This two-step path exists because Visio's direct raster export (Page.Exportto PNG/BMP) produces extremely low-quality output. The PDF intermediate preserves full vector fidelity. -
_restore_pdf_images— Visio's PDF export downscales and JPEG-compresses any raster images embedded in.vsdxfiles, even if the originals are lossless PNGs (e.g., a 1590x633 PNG becomes a 668x266 JPEG with visible chroma artifacts). There is no COM setting to control this. The fix: extract the original images from the.vsdxZIP, match them to degraded PDF images by aspect ratio, and replace them withpage.replace_image(). The corrected PDF is saved and reopened before rasterization. A critical detail: the PDF image transform matrix often has a negative Y scale (matrix.d < 0, since PDF origin is bottom-left), meaning the image data is stored vertically flipped. When replacing, the original must be flipped to match. Not all images need flipping — some transforms have positive Y scale. -
_border_clip_rect— Visio always draws a 0.75pt black stroked rectangle at the page edges of every exported PDF. This border is not centered on the page boundary — left/top are nearly flush while bottom/right overshoot outward by 0.02–0.12pt. The code detects this rectangle viapage.get_drawings()(typically drawing #0), computes per-side inset from the actual stroke overshoot, and clips the render rect inward to exclude both the stroke and its anti-alias fringe.Earlier approaches that failed:
- Fixed uniform inset (0.5pt or 1pt) — either clipped content or left borders on some sides due to the asymmetric overshoot.
- Pixel-level detection after rasterization — fragile; anti-aliased gray pixels don't pass a simple threshold, and results varied per image.
- White rectangle overlay — the overlay's own edges get anti-aliased, replacing one faint border with another.
-
_render_pdf_to_image— Rasterizes the first page of the corrected PDF to PNG/JPG via PyMuPDF, using the computed clip rect. Renders at the requested effective DPI and downscales only if the result would exceedmax_megapixels, so small and large Visio pages get the same quality.Alternatives that failed:
- SVG embedding via
<asvg:svgBlip>— The DrawingML spec supports embedding SVG alongside a raster fallback, and modern Word preserves SVG as vector through PDF export. But Visio COM'sPage.Export("*.svg")produces SVG files that render as all-black shapes (stroke definitions survive, but fills, text styles, and images are lost or mis-referenced). There is no COM option to fix this. Since the SVG is unusable, the approach reduces to the raster pipeline anyway. - Post-hoc PDF figure replacement — Let Word export docx → PDF as usual, then replace the Visio-sourced raster images in the output PDF with vector content from the intermediate Visio PDFs. Matching the raster placements in Word's PDF back to their source figures is unreliable: Word re-encodes JPGs (defeating hash match), changes pixel dimensions (defeating dim match), and can reorder or reflow figures (defeating document-order match). Even when matching works, the "vector" Visio PDFs are mostly raster anyway because Visio's PDF export rasterizes complex shapes — see
_restore_pdf_imagesfor the partial recovery we already do. - Higher DPI is the practical answer. 300 DPI effective (the default) is sharp at normal reading zoom; 600 DPI is crisp up to 400% zoom. The megapixel cap (
max_megapixels, default 100) keeps file sizes bounded when individual Visio pages are unusually large.
- SVG embedding via
2. OLE/VML to DrawingML conversion
extract_vml_dimensions— Parses width/height in EMU from<w:object>style attributes (handles bothptandinunits).object_to_drawing— Rewrites legacy VML<w:object>blocks as modern DrawingML<w:drawing>inline pictures, preserving the image relationship ID and dimensions.next_doc_pr_id— Scans the document XML for the highest existingdocPr/cNvPrid to generate unique IDs for new drawing elements.ensure_namespaces— Addswp:andr:namespace declarations to the root<w:document>if missing, which is required for DrawingML elements.
3. Image compression
compress_media_images— Re-encodes oversized raster images inword/media/. PNGs with high estimated compression potential are converted to JPEG. Existing JPEGs are re-saved at the target quality. Images whose pixel count exceedsmax_megapixelsare downscaled. Skips images that would grow larger after re-encoding.
4. Media deduplication
dedup_media— Identifies identical files inword/media/by MD5 hash. Keeps one canonical copy and rewrites all.relsreferences to point to it, removing the duplicates.
5. Metadata and markup stripping
sanitize_core_props/sanitize_app_props— Strip personal info (author, last modified by, company, manager, keywords, etc.) fromdocProps/core.xmlanddocProps/app.xml.remove_comment_files— Deletescomments.xml,commentsExtended.xml, andcommentsIds.xml.strip_comment_refs— Removes comment range/reference XML tags (commentRangeStart,commentRangeEnd,commentReference) from document XML.strip_revisions— Accepts all tracked changes inline: unwraps<w:ins>content, removes<w:del>blocks and their content, strips revision property tags (rPrChange,pPrChange,sectPrChange,tblPrChange).strip_bookmarks— Removes auto-generated bookmarks (_GoBack, empty-name).
6. Garbage part removal
remove_garbage_parts— Deletes thumbnail, VBA macros (vbaProject.bin,vbaData.xml), printer settings, ActiveX controls, custom XML data, and the.vsdxembeddings themselves (after conversion).
7. Cleanup and validation
clean_content_types— Removes[Content_Types].xmlentries referencing deleted parts.clean_relationships— Removes.relsentries referencing deleted parts across all relationship files._strip_xml_tags/_strip_nested_tag— Low-level helpers for removing XML tags, handling arbitrarily nested structures by processing innermost matches first.- Output ZIP integrity is validated (well-formed ZIP,
[Content_Types].xmlpresent) before finalizing.
8. Interactive mode
_interactive_reconvert— When-iis passed, presents the top 5 largest converted images and offers to re-convert selected ones at a different quality/DPI setting.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docx_shrinker-0.1.2.tar.gz.
File metadata
- Download URL: docx_shrinker-0.1.2.tar.gz
- Upload date:
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a6ddce1aefbace540d6136e3591d55a149ad44763bddf0a528d335b1caa3002
|
|
| MD5 |
643f3984f8584f52cfb4c0c4cc6e3e38
|
|
| BLAKE2b-256 |
e70011895bd90a979e21427bc9afeead9ee92725f4f6d0436fe9a861ebf39537
|
Provenance
The following attestation bundles were made for docx_shrinker-0.1.2.tar.gz:
Publisher:
python-publish.yml on Cognitohazard/docx-shrinker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docx_shrinker-0.1.2.tar.gz -
Subject digest:
6a6ddce1aefbace540d6136e3591d55a149ad44763bddf0a528d335b1caa3002 - Sigstore transparency entry: 1340129037
- Sigstore integration time:
-
Permalink:
Cognitohazard/docx-shrinker@2b2fa822c21b30934345267692db446dfc6ed7f2 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Cognitohazard
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2b2fa822c21b30934345267692db446dfc6ed7f2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file docx_shrinker-0.1.2-py3-none-any.whl.
File metadata
- Download URL: docx_shrinker-0.1.2-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4150bc3c122fade4c513d3379063d05d56f57a17460dffa7d3dcf310ac6b425b
|
|
| MD5 |
47ec771076f0aa7ad2cff4d55a1b8c34
|
|
| BLAKE2b-256 |
5155e8980d284fb0691ed4d4838d55648fa2bdec268d9d267408b908de1ce1d1
|
Provenance
The following attestation bundles were made for docx_shrinker-0.1.2-py3-none-any.whl:
Publisher:
python-publish.yml on Cognitohazard/docx-shrinker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docx_shrinker-0.1.2-py3-none-any.whl -
Subject digest:
4150bc3c122fade4c513d3379063d05d56f57a17460dffa7d3dcf310ac6b425b - Sigstore transparency entry: 1340129041
- Sigstore integration time:
-
Permalink:
Cognitohazard/docx-shrinker@2b2fa822c21b30934345267692db446dfc6ed7f2 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Cognitohazard
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2b2fa822c21b30934345267692db446dfc6ed7f2 -
Trigger Event:
release
-
Statement type: