Skip to main content

Shrink and sanitize Word (.docx) documents by converting Visio embeddings, compressing images, and stripping metadata

Project description

docx-shrinker

Shrink and sanitize Word (.docx) documents. Converts embedded Visio diagrams to raster images, compresses oversized media, deduplicates files, and strips metadata, comments, tracked changes, macros, and other cruft.

What it does

  1. Convert Visio embeddings.vsdx → PDF (via Visio COM) → JPG/PNG (via PyMuPDF). Falls back to keeping the EMF preview when Visio is unavailable.
  2. Convert OLE objects — Replaces legacy VML <w:object> blocks with modern DrawingML <w:drawing> inline pictures.
  3. Compress images — Resizes raster images exceeding a pixel width threshold and re-compresses JPGs.
  4. Deduplicate media — Identifies identical files by hash and rewrites relationships to point to a single copy.
  5. Strip personal info — Removes author, last modified by, company, manager, keywords, and other document properties.
  6. Remove comments and tracked changes — Deletes comment files and accepts all revisions inline.
  7. Strip bookmarks — Removes auto-generated bookmarks (_GoBack, empty).
  8. Remove garbage parts — Thumbnail, VBA macros, printer settings, ActiveX controls, custom XML data.
  9. Clean up — Updates [Content_Types].xml and .rels files to reflect removed parts.
  10. Validate output — Checks ZIP integrity and presence of [Content_Types].xml before finalizing.

Requirements

  • Python 3.10+
  • PyMuPDF (pymupdf) — image compression and PDF-to-image rendering
  • pywin32 — Visio COM automation (Windows only; Visio conversion is skipped if unavailable)
  • Microsoft Visio (optional) — required only for converting embedded .vsdx to high-quality images

Installation

pip install docx-shrinker

Or with uv:

uv tool install docx-shrinker

Usage

Command line

docx-shrinker report.docx

This produces report (shrunk).docx in the same directory.

Specify an output path:

docx-shrinker report.docx output.docx

Options

Flag Default Description
--format {jpg,png} jpg Image format for converted Visio figures
--dpi N 300 Rasterization DPI for Visio conversion
--quality N 95 JPG quality (1–100). Ignored for PNG.
--max-width N 2000 Max pixel width for raster images. 0 to disable.
-i, --interactive off After conversion, show top 5 largest images and offer to re-convert at different quality
--version Show version and exit

Examples

Convert Visio figures to PNG at 150 DPI:

docx-shrinker report.docx --format png --dpi 150

Aggressive compression (lower quality, smaller max width):

docx-shrinker report.docx --quality 80 --max-width 1200

Interactive mode to fine-tune large images:

docx-shrinker report.docx -i

Python API

from docx_shrinker import shrink_docx

result = shrink_docx("input.docx", "output.docx", fmt="jpg", dpi=300, quality=95)

print(f"{result['original_size_mb']} MB -> {result['new_size_mb']} MB")
print(f"Reduction: {result['reduction_percent']}%")

The result dict contains:

Key Type Description
original_size_mb float Original file size
new_size_mb float Output file size
reduction_mb float Size saved
reduction_percent float Percentage reduction
output_path str Path to the output file
visio_converted list (name, size_kb) tuples for each converted Visio diagram
visio_removed int Number of .vsdx embeddings removed
images_compressed list (filename, old_kb, new_kb) tuples
duplicates_removed int Number of duplicate media files removed
comments_removed int Number of comment files removed
bookmarks_removed int Number of bookmarks removed
garbage_removed list Names of removed garbage parts
warnings list Warning messages (e.g., Visio unavailable)

How it works

A .docx file is a ZIP archive containing XML and media files. docx-shrinker extracts the archive into a temp directory, applies all transformations in-place, then repacks it into a new ZIP. The original file is never modified.

Visio diagrams embedded as OLE objects include both the full .vsdx source and a low-resolution EMF preview image. docx-shrinker replaces these with a high-quality raster render and strips the heavy .vsdx originals — often the single biggest source of bloat.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx_shrinker-0.1.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docx_shrinker-0.1.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file docx_shrinker-0.1.0.tar.gz.

File metadata

  • Download URL: docx_shrinker-0.1.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docx_shrinker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f4294a96784ed170c8ecea68471f8b193fa0579126d4fb5f16508df537eb48d7
MD5 68b92dab069bbb3fbb34fd7b5a677999
BLAKE2b-256 1e80db1a61f0bbdda3903b6890a0d89ffbf378aac0b332732100672f70866011

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_shrinker-0.1.0.tar.gz:

Publisher: python-publish.yml on Cognitohazard/docx-shrinker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docx_shrinker-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docx_shrinker-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docx_shrinker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f5560c8d803dc137a56919dc09ade224f532159c6635de0f842027f8a6152d7e
MD5 e8c38f60161ca086f8bb5e692ff2972a
BLAKE2b-256 e4c6d116676527829c70234603fc3bfe5d320045c7c1de142f9e413c5b54298a

See more details on using hashes here.

Provenance

The following attestation bundles were made for docx_shrinker-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on Cognitohazard/docx-shrinker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page