Skip to main content

Reusable conversion utilities for images, documents, text, and related data types.

Project description

mmqc-utils

Reusable conversion utilities for MMQC projects.

Included utilities

  • Convert and downscale common image formats (TIFF, JPEG, PNG, GIF, WebP, PDF) to bounded JPEG previews, with optional byte-size budget enforcement
  • Convert DOCX, RTF, ODT, TeX, and PDF documents to HTML
  • Normalize HTML to plain text

System requirements

  • ImageMagick — required for image conversion (convert_to_bounded_jpeg). Install via your system package manager:
    # macOS
    brew install imagemagick
    
    # Debian / Ubuntu
    apt-get install imagemagick
    
  • Pandoc — bundled automatically via pypandoc-binary; no separate installation needed.

Installation

pip install mmqc-utils
# or
uv add mmqc-utils

Usage

All functions accept a file path (str or Path), raw bytes/bytearray, or a BinaryIO object.

Document conversion

from mmqc_utils import document_to_html

# From a file path
html = document_to_html("paper.docx")

# From bytes — input_format is required when there is no file extension to infer from
html = document_to_html(raw_bytes, input_format="rtf")

Supported formats: docx, rtf, odt, tex, pdf.

For PDFs, pandoc is tried first; if it cannot convert the file, text is extracted page-by-page via pypdf and wrapped in <div class='page'> elements.

Image conversion

from mmqc_utils import convert_to_bounded_jpeg, compress_to_bounded_jpeg

# Downscale to pixel dimensions
jpeg_bytes = convert_to_bounded_jpeg(
    "figure.tiff",
    rasterization_dpi=150,   # DPI for vector/PDF rasterization
    max_dimension=2000,       # downscale if width or height exceeds this
    compression_quality=80,   # JPEG quality 1–100
    background="white",       # background when removing transparency
)

# Compress until the result fits within a byte-size budget
jpeg_bytes = compress_to_bounded_jpeg(
    "figure.tiff",
    max_bytes=5 * 1024 * 1024,  # 5 MB
    max_dimension=2000,
    compression_quality=80,
)

Both functions accept Path, str, bytes, bytearray, or BinaryIO as input and only render the first page/layer of multi-page TIFFs.

compress_to_bounded_jpeg steps down JPEG quality first (80 → 70 → … → 30), then halves max_dimension and repeats, until the result fits within max_bytes. If the budget cannot be met even at minimum quality and dimension, it returns the smallest result achieved.

Text normalization

from mmqc_utils import html_to_text, compute_plain_text

text = html_to_text("<p>Hello <b>world</b></p>")
# → "Hello world"

compute_plain_text is an alias for html_to_text. Block-level tags (<p>, <div>, <br>, headings, list items, …) are replaced by a space; inline tags are stripped; whitespace is collapsed.

Development

Releasing a New Version

To release a new version of mmqc-utils to PyPI:

  1. Update the version number:
    uv run --group publish bump2version [major|minor|patch]
    
  2. Update the uv.lock file:
    uv lock
    
  3. Update the changelog in CHANGELOG.md.
  4. Build the distribution:
    just clean
    uv run --group publish python -m build
    
  5. Check the distribution:
    uv run --group publish twine check dist/*
    
  6. Upload to TestPyPI (optional but recommended):
    uv run --group publish twine upload --repository testpypi dist/*
    
  7. Upload to PyPI:
    uv run --group publish twine upload dist/*
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmqc_utils-0.2.0.tar.gz (42.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mmqc_utils-0.2.0-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file mmqc_utils-0.2.0.tar.gz.

File metadata

  • Download URL: mmqc_utils-0.2.0.tar.gz
  • Upload date:
  • Size: 42.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for mmqc_utils-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a25b6bc14c094835978d0850e0aa1a2cfced50333cf9dc2fd038777b9fa8e3d7
MD5 90d79aeea02fafa466d8d85be6091c83
BLAKE2b-256 dd1d38e21ad10a10525387c086a015b18e42ef2cb205d7f3ddc7e30221fc392d

See more details on using hashes here.

File details

Details for the file mmqc_utils-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: mmqc_utils-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for mmqc_utils-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fcc98e98d23592ba2d5d0e1112b74d354238562fc9ae3677ff4561cb8e8627ca
MD5 3bdf0345ba300230d87cb21b9624e4a7
BLAKE2b-256 5f003b3443a6b483a94d7cad72e2d276d54fd3dcbcdfeacfbd07808a42f13950

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page