Skip to main content

Reusable conversion utilities for images, documents, text, and related data types.

Project description

mmqc-utils

Reusable conversion utilities for MMQC projects.

Included utilities

  • Convert and downscale common image formats (TIFF, JPEG, PNG, GIF, WebP, PDF) to bounded JPEG previews, with optional byte-size budget enforcement
  • Convert DOCX, RTF, ODT, TeX, and PDF documents to HTML
  • Normalize HTML to plain text

System requirements

  • ImageMagick — required for image conversion (convert_to_bounded_jpeg). Install via your system package manager:
    # macOS
    brew install imagemagick
    
    # Debian / Ubuntu
    apt-get install imagemagick
    
  • Pandoc — bundled automatically via pypandoc-binary; no separate installation needed.

Installation

pip install mmqc-utils
# or
uv add mmqc-utils

Usage

All functions accept a file path (str or Path), raw bytes/bytearray, or a BinaryIO object.

Document conversion

from mmqc_utils import document_to_html

# From a file path
html = document_to_html("paper.docx")

# From bytes — input_format is required when there is no file extension to infer from
html = document_to_html(raw_bytes, input_format="rtf")

Supported formats: docx, rtf, odt, tex, pdf.

For PDFs, pandoc is tried first; if it cannot convert the file, text is extracted page-by-page via pypdf and wrapped in <div class='page'> elements.

Image conversion

from mmqc_utils import convert_to_bounded_jpeg, compress_to_bounded_jpeg

# Downscale to pixel dimensions
jpeg_bytes = convert_to_bounded_jpeg(
    "figure.tiff",
    rasterization_dpi=150,   # DPI for vector/PDF rasterization
    max_dimension=2000,       # downscale if width or height exceeds this
    compression_quality=80,   # JPEG quality 1–100
    background="white",       # background when removing transparency
)

# Compress until the result fits within a byte-size budget
jpeg_bytes = compress_to_bounded_jpeg(
    "figure.tiff",
    max_bytes=5 * 1024 * 1024,  # 5 MB
    max_dimension=2000,
    compression_quality=80,
)

Both functions accept Path, str, bytes, bytearray, or BinaryIO as input and only render the first page/layer of multi-page TIFFs.

compress_to_bounded_jpeg steps down JPEG quality first (80 → 70 → … → 30), then halves max_dimension and repeats, until the result fits within max_bytes. If the budget cannot be met even at minimum quality and dimension, it returns the smallest result achieved.

Text normalization

from mmqc_utils import html_to_text, compute_plain_text

text = html_to_text("<p>Hello <b>world</b></p>")
# → "Hello world"

compute_plain_text is an alias for html_to_text. Block-level tags (<p>, <div>, <br>, headings, list items, …) are replaced by a space; inline tags are stripped; whitespace is collapsed.

Development

Releasing a New Version

To release a new version of mmqc-utils to PyPI:

  1. Update the version number in pyproject.toml.
  2. Update the uv.lock file:
    uv lock
    
  3. Update the changelog in CHANGELOG.md.
  4. Build the distribution:
    just build
    
  5. Check the distribution:
    uv run --group publish twine check dist/*
    
  6. Upload to TestPyPI (optional but recommended):
    uv run --group publish twine upload --repository testpypi dist/*
    
  7. Upload to PyPI:
    uv run --group publish twine upload dist/*
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmqc_utils-0.2.1.tar.gz (42.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mmqc_utils-0.2.1-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file mmqc_utils-0.2.1.tar.gz.

File metadata

  • Download URL: mmqc_utils-0.2.1.tar.gz
  • Upload date:
  • Size: 42.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for mmqc_utils-0.2.1.tar.gz
Algorithm Hash digest
SHA256 c1566d1a5f0ada9719867d8b677235e6200547c3cbccaef700a1f7b6de2db7e6
MD5 b5a6ea96a9edfa9cb22d73acab8a7da0
BLAKE2b-256 404c5ab44bc5418fc00d88ca10f30b364f9a74df424838e0e6dc604d1cbaddfd

See more details on using hashes here.

File details

Details for the file mmqc_utils-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: mmqc_utils-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for mmqc_utils-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a854ea72fd8c44fdac20c20cd98d974a080239c8e763b1ebc0d76f591fc6d126
MD5 24e7b28de4faf3e7b45b02cb487f0c88
BLAKE2b-256 398e5020c722b100c8b9697d1efd674a0eaee1adddbfbe95e8ad7f31e140f4fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page