Skip to main content

Reusable conversion utilities for images, documents, text, and related data types.

Project description

mmqc-utils

Reusable conversion utilities for MMQC projects.

Included utilities

  • Convert and downscale common image formats (TIFF, JPEG, PNG, GIF, WebP, PDF) to bounded JPEG previews, with optional byte-size budget enforcement
  • Convert DOCX, RTF, ODT, TeX, and PDF documents to HTML
  • Normalize HTML to plain text

System requirements

  • ImageMagick — required for image conversion (convert_to_bounded_jpeg). Install via your system package manager:
    # macOS
    brew install imagemagick
    
    # Debian / Ubuntu
    apt-get install imagemagick
    
  • Pandoc — bundled automatically via pypandoc-binary; no separate installation needed.

Installation

pip install mmqc-utils
# or
uv add mmqc-utils

Usage

All functions accept a file path (str or Path), raw bytes/bytearray, or a BinaryIO object.

Document conversion

from mmqc_utils import document_to_html

# From a file path
html = document_to_html("paper.docx")

# From bytes — input_format is required when there is no file extension to infer from
html = document_to_html(raw_bytes, input_format="rtf")

Supported formats: docx, rtf, odt, tex, pdf.

For PDFs, pandoc is tried first; if it cannot convert the file, text is extracted page-by-page via pypdf and wrapped in <div class='page'> elements.

Image conversion

from mmqc_utils import convert_to_bounded_jpeg, compress_to_bounded_jpeg

# Downscale to pixel dimensions
jpeg_bytes = convert_to_bounded_jpeg(
    "figure.tiff",
    rasterization_dpi=150,   # DPI for vector/PDF rasterization
    max_dimension=2000,       # downscale if width or height exceeds this
    compression_quality=80,   # JPEG quality 1–100
    background="white",       # background when removing transparency
)

# Compress until the result fits within a byte-size budget
jpeg_bytes = compress_to_bounded_jpeg(
    "figure.tiff",
    max_bytes=5 * 1024 * 1024,  # 5 MB
    max_dimension=2000,
    compression_quality=80,
)

Both functions accept Path, str, bytes, bytearray, or BinaryIO as input and only render the first page/layer of multi-page TIFFs.

compress_to_bounded_jpeg steps down JPEG quality first (80 → 70 → … → 30), then halves max_dimension and repeats, until the result fits within max_bytes. If the budget cannot be met even at minimum quality and dimension, it returns the smallest result achieved.

Text normalization

from mmqc_utils import html_to_text, compute_plain_text

text = html_to_text("<p>Hello <b>world</b></p>")
# → "Hello world"

compute_plain_text is an alias for html_to_text. Block-level tags (<p>, <div>, <br>, headings, list items, …) are replaced by a space; inline tags are stripped; whitespace is collapsed.

Development

Releasing a New Version

To release a new version of mmqc-utils to PyPI:

  1. Update the version number:
    uv run --group publish bump2version [major|minor|patch]
    
  2. Update the uv.lock file:
    uv lock
    
  3. Update the changelog in CHANGELOG.md.
  4. Build the distribution:
    just clean
    uv run --group publish python -m build
    
  5. Check the distribution:
    uv run --group publish twine check dist/*
    
  6. Upload to TestPyPI (optional but recommended):
    uv run --group publish twine upload --repository testpypi dist/*
    
  7. Upload to PyPI:
    uv run --group publish twine upload dist/*
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmqc_utils-0.1.0.tar.gz (13.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mmqc_utils-0.1.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file mmqc_utils-0.1.0.tar.gz.

File metadata

  • Download URL: mmqc_utils-0.1.0.tar.gz
  • Upload date:
  • Size: 13.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for mmqc_utils-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ae657c11c8501a6ae70966892c5f8878607a46dde54a7e395251861dd9577bbb
MD5 0a25098f89f36ef4f071ce37971bc03b
BLAKE2b-256 2a1bf77048a6344702eee5fab63319860ad43ef5bb77d8e7dc21c834946e4220

See more details on using hashes here.

File details

Details for the file mmqc_utils-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mmqc_utils-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for mmqc_utils-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 38d39b1eec76d98828f6bc572960df11319c736919bf5724965f7210dd02d99f
MD5 4ca6f461432f467ce9f50147903f1ed2
BLAKE2b-256 aab10ff66895071e85088d9157e42ece98b53c2a552c7bffbdd4b323c497f8b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page