Reusable conversion utilities for images, documents, text, and related data types.
Project description
mmqc-utils
Reusable conversion utilities for MMQC projects.
Included utilities
- Convert and downscale common image formats (TIFF, JPEG, PNG, GIF, WebP, PDF) to bounded JPEG previews, with optional byte-size budget enforcement
- Convert DOCX, RTF, ODT, TeX, and PDF documents to HTML
- Normalize HTML to plain text
System requirements
- ImageMagick — required for image conversion (
convert_to_bounded_jpeg). Install via your system package manager:# macOS brew install imagemagick # Debian / Ubuntu apt-get install imagemagick
- Pandoc — bundled automatically via
pypandoc-binary; no separate installation needed.
Installation
pip install mmqc-utils
# or
uv add mmqc-utils
Usage
All functions accept a file path (str or Path), raw bytes/bytearray, or a BinaryIO object.
Document conversion
from mmqc_utils import document_to_html
# From a file path
html = document_to_html("paper.docx")
# From bytes — input_format is required when there is no file extension to infer from
html = document_to_html(raw_bytes, input_format="rtf")
Supported formats: docx, rtf, odt, tex, pdf.
For PDFs, pandoc is tried first; if it cannot convert the file, text is extracted page-by-page via pypdf and wrapped in <div class='page'> elements.
Image conversion
from mmqc_utils import convert_to_bounded_jpeg, compress_to_bounded_jpeg
# Downscale to pixel dimensions
jpeg_bytes = convert_to_bounded_jpeg(
"figure.tiff",
rasterization_dpi=150, # DPI for vector/PDF rasterization
max_dimension=2000, # downscale if width or height exceeds this
compression_quality=80, # JPEG quality 1–100
background="white", # background when removing transparency
)
# Compress until the result fits within a byte-size budget
jpeg_bytes = compress_to_bounded_jpeg(
"figure.tiff",
max_bytes=5 * 1024 * 1024, # 5 MB
max_dimension=2000,
compression_quality=80,
)
Both functions accept Path, str, bytes, bytearray, or BinaryIO as input and only render the first page/layer of multi-page TIFFs.
compress_to_bounded_jpeg steps down JPEG quality first (80 → 70 → … → 30), then halves max_dimension and repeats, until the result fits within max_bytes. If the budget cannot be met even at minimum quality and dimension, it returns the smallest result achieved.
Text normalization
from mmqc_utils import html_to_text, compute_plain_text
text = html_to_text("<p>Hello <b>world</b></p>")
# → "Hello world"
compute_plain_text is an alias for html_to_text. Block-level tags (<p>, <div>, <br>, headings, list items, …) are replaced by a space; inline tags are stripped; whitespace is collapsed.
Development
Releasing a New Version
To release a new version of mmqc-utils to PyPI:
- Update the version number in
pyproject.toml. - Update the uv.lock file:
uv lock - Update the changelog in
CHANGELOG.md. - Build the distribution:
just build - Check the distribution:
uv run --group publish twine check dist/*
- Upload to TestPyPI (optional but recommended):
uv run --group publish twine upload --repository testpypi dist/*
- Upload to PyPI:
uv run --group publish twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mmqc_utils-0.2.1.tar.gz.
File metadata
- Download URL: mmqc_utils-0.2.1.tar.gz
- Upload date:
- Size: 42.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1566d1a5f0ada9719867d8b677235e6200547c3cbccaef700a1f7b6de2db7e6
|
|
| MD5 |
b5a6ea96a9edfa9cb22d73acab8a7da0
|
|
| BLAKE2b-256 |
404c5ab44bc5418fc00d88ca10f30b364f9a74df424838e0e6dc604d1cbaddfd
|
File details
Details for the file mmqc_utils-0.2.1-py3-none-any.whl.
File metadata
- Download URL: mmqc_utils-0.2.1-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a854ea72fd8c44fdac20c20cd98d974a080239c8e763b1ebc0d76f591fc6d126
|
|
| MD5 |
24e7b28de4faf3e7b45b02cb487f0c88
|
|
| BLAKE2b-256 |
398e5020c722b100c8b9697d1efd674a0eaee1adddbfbe95e8ad7f31e140f4fb
|