Skip to main content

Miller: generate small representative mzML subsets for testing

Project description

Miller

Miller generates small, representative mzML files from full-sized proteomics mzML datasets. Production mzML files are often hundreds of megabytes or several gigabytes — too large to bundle in repositories, share casually, or iterate on quickly. Miller solves this by extracting a configurable subset of spectra into a new, fully valid mzML file that preserves the structure and metadata of the original.

Miller works with both DDA and DIA data and is useful in a variety of scenarios:

  • Smoke-testing data analysis pipelines — generate tiny mzML files to verify that a workflow runs end-to-end before committing to a full-scale run.
  • CI and integration tests — ship realistic test fixtures without multi-GB raw data.
  • Filtering step in a larger workflow — use Miller as a pre-processing stage, for example to trim mzML files in a cascade search or to focus on a retention-time window of interest.

Highlights

  • Include or exclude scans based on scan number or retention-time range.
  • Operate on specific MS levels (e.g. MS1, MS2).
  • Precursor inclusion (default on) — if an MSn scan references a precursor via spectrumRef, the full precursor chain is included automatically.
  • Preserves run-level sections and metadata; updates spectrumList/@count.
  • Recalculates TIC (MS:1000235) and BPC (MS:1000628) from retained spectra when present.
  • Indexed or non-indexed mzML output, defaulting to the source unless overridden.
  • Binary array compression control: source, zlib, or none.

Installation

pip (recommended)

pip install miller-mzml-filterer

Verify:

miller --help

Docker

docker pull ghcr.io/mriffle/miller-mzml-filterer:latest

Verify:

docker run --rm ghcr.io/mriffle/miller-mzml-filterer:latest --help

Quick Start

Using pip

Randomly select 50 scans:

miller --scan-count 50 input.mzML output.mzML

Randomly select 5% of scans:

miller --scan-percent 5 input.mzML output.mzML

Select 10 random MS2 scans (precursor MS1 scans are included automatically):

miller --ms-level 2 --scan-count 10 input.mzML output.mzML

Keep scans in a retention-time window:

miller --rt-range-start 35.2 --rt-range-end 35.8 input.mzML output.mzML

Select specific scans from an include file (one scan ID per line):

miller --scan-include-file scans.txt input.mzML output.mzML

Using Docker

All Docker examples below mount the current directory into the container and run as your current user/group so output files have the correct ownership:

docker run --rm \
  --user "$(id -u):$(id -g)" \
  -v "$PWD:/work" \
  -w /work \
  ghcr.io/mriffle/miller-mzml-filterer:latest \
  --scan-count 50 input.mzML output.mzML

Select 5% of scans:

docker run --rm \
  --user "$(id -u):$(id -g)" \
  -v "$PWD:/work" \
  -w /work \
  ghcr.io/mriffle/miller-mzml-filterer:latest \
  --scan-percent 5 input.mzML output.mzML

Select 10 random MS2 scans:

docker run --rm \
  --user "$(id -u):$(id -g)" \
  -v "$PWD:/work" \
  -w /work \
  ghcr.io/mriffle/miller-mzml-filterer:latest \
  --ms-level 2 --scan-count 10 input.mzML output.mzML

More Examples

Retention-time filtering

Combine an RT window with random selection:

miller --rt-range-start 35.2 --rt-range-end 35.8 --scan-count 50 input.mzML output.mzML

Pick a random contiguous 10% RT window, then select 25 scans from it:

miller --rt-window-percent 10 --scan-count 25 input.mzML output.mzML

Excluding scans

Exclude specific scans by ID (one per line in the file):

miller --scan-count 50 --scan-exclude-file exclude.txt input.mzML output.mzML

Keep all scans except the excluded ones:

miller --scan-exclude-file exclude.txt input.mzML output.mzML

Output format and compression

Force indexed output with zlib compression:

miller --indexed --compression zlib --scan-count 10 input.mzML output.mzML

Non-indexed, uncompressed:

miller --no-index --compression none --scan-count 10 input.mzML output.mzML

Precursor inclusion

By default, Miller follows spectrumRef links to include precursor scans (e.g. MS1 parents of selected MS2 scans). Disable this with:

miller --no-include-precursors --scan-count 10 input.mzML output.mzML

Determinism

Random selection is seeded (default 42). Vary the seed for different subsets of the same file:

miller --scan-count 50 --seed 1 input.mzML output_seed1.mzML
miller --scan-count 50 --seed 2 input.mzML output_seed2.mzML

CLI Reference

miller [OPTIONS] INPUT OUTPUT

Positional arguments

  • INPUT — path to the source mzML file (indexed or non-indexed).
  • OUTPUT — path for the output mzML file.

Selection mode (mutually exclusive)

  • --scan-count INTEGER — randomly select N scans from the eligible pool. Fails if N exceeds pool size.
  • --scan-percent FLOAT — randomly select a percentage (> 0, ≤ 100) of eligible scans.
  • --scan-include-file PATH — file with one scan ID per line. Accepts bare numbers (1001) or prefixed IDs (scan=1001).
  • If none of the above are given and --scan-exclude-file is set, all scans minus exclusions are kept.

Filtering

  • --rt-range-start FLOAT / --rt-range-end FLOAT — inclusive RT bounds applied before selection. Either or both may be supplied.
  • --rt-window-percent FLOAT — random contiguous RT window (percentage of eligible RT span), applied after fixed RT bounds.
  • --scan-exclude-file PATH — one scan ID per line to exclude from selection and final output.
  • --ms-level TEXT — comma-separated MS levels (e.g. 1, 2, 1,2). Valid only with --scan-count or --scan-percent.

Precursor inclusion

  • --include-precursors / --no-include-precursors (default: include) — walk spectrumRef chains to include ancestor scans.

Output format

  • --indexed / --no-index — force indexed or non-indexed output. Default follows the source file.
  • --compression [source|zlib|none] (default: source) — binary array compression mode.

Other

  • --seed INTEGER (default: 42) — random seed for --scan-count, --scan-percent, and --rt-window-percent.
  • --help / -h — show usage and exit.
  • --version / -v — show version and exit.

Exit codes

Code Meaning
1 Invalid or unreadable input file
2 CLI usage / argument error
3 One or more explicit scan IDs not found
4 Selection produced zero eligible scans
5 Output path / write error

Development

Local setup

python3 -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"

Running tests

.venv/bin/pytest --cov=miller --cov-report=term-missing tests/
.venv/bin/ruff check src/ tests/
.venv/bin/mypy src/

Smoke tests use test_data/test_data.mzML and run automatically with the full suite. To run only smoke tests:

.venv/bin/pytest tests/test_smoke_real_data.py

Building the Docker image locally

docker build -t miller .
docker run --rm miller --help

Running tests inside Docker

docker run --rm --entrypoint pytest ghcr.io/mriffle/miller-mzml-filterer:latest \
  --cov=miller --cov-report=term-missing tests/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

miller_mzml_filterer-1.0.4.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

miller_mzml_filterer-1.0.4-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file miller_mzml_filterer-1.0.4.tar.gz.

File metadata

  • Download URL: miller_mzml_filterer-1.0.4.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for miller_mzml_filterer-1.0.4.tar.gz
Algorithm Hash digest
SHA256 a49378d5293bc0e36ceefa156358e252264d77dbd3cabb512c5ebf2604f66565
MD5 8649219cf151f34db4f1b2378cc87e8d
BLAKE2b-256 e9c5d5b1fee458ad002af11b60ba0884e6fb51678fa4000161162fe6d4acc5df

See more details on using hashes here.

Provenance

The following attestation bundles were made for miller_mzml_filterer-1.0.4.tar.gz:

Publisher: release.yml on mriffle/miller-mzml-filterer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file miller_mzml_filterer-1.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for miller_mzml_filterer-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 eb0d2f8436ec2e1f85e47a00b46dce68ad707052294d39b15fe8253f9a3032eb
MD5 0677d948d0c19213aad6ecd57256861a
BLAKE2b-256 14ab5b2d7a13b60933f89a25c6cbbbd684a8af54119317c470d78a47be19651d

See more details on using hashes here.

Provenance

The following attestation bundles were made for miller_mzml_filterer-1.0.4-py3-none-any.whl:

Publisher: release.yml on mriffle/miller-mzml-filterer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page