Miller: generate small representative mzML subsets for testing

Project description

Miller

miller creates small, representative mzML files from full-sized proteomics mzML datasets. The goal is realistic test fixtures for CI, integration tests, and local development without shipping multi-GB raw conversions.

Key Properties

Fidelity: preserves mzML structure and metadata; only the spectrum set is reduced.
Determinism: random selection is reproducible via --seed (default 42).
Correctness-first: explicit validation and stable exit codes for automation.

What It Does (High Level)

Selects spectra by:
- Random count: --scan-count N
- Random percent: --scan-percent PCT
- Include file: --scan-include-file path/to/include.txt
Optional retention-time filtering: --rt-range-start MIN_RT, --rt-range-end MAX_RT
Optional random retention-time window: --rt-window-percent PCT
Optional exclusion file: --scan-exclude-file path/to/exclude.txt
Optional MS-level pre-filtering for random mode: --ms-level 1, --ms-level 2, --ms-level 1,2.
Precursor inclusion (default on): if an MSn scan references a precursor via spectrumRef, the full precursor chain is included.
Preserves run-level sections and metadata, updates spectrumList/@count.
Chromatograms:
- Recalculates TIC (MS:1000235) and BPC (MS:1000628) from retained spectra when present.
- Passes through all other chromatograms unmodified.
Output format:
- Indexed or non-indexed mzML output, defaulting to the source unless overridden.
- Binary array compression control: source, zlib, or none.

How To Run

Basic usage:

miller [OPTIONS] INPUT OUTPUT

Local day-to-day usage

Typical workflow is: keep large source mzMLs somewhere on disk, generate small subsets into a separate folder, then point your CI/tests/tools at the subset files.

Example directory layout:

project/
  data/
    input.mzML
  subsets/

Create a subset (random selection):

mkdir -p subsets
miller --scan-count 50 data/input.mzML subsets/input.subset_50.mzML

Create a subset from only MS2 scans (still includes precursor MS1 scans when referenced):

miller --ms-level 2 --scan-count 10 data/input.mzML subsets/input.ms2_10_plus_precursors.mzML

Create a subset with exact scan IDs using an include file (one scan ID per line, no header):

cat > subsets/include_scans.txt <<'EOF'
1001
1002
1050
EOF
miller --scan-include-file subsets/include_scans.txt data/input.mzML subsets/input.scans_1001_1002_1050.mzML

Create a random subset by percent:

miller --scan-percent 5 data/input.mzML subsets/input.subset_5pct.mzML

Create a subset from a chromatographic time window:

miller --rt-range-start 35.2 --rt-range-end 35.8 data/input.mzML subsets/input.rt_35p2_35p8.mzML

Use a retention-time filter before random selection:

miller --rt-range-start 35.2 --rt-range-end 35.8 --scan-count 50 data/input.mzML subsets/input.rt_window_random_50.mzML

Keep a random contiguous 10% retention-time window, then select 25 scans from within it:

miller --rt-window-percent 10 --scan-count 25 data/input.mzML subsets/input.rt_segment_10pct_count25.mzML

Exclude specific scans from random candidate pool (and final output):

cat > subsets/exclude_scans.txt <<'EOF'
1001
1002
EOF
miller --scan-count 50 --scan-exclude-file subsets/exclude_scans.txt data/input.mzML subsets/input.subset_50_excl.mzML

Exclude-only mode (all scans except excluded):

miller --scan-exclude-file subsets/exclude_scans.txt data/input.mzML subsets/input.all_minus_excluded.mzML

Disable precursor inclusion (output contains exactly the selected scans):

miller --no-include-precursors --scan-count 10 data/input.mzML subsets/input.subset_10_no_precursors.mzML

Force indexed/non-indexed output and compression:

miller --indexed --compression zlib --scan-count 10 data/input.mzML subsets/input.indexed.zlib.mzML
miller --no-index --compression none --scan-count 10 data/input.mzML subsets/input.noindex.none.mzML

Notes on determinism

Random selection uses --seed (default 42). If you want different subsets from the same file, vary the seed:

miller --scan-count 50 --seed 1 data/input.mzML subsets/input.subset_seed1.mzML
miller --scan-count 50 --seed 2 data/input.mzML subsets/input.subset_seed2.mzML

Quick examples (minimal)

Randomly select 50 scans:

miller --scan-count 50 input.mzML output.mzML

Select specific scans via include file:

miller --scan-include-file include_scans.txt input.mzML output.mzML

Randomly select by percent:

miller --scan-percent 10 input.mzML output.mzML

Only draw from MS2 scans, but still include MS1 precursors if referenced:

miller --ms-level 2 --scan-count 10 input.mzML output.mzML

Disable precursor chain inclusion:

miller --no-include-precursors --scan-count 10 input.mzML output.mzML

Force output format and compression:

miller --indexed --compression zlib --scan-count 10 input.mzML output.mzML
miller --no-index --compression none --scan-count 10 input.mzML output.mzML

CLI Parameters

Positional arguments:

INPUT (required): path to the source mzML file (indexed or non-indexed).
OUTPUT (required): path for the output mzML file.

Selection mode:

--scan-count INTEGER: randomly select N scans uniformly from the eligible pool.
- Output order is the original file order, not the random draw order.
- If N exceeds the eligible pool size, the program exits non-zero (see Exit Codes).
--scan-percent FLOAT: randomly select a percentage of eligible scans.
- Must be > 0 and <= 100.
- Selection count is computed from the eligible pool after any exclusions.
--scan-include-file PATH: file with one scan ID per line to include.
- Accepts either bare numbers (1001) or prefixed IDs (scan=1001).
- Output order follows source file order.
- Incompatible with --scan-count and --scan-percent.
--scan-exclude-file PATH can also be used alone (no include/count/percent), which means:
- Start from all scans in input.
- Apply any retention-time bounds.
- Exclude listed scans.
- Then apply precursor inclusion behavior and final exclusion.
--rt-range-start FLOAT and --rt-range-end FLOAT:
- Optional inclusive retention-time bounds applied before selection.
- If only one bound is provided, the other side is left open.
- Can be combined with random selection, include-file selection, or used by themselves to keep all scans within a time window.
- Scans with missing retention time are treated as ineligible when any RT filter is present.
- Precursor inclusion can still add scans outside the requested RT window.
--rt-window-percent FLOAT:
- Chooses a random contiguous retention-time window whose width is the given percentage of the eligible RT span.
- Applied after fixed RT bounds and before non-RT filters or primary selection.
- Can be combined with random selection, include-file selection, or used by itself.
- The percentage refers to retention-time span, not percentage of scans.
- Precursor inclusion can still add scans outside the chosen RT window.

Exclusion file:

--scan-exclude-file PATH: file with one scan ID per line to exclude.
- Excluded scans are removed from random candidate pools and from final output.
- Can be combined with random selection or include-file selection.
- Can be used by itself to produce "all scans except excluded scans" output.
- If the same scan appears in both include and exclude files, the program exits with usage error.

MS-level filtering:

--ms-level TEXT: comma-separated MS levels (e.g. 1, 2, 1,2).
- Valid only with random selection (--scan-count or --scan-percent).
- Applies only to the initial random selection pool. Precursor inclusion can add MS levels not listed here.
- Using --ms-level with --scan-include-file or exclude-only mode is a usage error.

Precursor inclusion:

--include-precursors / --no-include-precursors (default: include)
- When enabled, walks precursor/@spectrumRef chains and includes all referenced ancestors.
- Broken spectrumRef values emit a warning to stderr and continue.
- If no spectrumRef attributes exist in the file, this option has no effect.

Output format:

--indexed / --no-index:
- When omitted, the output format follows the source file.
- --indexed adds an index (indexList and indexListOffset) to the end of the file.
- --no-index omits those elements entirely.

Binary array compression:

--compression [source|zlib|none] (default: source)
- source: copies each spectrum's binary arrays without re-encoding.
- zlib: decodes and re-encodes all spectrum arrays with zlib compression and updates CV terms.
- none: decodes and re-encodes all spectrum arrays uncompressed and updates CV terms.
- Recalculated TIC/BPC use this setting. Pass-through chromatograms retain their original encoding.

Reproducibility:

--seed INTEGER (default: 42): random seed used for --scan-count and --scan-percent.
- Also used for --rt-window-percent.

Help:

--help / -h: show usage and exit.

Exit Codes

1: invalid/unreadable input file.
2: CLI usage/argument error (bad flag combinations).
3: one or more explicit scans were not found.
4: random selection request exceeds or has no eligible scans after filtering/exclusion.
- Also used when any other filter/selection combination leaves zero scans selected.
5: output path/write error.

Installation (Local Dev)

python3 -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"

Testing

.venv/bin/pytest --cov=miller --cov-report=term-missing tests/
.venv/bin/ruff check src/ tests/
.venv/bin/mypy src/

Smoke tests:

tests/test_smoke_real_data.py uses test_data/test_data.mzML.
These smoke tests run automatically with the rest of the suite in GitHub Actions because they live under tests/.
Run only smoke tests locally:

.venv/bin/pytest tests/test_smoke_real_data.py

Docker

Build:

docker build -t miller .

Run help:

docker run --rm miller --help

Docker day-to-day usage (with mounts)

When running in Docker, you almost always want to mount a host directory containing mzML files into the container, and mount an output directory to receive the subset file.

Example host layout:

/path/to/project/
  data/
    input.mzML
  subsets/

Run the tool against a mounted input file and write to a mounted output directory:

mkdir -p subsets
docker run --rm \
  -v "$PWD/data:/data:ro" \
  -v "$PWD/subsets:/out" \
  miller \
  --scan-count 50 \
  /data/input.mzML /out/input.subset_50.mzML

If you want the output file to be owned by your host user (instead of root), run the container as you:

docker run --rm \
  --user "$(id -u):$(id -g)" \
  -v "$PWD/data:/data:ro" \
  -v "$PWD/subsets:/out" \
  miller \
  --ms-level 2 --scan-count 10 \
  /data/input.mzML /out/input.ms2_10_plus_precursors.mzML

Run tests inside the container:

docker run --rm --entrypoint pytest miller \
  --cov=miller --cov-report=term-missing tests/

Project details

Release history Release notifications | RSS feed

1.0.4

Mar 30, 2026

1.0.3

Mar 17, 2026

1.0.2

Mar 16, 2026

This version

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

miller_mzml_filterer-0.1.0.tar.gz (34.5 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

miller_mzml_filterer-0.1.0-py3-none-any.whl (22.2 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file miller_mzml_filterer-0.1.0.tar.gz.

File metadata

Download URL: miller_mzml_filterer-0.1.0.tar.gz
Upload date: Mar 16, 2026
Size: 34.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for miller_mzml_filterer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d056c42930a395a51f671a3490d6aecc0a2b21578fb9a1791aeef81ea9a77965`
MD5	`92df3641c14e2a579263f20dc4dfb988`
BLAKE2b-256	`d2d3c7446f29bd2a476f863fec217339d1b44cb2931e1850c580e6a4c16b7e2f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for miller_mzml_filterer-0.1.0.tar.gz:

Publisher: release.yml on mriffle/miller-mzml-filterer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: miller_mzml_filterer-0.1.0.tar.gz
- Subject digest: d056c42930a395a51f671a3490d6aecc0a2b21578fb9a1791aeef81ea9a77965
- Sigstore transparency entry: 1112643488
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: mriffle/miller-mzml-filterer@d7ac2ae31f0c642793bc04756f3cf60808e19f64
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/mriffle
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d7ac2ae31f0c642793bc04756f3cf60808e19f64
- Trigger Event: push

File details

Details for the file miller_mzml_filterer-0.1.0-py3-none-any.whl.

File metadata

Download URL: miller_mzml_filterer-0.1.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for miller_mzml_filterer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`206b60a4e2135b7e68e3e1d351727ab1b63e3e70ea1440bf58a7a5291a7926a6`
MD5	`346d0a4d59b4d8b565568982b872114d`
BLAKE2b-256	`5711fc84dd1fce448b3a3b184e8d39e5d30760ccf37a0003e7758c4529510890`

See more details on using hashes here.

Provenance

The following attestation bundles were made for miller_mzml_filterer-0.1.0-py3-none-any.whl:

Publisher: release.yml on mriffle/miller-mzml-filterer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: miller_mzml_filterer-0.1.0-py3-none-any.whl
- Subject digest: 206b60a4e2135b7e68e3e1d351727ab1b63e3e70ea1440bf58a7a5291a7926a6
- Sigstore transparency entry: 1112643555
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: mriffle/miller-mzml-filterer@d7ac2ae31f0c642793bc04756f3cf60808e19f64
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/mriffle
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d7ac2ae31f0c642793bc04756f3cf60808e19f64
- Trigger Event: push

miller-mzml-filterer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Miller

Key Properties

What It Does (High Level)

How To Run

Local day-to-day usage

Notes on determinism

Quick examples (minimal)

CLI Parameters

Exit Codes

Installation (Local Dev)

Testing

Docker

Docker day-to-day usage (with mounts)

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance