Miller: generate small representative mzML subsets for testing
Project description
Miller
miller creates small, representative mzML files from full-sized proteomics mzML datasets. The goal is realistic test fixtures for CI, integration tests, and local development without shipping multi-GB raw conversions.
Key Properties
- Fidelity: preserves mzML structure and metadata; only the spectrum set is reduced.
- Determinism: random selection is reproducible via
--seed(default42). - Correctness-first: explicit validation and stable exit codes for automation.
What It Does (High Level)
- Selects spectra by:
- Random count:
--scan-count N - Random percent:
--scan-percent PCT - Include file:
--scan-include-file path/to/include.txt
- Random count:
- Optional retention-time filtering:
--rt-range-start MIN_RT,--rt-range-end MAX_RT - Optional random retention-time window:
--rt-window-percent PCT - Optional exclusion file:
--scan-exclude-file path/to/exclude.txt - Optional MS-level pre-filtering for random mode:
--ms-level 1,--ms-level 2,--ms-level 1,2. - Precursor inclusion (default on): if an MSn scan references a precursor via
spectrumRef, the full precursor chain is included. - Preserves run-level sections and metadata, updates
spectrumList/@count. - Chromatograms:
- Recalculates TIC (
MS:1000235) and BPC (MS:1000628) from retained spectra when present. - Passes through all other chromatograms unmodified.
- Recalculates TIC (
- Output format:
- Indexed or non-indexed mzML output, defaulting to the source unless overridden.
- Binary array compression control:
source,zlib, ornone.
How To Run
Basic usage:
miller [OPTIONS] INPUT OUTPUT
Local day-to-day usage
Typical workflow is: keep large source mzMLs somewhere on disk, generate small subsets into a separate folder, then point your CI/tests/tools at the subset files.
Example directory layout:
project/
data/
input.mzML
subsets/
Create a subset (random selection):
mkdir -p subsets
miller --scan-count 50 data/input.mzML subsets/input.subset_50.mzML
Create a subset from only MS2 scans (still includes precursor MS1 scans when referenced):
miller --ms-level 2 --scan-count 10 data/input.mzML subsets/input.ms2_10_plus_precursors.mzML
Create a subset with exact scan IDs using an include file (one scan ID per line, no header):
cat > subsets/include_scans.txt <<'EOF'
1001
1002
1050
EOF
miller --scan-include-file subsets/include_scans.txt data/input.mzML subsets/input.scans_1001_1002_1050.mzML
Create a random subset by percent:
miller --scan-percent 5 data/input.mzML subsets/input.subset_5pct.mzML
Create a subset from a chromatographic time window:
miller --rt-range-start 35.2 --rt-range-end 35.8 data/input.mzML subsets/input.rt_35p2_35p8.mzML
Use a retention-time filter before random selection:
miller --rt-range-start 35.2 --rt-range-end 35.8 --scan-count 50 data/input.mzML subsets/input.rt_window_random_50.mzML
Keep a random contiguous 10% retention-time window, then select 25 scans from within it:
miller --rt-window-percent 10 --scan-count 25 data/input.mzML subsets/input.rt_segment_10pct_count25.mzML
Exclude specific scans from random candidate pool (and final output):
cat > subsets/exclude_scans.txt <<'EOF'
1001
1002
EOF
miller --scan-count 50 --scan-exclude-file subsets/exclude_scans.txt data/input.mzML subsets/input.subset_50_excl.mzML
Exclude-only mode (all scans except excluded):
miller --scan-exclude-file subsets/exclude_scans.txt data/input.mzML subsets/input.all_minus_excluded.mzML
Disable precursor inclusion (output contains exactly the selected scans):
miller --no-include-precursors --scan-count 10 data/input.mzML subsets/input.subset_10_no_precursors.mzML
Force indexed/non-indexed output and compression:
miller --indexed --compression zlib --scan-count 10 data/input.mzML subsets/input.indexed.zlib.mzML
miller --no-index --compression none --scan-count 10 data/input.mzML subsets/input.noindex.none.mzML
Notes on determinism
Random selection uses --seed (default 42). If you want different subsets from the same file, vary the seed:
miller --scan-count 50 --seed 1 data/input.mzML subsets/input.subset_seed1.mzML
miller --scan-count 50 --seed 2 data/input.mzML subsets/input.subset_seed2.mzML
Quick examples (minimal)
Randomly select 50 scans:
miller --scan-count 50 input.mzML output.mzML
Select specific scans via include file:
miller --scan-include-file include_scans.txt input.mzML output.mzML
Randomly select by percent:
miller --scan-percent 10 input.mzML output.mzML
Only draw from MS2 scans, but still include MS1 precursors if referenced:
miller --ms-level 2 --scan-count 10 input.mzML output.mzML
Disable precursor chain inclusion:
miller --no-include-precursors --scan-count 10 input.mzML output.mzML
Force output format and compression:
miller --indexed --compression zlib --scan-count 10 input.mzML output.mzML
miller --no-index --compression none --scan-count 10 input.mzML output.mzML
CLI Parameters
Positional arguments:
INPUT(required): path to the source mzML file (indexed or non-indexed).OUTPUT(required): path for the output mzML file.
Selection mode:
--scan-count INTEGER: randomly select N scans uniformly from the eligible pool.- Output order is the original file order, not the random draw order.
- If N exceeds the eligible pool size, the program exits non-zero (see Exit Codes).
--scan-percent FLOAT: randomly select a percentage of eligible scans.- Must be
> 0and<= 100. - Selection count is computed from the eligible pool after any exclusions.
- Must be
--scan-include-file PATH: file with one scan ID per line to include.- Accepts either bare numbers (
1001) or prefixed IDs (scan=1001). - Output order follows source file order.
- Incompatible with
--scan-countand--scan-percent.
- Accepts either bare numbers (
--scan-exclude-file PATHcan also be used alone (no include/count/percent), which means:- Start from all scans in input.
- Apply any retention-time bounds.
- Exclude listed scans.
- Then apply precursor inclusion behavior and final exclusion.
--rt-range-start FLOATand--rt-range-end FLOAT:- Optional inclusive retention-time bounds applied before selection.
- If only one bound is provided, the other side is left open.
- Can be combined with random selection, include-file selection, or used by themselves to keep all scans within a time window.
- Scans with missing retention time are treated as ineligible when any RT filter is present.
- Precursor inclusion can still add scans outside the requested RT window.
--rt-window-percent FLOAT:- Chooses a random contiguous retention-time window whose width is the given percentage of the eligible RT span.
- Applied after fixed RT bounds and before non-RT filters or primary selection.
- Can be combined with random selection, include-file selection, or used by itself.
- The percentage refers to retention-time span, not percentage of scans.
- Precursor inclusion can still add scans outside the chosen RT window.
Exclusion file:
--scan-exclude-file PATH: file with one scan ID per line to exclude.- Excluded scans are removed from random candidate pools and from final output.
- Can be combined with random selection or include-file selection.
- Can be used by itself to produce "all scans except excluded scans" output.
- If the same scan appears in both include and exclude files, the program exits with usage error.
MS-level filtering:
--ms-level TEXT: comma-separated MS levels (e.g.1,2,1,2).- Valid only with random selection (
--scan-countor--scan-percent). - Applies only to the initial random selection pool. Precursor inclusion can add MS levels not listed here.
- Using
--ms-levelwith--scan-include-fileor exclude-only mode is a usage error.
- Valid only with random selection (
Precursor inclusion:
--include-precursors / --no-include-precursors(default: include)- When enabled, walks
precursor/@spectrumRefchains and includes all referenced ancestors. - Broken
spectrumRefvalues emit a warning to stderr and continue. - If no
spectrumRefattributes exist in the file, this option has no effect.
- When enabled, walks
Output format:
--indexed / --no-index:- When omitted, the output format follows the source file.
--indexedadds an index (indexListandindexListOffset) to the end of the file.--no-indexomits those elements entirely.
Binary array compression:
--compression [source|zlib|none](default:source)source: copies each spectrum's binary arrays without re-encoding.zlib: decodes and re-encodes all spectrum arrays with zlib compression and updates CV terms.none: decodes and re-encodes all spectrum arrays uncompressed and updates CV terms.- Recalculated TIC/BPC use this setting. Pass-through chromatograms retain their original encoding.
Reproducibility:
--seed INTEGER(default:42): random seed used for--scan-countand--scan-percent.- Also used for
--rt-window-percent.
- Also used for
Help:
--help/-h: show usage and exit.--version/-v: show the installed release version, or a git-derived development version when available.
Exit Codes
1: invalid/unreadable input file.2: CLI usage/argument error (bad flag combinations).3: one or more explicit scans were not found.4: random selection request exceeds or has no eligible scans after filtering/exclusion.- Also used when any other filter/selection combination leaves zero scans selected.
5: output path/write error.
Installation
Install from PyPI:
python3 -m pip install miller-mzml-filterer
Verify the CLI is available:
miller --help
Example run after installing with pip:
miller --scan-count 50 input.mzML output.subset_50.mzML
Installation (Local Dev)
python3 -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"
Testing
.venv/bin/pytest --cov=miller --cov-report=term-missing tests/
.venv/bin/ruff check src/ tests/
.venv/bin/mypy src/
Smoke tests:
tests/test_smoke_real_data.pyusestest_data/test_data.mzML.- These smoke tests run automatically with the rest of the suite in GitHub Actions because they live under
tests/. - Run only smoke tests locally:
.venv/bin/pytest tests/test_smoke_real_data.py
Docker
Pull the published image for this GitHub project:
docker pull ghcr.io/mriffle/miller-mzml-filterer:latest
Run help:
docker run --rm ghcr.io/mriffle/miller-mzml-filterer:latest --help
Run the tool in the current directory, as your current user and group, with the current directory mounted at /work:
docker run --rm \
--user "$(id -u):$(id -g)" \
-v "$PWD:/work" \
-w /work \
ghcr.io/mriffle/miller-mzml-filterer:latest \
--scan-count 50 input.mzML output.subset_50.mzML
Docker day-to-day usage (with mounts)
When running in Docker, you almost always want to mount a host directory containing mzML files into the container, and mount an output directory to receive the subset file.
Example host layout:
/path/to/project/
data/
input.mzML
subsets/
Run the tool against a mounted input file and write to a mounted output directory:
mkdir -p subsets
docker run --rm \
-v "$PWD/data:/data:ro" \
-v "$PWD/subsets:/out" \
ghcr.io/mriffle/miller-mzml-filterer:latest \
--scan-count 50 \
/data/input.mzML /out/input.subset_50.mzML
If you want the output file to be owned by your host user (instead of root), run the container as you:
docker run --rm \
--user "$(id -u):$(id -g)" \
-v "$PWD/data:/data:ro" \
-v "$PWD/subsets:/out" \
ghcr.io/mriffle/miller-mzml-filterer:latest \
--ms-level 2 --scan-count 10 \
/data/input.mzML /out/input.ms2_10_plus_precursors.mzML
Run tests inside the container:
docker run --rm --entrypoint pytest ghcr.io/mriffle/miller-mzml-filterer:latest \
--cov=miller --cov-report=term-missing tests/
If you want to build the image locally during development instead of pulling it from GHCR:
docker build -t miller .
docker run --rm miller --help
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file miller_mzml_filterer-1.0.2.tar.gz.
File metadata
- Download URL: miller_mzml_filterer-1.0.2.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e4ca212072cefb29ef9c50989d3edea3bf6f7758e90ba783f2867e8c27576cb
|
|
| MD5 |
1286fa5eb4e20c024ebe85268e73f645
|
|
| BLAKE2b-256 |
f7ea29d613a717f05fae5d14961ce6569317acbc32168fd9845609e1a22ce09c
|
Provenance
The following attestation bundles were made for miller_mzml_filterer-1.0.2.tar.gz:
Publisher:
release.yml on mriffle/miller-mzml-filterer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
miller_mzml_filterer-1.0.2.tar.gz -
Subject digest:
4e4ca212072cefb29ef9c50989d3edea3bf6f7758e90ba783f2867e8c27576cb - Sigstore transparency entry: 1112874339
- Sigstore integration time:
-
Permalink:
mriffle/miller-mzml-filterer@dc4c9f7b6f15244fa76a987282d7342b39a334c4 -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/mriffle
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@dc4c9f7b6f15244fa76a987282d7342b39a334c4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file miller_mzml_filterer-1.0.2-py3-none-any.whl.
File metadata
- Download URL: miller_mzml_filterer-1.0.2-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14e55acbfab789becb9c1830f98b44dd64f04d4983b3cbf5daf054a0a3eb853a
|
|
| MD5 |
eeef56b2f840d1253a96ad1fb0f86a0a
|
|
| BLAKE2b-256 |
4b8b09ac6a11df1ee496f9207ec35e7f12c75c4f2a9d5ef932d01c6d9c1c3776
|
Provenance
The following attestation bundles were made for miller_mzml_filterer-1.0.2-py3-none-any.whl:
Publisher:
release.yml on mriffle/miller-mzml-filterer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
miller_mzml_filterer-1.0.2-py3-none-any.whl -
Subject digest:
14e55acbfab789becb9c1830f98b44dd64f04d4983b3cbf5daf054a0a3eb853a - Sigstore transparency entry: 1112874359
- Sigstore integration time:
-
Permalink:
mriffle/miller-mzml-filterer@dc4c9f7b6f15244fa76a987282d7342b39a334c4 -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/mriffle
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@dc4c9f7b6f15244fa76a987282d7342b39a334c4 -
Trigger Event:
push
-
Statement type: