Skip to main content

A comprehensive toolkit for MALDI-TOF mass spectrometry data preprocessing for antimicrobial resistance (AMR) prediction purposes

Project description

MaldiAMRKit

CI Coverage Documentation

PyPI Version Python License

MaldiAMRKit

A comprehensive toolkit for MALDI-TOF mass spectrometry data preprocessing for antimicrobial resistance (AMR) prediction purposes

InstallationFeaturesQuick StartDocumentationTutorialsMaldiSuiteContributingCitingLicense

Installation

pip install maldiamrkit

Install the full MaldiSuite

To install MaldiAMRKit together with MaldiBatchKit and MaldiDeepKit at compatible versions, install the maldisuite meta-package:

pip install maldisuite

Visit the MaldiSuite landing page at https://ettorerocchi.github.io/MaldiSuite/.

Optional: Batch Correction & UMAP

pip install maldiamrkit[batch]

Installs combatlearn for ComBat-based batch effect correction and umap-learn for UMAP exploratory plots.

Development Installation

git clone https://github.com/EttoreRocchi/MaldiAMRKit.git
cd MaldiAMRKit
pip install -e .[dev]

Features

Preprocessing

  • Composable Pipeline: Build custom PreprocessingPipeline from individual transformers (smoothing, baseline correction, normalization, trimming), serializable to JSON/YAML
  • Smoothing: Savitzky-Golay and moving-average filters
  • Baseline Correction: SNIP, morphological top-hat, lower-convex-hull, and iterative rolling-median baselines
  • Multiple Binning Strategies: Uniform, proportional, adaptive, and custom bin edges
  • Quality Metrics: SNR estimation, MAD-based noise estimation, comprehensive quality reports, and alignment assessment
  • Replicate Merging: Mean/median/weighted merging with correlation-based outlier detection

Alignment & Detection

  • Spectral Alignment: Shift, linear, piecewise, DTW, quadratic, cubic, and LOWESS warping for both binned and raw full-resolution spectra
  • Peak Detection: Local maxima and persistent homology methods

Evaluation

  • AMR Metrics: VME, ME, sensitivity, specificity, categorical agreement, and amr_classification_report following EUCAST conventions
  • mic_regression_report: RMSE in log2 dilutions, essential agreement (±1 dilution), and categorical agreement after re-binning to S/I/R - the regression counterpart to amr_classification_report
  • Stratified Splitting: Species-drug stratified and case-based (patient-grouped) splitting to prevent data leakage

Susceptibility (MIC encoding & breakpoints)

  • MICEncoder: Convert raw MIC strings into log2(MIC) regression targets plus S/I/R categories and ATU flags in a single pass
  • BreakpointTable: Clinical breakpoint tables (EUCAST v1.0-v16.0 bundled) loaded by version, year, latest, or custom YAML
  • LabelEncoder: Map R/I/S to binary with configurable intermediate handling (moved from maldiamrkit.evaluation in v0.15)

Differential Analysis

  • DifferentialAnalysis: Per-bin statistical testing (Mann-Whitney U, Welch's t-test) between resistant and susceptible groups, with multiple-testing correction, log2 fold change, and Cohen's d effect size
  • Peak Selection: top_peaks() by adjusted p-value, significant_peaks() with fold-change and p-value thresholds, compare_drugs() for multi-drug boolean significance matrices
  • AMR-Aware Plots: plot_volcano(), plot_manhattan() along the m/z axis, and plot_drug_comparison() with binary heatmap or UpSet-style intersection view

Drift Monitoring

  • DriftMonitor: Anchor a baseline on early timestamps (default: first 20%) and track temporal drift via three complementary views - reference similarity of per-window median spectra, PCA centroid trajectory in a baseline-fitted PCA space, and Jaccard stability of top-k differential peaks over time
  • Trajectory Plots: plot_reference_drift, plot_pca_drift, plot_peak_stability, plot_effect_size_drift

Data Management

  • Dataset Building & Loading: DatasetBuilder and DatasetLoader with pluggable layout adapters (FlatLayout, BrukerTreeLayout, DRIAMSLayout, MARISMaLayout)
  • Bruker Format Support: Read Bruker flexAnalysis binary data (fid/1r + acqus) natively via read_spectrum() on directories
  • MIC Parsing: parse_mic_column() for parsing MIC strings with qualifiers and European decimals
  • Composable Filters: SpeciesFilter, DrugFilter, QualityFilter, MetadataFilter combinable with &/|/~ operators
  • Spectrum Export: Save spectra to CSV or TXT via MaldiSpectrum.save() and MaldiSet.save_spectra()

Visualization & Tools

  • Exploratory Plots: PCA, t-SNE, and UMAP scatter plots colored by species, resistance phenotype, or any metadata column
  • Batch Effect Correction: Multi-site/multi-instrument correction via combatlearn (pip install maldiamrkit[batch])
  • CLI: maldiamrkit preprocess, maldiamrkit quality, and maldiamrkit build for batch processing
  • Parallel Processing: Multi-core support via n_jobs parameter
  • ML-Ready: Direct integration with scikit-learn pipelines

Documentation

Full documentation is available at maldiamrkit.readthedocs.io.

Quick Start

Load and Preprocess a Single Spectrum

from maldiamrkit import MaldiSpectrum

# Load spectrum from file
spec = MaldiSpectrum("data/spectrum.txt")

# Preprocess: smoothing, baseline removal, normalization
spec.preprocess()

# Optional: bin to reduce dimensions
spec.bin(bin_width=3)  # 3 Da bins

# Visualize
from maldiamrkit.visualization import plot_spectrum
plot_spectrum(spec, stage="binned")

Build a Dataset from Multiple Spectra

from maldiamrkit import MaldiSet

# Load multiple spectra with metadata
data = MaldiSet.from_directory(
    spectra_dir="data/spectra/",
    meta_file="data/metadata.csv",
    aggregate_by=dict(antibiotics="Drug", species="Escherichia coli"),
    bin_width=3
)

# Access features and labels
X = data.X  # Feature matrix
y = data.get_y_single("Drug")  # Target labels

Machine Learning Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from maldiamrkit.alignment import Warping
from maldiamrkit.detection import MaldiPeakDetector

# Create ML pipeline
pipe = Pipeline([
    ("peaks", MaldiPeakDetector(binary=False, prominence=0.05)),
    ("warp", Warping(method="shift")),
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(n_estimators=100, random_state=42))
])

# Cross-validation
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print(f"CV Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")

For more examples covering alignment, filtering, evaluation, CLI usage, and more, see the Quickstart Guide and API Reference.

Tutorials

For more detailed examples, see the notebooks:

  • Quick Start - Loading, preprocessing, binning, and quality assessment
  • Peak Detection - Local maxima and persistent homology methods
  • Alignment - Warping methods and alignment quality
  • Evaluation - AMR metrics, label encoding, and stratified splitting
  • Exploration - PCA, t-SNE, UMAP visualizations and batch correction
  • Differential Analysis - R vs. S peak testing, volcano/Manhattan plots, and multi-drug comparison
  • Drift Monitoring - Baseline-anchored drift detection: reference similarity, PCA trajectory, peak stability, and effect-size drift
  • Susceptibility - MICEncoder + BreakpointTable for log2(MIC) regression targets, S/I/R categorisation with ATU, and mic_regression_report evaluation

Notebooks 01-03 and 08 run on the small example dataset bundled under data/ or are fully self-contained. Notebooks 04-07 need more samples and pull the real MALDI-Kleb-AI archive (Rocchi et al., 2026; Zenodo DOI 10.5281/zenodo.17405072) via the demo helper. By default the helper restricts the dataset to the Rome sub-cohort (~470 spectra, single acquisition centre, no batch correction required); the 370 MB tarball is cached under ~/.cache/maldiamrkit/ on first use.

MaldiSuite Ecosystem

MaldiAMRKit is the data-model and preprocessing package of the MaldiSuite ecosystem:

  • MaldiAMRKit (this package) - data model (MaldiSpectrum, MaldiSet), preprocessing, alignment, peak detection, differential analysis, and AMR-aware evaluation.
  • MaldiBatchKit - batch-effect correction and harmonisation for multi-centre / multi-instrument MALDI-TOF spectra.
  • MaldiDeepKit - sklearn-compatible deep learning classifiers (MLP, CNN, ResNet, Transformer).

The three packages share the MaldiSet / MaldiSpectrum data model and are designed to compose in a single end-to-end pipeline. Install the full suite with pip install maldisuite. Landing page: MaldiSuite.

Contributing

Pull requests, bug reports, and feature ideas are welcome. See the Contributing Guide for how to get started.

Citing

If you use MaldiAMRKit in your research, please cite:

Rocchi, E., Nicitra, E., Calvo, M. et al. Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy. BMC Microbiol (2026). doi:10.1186/s12866-025-04657-2

See the full publications list for more papers using MaldiAMRKit.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

This toolkit is inspired by:

Weis, C., Cuénod, A., Rieck, B., et al. (2022). Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nature Medicine, 28, 164-174. https://doi.org/10.1038/s41591-021-01619-9

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maldiamrkit-0.15.0.tar.gz (198.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maldiamrkit-0.15.0-py3-none-any.whl (233.2 kB view details)

Uploaded Python 3

File details

Details for the file maldiamrkit-0.15.0.tar.gz.

File metadata

  • Download URL: maldiamrkit-0.15.0.tar.gz
  • Upload date:
  • Size: 198.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for maldiamrkit-0.15.0.tar.gz
Algorithm Hash digest
SHA256 74fbe4e29cc57ea8808590b8937b286559d91f541b648a8bd9ce96348d42339f
MD5 c3bfafaf97aecb539ee1e213459aa422
BLAKE2b-256 e9b65a9939296b98b70733dcf6102266dfb3a419c2e17e02c4e8c59c1b8127d7

See more details on using hashes here.

File details

Details for the file maldiamrkit-0.15.0-py3-none-any.whl.

File metadata

  • Download URL: maldiamrkit-0.15.0-py3-none-any.whl
  • Upload date:
  • Size: 233.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for maldiamrkit-0.15.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c9446e099f67c6bed59db0ab4260934cb760d83e440256d9da6ddd46a8969e39
MD5 947ac0d9828f8f340e2c464dbb62163a
BLAKE2b-256 50d3d759e2505ad5a18c4fdd64f13380bf21e67cbab711cfeaff87bc0c237dde

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page