A comprehensive toolkit for MALDI-TOF mass spectrometry data preprocessing for antimicrobial resistance (AMR) prediction purposes
Project description
MaldiAMRKit
A comprehensive toolkit for MALDI-TOF mass spectrometry data preprocessing for antimicrobial resistance (AMR) prediction purposes
Installation • Features • Documentation • License
Installation
pip install maldiamrkit
Development Installation
git clone https://github.com/EttoreRocchi/MaldiAMRKit.git
cd MaldiAMRKit
pip install -e .[dev]
Features
- Spectrum Processing: Load, smooth, baseline correct, and normalize MALDI-TOF spectra
- Dataset Management: Process multiple spectra with metadata integration
- Peak Detection: Local maxima and persistent homology methods
- Spectral Alignment (Warping): Multiple alignment methods (shift, linear, piecewise, DTW)
- Raw Spectra Warping: Full m/z resolution alignment before binning
- Quality Metrics: SNR estimation, comprehensive quality reports, and alignment assessment
- Replicate Merging: Mean/median/weighted merging of spectral replicates with correlation-based outlier detection
- Composable Preprocessing Pipeline: Build custom
PreprocessingPipelinefrom individual transformers, serializable to JSON/YAML - Composable Filter System:
SpeciesFilter,DrugFilter,QualityFilter,MetadataFilterwith&/|/~operators for flexible dataset filtering - Evaluation Metrics: VME, ME, sensitivity, specificity, categorical agreement, and
amr_classification_report - Stratified Splitting: Species-drug stratified and case-based (patient-grouped) splitting to prevent data leakage
- Label Encoding:
LabelEncoderfor mapping R/I/S to binary with configurable intermediate handling - Spectrum Export: Save individual spectra (raw, preprocessed, or binned) to CSV or TXT via
MaldiSet.save_spectra() - CLI:
maldiamrkit preprocessandmaldiamrkit qualitycommands for batch processing - Parallel Processing: Multi-core support via
n_jobsparameter for faster processing - ML-Ready: Direct integration with scikit-learn pipelines
Quick Start
Load and Preprocess a Single Spectrum
from maldiamrkit import MaldiSpectrum
# Load spectrum from file
spec = MaldiSpectrum("data/spectrum.txt")
# Preprocess: smoothing, baseline removal, normalization
spec.preprocess()
# Optional: bin to reduce dimensions
spec.bin(bin_width=3) # 3 Da bins
# Visualize
spec.plot(binned=True)
Build a Dataset from Multiple Spectra
from maldiamrkit import MaldiSet
# Load multiple spectra with metadata
data = MaldiSet.from_directory(
spectra_dir="data/spectra/",
meta_file="data/metadata.csv",
aggregate_by=dict(antibiotics="Drug", species="Escherichia coli"),
bin_width=3
)
# Access features and labels
X = data.X # Feature matrix
y = data.get_y_single("Drug") # Target labels
Binning Methods
MaldiAMRKit supports multiple binning strategies:
from maldiamrkit import MaldiSpectrum
spec = MaldiSpectrum("data/spectrum.txt").preprocess()
# Uniform binning (default)
spec.bin(bin_width=3)
# Logarithmic binning (width scales with m/z)
spec.bin(bin_width=3, method="logarithmic")
# Adaptive binning (smaller bins in peak-dense regions)
spec.bin(method="adaptive", adaptive_min_width=1.0, adaptive_max_width=10.0)
# Custom binning (user-defined edges)
spec.bin(method="custom", custom_edges=[2000, 5000, 10000, 15000, 20000])
# Access bin metadata
print(spec.bin_metadata.head())
# bin_index bin_start bin_end bin_width
# 0 0 2000.0 2003.0 3.0
# 1 1 2003.0 2006.0 3.0
Binning Methods:
uniform: Fixed width bins (default)logarithmic: Bin width scales with m/z (matches instrument resolution)adaptive: Smaller bins where peaks are dense, larger bins elsewherecustom: User-defined bin edges for domain-specific analysis
Machine Learning Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from maldiamrkit.alignment import Warping
from maldiamrkit.detection import MaldiPeakDetector
# Create ML pipeline
pipe = Pipeline([
("peaks", MaldiPeakDetector(binary=False, prominence=0.05)),
("warp", Warping(method="shift")),
("scaler", StandardScaler()),
("clf", RandomForestClassifier(n_estimators=100, random_state=42))
])
# Cross-validation (recommended over train accuracy)
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print(f"CV Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
Spectral Alignment
Align spectra to correct for mass calibration drift:
from maldiamrkit.alignment import Warping
# Create warping transformer
warper = Warping(
method='piecewise', # or 'shift', 'linear', 'dtw'
reference='median',
n_segments=5
)
# Fit on training data and transform
warper.fit(X_train)
X_aligned = warper.transform(X_test)
# Check alignment quality
quality = warper.get_alignment_quality(X_test, X_aligned)
print(f"Mean improvement: {quality['improvement'].mean():.4f}")
# Visualize
warper.plot_alignment(X_test, X_aligned, indices=[0], show_peaks=True)
Raw Spectra Warping
For higher precision, use RawWarping which operates at full m/z resolution:
from maldiamrkit.alignment import RawWarping, create_raw_input
# Create input DataFrame from spectrum files
X_raw = create_raw_input("data/spectra/")
# Raw warping loads original files for warping
warper = RawWarping(
method="piecewise",
bin_width=3,
max_shift_da=10.0,
n_jobs=-1 # Parallel processing
)
# Outputs binned data for pipeline compatibility
warper.fit(X_raw)
X_aligned = warper.transform(X_raw)
Alignment Methods:
shift: Global median shift (fast, simple)linear: Least-squares linear transformationpiecewise: Local shifts across spectrum segments (most flexible)dtw: Dynamic Time Warping (best for non-linear drift)
Quality Assessment
from maldiamrkit import MaldiSpectrum
from maldiamrkit.preprocessing import estimate_snr, SpectrumQuality
# Estimate signal-to-noise ratio
spec = MaldiSpectrum("spectrum.txt").preprocess()
snr = estimate_snr(spec)
print(f"SNR: {snr:.1f}")
# Comprehensive quality report
qc = SpectrumQuality() # Uses high m/z region (19500-20000) by default
report = qc.assess(spec)
print(f"SNR: {report.snr:.1f}")
print(f"Peak count: {report.peak_count}")
print(f"Dynamic range: {report.dynamic_range:.2f}")
Replicate Merging
Merge multiple spectral replicates per isolate into a single consensus spectrum:
from maldiamrkit import MaldiSpectrum
from maldiamrkit.preprocessing import merge_replicates, detect_outlier_replicates
# Load replicates as MaldiSpectrum objects
spectra = [MaldiSpectrum(f"data/isolate_rep{i}.txt") for i in range(1, 4)]
# Detect and remove outlier replicates
keep = detect_outlier_replicates(spectra)
clean = [s for s, k in zip(spectra, keep) if k]
# Merge into a single consensus spectrum
merged = merge_replicates(clean, method="mean")
Composable Preprocessing Pipeline
Build a composable, serializable preprocessing pipeline:
from maldiamrkit.preprocessing import (
PreprocessingPipeline,
ClipNegatives, SqrtTransform, SavitzkyGolaySmooth,
SNIPBaseline, MzTrimmer, TICNormalizer,
)
# Use the default pipeline
pipe = PreprocessingPipeline.default()
# Or build a custom pipeline
pipe = PreprocessingPipeline([
("clip", ClipNegatives()),
("sqrt", SqrtTransform()),
("smooth", SavitzkyGolaySmooth(window_length=15, polyorder=2)),
("baseline", SNIPBaseline(half_window=30)),
("trim", MzTrimmer(mz_min=2000, mz_max=20000)),
("norm", TICNormalizer()),
])
# Serialize to JSON/YAML for reproducibility
pipe.to_json("my_pipeline.json")
pipe = PreprocessingPipeline.from_json("my_pipeline.json")
# Apply to a spectrum
spec = MaldiSpectrum("data/spectrum.txt", pipeline=pipe)
spec.preprocess().bin(3)
Dataset Filtering
Use composable filters to select subsets of a MaldiSet:
from maldiamrkit import MaldiSet
from maldiamrkit.filters import SpeciesFilter, DrugFilter, QualityFilter, MetadataFilter
data = MaldiSet.from_directory("spectra/", "metadata.csv",
aggregate_by=dict(antibiotics="Drug"))
# Filter by species
ecoli = data.filter(SpeciesFilter("Escherichia coli"))
# Combine filters with & (and), | (or), ~ (not)
f = SpeciesFilter("Escherichia coli") & QualityFilter(min_snr=5.0)
high_quality_ecoli = data.filter(f)
# Filter by antibiotic resistance status
f = SpeciesFilter("Escherichia coli") & DrugFilter("Ceftriaxone", status="R")
resistant_ecoli = data.filter(f)
# Custom metadata filter
f = MetadataFilter("batch_id", lambda v: v == "batch_1")
batch1 = data.filter(f)
Evaluation Metrics
AMR-specific evaluation following EUCAST/CLSI conventions:
from maldiamrkit.evaluation import (
very_major_error_rate, major_error_rate,
amr_classification_report, vme_scorer, me_scorer,
LabelEncoder,
)
# Encode R/I/S labels to binary
enc = LabelEncoder(intermediate="susceptible")
y_binary = enc.fit_transform(y_raw)
# Compute individual metrics
vme = very_major_error_rate(y_true, y_pred)
me = major_error_rate(y_true, y_pred)
# Full classification report
report = amr_classification_report(y_true, y_pred)
# {'vme': 0.1, 'me': 0.05, 'sensitivity': 0.9, 'specificity': 0.95, ...}
# Use as sklearn scorers in cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring=vme_scorer)
Stratified Splitting
Prevent data leakage with species-aware and patient-grouped splits:
from maldiamrkit.evaluation import (
stratified_species_drug_split,
case_based_split,
SpeciesDrugStratifiedKFold,
CaseGroupedKFold,
)
# Single split stratified by species + drug label
X_train, X_test, y_train, y_test = stratified_species_drug_split(
X, y, species=species_labels, test_size=0.2, random_state=42
)
# Patient-grouped split (no patient in both train and test)
X_train, X_test, y_train, y_test = case_based_split(
X, y, case_ids=patient_ids, test_size=0.2
)
# Cross-validation splitters (sklearn-compatible)
cv = SpeciesDrugStratifiedKFold(n_splits=5)
for train_idx, test_idx in cv.split(X, y, species=species_labels):
...
cv = CaseGroupedKFold(n_splits=5)
for train_idx, test_idx in cv.split(X, y, groups=patient_ids):
...
Command-Line Interface
Batch preprocess spectra or generate quality reports from the terminal:
# Preprocess and bin to a CSV feature matrix
maldiamrkit preprocess --input-dir data/ --output processed.csv --bin-width 3
# Also save individual preprocessed spectra as TXT files
maldiamrkit preprocess --input-dir data/ --output processed.csv --save-spectra-dir processed/
# Use a custom pipeline config
maldiamrkit preprocess --input-dir data/ --output processed.csv --pipeline config.yaml
# Generate quality report
maldiamrkit quality --input-dir data/ --output report.csv
Parallel Processing
Use n_jobs parameter for multi-core processing:
from maldiamrkit import MaldiSet
from maldiamrkit.alignment import Warping
from maldiamrkit.detection import MaldiPeakDetector
# Parallel dataset loading
data = MaldiSet.from_directory("spectra/", "meta.csv", n_jobs=-1)
# Parallel peak detection
detector = MaldiPeakDetector(prominence=0.01, n_jobs=-1)
peaks = detector.fit_transform(X)
# Parallel alignment
warper = Warping(method="piecewise", n_jobs=-1)
X_aligned = warper.fit_transform(X)
Tutorials
For more detailed examples, see the notebooks:
- Quick Start - Loading, preprocessing, binning, and quality assessment
- Peak Detection - Local maxima and persistent homology methods
- Alignment - Warping methods and alignment quality
- Evaluation - AMR metrics, label encoding, and stratified splitting
Contributing
Pull requests, bug reports, and feature ideas are welcome: feel free to open a PR!
License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgements
This toolkit is inspired by and builds upon the methodology described in:
Weis, C., Cuénod, A., Rieck, B., et al. (2022). Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nature Medicine, 28, 164–174. https://doi.org/10.1038/s41591-021-01619-9
Please consider citing this work if you find MaldiAMRKit useful.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file maldiamrkit-0.7.0.tar.gz.
File metadata
- Download URL: maldiamrkit-0.7.0.tar.gz
- Upload date:
- Size: 53.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59e4dc8c2dca6351e2d52e3c1bfabc01d9304e5986586cf19b708cdf40664bd4
|
|
| MD5 |
dff8ab5c640befb569bfc7355605f631
|
|
| BLAKE2b-256 |
f7ddb16511321ed381b4435b5e6eed5c1869a3d326d3aa7d46236b63a7034623
|
File details
Details for the file maldiamrkit-0.7.0-py3-none-any.whl.
File metadata
- Download URL: maldiamrkit-0.7.0-py3-none-any.whl
- Upload date:
- Size: 61.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1360fc69aca00cede0e6d2bee0242fbf0ea656a034950ad9ecc6d7fdab2efd98
|
|
| MD5 |
5e54127359f0408dc9e84e440b800b03
|
|
| BLAKE2b-256 |
0a28036ca24bb4f7ce6d7a0fd31f2ab7881ec2ee84247f03d45599cc755747f3
|