Skip to main content

The maldi_tof_classifier package offers a CLI and Python 3 API for machine learning based classification of MALDI TOF spectra as measured by a Shimadzu 8300 MALDI-TOF mass spectrometer.

Project description

maldi-tof-classifier

Version: 0.3.0

The maldi-tof-classifier package provides functionality for:

  • Reading MALDI-TOF spectra
  • Preprocessing spectral data
  • Machine learning based classification

It is designed for spectra generated by a Shimadzu 8030 MALDI-TOF mass spectrometer.

Source code: https://github.com/ofmk94/maldi-tof-classifier

License: MIT


Installation

Python 3.10 or later is required.

Install the package from PyPI:

pip install maldi-tof-classifier

Additionally, the R package MALDIquant must be installed on the system. It can usually be installed from within R with:

install.packages("MALDIquant")

Overview

This README consists of two parts:

  1. CLI tool usage
  2. Python API and typical workflows

The tool is a Python package, but foremost a CLI tool.

It is strongly recommended to work with peak data and the default PeakExtractor.

Example data for download is available under: https://github.com/ofmk94/maldi-tof-classifier-data


Part 1 - CLI Tool Usage

1.1 Required directory structure

The CLI tool requires the following directory structure:

data_train/
    A/
        sample1.csv
        sample2.csv
    B/
        sample3.csv

data_predict/
    unknown1.csv
    unknown2.csv

cli_files/
    config.yaml
  • data_train Contains subdirectories for the classes to be learned, for example A, B, C. The subdirectories may contain either .txt files with spectra or .csv files with peak data as produced by a Shimadzu 8030 MALDI-TOF mass spectrometer. This directory is for training only.

  • data_predict Contains files of the same type, either .txt full spectra or .csv peak data, to be classified. This directory is for prediction.

  • cli_files Contains the files necessary for the CLI setup and the files with results. It must contain config.yaml.

Training and prediction must use the same file type and the same extractor.


1.2 Output files

The following files are created inside cli_files during usage:

cli_files/pipeline.joblib
    created once the model with the classification pipeline is trained.

cli_files/training_performance.csv
    test set performance of the pipeline on classification
    includes accuracy, precision, recall, f1-score, confusion matrix.

cli_files/predictions.csv
    predictions on data_predict data.

1.3 CLI commands

There are two commands available:

Train the model:

mtc train

Predict on new data:

mtc predict

Both commands need to be executed in an environment with the directories described above.


1.4 Configuration via config.yaml

The setup for the training can be thoroughly defined through cli_files/config.yaml.

All parameters are optional. There are default values for everything, so providing settings is optional.


1.5 Extractor settings

1.5.1 extractor_cls

Type of extractor.

Options:

  • "PeakExtractor" for working with .csv files containing peak data
  • "FullSpectraExtractor" for full spectra .txt files

Default:

extractor_cls: "PeakExtractor"

This must be coherent between mtc train and mtc predict.

1.5.2 extractor_params

Additional optional parameters for PeakExtractor or FullSpectraExtractor.

Default:

extractor_params: null

These parameters are passed directly to the selected extractor constructor.

For PeakExtractor, the main parameters are:

  • snr_thresh default 3.0
  • rel_shift_tolerance default 0.002
  • min_peak_freq default 0.25

Example:

extractor_params:
    snr_thresh: 3.0
    rel_shift_tolerance: 0.002

For FullSpectraExtractor, the main parameters are:

  • use_mz_cutoff default false
  • mz_cutoff_mass default 20000.0

Example:

extractor_params:
    use_mz_cutoff: true
    mz_cutoff_mass: 20000.0

The dataclasses for file location and file parsing are advanced options and should generally not be set by the CLI user.


1.6 Scaling and dimensionality reduction

1.6.1 scaler_cls

Scaling object.

Available options from sklearn.preprocessing:

  • "StandardScaler"
  • "MinMaxScaler"

Optional.

Default:

scaler_cls: null

1.6.2 dim_reducer_cls

Optional dimensionality reduction.

Options:

  • "PCA" from sklearn.decomposition
  • "SVD" using sklearn.decomposition.TruncatedSVD

Default:

dim_reducer_cls: null

1.6.3 n_components

Number of components to use in optional dimensionality reduction.

Type:

  • int

Default:

n_components: 20

1.7 Classifier settings

1.7.1 classifier_cls

Classification model.

Available options:

  • sklearn models: LogisticRegression, LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis, PLSRegression as PLS-DA, SVC, RandomForestClassifier
  • xgboost model: XGBClassifier
  • special option: OPLS-DA, implemented via LogisticRegression

Default:

classifier_cls: "RandomForestClassifier"

1.7.2 classifier_params

Optional parameters for the classifier.

Default:

classifier_params: null

These parameters are passed directly to the selected classifier constructor. Only parameters valid for the selected classifier should be used here.

Examples:

classifier_params:
    n_estimators: 200
    max_depth: 10

or:

classifier_params:
    C: 1.0
    max_iter: 1000

Refer to the documentation of scikit-learn or xgboost for the full list of supported constructor arguments.


1.8 Train/test split and balancing

1.8.1 test_size

Size of test set.

Type:

  • float

Default:

test_size: 0.2

1.8.2 oversample

Whether simple oversampling should be performed for balancing training classes.

Type:

  • bool

Default:

oversample: true

1.9 Notes

  • It is strongly recommended to work with peak data and the default PeakExtractor.
  • Training and prediction must use the same extractor and the same data format.
  • The advanced file parsing options are usually not needed for standard CLI usage.

Part 2 - Python API and typical workflows

2.1 Overview

The maldi-tof-classifier package provides functionality for:

  • Reading MALDI-TOF spectra
  • Preprocessing spectral data
  • Machine learning based classification

Source code: https://github.com/ofmk94/maldi-tof-classifier

Example data: https://github.com/ofmk94/maldi-tof-classifier-data

Docstrings contain more detailed information on parameters and behavior. This section illustrates typical usage.

The directory structure is the same as in Part 1.


2.2 Step 1 - Loading and preprocessing data

Recommended: peak data using PeakExtractor

from maldi_tof_classifier.extractors import PeakExtractor
from pathlib import Path
from sklearn.model_selection import train_test_split

TRAIN_DIR = Path(".") / "data_train"

extractor = PeakExtractor(snr_thresh=3.0)

peaks_dfs, class_labels = extractor.extract_train_data(TRAIN_DIR)

X_train, X_test, y_train, y_test = train_test_split(
    peaks_dfs, class_labels, test_size=config["test_size"]
)

X_train = extractor.transform_train_data(X_train)
X_test = extractor.transform_predict_data(X_test)

Alternative: full spectra using FullSpectraExtractor

from maldi_tof_classifier.extractors import FullSpectraExtractor

extractor = FullSpectraExtractor(use_mz_cutoff=True, mz_cutoff_mass=20000.0)

spectra, class_labels, spots = extractor.extract_train_data(TRAIN_DIR)

X_train, X_test, y_train, y_test = train_test_split(
    spectra, class_labels, test_size=config["test_size"]
)

2.3 Step 2 - Label encoding

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

2.4 Step 3 - Handle class imbalance (optional)

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_train, y_train = ros.fit_resample(X_train, y_train)

2.5 Step 4 - Scaling (optional)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

2.6 Step 5 - Dimensionality reduction (optional)

from sklearn.decomposition import PCA

pca = PCA(n_components=20)

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

2.7 Step 6 - Classification

RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier()

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

XGBClassifier

from xgboost import XGBClassifier

classifier = XGBClassifier()

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

2.8 Neural network models

Available in maldi_tof_classifier.nn:

  • CNN1DClassifier
  • LSTMClassifier

For neural networks, a train/validation/test split and one-hot encoding is typically used.

Split

from sklearn.model_selection import train_test_split

X_train, X_val_test, y_train, y_val_test = train_test_split(
    spectra, class_labels, test_size=0.3
)

X_val, X_test, y_val, y_test = train_test_split(
    X_val_test, y_val_test, test_size=0.333
)

One-hot encoding

from tensorflow.keras.utils import to_categorical

n_classes = y_train.max() + 1

y_train = to_categorical(y_train, n_classes)
y_val = to_categorical(y_val, n_classes)
y_test = to_categorical(y_test, n_classes)

Example

from maldi_tof_classifier.nn import CNN1DClassifier

model = CNN1DClassifier(X_train, y_train)

model.fit(
    X_train,
    y_train,
    epochs=20,
    validation_data=(X_val, y_val)
)

y_pred = model.predict(X_test)

2.9 Pipeline API

Steps 2.5–2.7 (Step 4–6) can be combined into a pipeline:

from maldi_tof_classifier.pipelines import generate_pipeline

Components

  • Scaler (optional)
  • Dimensionality Reduction (optional)
  • Classifier (required)

Parameters

  • classifier_cls Instantiable class of the classifier.

  • classifier_params Parameters passed to the classifier.

  • scaler_cls Optional scaler class.

  • dim_reducer_cls Optional dimensionality reduction class.

  • n_components Number of components for dimensionality reduction.

Example

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

from maldi_tof_classifier.core import generate_pipeline

pipeline = generate_pipeline(
    classifier_cls=RandomForestClassifier,
    classifier_params={"n_estimators": 100},
    scaler_cls=StandardScaler,
    dim_reducer_cls=PCA,
    n_components=20
)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

Author

Oliver Klein
oliver.klein@stud.hcw.ac.at
oliverfmklein@gmail.com


License

This project is licensed under the MIT License.

Copyright (c) 2026 Oliver Felix Matthias Klein (GitHub username: ofmk94)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Disclaimer

This README was written based on the original draft and revised into English Markdown format with assistance from ChatGPT (Version 5.3).

No liability is assumed for the provided software or for the contents of this README.


Last edited: April 16th, 2026

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maldi_tof_classifier-0.3.0.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maldi_tof_classifier-0.3.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file maldi_tof_classifier-0.3.0.tar.gz.

File metadata

  • Download URL: maldi_tof_classifier-0.3.0.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for maldi_tof_classifier-0.3.0.tar.gz
Algorithm Hash digest
SHA256 78449e88e05893daf456c6b631f095a6a1153ada52ac9af6a3d784e38f33b10d
MD5 a745a0bd0b12dfe3bca7ca00862da2ab
BLAKE2b-256 02ea8cad455e6d54da1d60a9613f58f5be9be1025a705fa758818b87d71dbb3e

See more details on using hashes here.

File details

Details for the file maldi_tof_classifier-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for maldi_tof_classifier-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 44643a78d46ce2f699ef64d2d4028ac771915d8090b5837395f48184ead5a85f
MD5 0d26582dea27b98d721f0ea9e34b8e21
BLAKE2b-256 0fc9c925ccf8333fb51c33466e4d1d0b6bbbda48c67aa4f5e48ac610e3ae422d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page