No project description provided

These details have not been verified by PyPI

Project description

MIPMLP

(Microbiome Preprocessing Machine Learning Pipeline)

MIPMLP is a modular pipeline for preprocessing 16S microbiome feature data (ASVs/OTUs) prior to classification tasks using machine learning.

It is based on the paper:
"Microbiome Preprocessing Machine Learning Pipeline", Frontiers in Immunology, 2021 (link)

Background

Raw microbiome data obtained from 16S sequencing (ASVs/OTUs) often requires careful preprocessing before it is suitable for machine learning (ML) classification tasks. MIPMLP (Microbiome Preprocessing Machine Learning Pipeline) was designed to improve ML performance by addressing issues such as sparsity, taxonomic redundancy, and skewed feature distributions.

MIPMLP consists of the following four modular steps:

Taxonomy Grouping
Merge features according to a specified taxonomy level: Order, Family, or Genus.
Grouping method options:
- sum: total abundance
- mean: average abundance
- sub-PCA: PCA on each taxonomic group, retaining components explaining ≥50% of the variance
Normalization
Normalize feature counts using:
- log: log10(x + epsilon) — recommended
- relative: divide by total sample counts
Standardization (Z-scoring)
Standardize across:
- Samples (row-wise)
- Features (column-wise)
- Both
- Or skip standardization altogether
Dimensionality Reduction (optional)
Apply PCA or ICA to reduce the number of features.

These steps can be customized via a parameter dictionary as shown below.

How to Use

Installation & Setup

# (optional) create a virtual environment:
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# install dependencies:
pip install -r requirements.txt

Run the pipeline

import MIPMLP

# basic usage:
df_processed = MIPMLP.preprocess(df_train)

# full usage:
df_train_processed, df_test_processed = MIPMLP.preprocess(
    df_train,
    tag=tag_df,  # optional
    taxonomy_level=7,   # default: 7, options: 4-8
    taxnomy_group='mean',  # default: "mean", options: "sub PCA", "mean", "sum"
    epsilon=0.00001,   # default: 0.00001, range: 0-1
    normalization='log',  # default: "log", options: "log", "relative"
    z_scoring='No',  # default: "No", options: "row", "col", "both", "No"
    norm_after_rel='No',  # default: "No", options: "No", "z_after_relative" (only used with 'relative')
    pca=(0, 'PCA'),  # default: (0, "PCA"), use (n, "PCA") for dimensionality reduction, -1 for auto
    rare_bacteria_threshold=0.01,   # default: 0.01 (1%), removes bacteria that appear in fewer samples
    plot=False,   # default: False, options: True, False
    df_test=df_test_df,  # optional: test set to be preprocessed with same parameters
    external_sub_pca=sub_pca_model,  # optional: use pre-fitted SubPCA model instead of fitting
    external_pca=pca_model,  # optional: use pre-fitted PCA model instead of fitting
    drop_tax_prefix=True   # default: True, options: True, False
)

Behavior:

If df_test is provided, the pipeline returns both train and test DataFrames .
If not, it returns only the processed train DataFrame.
You may pass a pre-fitted PCA or SubPCA model; otherwise, the pipeline will fit one for you.

Input Format:

You can provide:

Option 1: A .biom file with raw OTU/ASV counts + a taxonomy .tsv file
Option 2: A merged .csv file that includes both features and taxonomy:
- First column: "ID" (sample IDs)
- Rows: individual samples
- Columns: ASVs/features
- Last row: taxonomy info, labeled "taxonomy"

🔗 Download example input file

Optional: Tag File
You may also provide a tag file (as a DataFrame) containing class labels for each sample.
This is not required for preprocessing, but if present, MIPMLP will generate additional summary statistics relating features to classes.

Output

The returned value is a preprocessed DataFrame, ready for ML pipelines.
If both train and test are provided, both are returned.

(⚠️ If drop_tax_prefix = True (default), taxonomy prefixes such as k__, p__, g__ will be removed from the feature names. Set this to False if you wish to retain the full taxonomy format in the output table.)

If plot = True , a histogram showing the percentage of samples in which each bacterium appears.
(⚠️ If pca is enabled, plot=True is not recommended. The visualization will not reflect the original features post-dimensionality reduction.)

Example histogram visualization:

iMic

iMic is a method to combine information from different taxa and improves data representation for machine learning using microbial taxonomy. iMic translates the microbiome to images, and convolutional neural networks are then applied to the image.

micro2matrix

Translates the microbiome values and the cladogram into an image. micro2matrix also saves the images that were created in a given folder.

Input

-df A pandas dataframe which is similar to the MIPMLP preprocessing's input (above). -folder A folder to save the new images at.

Parameters

You can determine all the MIPMLP preprocessing parameters too, otherwise it will run with its deafulting parameters (as explained above).

How to use

import pandas as pd
df = pd.read_csv("address/ASVS_file.csv")
folder = "save_img_folder"
MIPMLP.micro2matrix(df, folder)

CNN2 class - optional

A model of 2 convolutional layer followed by 2 fully connected layers.

####CNN model parameters -l1 loss = the coefficient of the L1 loss -weight decay = L2 regularization -lr = learning rate -batch size = as it sounds -activation = activation function one of: "elu", | "relu" | "tanh" -dropout = as it sounds (is common to all the layers) -kernel_size_a = the size of the kernel of the first CNN layer (rows) -kernel_size_b = the size of the kernel of the first CNN layer (columns) -stride = the stride's size of the first CNN -padding = the padding size of the first CNN layer -padding_2 = the padding size of the second CNN layer -kernel_size_a_2 = the size of the kernel of the second CNN layer (rows) -kernel_size_b_2 = the size of the kernel of the second CNN layer (columns) -stride_2 = the stride size of the second CNN -channels = number of channels of the first CNN layer -channels_2 = number of channels of the second CNN layer -linear_dim_divider_1 = the number to divide the original input size to get the number of neurons in the first FCN layer -linear_dim_divider_2 = the number to divide the original input size to get the number of neurons in the second FCN layer -input dim = the dimention of the input image (rows, columns)

How to use

params = {
    "l1_loss": 0.1,
    "weight_decay": 0.01,
    "lr": 0.001,
    "batch_size": 128,
    "activation": "elu",
    "dropout": 0.1,
    "kernel_size_a": 4,
    "kernel_size_b": 4,
    "stride": 2,
    "padding": 3,
    "padding_2": 0,
    "kernel_size_a_2": 2,
    "kernel_size_b_2": 7,
    "stride_2": 3,
    "channels": 3,
    "channels_2": 14,
    "linear_dim_divider_1": 10,
    "linear_dim_divider_2": 6,
	"input_dim": (8,100)
}
model = MIPMLP.CNN(params)

A trainer on the model should be applied by the user after choosing the best hyperparameters by an NNI platform.

apply_iMic (a basic example run of iMic function)

A basic running iMic option of uploading the images dividing them to a training set and test set and returns the real labels (train and test) and the predicted labels (train and test)

Input

-tag A tag pandas dataframe with similar samples to the raw ASVs file. -folder A folder of the saved images from the micro2matrix step. -test_size Fraction of the test set from the whole cohort (default is 0.2). -params iMic model's hyperparameters. Should be selected for each dataset separately by grid-search or NNI on appropriate validation set. The default params are { "l1_loss": 0.1, "weight_decay": 0.01, "lr": 0.001, "batch_size": 128, "activation": "elu", "dropout": 0.1, "kernel_size_a": 4, "kernel_size_b": 4, "stride": 2, "padding": 3, "padding_2": 0, "kernel_size_a_2": 2, "kernel_size_b_2": 7, "stride_2": 3, "channels": 3, "channels_2": 14, "linear_dim_divider_1": 10, "linear_dim_divider_2": 6, "input_dim": (8, 235) })

Note that the input_dim is also updated automatically during the run.

Output

A dictionary of {"pred_train": pred_train,"pred_test": pred_test,"y_train": y_train,"y_test": y_test}

How to use

# Load tag
tag = pd.read_csv("data/ibd_tag.csv", index_col=0)

# Prepare iMic images
otu = pd.read_csv("data/ibd_for_process.csv")
MIPMLP.micro2matrix(otu, folder="data/2D_images")

# Run a toy iMic model. One should optimize hyperparameters before
dct = apply_iMic(tag, folder="data/2D_images")

Citation

Shtossel, Oshrit, et al. "Ordering taxa in image convolution networks improves microbiome-based machine learning accuracy." Gut Microbes 15.1 (2023): 2224474.
Jasner, Yoel, et al. "Microbiome preprocessing machine learning pipeline." Frontiers in Immunology 12 (2021): 677870.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.2.19

May 5, 2025

1.2.18

Mar 3, 2024

1.2.17

Dec 5, 2023

1.2.16

Nov 13, 2023

1.2.15

Nov 12, 2023

1.2.14

Nov 12, 2023

1.2.13

Nov 12, 2023

1.2.12

Nov 12, 2023

1.2.10

Nov 12, 2023

1.2.9

Nov 9, 2023

1.2.8

Nov 9, 2023

1.2.7

Nov 6, 2023

1.2.6

Nov 5, 2023

1.2.5

Nov 2, 2023

1.2.4

Nov 2, 2023

1.2.3

Nov 2, 2023

1.2.2

Jul 28, 2022

1.2.1

Jul 27, 2022

1.2.0

Jul 25, 2022

1.1.6

Jul 19, 2022

1.1.5

Jul 19, 2022

1.1.4

Jul 18, 2022

1.1.3

Jul 18, 2022

1.1.2

Jul 18, 2022

1.1.1

Jul 17, 2022

1.1

Jul 17, 2022

1.0.5

Jul 17, 2022

1.0.4

Jul 17, 2022

1.0.2

Jul 17, 2022

1.0.1

Jul 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mipmlp-1.2.19.tar.gz (24.6 kB view details)

Uploaded May 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mipmlp-1.2.19-py3-none-any.whl (29.3 kB view details)

Uploaded May 5, 2025 Python 3

File details

Details for the file mipmlp-1.2.19.tar.gz.

File metadata

Download URL: mipmlp-1.2.19.tar.gz
Upload date: May 5, 2025
Size: 24.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mipmlp-1.2.19.tar.gz
Algorithm	Hash digest
SHA256	`ef7a71994375d1f624e9371f6de9a9c67b25417e39356769a17b91a085a79b3a`
MD5	`f3195ddb8b06ff509d46f559cdf5842f`
BLAKE2b-256	`46d65c18400fb7616aae89209c4ad3b4f8610544165e133ce594175664ef1554`

See more details on using hashes here.

File details

Details for the file mipmlp-1.2.19-py3-none-any.whl.

File metadata

Download URL: mipmlp-1.2.19-py3-none-any.whl
Upload date: May 5, 2025
Size: 29.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mipmlp-1.2.19-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5dadbd9b2c33b505b4be44c58671682f58df6f26097c1f01b02957ea0638703a`
MD5	`6ad90c9043b828e37d8fc962ea603130`
BLAKE2b-256	`a24701baa6c0377ec5ee09aa33c9792fb64d26edbe3604a574d89c49b0f2234a`

See more details on using hashes here.

MIPMLP 1.2.19

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

MIPMLP

(Microbiome Preprocessing Machine Learning Pipeline)

Background

MIPMLP consists of the following four modular steps:

How to Use

Installation & Setup

Run the pipeline

Behavior:

Input Format:

Output

iMic

micro2matrix

Input

Parameters

How to use

CNN2 class - optional

How to use

apply_iMic (a basic example run of iMic function)

Input

Output

How to use

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes