Skip to main content

MGM (Microbial General Model) as a large-scaled pretrained language model for interpretable microbiome data analysis.

Project description

MGM

Microbial General Model (MGM) is a large-scale pretrained language model designed for interpretable microbiome data analysis. MGM allows for fine-tuning and evaluation across various microbiome data analysis tasks.

MGM Pipeline

Installation

By pip

pip install microformer-mgm

By source

Install the MGM package using setup.py:

python setup.py install

Usage

MGM can be utilized via the command line interface (CLI) with different modes. The general syntax is:

mgm <mode> [options]

Available Modes

construct

Converts input abundance data to a count matrix at the Genus level, normalizes it using phylogeny, and constructs a microbiome corpus. The corpus represents each sample as a sentence from high rank genus to low rank genus.

Input: Data in hdf5, csv, or tsv format (features in rows, samples in columns)
Output: A pkl file containing the microbiome corpus

Example:

mgm construct -i infant_data/abundance.csv -o infant_corpus.pkl

For hdf5 files, specify the key using -k (default key is genus).

pretrain

Pretrains the MGM model using the microbiome corpus in a GPT-style manner. Optionally, you can train the generator by providing a label file. If the label file is provided, the tokenized label will be added following the <bos> token, meanwhile, the tokenizer will be updated and the model's embedding layer will be expanded.

Input: Corpus from construct mode
Output: Pretrained MGM model

Examples:

mgm pretrain -i infant_corpus.pkl -o infant_model
mgm pretrain -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -o infant_model_gen --with-label

Use --from-scratch to train the model from scratch instead of loading pretrained weights.

train

Trains a supervised MGM model without mask pretrained weights, requiring labeled data.

Input: Corpus from construct mode, label file (csv)
Output: Supervised MGM model

Example:

mgm train -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -o infant_model_clf

finetune

Finetunes the MGM model to fit a new task, using labeled data and optionally a customized MGM model.

Input: Corpus from construct mode, label file (csv), pretrained model (optional)
Output: Finetuned MGM model

Example:

mgm finetune -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -m infant_model -o infant_model_clf_finetune

predict

Predicts labels of input data using the expert model. If a label file is provided, prediction results will be compared with the ground truth using various metrics.

Input: Corpus from construct mode, label file (optional), supervised MGM model
Output: Prediction results in csv format

Example:

mgm predict -E -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -m infant_model_clf -o infant_prediction.csv

generate

Generates synthetic microbiome data using the pretrained MGM model. A prompt file is required for generating samples with specific labels.

Input: Pretrained MGM model
Output: Synthetic genus tensors in pickle format

Example:

mgm generate -m infant_model_gen -p infant_data/prompt.txt -n 100 -o infant_synthetic.pkl

reconstruct

Reconstruct abundance from ranked corpus.

Input: Abundance file for train reconstructor or trained model in ckpt; Ranked corpus for reconstruct; Get label's tokenizer in generator if there is; Prompt if there is label in corpus

Output: Reconstructed corpus ; Reconstructor model; Decoded label

mgm reconstruct -a infant_data/abundance.csv -i infant_synthetic.pkl -g infant_model_generate -w True -o reconstructor_file 
mgm reconstruct -r reconstructor_file/reconstructor_model.ckpt -i infant_synthetic.pkl -g infant_model_generate -w True -o reconstructor_file 

For detailed usage of each mode, refer to the help message:

mgm <mode> --help

Maintainers

Name Email Organization
Haohong Zhang haohongzh@gmail.com PhD Student, School of Life Science and Technology, Huazhong University of Science & Technology
Zixin Kang 29590kang@gmail.com Undergraduate, School of Life Science and Technology, Huazhong University of Science & Technology
Kang Ning ningkang@hust.edu.cn Professor, School of Life Science and Technology, Huazhong University of Science & Technology

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

microformer-mgm-0.5.8.tar.gz (33.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

microformer_mgm-0.5.8-py3-none-any.whl (33.5 MB view details)

Uploaded Python 3

File details

Details for the file microformer-mgm-0.5.8.tar.gz.

File metadata

  • Download URL: microformer-mgm-0.5.8.tar.gz
  • Upload date:
  • Size: 33.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.7

File hashes

Hashes for microformer-mgm-0.5.8.tar.gz
Algorithm Hash digest
SHA256 391789d677695d2d98415b18fb039ad22773df5d8ccc6072d371659832d50106
MD5 da5073555ee42b827a16d15ff07897a2
BLAKE2b-256 44feb6485b80e6f4fe38621ba8a8905b6ee0cbad6722de30b92836c4caea3a25

See more details on using hashes here.

File details

Details for the file microformer_mgm-0.5.8-py3-none-any.whl.

File metadata

File hashes

Hashes for microformer_mgm-0.5.8-py3-none-any.whl
Algorithm Hash digest
SHA256 210891685565022ea869e88a7452769dc6fbe3f699d2a133e15338d1e30eb92b
MD5 aae142c78b9ea82c32984ba045e9c94a
BLAKE2b-256 4b9c829a1e59d5e618756ce8e57553e15d13bb16c58d37896f03233d8697515a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page