MGM (Microbial General Model) as a large-scaled pretrained language model for interpretable microbiome data analysis.
Project description
MGM
Microbial General Model (MGM) is a large-scale pretrained language model designed for interpretable microbiome data analysis. MGM allows for fine-tuning and evaluation across various microbiome data analysis tasks.
Installation
By pip
pip install microformer-mgm
By source
Install the MGM package using setup.py:
python setup.py install
Usage
MGM can be utilized via the command line interface (CLI) with different modes. The general syntax is:
mgm <mode> [options]
Available Modes
construct
Converts input abundance data to a count matrix at the Genus level, normalizes it using phylogeny, and constructs a microbiome corpus. The corpus represents each sample as a sentence from high rank genus to low rank genus.
Input: Data in hdf5, csv, or tsv format (features in rows, samples in columns)
Output: A pkl file containing the microbiome corpus
Example:
mgm construct -i infant_data/abundance.csv -o infant_corpus.pkl
For hdf5 files, specify the key using
-k(default key isgenus).
pretrain
Pretrains the MGM model using the microbiome corpus in a GPT-style manner. Optionally, you can train the generator by providing a label file. If the label file is provided, the tokenized label will be added following the <bos> token, meanwhile, the tokenizer will be updated and the model's embedding layer will be expanded.
Input: Corpus from construct mode
Output: Pretrained MGM model
Examples:
mgm pretrain -i infant_corpus.pkl -o infant_model
mgm pretrain -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -o infant_model_gen --with-label
Use
--from-scratchto train the model from scratch instead of loading pretrained weights.
train
Trains a supervised MGM model without mask pretrained weights, requiring labeled data.
Input: Corpus from construct mode, label file (csv)
Output: Supervised MGM model
Example:
mgm train -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -o infant_model_clf
finetune
Finetunes the MGM model to fit a new task, using labeled data and optionally a customized MGM model.
Input: Corpus from construct mode, label file (csv), pretrained model (optional)
Output: Finetuned MGM model
Example:
mgm finetune -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -m infant_model -o infant_model_clf_finetune
predict
Predicts labels of input data using the expert model. If a label file is provided, prediction results will be compared with the ground truth using various metrics.
Input: Corpus from construct mode, label file (optional), supervised MGM model
Output: Prediction results in csv format
Example:
mgm predict -E -i infant_corpus.pkl -l infant_data/meta_withbirth.csv -m infant_model_clf -o infant_prediction.csv
generate
Generates synthetic microbiome data using the pretrained MGM model. A prompt file is required for generating samples with specific labels.
Input: Pretrained MGM model
Output: Synthetic genus tensors in pickle format
Example:
mgm generate -m infant_model_gen -p infant_data/prompt.txt -n 100 -o infant_synthetic.pkl
reconstruct
Reconstruct abundance from ranked corpus.
Input: Abundance file for train reconstructor or trained model in ckpt; Ranked corpus for reconstruct; Get label's tokenizer in generator if there is; Prompt if there is label in corpus
Output: Reconstructed corpus ; Reconstructor model; Decoded label
mgm reconstruct -a infant_data/abundance.csv -i infant_synthetic.pkl -g infant_model_generate -w True -o reconstructor_file
mgm reconstruct -r reconstructor_file/reconstructor_model.ckpt -i infant_synthetic.pkl -g infant_model_generate -w True -o reconstructor_file
For detailed usage of each mode, refer to the help message:
mgm <mode> --help
Maintainers
| Name | Organization | |
|---|---|---|
| Haohong Zhang | haohongzh@gmail.com | PhD Student, School of Life Science and Technology, Huazhong University of Science & Technology |
| Zixin Kang | 29590kang@gmail.com | Undergraduate, School of Life Science and Technology, Huazhong University of Science & Technology |
| Kang Ning | ningkang@hust.edu.cn | Professor, School of Life Science and Technology, Huazhong University of Science & Technology |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file microformer-mgm-0.5.8.tar.gz.
File metadata
- Download URL: microformer-mgm-0.5.8.tar.gz
- Upload date:
- Size: 33.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
391789d677695d2d98415b18fb039ad22773df5d8ccc6072d371659832d50106
|
|
| MD5 |
da5073555ee42b827a16d15ff07897a2
|
|
| BLAKE2b-256 |
44feb6485b80e6f4fe38621ba8a8905b6ee0cbad6722de30b92836c4caea3a25
|
File details
Details for the file microformer_mgm-0.5.8-py3-none-any.whl.
File metadata
- Download URL: microformer_mgm-0.5.8-py3-none-any.whl
- Upload date:
- Size: 33.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
210891685565022ea869e88a7452769dc6fbe3f699d2a133e15338d1e30eb92b
|
|
| MD5 |
aae142c78b9ea82c32984ba045e9c94a
|
|
| BLAKE2b-256 |
4b9c829a1e59d5e618756ce8e57553e15d13bb16c58d37896f03233d8697515a
|