Exact and pervasive expert model for source tracking based on transfer learning
Project description
EXPERT
Exact and pervasive expert model for source tracking based on transfer learning
Habitat specific patterns reflected by microbial communities, as well as complex interactions between the community and their environments or hosts' characteristics, have created obstacles for microbial source tracking: diverse and context-dependent applications are asking for quantification of the contributions of different niches (biomes), which have already overwhelmed existing methods. Moreover, existing source tracking methods could not extend well for source tracking samples from understudied biomes, as well as samples from longitudinal studies.
Built upon the biome ontology information and transfer learning techniques, EXPERT has acquired the context-aware flexibility and could easily expand the supervised model's search scope to include the context-dependent community samples and understudied biomes. While at the same time, it is superior to current approaches in source tracking accuracy and speed. EXPERT's superiority has been demonstrated on multiple source tracking tasks, including source tracking samples collected at different disease stages and longitudinal samples. For example, when dealing with 635 samples from a recent study of colorectal cancer, EXPERT could achieve an AUROC of 0.977 when predicting the host's phenotypical status. In summary, EXPERT has unleashed the potential of model-based source tracking approaches, enabling source tracking in versatile context-dependent settings, accomplishing pervasive and in-depth knowledge discovery from microbiome.
If you use EXPERT in a scientific publication, we would appreciate citations to the following paper:
Enabling technology for microbial source tracking based on transfer learning: From ontology-aware general knowledge to context-aware expert systems
Hui Chong, Qingyang Yu, Yuguo Zha, Guangzhou Xiong, Nan Wang, Chuqing Sun, Sicheng Wu, Weihua Chen, Kang Ning
bioRxiv 2021.01.29.428751; doi: https://doi.org/10.1101/2021.01.29.428751
If you have any question about our work, feel free to contact us.
Thank you for using EXPERT.
Current features
- Context-aware ability to adapt to microbiome studies via transfer learning
- Fast, accurate and interpretable source tracking via ontology-aware forward propagation
- Selective learning from training data
- Ultra-fast data cleaning & cleaning via in-memory NCBI taxonomy database
- Parallelized feature encoding via
tensorflow.keras
Installation
python setup.py install
Usage
Ontology construction
construct a biome ontology using microbiomes.txt
expert construct -i microbiomes.txt -o ontology.pkl
# Also equivalent to
expert construct --input microbiomes.txt --output ontology.pkl
- Input:
microbiomes.txt
file, contains path from "root" node to each leaf node of biome ontology.
root:Environmental:Terrestrial:Soil
root:Environmental:Terrestrial:Soil:Agricultural
root:Environmental:Terrestrial:Soil:Boreal_forest
root:Environmental:Terrestrial:Soil:Contaminated
root:Environmental:Terrestrial:Soil:Crop
root:Environmental:Terrestrial:Soil:Crop:Agricultural_land
root:Environmental:Terrestrial:Soil:Desert
root:Environmental:Terrestrial:Soil:Forest_soil
root:Environmental:Terrestrial:Soil:Grasslands
root:Environmental:Terrestrial:Soil:Loam:Agricultural
root:Environmental:Terrestrial:Soil:Permafrost
root:Environmental:Terrestrial:Soil:Sand
root:Environmental:Terrestrial:Soil:Tropical_rainforest
root:Environmental:Terrestrial:Soil:Uranium_contaminated
root:Environmental:Terrestrial:Soil:Wetlands
root:Host-associated:Plants:Rhizosphere:Soil
- Output: constructed biome ontology (pickle format, non-human-readable).
Source mapping
Mapping their source environments to microbiome ontology
expert map --to-otlg -t ontology.pkl -i mapper.csv -o labels.h5
# Also equivalent to
expert map --to-otlg --otlg ontology.pkl --input mapper.csv --output labels.h5
- Input: the mapper file, contains biome source information for samples.
Env | SampleID | |
---|---|---|
0 | root:Engineered:Wastewater | ERR2260442 |
1 | root:Engineered:Wastewater | SRR980322 |
2 | root:Engineered:Wastewater | ERR2985272 |
3 | root:Engineered:Wastewater | ERR2814648 |
4 | root:Engineered:Wastewater | ERR2985275 |
- Output: the labels for samples in each layer of the biome ontology (HDF format, non-human-readable).
Data converting & cleaning
Convert input data to a count matrix in genus level.
expert convert -i countMatrices.txt -o countMatrix.h5 --in-cm
# Also equivalent to
expert convert --input countMatrices.txt --output countMatrix.h5 --in-cm
- Input: a text file contains path to input count matrix file / OTU table.
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1690680.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1689675.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00000513-ERR986792.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1691198.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00001704-ERR1905845.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1689214.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1689910.tsv
- Output: converted count matrix file in genus level (HDF format, non-human-readable).
Ab initio training
Build EXPERT model from scratch and training
expert train -i countMatrix.h5 -l labels.h5 -t ontology.pkl -o model
# Also equivalent to
expert train --input countMatrix.h5 --labels labels.h5 --otlg ontology.pkl --output model
- Input: biome ontology and converted count matrix in genus level (and also labels for samples involved in the count matrix).
- Output: trained model.
Fast adaptation
expert transfer -i countMatrix.h5 -l labels.h5 -t ontology.pkl -o model
# Also equivalent to
expert transfer --input countMatrix.h5 --labels labels.h5 --otlg ontology.pkl --output model
- Input: biome ontology and converted count matrix in genus level (and also labels for samples involved in the count matrix).
- Output: trained model.
Source tracking
expert search -i countMatrix.h5 -o searchResult -m model
# Also equivalent to
expert search --input countMatrix.h5 --output searchResult --model model
- Input: converted count matrix in genus level.
- Output: search result (multi-layer ).
searchResult
├── layer-2.csv
├── layer-3.csv
├── layer-4.csv
├── layer-5.csv
└── layer-6.csv
Take layer-2.csv
as an example.
root:Engineered | root:Environmental | root:Host-associated | root:Mixed | Unknown | |
---|---|---|---|---|---|
ERR2278752 | 0.0041427016 | 0.26372418 | 0.68632126 | 0.00040003657 | 0.045411825 |
ERR2278753 | 0.002841179 | 0.07928896 | 0.91735524 | 0.00051463145 | 0.0 |
ERR2666855 | 0.0006751048 | 0.0021803565 | 0.9970531 | 9.1493675e-05 | 0.0 |
ERR2666860 | 0.0005227786 | 0.013902989 | 0.98542625 | 0.00014803928 | 0.0 |
ERR2666881 | 0.0009569057 | 0.0023957777 | 0.9965403 | 0.00010694566 | 0.0 |
Evaluation
expert evaluate -i searchResultFolder -l labels.h5 -o EvaluationReport -p NUMProcesses
# Also equivalent to
expert evaluate --input searchResultFolder --labels labels.h5 --output EvaluationReport --processors NUMProcesses
- Input: multi-layer labels and search result (source contribution) for samples.
- Output: label-based evaluation report.
EvaluationReport
├── layer-2
│ └── root:Host-associated.csv
├── layer-2.csv
├── layer-3
│ └── root:Host-associated:Human.csv
├── layer-3.csv
├── layer-4
│ ├── root:Host-associated:Human:Circulatory_system.csv
│ ├── root:Host-associated:Human:Digestive_system.csv
│ ├── root:Host-associated:Human:Lympathic_system.csv
│ ├── root:Host-associated:Human:Reproductive_system.csv
│ ├── root:Host-associated:Human:Respiratory_system.csv
│ └── root:Host-associated:Human:Skin.csv
├── layer-4.csv
├── layer-5
│ ├── root:Host-associated:Human:Circulatory_system:Blood.csv
│ ├── ...
│ └── root:Host-associated:Human:Respiratory_system:Pulmonary_system.csv
├── layer-5.csv
├── layer-6
│ ├── root:Host-associated:Human:Digestive_system:Large_intestine:Fecal.csv
│ ├── ...
│ └── root:Host-associated:Human:Respiratory_system:Pulmonary_system:Sputum.csv
└── layer-6.csv
Take layer-4/root:Host-associated:Human:Skin.csv
as an example.
t | TN | FP | FN | TP | Acc | Sn | Sp | TPR | FPR | Rc | Pr | F1 | ROC-AUC | F-max |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.0 | 0 | 47688 | 0 | 4847 | 0.0923 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0923 | 0.1689 | 0.9951 | 0.9374 |
0.01 | 44794 | 2893 | 30 | 4816 | 0.9444 | 0.9938 | 0.9393 | 0.9938 | 0.0607 | 0.9938 | 0.6247 | 0.7672 | 0.9951 | 0.9374 |
0.02 | 45545 | 2142 | 44 | 4802 | 0.9584 | 0.9909 | 0.9551 | 0.9909 | 0.0449 | 0.9909 | 0.6915 | 0.8146 | 0.9951 | 0.9374 |
0.03 | 45934 | 1753 | 59 | 4787 | 0.9655 | 0.9878 | 0.9632 | 0.9878 | 0.0368 | 0.9878 | 0.732 | 0.8409 | 0.9951 | 0.9374 |
0.04 | 46228 | 1459 | 73 | 4773 | 0.9708 | 0.9849 | 0.9694 | 0.9849 | 0.0306 | 0.9849 | 0.7659 | 0.8617 | 0.9951 | 0.9374 |
Run the program with -h
option to see a detailed description on work modes & options.
Input abundance data
EXPERT takes two kinds of **abundance data **as inputs.
Taxonomic assignments result for a single sample (OTU table)
Notice that here is a header "# Constructed from biom file" in the first line.
# Constructed from biom file | ||
---|---|---|
# OTU ID | ERR1754760 | taxonomy |
207119 | 19.0 | sk__Archaea |
118090 | 45.0 | sk__Archaea;k__;p__Thaumarchaeota;c__;o__Nitrosopumilales;f__Nitro... |
153156 | 38.0 | sk__Archaea;k__;p__Thaumarchaeota;c__;o__Nitrosopumilales;f__Nitro... |
131704 | 1.0 | sk__Archaea;k__;p__Thaumarchaeota;c__Nitrososphaeria;o__Nitrososp... |
103181 | 5174.0 | sk__Bacteria |
157361 | 9.0 | sk__Bacteria;k__;p__;c__;o__;f__;g__;s__agricultural_soil_bacterium_SC-I-11 |
Taxonomic assignments result for multiple samples (count matrix)
#SampleID | ERR1844510 | ERR1844449 | ERR1844450 | ERR1844451 |
---|---|---|---|---|
sk__Archaea | 1.0 | 17.0 | 8.0 | 16.0 |
sk__Archaea;k__;p__Crenarchaeota | 0 | 0 | 0 | 0 |
sk__Archaea;k__;p__Euryarchaeota | 8.0 | 2.0 | 3.0 | 1.0 |
sk__Archaea;k__;p__Eury...;c__... | 0 | 0 | 0 | 0 |
sk__Archaea;k__;p__Eury...;c__...;o__... | 0 | 0 | 0 | 0 |
License
Maintainer
Name | Organization | |
---|---|---|
Hui Chong | huichong.me@gmail.com | Research Assistant, School of Life Science and Technology, Huazhong University of Science & Technology |
Kang Ning | ningkang@hust.edu.cn | Professor, School of Life Science and Technology, Huazhong University of Science & Technology |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file expert-mst-0.3.tar.gz
.
File metadata
- Download URL: expert-mst-0.3.tar.gz
- Upload date:
- Size: 74.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef34496bf5d6e12682e03e5cb1ff5c2d85ceb1cbecd7c98ac9dc67f3abf7943f |
|
MD5 | 7ce4e8d78064431f0f402be05a85dcbc |
|
BLAKE2b-256 | d261f4851e88c982c66f9428cf3066ca709fc21056ebb6290f5f26501781b601 |
File details
Details for the file expert_mst-0.3-py3-none-any.whl
.
File metadata
- Download URL: expert_mst-0.3-py3-none-any.whl
- Upload date:
- Size: 74.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 407f72688f3f83d492568c74cf3482be371cc51e38a9fc0284ce8c4027799452 |
|
MD5 | a503d31e33e39162cc45e95e178001f7 |
|
BLAKE2b-256 | a30d548c053c1153f07c40311ee95a85c7458cdd69a5e9a7dc3e76e63127b7e0 |