Skip to main content

Exact and pervasive expert model for source tracking based on transfer learning

Project description

EXPERT

Exact and pervasive expert model for source tracking based on transfer learning

Habitat specific patterns reflected by microbial communities, as well as complex interactions between the community and their environments or hosts' characteristics, have created obstacles for microbial source tracking: diverse and context-dependent applications are asking for quantification of the contributions of different niches (biomes), which have already overwhelmed existing methods. Moreover, existing source tracking methods could not extend well for source tracking samples from understudied biomes, as well as samples from longitudinal studies.

Built upon the biome ontology information and transfer learning techniques, EXPERT has acquired the context-aware flexibility and could easily expand the supervised model's search scope to include the context-dependent community samples and understudied biomes. While at the same time, it is superior to current approaches in source tracking accuracy and speed. EXPERT's superiority has been demonstrated on multiple source tracking tasks, including source tracking samples collected at different disease stages and longitudinal samples. For example, when dealing with 635 samples from a recent study of colorectal cancer, EXPERT could achieve an AUROC of 0.977 when predicting the host's phenotypical status. In summary, EXPERT has unleashed the potential of model-based source tracking approaches, enabling source tracking in versatile context-dependent settings, accomplishing pervasive and in-depth knowledge discovery from microbiome.

If you use EXPERT in a scientific publication, we would appreciate citations to the following paper:

Enabling technology for microbial source tracking based on transfer learning: From ontology-aware general knowledge to context-aware expert systems
Hui Chong, Qingyang Yu, Yuguo Zha, Guangzhou Xiong, Nan Wang, Chuqing Sun, Sicheng Wu, Weihua Chen, Kang Ning
bioRxiv 2021.01.29.428751; doi: https://doi.org/10.1101/2021.01.29.428751

If you have any question about our work, feel free to contact us.

Thank you for using EXPERT.

Current features

  • Context-aware ability to adapt to microbiome studies via transfer learning
  • Fast, accurate and interpretable source tracking via ontology-aware forward propagation
  • Selective learning from training data
  • Ultra-fast data cleaning & cleaning via in-memory NCBI taxonomy database
  • Parallelized feature encoding via tensorflow.keras

Installation

python setup.py install

Usage

Ontology construction

construct a biome ontology using microbiomes.txt

expert construct -i microbiomes.txt -o ontology.pkl
# Also equivalent to
expert construct --input microbiomes.txt --output ontology.pkl
  • Input: microbiomes.txt file, contains path from "root" node to each leaf node of biome ontology.
root:Environmental:Terrestrial:Soil
root:Environmental:Terrestrial:Soil:Agricultural
root:Environmental:Terrestrial:Soil:Boreal_forest
root:Environmental:Terrestrial:Soil:Contaminated
root:Environmental:Terrestrial:Soil:Crop
root:Environmental:Terrestrial:Soil:Crop:Agricultural_land
root:Environmental:Terrestrial:Soil:Desert
root:Environmental:Terrestrial:Soil:Forest_soil
root:Environmental:Terrestrial:Soil:Grasslands
root:Environmental:Terrestrial:Soil:Loam:Agricultural
root:Environmental:Terrestrial:Soil:Permafrost
root:Environmental:Terrestrial:Soil:Sand
root:Environmental:Terrestrial:Soil:Tropical_rainforest
root:Environmental:Terrestrial:Soil:Uranium_contaminated
root:Environmental:Terrestrial:Soil:Wetlands
root:Host-associated:Plants:Rhizosphere:Soil
  • Output: constructed biome ontology (pickle format, non-human-readable).

Source mapping

Mapping their source environments to microbiome ontology

expert map --to-otlg -t ontology.pkl -i mapper.csv -o labels.h5
# Also equivalent to
expert map --to-otlg --otlg ontology.pkl --input mapper.csv --output labels.h5
  • Input: the mapper file, contains biome source information for samples.
EnvSampleID
0root:Engineered:WastewaterERR2260442
1root:Engineered:WastewaterSRR980322
2root:Engineered:WastewaterERR2985272
3root:Engineered:WastewaterERR2814648
4root:Engineered:WastewaterERR2985275
  • Output: the labels for samples in each layer of the biome ontology (HDF format, non-human-readable).

Data converting & cleaning

Convert input data to a count matrix in genus level.

expert convert -i countMatrices.txt -o countMatrix.h5 --in-cm
# Also equivalent to
expert convert --input countMatrices.txt --output countMatrix.h5 --in-cm 
  • Input: a text file contains path to input count matrix file / OTU table.
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1690680.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1689675.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00000513-ERR986792.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1691198.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00001704-ERR1905845.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1689214.tsv
datasets/soil_dataset/root:Host-associated:Plants:Rhizosphere:Soil/MGYS00005146-ERR1689910.tsv
  • Output: converted count matrix file in genus level (HDF format, non-human-readable).

Ab initio training

Build EXPERT model from scratch and training

expert train -i countMatrix.h5 -l labels.h5 -t ontology.pkl -o model
# Also equivalent to
expert train --input countMatrix.h5 --labels labels.h5 --otlg ontology.pkl --output model
  • Input: biome ontology and converted count matrix in genus level (and also labels for samples involved in the count matrix).
  • Output: trained model.

Fast adaptation

expert transfer -i countMatrix.h5 -l labels.h5 -t ontology.pkl -o model
# Also equivalent to
expert transfer --input countMatrix.h5 --labels labels.h5 --otlg ontology.pkl --output model
  • Input: biome ontology and converted count matrix in genus level (and also labels for samples involved in the count matrix).
  • Output: trained model.

Source tracking

expert search -i countMatrix.h5 -o searchResult -m model
# Also equivalent to
expert search --input countMatrix.h5 --output searchResult --model model
  • Input: converted count matrix in genus level.
  • Output: search result (multi-layer ).
searchResult
├── layer-2.csv
├── layer-3.csv
├── layer-4.csv
├── layer-5.csv
└── layer-6.csv

Take layer-2.csv as an example.

root:Engineered root:Environmental root:Host-associated root:Mixed Unknown
ERR2278752 0.0041427016 0.26372418 0.68632126 0.00040003657 0.045411825
ERR2278753 0.002841179 0.07928896 0.91735524 0.00051463145 0.0
ERR2666855 0.0006751048 0.0021803565 0.9970531 9.1493675e-05 0.0
ERR2666860 0.0005227786 0.013902989 0.98542625 0.00014803928 0.0
ERR2666881 0.0009569057 0.0023957777 0.9965403 0.00010694566 0.0

Evaluation

expert evaluate -i searchResultFolder -l labels.h5 -o EvaluationReport -p NUMProcesses
# Also equivalent to
expert evaluate --input searchResultFolder --labels labels.h5 --output EvaluationReport --processors NUMProcesses
  • Input: multi-layer labels and search result (source contribution) for samples.
  • Output: label-based evaluation report.
EvaluationReport
├── layer-2
│   └── root:Host-associated.csv
├── layer-2.csv
├── layer-3
│   └── root:Host-associated:Human.csv
├── layer-3.csv
├── layer-4
│   ├── root:Host-associated:Human:Circulatory_system.csv 
│   ├── root:Host-associated:Human:Digestive_system.csv
│   ├── root:Host-associated:Human:Lympathic_system.csv
│   ├── root:Host-associated:Human:Reproductive_system.csv
│   ├── root:Host-associated:Human:Respiratory_system.csv
│   └── root:Host-associated:Human:Skin.csv
├── layer-4.csv
├── layer-5
│   ├── root:Host-associated:Human:Circulatory_system:Blood.csv
│   ├── ...
│   └── root:Host-associated:Human:Respiratory_system:Pulmonary_system.csv
├── layer-5.csv
├── layer-6
│   ├── root:Host-associated:Human:Digestive_system:Large_intestine:Fecal.csv
│   ├── ...
│   └── root:Host-associated:Human:Respiratory_system:Pulmonary_system:Sputum.csv
└── layer-6.csv

Take layer-4/root:Host-associated:Human:Skin.csv as an example.

t TN FP FN TP Acc Sn Sp TPR FPR Rc Pr F1 ROC-AUC F-max
0.0 0 47688 0 4847 0.0923 1.0 0.0 1.0 1.0 1.0 0.0923 0.1689 0.9951 0.9374
0.01 44794 2893 30 4816 0.9444 0.9938 0.9393 0.9938 0.0607 0.9938 0.6247 0.7672 0.9951 0.9374
0.02 45545 2142 44 4802 0.9584 0.9909 0.9551 0.9909 0.0449 0.9909 0.6915 0.8146 0.9951 0.9374
0.03 45934 1753 59 4787 0.9655 0.9878 0.9632 0.9878 0.0368 0.9878 0.732 0.8409 0.9951 0.9374
0.04 46228 1459 73 4773 0.9708 0.9849 0.9694 0.9849 0.0306 0.9849 0.7659 0.8617 0.9951 0.9374

Run the program with -h option to see a detailed description on work modes & options.

Input abundance data

EXPERT takes two kinds of **abundance data **as inputs.

Taxonomic assignments result for a single sample (OTU table)

Notice that here is a header "# Constructed from biom file" in the first line.

# Constructed from biom file
# OTU IDERR1754760taxonomy
20711919.0sk__Archaea
11809045.0sk__Archaea;k__;p__Thaumarchaeota;c__;o__Nitrosopumilales;f__Nitro...
15315638.0sk__Archaea;k__;p__Thaumarchaeota;c__;o__Nitrosopumilales;f__Nitro...
1317041.0sk__Archaea;k__;p__Thaumarchaeota;c__Nitrososphaeria;o__Nitrososp...
1031815174.0sk__Bacteria
1573619.0sk__Bacteria;k__;p__;c__;o__;f__;g__;s__agricultural_soil_bacterium_SC-I-11

Taxonomic assignments result for multiple samples (count matrix)

#SampleIDERR1844510ERR1844449ERR1844450ERR1844451
sk__Archaea1.017.08.016.0
sk__Archaea;k__;p__Crenarchaeota0000
sk__Archaea;k__;p__Euryarchaeota8.02.03.01.0
sk__Archaea;k__;p__Eury...;c__...0000
sk__Archaea;k__;p__Eury...;c__...;o__...0000

License

Maintainer

Name Email Organization
Hui Chong huichong.me@gmail.com Research Assistant, School of Life Science and Technology, Huazhong University of Science & Technology
Kang Ning ningkang@hust.edu.cn Professor, School of Life Science and Technology, Huazhong University of Science & Technology

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

expert-mst-0.3.tar.gz (74.2 MB view hashes)

Uploaded Source

Built Distribution

expert_mst-0.3-py3-none-any.whl (74.2 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page