Classification and prediction of the origin of metagenomic samples
Project description
Sourcepredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking. Sourcepredict solves this problem by using machine learning classification on dimensionally reduced datasets.
Installation
With conda (recommended)
$ conda install -c conda-forge -c maxibor sourcepredict
With pip
$ pip install sourcepredict
Example
Input
- Sink taxonomic count file (see example file and documentation)
- Source taxonomic count file (see example file and documentation)
- Source label file (see example file and documentation)
Usage
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/test/dog_test_sink_sample.csv -O dog_example.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_labels.csv -O sp_labels.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_sources.csv -O sp_sources.csv
$ sourcepredict -s sp_sources.csv -l sp_labels.csv dog_example.csv
Step 1: Checking for unknown proportion
== Sample: ERR1915662 ==
Adding unknown
Normalizing (GMPR)
Computing Bray-Curtis distance
Performing MDS embedding in 2 dimensions
KNN machine learning
Training KNN classifier on 2 cores...
-> Testing Accuracy: 1.0
----------------------
- Sample: ERR1915662
known:98.61%
unknown:1.39%
Step 2: Checking for source proportion
Computing weighted_unifrac distance on species rank
TSNE embedding in 2 dimensions
KNN machine learning
Performing 5 fold cross validation on 2 cores...
Trained KNN classifier with 10 neighbors
-> Testing Accuracy: 0.99
----------------------
- Sample: ERR1915662
Canis_familiaris:96.1%
Homo_sapiens:2.47%
Soil:1.43%
Sourcepredict result written to dog_test_sample.sourcepredict.csv
Output
Sourcepredict output the predicted source contribution to each sink sample, and the embedding of all samples in the lower dimensional space. See documentation for details.
Runtime
Depending on the normalization method (-n
), the embedding (-me
) method, the cpus available for parallel processing (-t
), and the data, the runtime should be between a few seconds and a few minutes per sink sample.
Documentation
The documentation of SourcePredict is available here: sourcepredict.readthedocs.io
Sourcepredict example files
- The sources were obtained with a simple Nextflow pipeline, with Kraken2 using the MiniKraken2_v2_8GB.
See the documentation for more informations on how to build a custom source file. - The example source file is here modern_gut_microbiomes_sources.csv
- The example label file is here modern_gut_microbiomes_sources.csv
Environments included in the example source file
- Homo sapiens gut microbiome (1, 2, 3, 4, 5, 6)
- Canis familiaris gut microbiome (1)
- Soil microbiome (1, 2, 3)
Contributing Code, Documentation, or Feedback
If you wish to contribute to Sourcepredict, you are welcome and encouraged to contribute by opening an issue, or creating a pull-request. All contributions will be made under the GPLv3 license. More informations can found on the contributing page.
How to cite
Sourcepredict has been published in JOSS.
@article{Borry2019Sourcepredict,
journal = {Journal of Open Source Software},
doi = {10.21105/joss.01540},
issn = {2475-9066},
number = {41},
publisher = {The Open Journal},
title = {Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification},
url = {http://dx.doi.org/10.21105/joss.01540},
volume = {4},
author = {Borry, Maxime},
pages = {1540},
date = {2019-09-04},
year = {2019},
month = {9},
day = {4}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sourcepredict-0.5.1.tar.gz
.
File metadata
- Download URL: sourcepredict-0.5.1.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e3e5a418e73f55e6a518e53d5f28838bcdad6307e481d755ad907d2a76b74c9 |
|
MD5 | 8e8a054fc2a1ad9b2fe938827941ba71 |
|
BLAKE2b-256 | 79ccf93fe258f3b994c371dfea39136a370314a87edb4d5a7d9ba4d25c60735f |
File details
Details for the file sourcepredict-0.5.1-py3-none-any.whl
.
File metadata
- Download URL: sourcepredict-0.5.1-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9274508874034b8905e9f380bb7129d31590be481a5ec397defdee26156d4f5b |
|
MD5 | e39a6576a7a0301b3a857b7dc0a979dc |
|
BLAKE2b-256 | a1c7caa7f64a925221baacc87cde7a0b02d2dca8d9ca3d7497ee0ccae4ed18f7 |