Uses a randomForest model to predict which OTUs are present in a microbiome
Project description
OTU_predictor
OTU_predictor uses a trained RandomForestClassifier ML model to predict 'real' OTU presence from ancient metagenomic samples (although it's use is not limited to ancient samples). The training dataset consists of 200 simulated populations generated through InSilicoSeq and deaminated using gargammel. Each population contains between 5 and 20 microbial species with know, variable abundance. OTU_predictor uses input files generated in centrifgure, specifically centrifugeReport.txt files.
Install package
OTU_predictor is currently running on python 3.11. It may run on earlier python versions also, but this has not been extensively tested. The easiest way to install OTU_predictor is using pip. Either of the following commands will do this:
pip install OTU-predictor
or
pip install git+https://github.com/DrATedder/OTU_predictor.git
Basic Usage
1. Converting your data
OTU_predictor works with centrifugeReport.txt files. Before you can run the model prediction step, some minor format teaks are required (see example output below). This can be done in the following way:
import OTU_predictor
centrifugeReport = "/path/to/your/file_centrifugeReport.txt"
OTU_predictor.convert_file(centrifugeReport)
If this step is successful, you will see a message similar to the one below:
'Data file /path/to/your/file_centrifugeReport_data.txt created'
The output data format should look something like this:
| name | taxID | taxRank | genomeSize | numReads | numUniqueReads | abundance | genus | presence | sim_abundance |
|---|---|---|---|---|---|---|---|---|---|
| Bacteria | 2 | superkingdom | 0 | 127 | 103 | 0.00026298841815572643 | NA | 0 | 0 |
| Azorhizobium | 6 | genus | 5369772 | 1 | 0 | 2.070774946108082e-06 | Azorhizobium | 0 | 0 |
| Azorhizobium caulinodans | 7 | species | 5369772 | 3 | 0 | 6.212324838324246e-06 | Azorhizobium | 0 | 0 |
| Buchnera aphidicola | 9 | species | 602805 | 3 | 1 | 6.212324838324246e-06 | Buchnera | 0 | 0 |
| Cellulomonas gilvus | 11 | species | 3526441 | 15 | 0 | 3.106162419162123e-05 | Cellulomonas | 0 | 0 |
| Phenylobacterium | 20 | genus | 4379231 | 1 | 0 | 2.070774946108082e-06 | Phenylobacterium | 0 | 0 |
| Shewanella | 22 | genus | 5140018 | 10 | 1 | 2.0707749461080822e-05 | Shewanella | 0 | 0 |
| Shewanella putrefaciens | 24 | species | 4749735 | 2 | 1 | 4.141549892216164e-06 | Shewanella | 0 | 0 |
| Myxococcales | 29 | order | 9638245 | 171 | 0 | 0.00035410251578448204 | NA | 0 | 0 |
| Myxococcaceae | 31 | family | 9636120 | 9 | 0 | 1.863697451497274e-05 | NA | 0 | 0 |
| Myxococcus | 32 | genus | 9487953 | 10 | 0 | 2.0707749461080822e-05 | Myxococcus | 0 | 0 |
| Myxococcus xanthus | 34 | species | 9139763 | 47 | 10 | 9.732642246707986e-05 | Myxococcus | 0 | 0 |
| Myxococcus macrosporus | 35 | species | 8973512 | 20 | 8 | 4.1415498922161644e-05 | Myxococcus | 0 | 0 |
| Archangiaceae | 39 | family | 10085598 | 11 | 0 | 2.2778524407188902e-05 | NA | 0 | 0 |
| Stigmatella | 40 | genus | 10260756 | 2 | 0 | 4.141549892216164e-06 | Stigmatella | 0 | 0 |
| Stigmatella aurantiaca | 41 | species | 10260756 | 1 | 0 | 2.070774946108082e-06 | Stigmatella | 0 | 0 |
| Cystobacter | 42 | genus | 0 | 1 | 0 | 2.070774946108082e-06 | Cystobacter | 0 | 0 |
Note. You will notice the addition of three new columns. these are variables used by the model during training, and while it is essential for them to be included for the file to be valid, they are not interpreted as part of the prediction step. It is also worth pointing out that the 'tab-delimitation' is replaced by 'comma-delimitation'.
2. Making predictions
The make_predictions() function uses your data file and the trained model which packages with this distribution. It is possible to create your own model if this is preferable though. The basic steps to make predictions are as follows:
converted_data = "/path/to/your/file_centrifugeReport_data.txt"
OTU_predictor.make_predictions(converted_data)
The output will be a list (of dictionaries) similar to the one shown below:
[{'Species': 'Neisseria mucosa', 'TaxID': 488, 'Prediction': 1, 'Certainty': 0.68},
{'Species': 'Streptococcus sanguinis', 'TaxID': 1305, 'Prediction': 1, 'Certainty': 0.72},
{'Species': 'Actinomyces sp. oral taxon 414', 'TaxID': 712122, 'Prediction': 1, 'Certainty': 0.97},
{'Species': 'Olsenella sp. oral taxon 807', 'TaxID': 712411, 'Prediction': 1, 'Certainty': 0.88},
{'Species': 'Anaerolineaceae bacterium oral taxon 439', 'TaxID': 1889813, 'Prediction': 1, 'Certainty': 0.87},
{'Species': 'Desulfobulbus oralis', 'TaxID': 1986146, 'Prediction': 1, 'Certainty': 0.84}]
Note. As you can see from the output list, OTU (species - although it can be at any taxonomic level determined by centrifuge) and taxID are given, along with a certainty score. These scores will be between 0 and 1, with higher scores indicating increased certainty. Prediction: 1 is OTU presence in the sample. The model also determines (but does not show) OTU absence (Prediction: 0).
Users should choose a certainty score that fits their experimental purpose.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file otu_predictor-1.1.0.tar.gz.
File metadata
- Download URL: otu_predictor-1.1.0.tar.gz
- Upload date:
- Size: 948.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79934df220a0beb3f23de52542bc638f1d5767efd630c57e527083db57573dd5
|
|
| MD5 |
cde2f73d4d8e0b7bd47bf1643779a67c
|
|
| BLAKE2b-256 |
7a380920a15e24ddcd1a2279f76e1140f28834002eada58a1a9ecb01650696d8
|
File details
Details for the file otu_predictor-1.1.0-py3-none-any.whl.
File metadata
- Download URL: otu_predictor-1.1.0-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1f2d89d3c2f5857a1f9920fe81c464a6cccbfe5eef7a12e12cb5b35cc51e86c
|
|
| MD5 |
2fa4208fe4cce79aa7c0598478fc94cc
|
|
| BLAKE2b-256 |
dd2e39fd1069e79d9af4daad37498f16703dbf568238afb876a8576ef8d67d52
|