Machine learning platform for taxonomic classification
Project description
ePlacer
ePlacer is a taxonomic classification tool that uses deep-learning approaches to incorporate both sequence information and biogeographic information into taxonomic assignment of DNA sequences.
Why use ePlacer
The machine learning architecture of ePlacer enables powerful prediction beyond sequence-only classification tools (e.g. sequence alignment with blast or naive-bayes classifiers) by directly incorporating additional data into the probabalistic estimate of taxonomy, specifically developed for metabarcoding data. This novel applciation of deep-learning is immensely useful, as there can be many cases in metabarcoding data where two reference species have 100% sequence overlap, but distinct geographic ranges. This tool discriminates these cases and provides additional data for downstream taxonomic curation. Due to this, ePlacer provides enhanced interoperability between metabarcoding datasets.
Currently, ePlacer offers pre-trained models for two popular metabarcoding regions: the MiFish and the ecoPrimer, or Riaz, marker gene regions. For these two regions, ePlacer offers the following benefits:
- Interoperability. ePlacer is trained on global datasets, allowing for direct comparison between metabarcoding datasets, regardless of geographic region.
- Portability. ePlacer has pre-trained models available for both MiFish and Riaz marker gene regions containerized and available for out-of-the-box use
- Interactive Visualization. ePlacer provides an interactive GUI and curation tool that allows
- Increased Accuracy. The ePlacer model architecture provides increased accuracy, precision, and recall as compared to blast, Naive-Bayes, or least common ancestor approachers
- Trainability In addition to the two provided barcodes, this code repository provides tools for training new models.
For other barcode regions, there will be significant advantages with the training of new models. If you are interested in training a new model for ePlacer, please do not hesitate to reach out!
Installation
Users can install the current version of ePlacer with conda.
conda install bioconda::eplacer
Using ePlacer for classification
The ePlacer taxonomic assignment tool can be run two ways: natively (through the ePlacer CLI or API) or with a QIIME2 plugin. Here, the documentation will be detailing the native usage. Details on usage of the QIIME2 plugin can be found in the linked git repository.
ePlacer taxonomically classified ASV sequences using two distinct types of information:
- Sequence information (inferred from ASVs)
- Biogeography (inferred from sample metadata and count tables)
Although not strictly required for assignment, blast results are also used to automatically check "solvable" taxonomic assignments and resolve them more accurately as an automated curation step.
Using this information, ePlacer generates a raw confidence of presence across all possible taxonomic labels.
In order to run classification with ePlacer, four data files are required. Properly formatted examples can be seen here:
- A fasta file of ASVs
>ASV1
CCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACT
>ASV2
CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
>ASV3
CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
- A geography metadata file
#SampleID Latitude Longitude
Sample1 39.645946 -71.746641
Sample2 39.645946 -71.746641
- A count table
#OTU ID Sample1 Sample2
ASV1 15 0
ASV2 5 22
ASV3 0 10
- blast data output (generated with -outfmt "6 qseqid sseqid pident evalue length qlen slen qstart qend sstart send sseq")
ASV1 SubjectRef_A 100.00 1.45e-45 98 98 98 1 98 1 98 GCCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCAGCTTATAACCCAAAGGACTTGGCGCTGCTTCAGACCCCCCT
ASV2 SubjectRef_B 99.00 2.12e-42 98 98 98 1 98 1 98 GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
ASV3 SubjectRef_C 100.00 1.45e-45 98 98 98 1 98 1 98 GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
Acquiring pre-trained models.
Pre-trained models can be acquired from Zenodo (doi:10.5281/zenodo.20820029). Currently, only 12S-V5 ecoprimer and mifish primers are available, but others will be created and stored in the future. If you develop your own model, please don't hesitate to reach out.
Natively trained models contain directories of information and can be obtained in the following manner:
wget https://zenodo.org/records/20820029/files/mifish.tar.gz
tar -xzf mifish.tar.gz
wget https://zenodo.org/records/20820029/files/riaz.tar.gz
tar -xzf riaz.tar.gz
Note we also provide pre-compiled *.qza models for use with QIIME2. These can be found in the same zenodo repository.
Running Classification with Pre-trained models
For users that have generated their own models, use the following code:
eplacer run-model --fasta <fasta path> --counts <count matrix> --geoData <geoData path> --confidence <threshold> --model <model path> --maskrate 0
Training new ePlacer models
Training new ePlacer models is very simple! All that is required is an aligned fasta file for the barcode of interest (containing all available references of interest), a flat taxonomy file, and a reference file for biogeography (currently, eplacer supports the OBIS csv download).
ePlacer also supports custom references for biogeography, formatted as follows:
#Species Latitude Longitude
SpeciesLabelA 39.645946 -71.746641
SpeciesLabelB 39.645946 -71.746641
To run the training, use the following:
eplacer train-model --fasta <alignment file> --taxa <taxonomy file> \
--out <output directory> --taxlevel SPECIES \
--geoData <obis data> --augments <Several parameters should be test here> \
--maskrate <Several parameters should be test here> --threads 1
==============================================================
This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eplacer-0.1.1.tar.gz.
File metadata
- Download URL: eplacer-0.1.1.tar.gz
- Upload date:
- Size: 28.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
159d6d8bbec174203651212efcfce597665579636f52f2cfff49d87d18b1aac1
|
|
| MD5 |
d9b9dee8ab82c28fbf59c64d03b6159f
|
|
| BLAKE2b-256 |
b1b8dda3f5e24fafa4424fb65abc0f4155fe03177984d24211a1e61d7c53ed84
|
File details
Details for the file eplacer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: eplacer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bac1f8e07c89705fd7a6b439bcef82d40c02dddcc52d3500c18169c4e8cf5ddf
|
|
| MD5 |
fc3bb49af81c26387f4ee68a7934543b
|
|
| BLAKE2b-256 |
aeb32c92fc3d5b2dabb45d518ee27c83d2aa93f3951398eac7a0c01aecfcdbcc
|