Skip to main content

Machine learning platform for taxonomic classification

Project description

ePlacer

ePlacer is a taxonomic classification tool that uses deep-learning approaches to incorporate both sequence information and biogeographic information into taxonomic assignment of DNA sequences.

Why use ePlacer

The machine learning architecture of ePlacer enables powerful prediction beyond sequence-only classification tools (e.g. sequence alignment with blast or naive-bayes classifiers) by directly incorporating additional data into the probabalistic estimate of taxonomy, specifically developed for metabarcoding data. This novel applciation of deep-learning is immensely useful, as there can be many cases in metabarcoding data where two reference species have 100% sequence overlap, but distinct geographic ranges. This tool discriminates these cases and provides additional data for downstream taxonomic curation. Due to this, ePlacer provides enhanced interoperability between metabarcoding datasets.

Currently, ePlacer offers pre-trained models for two popular metabarcoding regions: the MiFish and the ecoPrimer, or Riaz, marker gene regions. For these two regions, ePlacer offers the following benefits:

  • Interoperability. ePlacer is trained on global datasets, allowing for direct comparison between metabarcoding datasets, regardless of geographic region.
  • Portability. ePlacer has pre-trained models available for both MiFish and Riaz marker gene regions containerized and available for out-of-the-box use
  • Interactive Visualization. ePlacer provides an interactive GUI and curation tool that allows
  • Increased Accuracy. The ePlacer model architecture provides increased accuracy, precision, and recall as compared to blast, Naive-Bayes, or least common ancestor approachers
  • Trainability In addition to the two provided barcodes, this code repository provides tools for training new models.

For other barcode regions, there will be significant advantages with the training of new models. If you are interested in training a new model for ePlacer, please do not hesitate to reach out!

Installation

Users can install the current version of ePlacer with conda.

conda install bioconda::eplacer

Using ePlacer for classification

The ePlacer taxonomic assignment tool can be run two ways: natively (through the ePlacer CLI or API) or with a QIIME2 plugin. Here, the documentation will be detailing the native usage. Details on usage of the QIIME2 plugin can be found in the linked git repository.

ePlacer taxonomically classified ASV sequences using two distinct types of information:

  • Sequence information (inferred from ASVs)
  • Biogeography (inferred from sample metadata and count tables)

Although not strictly required for assignment, blast results are also used to automatically check "solvable" taxonomic assignments and resolve them more accurately as an automated curation step.

Using this information, ePlacer generates a raw confidence of presence across all possible taxonomic labels.

In order to run classification with ePlacer, four data files are required. Properly formatted examples can be seen here:

  • A fasta file of ASVs
>ASV1
CCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACT
>ASV2
CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
>ASV3
CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
  • A geography metadata file
#SampleID	Latitude	Longitude
Sample1	39.645946	-71.746641
Sample2	39.645946	-71.746641
  • A count table
#OTU ID	Sample1	Sample2
ASV1	15	0
ASV2	5	22
ASV3	0	10
  • blast data output (generated with -outfmt "6 qseqid sseqid pident evalue length qlen slen qstart qend sstart send sseq")
ASV1	SubjectRef_A	100.00	1.45e-45	98	98	98	1	98	1	98	GCCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCAGCTTATAACCCAAAGGACTTGGCGCTGCTTCAGACCCCCCT
ASV2	SubjectRef_B	99.00	2.12e-42	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
ASV3	SubjectRef_C	100.00	1.45e-45	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT

Acquiring pre-trained models.

Pre-trained models can be acquired from Zenodo (doi:10.5281/zenodo.20820029). Currently, only 12S-V5 ecoprimer and mifish primers are available, but others will be created and stored in the future. If you develop your own model, please don't hesitate to reach out.

Natively trained models contain directories of information and can be obtained in the following manner:

wget https://zenodo.org/records/20820029/files/mifish.tar.gz
tar -xzf mifish.tar.gz
wget https://zenodo.org/records/20820029/files/riaz.tar.gz
tar -xzf riaz.tar.gz

Note we also provide pre-compiled *.qza models for use with QIIME2. These can be found in the same zenodo repository.

Running Classification with Pre-trained models

For users that have generated their own models, use the following code:

eplacer run-model --fasta <fasta path> --counts <count matrix> --geoData <geoData path> --confidence <threshold> --model <model path> --maskrate 0

Training new ePlacer models

Training new ePlacer models is very simple! All that is required is an aligned fasta file for the barcode of interest (containing all available references of interest), a flat taxonomy file, and a reference file for biogeography (currently, eplacer supports the OBIS csv download).

ePlacer also supports custom references for biogeography, formatted as follows:

#Species	Latitude	Longitude
SpeciesLabelA	39.645946	-71.746641
SpeciesLabelB	39.645946	-71.746641

To run the training, use the following:

eplacer train-model --fasta <alignment file> --taxa <taxonomy file> \
            --out <output directory> --taxlevel SPECIES \
            --geoData <obis data> --augments <Several parameters should be test here> \
            --maskrate <Several parameters should be test here> --threads 1

==============================================================

This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eplacer-0.1.1.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eplacer-0.1.1-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file eplacer-0.1.1.tar.gz.

File metadata

  • Download URL: eplacer-0.1.1.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for eplacer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 159d6d8bbec174203651212efcfce597665579636f52f2cfff49d87d18b1aac1
MD5 d9b9dee8ab82c28fbf59c64d03b6159f
BLAKE2b-256 b1b8dda3f5e24fafa4424fb65abc0f4155fe03177984d24211a1e61d7c53ed84

See more details on using hashes here.

File details

Details for the file eplacer-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: eplacer-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for eplacer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bac1f8e07c89705fd7a6b439bcef82d40c02dddcc52d3500c18169c4e8cf5ddf
MD5 fc3bb49af81c26387f4ee68a7934543b
BLAKE2b-256 aeb32c92fc3d5b2dabb45d518ee27c83d2aa93f3951398eac7a0c01aecfcdbcc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page