Skip to main content

Machine learning platform for taxonomic classification

Project description

ePlacer

ePlacer is a taxonomic classification tool that uses deep-learning approaches to incorporate both sequence information and biogeographic information into taxonomic assignment of DNA sequences.

Why use ePlacer

The machine learning architecture of ePlacer enables powerful prediction beyond sequence-only classification tools (e.g. sequence alignment with blast or naive-bayes classifiers) by directly incorporating additional data into the probabalistic estimate of taxonomy, specifically developed for metabarcoding data. This novel applciation of deep-learning is immensely useful, as there can be many cases in metabarcoding data where two reference species have 100% sequence overlap, but distinct geographic ranges. This tool discriminates these cases and provides additional data for downstream taxonomic curation. Due to this, ePlacer provides enhanced interoperability between metabarcoding datasets.

Currently, ePlacer offers pre-trained models for two popular metabarcoding regions: the MiFish and the ecoPrimer, or Riaz, marker gene regions. For these two regions, ePlacer offers the following benefits:

  • Interoperability. ePlacer is trained on global datasets, allowing for direct comparison between metabarcoding datasets, regardless of geographic region.
  • Portability. ePlacer has pre-trained models available for both MiFish and Riaz marker gene regions containerized and available for out-of-the-box use
  • Interactive Visualization. ePlacer provides an interactive GUI and curation tool that allows
  • Increased Accuracy. The ePlacer model architecture provides increased accuracy, precision, and recall as compared to blast, Naive-Bayes, or least common ancestor approachers
  • Trainability In addition to the two provided barcodes, this code repository provides tools for training new models.

For other barcode regions, there will be significant advantages with the training of new models. If you are interested in training a new model for ePlacer, please do not hesitate to reach out!

Installation

Users can install the current version of ePlacer with conda.

conda install bioconda::eplacer

Using ePlacer for classification

The ePlacer taxonomic assignment tool can be run two ways: natively (through the ePlacer CLI or API) or with a QIIME2 plugin. Here, the documentation will be detailing the native usage. Details on usage of the QIIME2 plugin can be found in the linked git repository.

ePlacer taxonomically classified ASV sequences using two distinct types of information:

  • Sequence information (inferred from ASVs)
  • Biogeography (inferred from sample metadata and count tables)

Although not strictly required for assignment, blast results are also used to automatically check "solvable" taxonomic assignments and resolve them more accurately as an automated curation step.

Using this information, ePlacer generates a raw confidence of presence across all possible taxonomic labels.

In order to run classification with ePlacer, four data files are required. Properly formatted examples can be seen here:

  • A fasta file of ASVs
>ASV1
CCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACT
>ASV2
CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
>ASV3
CGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACT
  • A geography metadata file
#SampleID	Latitude	Longitude
Sample1	39.645946	-71.746641
Sample2	39.645946	-71.746641
  • A count table
#OTU ID	Sample1	Sample2
ASV1	15	0
ASV2	5	22
ASV3	0	10
  • blast data output (generated with -outfmt "6 qseqid sseqid pident evalue length qlen slen qstart qend sstart send sseq")
ASV1	SubjectRef_A	100.00	1.45e-45	98	98	98	1	98	1	98	GCCGTAAACTTAGATAAATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCAGCTTATAACCCAAAGGACTTGGCGCTGCTTCAGACCCCCCT
ASV2	SubjectRef_B	99.00	2.12e-42	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT
ASV3	SubjectRef_C	100.00	1.45e-45	98	98	98	1	98	1	98	GCGGTAAACTTAGATATATTAGTACAACAAATATCGGCCCGGGAACTACGAGCGCCTGCTTAAAACCCAAAGGTCTTGGCGGTGCTTCAGACCCCCCT

Acquiring pre-trained models.

Pre-trained models can be acquired from Zenodo (doi:10.5281/zenodo.20820029). Currently, only 12S-V5 ecoprimer and mifish primers are available, but others will be created and stored in the future. If you develop your own model, please don't hesitate to reach out.

Natively trained models contain directories of information and can be obtained in the following manner:

wget https://zenodo.org/records/20820029/files/mifish.tar.gz
tar -xzf mifish.tar.gz
wget https://zenodo.org/records/20820029/files/riaz.tar.gz
tar -xzf riaz.tar.gz

Note we also provide pre-compiled *.qza models for use with QIIME2. These can be found in the same zenodo repository.

Running Classification with Pre-trained models

For users that have generated their own models, use the following code:

eplacer run-model --fasta <fasta path> --counts <count matrix> --geoData <geoData path> --confidence <threshold> --model <model path> --maskrate 0

Training new ePlacer models

Training new ePlacer models is very simple! All that is required is an aligned fasta file for the barcode of interest (containing all available references of interest), a flat taxonomy file, and a reference file for biogeography (currently, eplacer supports the OBIS csv download).

ePlacer also supports custom references for biogeography, formatted as follows:

#Species	Latitude	Longitude
SpeciesLabelA	39.645946	-71.746641
SpeciesLabelB	39.645946	-71.746641

To run the training, use the following:

eplacer train-model --fasta <alignment file> --taxa <taxonomy file> \
            --out <output directory> --taxlevel SPECIES \
            --geoData <obis data> --augments <Several parameters should be test here> \
            --maskrate <Several parameters should be test here> --threads 1

==============================================================

This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eplacer-0.1.0.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eplacer-0.1.0-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file eplacer-0.1.0.tar.gz.

File metadata

  • Download URL: eplacer-0.1.0.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for eplacer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 91a17596dcf9d2e6e990ee255e3d55b76e35e9dfde0abb6a8963d588bb921b5f
MD5 c6c03c9dbbd1e8a0117174bb2ebe603a
BLAKE2b-256 8a01b2b16a01def6ae956545445942501e79fa1040910da2686c342721ef657f

See more details on using hashes here.

File details

Details for the file eplacer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: eplacer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for eplacer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c0303ed7e772a011290cd5eca8b3d194407819895ca874376cf46a518e79ba78
MD5 79f383dcbace6219b6e347587cfebf17
BLAKE2b-256 aa65eb60de19afc757f2a788480478b864e6065909429c531630a772ad564c57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page