dbDNA - A phylogeny- and expert identifier-driven grading system for reliable taxonomic annotation of (meta)barcoding data

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

dbDNA - A phylogeny- and expert identifier-driven grading system for reliable taxonomic annotation of (meta)barcoding data

Introduction

Text

Installation

SeqRanker pipeline

Individual dbDNA databases can be created using the SeqRanker pipeline, which can be installed on all common operating systems (Windows, Linux, MacOS). SeqRanker requires Python 3.7 or higher and can be easily installed via pip in any command line:

pip3 install seqranker

To update SeqRanker run:

pip3 install --upgrade seqranker

Alternatively, standalone versions of the SeqRanker pipeline for Windows11 and MacOS (tested on Ventura 13.5) are available under the latest release.

Further Dependencies

Besides the main script, several other programs are required for the database creation. Please follow the installation instructions for your operating system for each software.

mafft

Mafft is software to calculate multiple sequence alignments and is required the phylogenetic approach. More information about the installation of mafft can be found here.

IQ-TREE

IQ-TREE is a phylogenomic software that calculate maximum likelihood trees. IQ-TREE is required to for the phylogenetic approach. More information about the installation of IQ-TREE can be found here.

mPTP

mPTP is a software that is applied for species delimitation using the multi-rate Poisson Tree Processes. More information about the installation of mPTP can be found here

BLAST+

BLAST+ is a software to create BLAST databases and perform BLAST searches on custom (local) databases. More information about the installation of BLAST+ can be found here.

APSCALE blast

APSCALE is a software to process (e)DNA metabarcoding datasets. The blastn module is used to perform BLAST searches on custom (local) databases. More information about the installation of APSCALE blast can be found here.

Settings file

The SeqRanker pipeline collects the required information from an Excel file. All specifications must be entered into this file.

Sheet 1 contains the Run parameters. Here, the "Run" column is to be modified

Task	Run	Comment
source	BOLD	define source
download	yes	download BOLD/NCBI data
extract	yes	extract BOLD/NCBI data
phylogeny	yes	calculate phylogenetic trees
rating	yes	create table and rate records
create database	yes	create blast database

Sheet 2 contains the database information and source files. Here, the "User input" column is to be modified

Variable	User input	Comment	Options
project name	Invertebrate_example_database	Name of the database	string
taxa list	/PATH/invertebrates.xlsx	Excel file containing taxa to download	PATH
identifier whitelist	/PATH/identifier_white_list.xlsx	Enter path to identifier whitelist	PATH
location whitelist	/PATH/country_white_list.xlsx	Enter path to location whitelist	PATH
output folder	/PATH/example	Enter path to output directory	PATH
marker	COI-5P	Marker to download	string
rating minimum	5	Keep only sequences that are >= X	yes / no
download overwrite	yes	Overwrite existing files?	yes / no
alignment overwrite	yes	Overwrite existing files?	yes / no
tree overwrite	yes	Overwrite existing files?	yes / no
mafft executable	/PATH/mafft	Either "mafft" or "PATH/TO/mafft"	PATH
iqtree executable	/PATH/iqtree2	Either "iqtree" or "PATH/TO/iqtree"	PATH
mptp executable	/PATH/mptp	Either "mptp" or "PATH/TO/mptp"	PATH
makeblastdb executable	/PATH/makeblastdb	Either "makeblastdb" or "PATH/TO/makeblastdb"	PATH
MIDORI2 fasta		Enter path to MDORI2 file	PATH
outgroup_fasta	/PATH/outgroup.fasta	Enter path to outgroup sequence	PATH

Run SeqRanker

First, prepare the settings file according to your needs. Then, the SeqRanker pipeline can easily be initiated via the following command(s):

pypi version

Open a new terminal
Execute: seqranker ./PATH/TO/FOLDER/settings.xlsx

standalone version

Doubleclick on the seqranker_v0.1-macosx-ventura or seqranker_v0.1-W11 executable.
Provide the settings.xlsx file.

Example data

Example data that was used for the creation a database for European freshwater invertebrates can be found here:

SeqRanker pipeline: a short overview

Overview slides

A more detailed overview into the pipeline can be found in this presentation.

Step 1: Data acquisition

Records for all taxa provided in taxa list are downloaded (the taxon can be any taxonomic level). For example, of a genus is provided, all species records for this genus will be fetched.
Sequence records can be obtained from BOLDsystems and MIDORI2 (GenBank).
For each record, all available metadata is downloaded (from BOLDsystems or GenBank, depending on the source).
All records and their respective metadata are stored in a raw sequence table.

Step 2: Species delineation

The sequences of all records of each family in the dataset are combined in a separate .fasta file.
A multiple sequence alignment for each family is calculated, using mafft.
A maximum likelihood tree for each family is calculated, using IQ-Tree (fast option).
Species are delimited for each family, using mPTP.
The species delimitation results are used evaluate if a species record is mono- or paraphyletic.

Step 3: Rating system

Each individual record is scored, based on the following criteria.
If a criterion is not met, no points are gained.

Category	Points gained	Explanation
monophyletic OR	15	Delimited species group only contains one species
monophyletic (singleton)	5	Delimited species group only contains one species, but only a single sequence
good sequence quality	3	Only the four bases "AGCT" are present
bad sequence quality	-10	More than 2% of the sequence are not "AGCT"
longer than 500 bp	2	The recommended minimum barcode length is >= 500 bp
identifier on whitelist	15	The specimen was identified by an identifier on the white list
main country OR	9	The specimen was collected in the main country
neighbour country OR	6	The specimen was collected in a neighbouring country
continent	3	The specimen was collected on the same continent
distance <= d1 OR	9	The specimen was collected in the main country
distance <= d2 OR	6	The specimen was collected in a neighbouring country
distance <= d3	3	The specimen was collected on the same continent
image	1	An image is available
province	1	The metadata is available
region	1	The metadata is available
exactsite	1	The metadata is available
lifestage	1	The metadata is available
sex	1	The metadata is available

Each record can gain between 50 (excellent) and -10 (highly unreliable) points.
All records are categorized according to their points.

Border	Gold	Silver	Bronze	Unreliable
Upper	50	39	24	9
Lower	40	25	10	-10

Step 4: Database creation

The function makeblastdb is used to create a BLAST+ compatible database.

Step 5: Local BLASTn

The APSCALE BLASTn tool can be used for the taxonomic assignment of DNA metabarcoding datasets against the newly created database.
APSCALE will automatically filter the hits and include the ratings of the record in the filtering process.
The filtering algorithm works as follows, for each OTU individually:

Obtain the Top20 BLASTn hits for the OTU.
Filter by similarity: all hits with the highest similarity are kept.
Trim hits according to similarity: Species >=98%, Genus >=95%, Family >=90%, Order >= 85%.
Filter remaining hits by rating: A) keep all Gold hits OR B) keep all Silver hits OR C) keep all Bronze hits OR D) keep all unreliable hits.
Trim taxonomy of remaining hits to their most recent common ancestor (MRCA filtering): Phylum, Class, Order, Family, Genus, Species.

All ambiguous taxonomic assignments and metadata are kept in the final table as "traits" for each OTU.

Available databases

European freshwater invertebrates (COI)

All species of all genera classified as European freshwater invertebrates (according to freshwaterecology.info).
A filtered and unfilitered version is available here.

European freshwater fish and lamprey (12S)

All species of all genera classified as European freshwater fish and lamprey (according to freshwaterecology.info).
A filtered and unfilitered version is available here.

Benchmark

Runtimes for the SeqRanker database creation are optimized for parallelization.
Increasing the number of available cores will signficantly reduce runtimes.
However, even large databases can be curated on average hardware.

Example

All genera of all European freshwater macroinvertebrates, available on freshwater-ecology.info.
In total 500,521k records were downloaded from BOLDsystems.
Executed on a MacBook M1 Pro 2021 (16GB RAM, 8 cores).

Runtime (min)	Step
124	Sequence download
2	Record extraction
20	Alignments
120	ML tree
10	Species delimitation
8	Barcode ranking
6	Database creation

Citation

SeqRanker

Coming soon...

mafft

Katoh, K., Misawa, K., Kuma, K., & Miyata, T. (2002). MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30(14), 3059–3066. https://doi.org/10.1093/nar/gkf436

IQ-Tree

Nguyen, L.-T., Schmidt, H. A., von Haeseler, A., & Minh, B. Q. (2015). IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. Molecular Biology and Evolution, 32(1), 268–274. https://doi.org/10.1093/molbev/msu300

mPTP

Kapli, P., Lutteropp, S., Zhang, J., Kobert, K., Pavlidis, P., Stamatakis, A., & Flouri, T. (2017). Multi-rate Poisson tree processes for single-locus species delimitation under maximum likelihood and Markov chain Monte Carlo. Bioinformatics, 33(11), 1630–1638. https://doi.org/10.1093/bioinformatics/btx025

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.4

Apr 1, 2025

0.1.3

Apr 1, 2025

This version

0.1.2

Mar 24, 2025

0.1.1

Mar 24, 2025

0.1.0

Mar 24, 2025

0.0.1

Mar 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqranker-0.1.2.tar.gz (39.9 kB view details)

Uploaded Mar 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

seqranker-0.1.2-py3-none-any.whl (35.8 kB view details)

Uploaded Mar 24, 2025 Python 3

File details

Details for the file seqranker-0.1.2.tar.gz.

File metadata

Download URL: seqranker-0.1.2.tar.gz
Upload date: Mar 24, 2025
Size: 39.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for seqranker-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`8f2f968e197170d2770303ccc36c7154938a281d87e85b29818aebb9b1bf3c72`
MD5	`44cc5eb96b7783357afd8cfa88024980`
BLAKE2b-256	`b9fc525878bfd84d462592c06e5ea0efb13bdcdf31e4d897afce887be9c02805`

See more details on using hashes here.

File details

Details for the file seqranker-0.1.2-py3-none-any.whl.

File metadata

Download URL: seqranker-0.1.2-py3-none-any.whl
Upload date: Mar 24, 2025
Size: 35.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for seqranker-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`853d3b0475445b01af7108ab5e4d89fe63ab527a8c750b48408e745a7fd6da6e`
MD5	`82fa02567b0be075a0f59018ee70fe01`
BLAKE2b-256	`830feb18fe427ca5c44875c00db790f2b4ad1b5468bc4f8a6c1ae6e90234a5b5`

See more details on using hashes here.

seqranker 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dbDNA - A phylogeny- and expert identifier-driven grading system for reliable taxonomic annotation of (meta)barcoding data

Introduction

Installation

SeqRanker pipeline

Further Dependencies

mafft

IQ-TREE

mPTP

BLAST+

APSCALE blast

Settings file

Run SeqRanker

pypi version

standalone version

Example data

SeqRanker pipeline: a short overview

Overview slides

Step 1: Data acquisition

Step 2: Species delineation

Step 3: Rating system

Step 4: Database creation

Step 5: Local BLASTn

Available databases

European freshwater invertebrates (COI)

European freshwater fish and lamprey (12S)

Benchmark

Example

Citation

SeqRanker

mafft

IQ-Tree

mPTP

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes