A python package to query different databases of boldsystems.org v5!

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

BOLDigger3

A Python program to query .fasta files against the databases of www.boldsystems.org v5!

Introduction

DNA metabarcoding datasets often comprise hundreds of Operational Taxonomic Units (OTUs), requiring querying against databases for taxonomic assignment. The Barcode of Life Data system (BOLD) is a widely used database for this purpose among biologists. However, BOLD's online platform limits users to identifying batches of only 1000, 200 or 100 (depending on operating mode) sequences at a time.

BOLDigger3, the successor to BOLDigger2 and BOLDigger, aims to overcome these limitations. As a pure Python program, BOLDigger3 offers:

Automated access to BOLD's identification engine
Downloading of BOLD's latest data package release to access all metadata
Selection of the best-fitting hit from the returned results

Overview

BOLDigger3 is an automated tool designed for DNA sequence identification through BOLDSystems v5, supporting integration into bioinformatics pipelines with enhanced functionality and performance. With BOLDigger3, users can identify up to 10,000 sequences per hour using an optimized data storage and queuing system that improves speed and process safety.

Key Differences Between BOLDigger3 and BOLDigger2

Unified Function: BOLDigger3 consolidates all actions into a single function, identify, which automatically performs identification, additional data downloading, and top-hit selection, making it easier to integrate into pipelines.
Enhanced Database Accessibility: Users have access to all databases offered by BOLDSystems v5 and can select from three different operating modes.
Improved Speeds: Depending on the operating mode, BOLDigger3 can identify up to 10,000 sequences per hour, significantly faster than BOLDigger2.
BOLD Credentials Required for Database Download: Users need a BOLD account to download the public database package. Sequence identifications via the ID engine do not require credentials.
Streamlined Data Storage: Data is stored in a DuckDB database for faster processing, with final outputs available in .xlsx and .parquet formats.
Process Safety: BOLDigger3 can resume interrupted executions, continuing exactly where it left off.
Dynamic Queuing: The tool automatically manages request queuing based on the selected operating mode.

Features

Identify Sequences Automatically: Run DNA sequence identifications with a single command.
Flexible Database Options: Access to all BOLDSystems v5 databases with user-selected operating modes.
High-Performance Processing: Up to 10,000 identifications per hour, depending on settings.
Robust Storage: Data stored in DuckDB format for efficient processing; results in .xlsx and .parquet.
User-Friendly: Credentials are only required once to download the public database.

Installation and Usage

BOLDigger3 requires Python version 3.11 or higher and can be easily installed using pip in any command line:

pip install boldigger3

This command will install BOLDigger3 along with all its dependencies.

Step 1: Download the public database

Before running identifications, download the latest BOLD public database package. This step requires a valid BOLD account (register at www.boldsystems.org):

boldigger3 download_db PATH_TO_OUTPUT_DIR

BOLDigger3 will prompt for your BOLD username and password, check whether a local database already exists, and download and convert the latest release to a DuckDB file (.ddb) if needed.

Step 2: Run the identification

To run the identify function, use the following command:

boldigger3 identify PATH_TO_FASTA PATH_TO_DATABASE --db DATABASE_NR --mode OPERATING_MODE

PATH_TO_DATABASE is the path to the .ddb file downloaded in Step 1.

Databases

The --db is a number between 1 and 8 corresponding to the eight databases BOLD v5 currently offers:

1: ANIMAL LIBRARY (PUBLIC)
2: ANIMAL SPECIES-LEVEL LIBRARY (PUBLIC + PRIVATE)
3: ANIMAL LIBRARY (PUBLIC+PRIVATE)
4: VALIDATED CANADIAN ARTHROPOD LIBRARY
5: PLANT LIBRARY (PUBLIC)
6: FUNGI LIBRARY (PUBLIC)
7: ANIMAL SECONDARY MARKERS (PUBLIC)
8: VALIDATED ANIMAL RED LIST LIBRARY

Operating modes

The --mode is a number between 1 and 3, corresponding to the three operating modes BOLD v5 currently offers:

1: Rapid Species Search
2: Genus and Species Search
3: Exhaustive Search

To customize the implemented thresholds for user-specific needs, the thresholds can be passed as an additional (ordered) argument. Up to five different thresholds can be passed for the different taxonomic levels (Species, Genus, Family, Order, Class). Thresholds not passed will be replaced by default, but BOLDigger3 will also inform you about this:

boldigger3 identify PATH_TO_FASTA PATH_TO_DATABASE --db DATABASE_NR --mode OPERATING_MODE --thresholds 99 97

Output:

19:16:16: Default thresholds changed!
19:16:16: Species: 99, Genus: 97, Family: 90, Order: 85

When a new version is released, you can update BOLDigger3 by typing:

pip install --upgrade boldigger3

How to cite

Buchner D, Leese F (2020) BOLDigger â€“ a Python package to identify and organise sequences with the Barcode of Life Data systems. Metabarcoding and Metagenomics 4: e53535. https://doi.org/10.3897/mbmg.4.53535

The BOLDigger3 Algorithm

Database download (`download_db`)

Login: BOLDigger3 prompts for BOLD credentials and establishes an authenticated session.
Check database status: BOLDigger3 checks whether a local database file already matches the latest BOLD data package release.
Download and compile: If the local database is missing or outdated, BOLDigger3 downloads the latest Parquet release from BOLD and converts it into a DuckDB file (.ddb) for fast lookups.

Identification (`identify`)

Split the FASTA: The input FASTA file is divided into chunks that fit the limits of the selected operating mode of the identification engine.
Queue the Chunks: These chunks are then queued in the identification engine for processing.
Check for Results: The algorithm periodically checks if any results can be downloaded.
Data Download: Once results are available, the data is downloaded.
Data Validation: The algorithm ensures that all data has been correctly downloaded.
Retrieve Additional Data: Additional metadata (collection site, coordinates, collector, etc.) is joined from the local DuckDB database.
Select Top Hit: Finally, the algorithm selects the top hit backed by the most database entries for the final output.

Top hit selection

Different thresholds for the taxonomic levels are used to find the best fitting hit. The default thresholds are: 97%: species, 95%: genus, 90%: family, 85%: order, 75%: class, and 50%: phylum. After determining the threshold for all hits, the most common hit above the threshold will be selected. Note that for all hits below the threshold, the taxonomic resolution will be adjusted accordingly (e.g. for a 96% hit the species-level information will be discarded, and genus-level information will be used as the lowest taxonomic level).

Identify Maximum Similarity: Find the maximum similarity value among the top 100 hits currently under consideration.
Set Threshold: Set the threshold to this maximum similarity level. Remove all hits with a similarity below this threshold. For example, if the highest hit has a similarity of 100%, the threshold will be set to 97%, and all hits below this threshold will be removed temporarily.
Classification and Sorting: Count all individual classifications and sort them by abundance.
Filter Missing Data: Drop all classifications that contain missing data. For instance, if the most common hit is "Arthropoda --> Insecta" with a similarity of 100% but missing values for Order, Family, Genus, and Species.
Identify Common Hit: Look for the most common hit that has no missing values.
Return Hit: If a hit with no missing values is found, return that hit.
Threshold Adjustment: If no hit with no missing values is found, increase the threshold to the next higher level and repeat the process until a hit is found.

BOLDigger3 Flagging System

BOLDigger3 employs a flagging system to highlight certain conditions, indicating a degree of uncertainty in the selected hit. Currently, there are five flags implemented, which may be updated as needed:

Reverse BIN Taxonomy: This flag is raised if all of the top 100 hits representing the selected match utilize reverse BIN taxonomy. Reverse BIN taxonomy assigns species names to deposited sequences on BOLD that lack species information, potentially introducing uncertainty.
Differing Taxonomic Information: If the percentage of hits represented by the selected top hit is smaller than 90%, flag 2 will be raised indicating a potential taxonomic conflict. If your top hit is represented by 99 hits and there is 1 hit with differing taxonomy, this flag will not be raised.
Private Data: If all of the top 100 hits representing the top hit are private hits, this flag is raised, indicating limited accessibility to data.
Unique Hit: This flag indicates that the top hit result represents a unique hit among the top 100 hits, potentially requiring further scrutiny.
Multiple BINs: If the selected species-level hit is composed of more than one BIN, this flag is raised, suggesting potential complexities in taxonomic assignment.

Given the presence of these flags, it is advisable to conduct a closer examination of all flagged hits to better understand and address any uncertainties in the selected hit.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

3.0.0

May 13, 2026

2.3.0

Apr 16, 2026

2.2.0

Nov 18, 2025

2.1.4

Aug 14, 2025

2.1.3

Aug 13, 2025

2.1.2

Jul 24, 2025

2.1.1

Jul 22, 2025

2.1.0

Jul 22, 2025

2.0.2

Jul 22, 2025

2.0.1

Jul 20, 2025

2.0.0

Jul 18, 2025

1.6.3

Jul 11, 2025

1.6.2

Jun 20, 2025

1.6.1

Jun 20, 2025

1.6.0

May 23, 2025

1.5.3

May 15, 2025

1.5.2

May 15, 2025

1.5.1

Apr 19, 2025

1.5.0

Apr 19, 2025

1.4.7

Apr 8, 2025

1.4.6

Apr 8, 2025

1.4.5

Mar 19, 2025

1.4.4

Mar 19, 2025

1.4.3

Mar 19, 2025

1.4.2

Mar 7, 2025

1.4.1

Mar 6, 2025

1.4.0

Jan 27, 2025

1.3.1

Jan 24, 2025

1.3.0

Jan 23, 2025

1.2.8

Jan 23, 2025

1.2.7

Jan 23, 2025

1.2.6

Jan 19, 2025

1.2.5

Nov 21, 2024

1.2.4

Nov 21, 2024

1.2.3

Nov 21, 2024

1.2.2

Nov 14, 2024

1.2.1

Nov 10, 2024

1.2.0

Nov 9, 2024

1.1.14

Nov 8, 2024

1.1.13

Nov 8, 2024

1.1.12

Nov 6, 2024

1.1.11

Nov 5, 2024

1.1.10

Nov 5, 2024

1.1.9

Nov 4, 2024

1.1.8

Nov 3, 2024

1.1.7

Nov 1, 2024

1.1.6

Nov 1, 2024

1.1.5

Nov 1, 2024

1.1.4

Oct 31, 2024

1.1.3

Oct 31, 2024

1.1.2

Oct 30, 2024

1.1.1

Oct 29, 2024

1.1.0

Oct 29, 2024

1.0.2

Oct 28, 2024

1.0.1

Oct 28, 2024

1.0.0

Oct 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boldigger3-3.0.0.tar.gz (26.9 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

boldigger3-3.0.0-py3-none-any.whl (23.7 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file boldigger3-3.0.0.tar.gz.

File metadata

Download URL: boldigger3-3.0.0.tar.gz
Upload date: May 13, 2026
Size: 26.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for boldigger3-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a907443b432253acdd78b14f3d3d27de1f38154fbf7c85aaad61abc250125c3d`
MD5	`10608d56c7764626db681751f1d7e555`
BLAKE2b-256	`7cdeab9b4e4caa7a12820be3131f901a6de9bca84304e744c0c6f150fec15144`

See more details on using hashes here.

File details

Details for the file boldigger3-3.0.0-py3-none-any.whl.

File metadata

Download URL: boldigger3-3.0.0-py3-none-any.whl
Upload date: May 13, 2026
Size: 23.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for boldigger3-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`799be4ddac6d0ecd8789b3f16bf14c7918c36929d11f3485ef1a98bea2ce67e9`
MD5	`6c72342d2e0f2834ca8874608bc6c196`
BLAKE2b-256	`6b2a3f72361658d7c6f2c4fd5e1861492a63e458899b5a54dcf253e45fca4210`

See more details on using hashes here.

boldigger3 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BOLDigger3

Introduction

Overview

Key Differences Between BOLDigger3 and BOLDigger2

Features

Installation and Usage

Step 1: Download the public database

Step 2: Run the identification

Databases

Operating modes

How to cite

The BOLDigger3 Algorithm

Database download (download_db)

Identification (identify)

Top hit selection

BOLDigger3 Flagging System

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Database download (`download_db`)

Identification (`identify`)