A python package for structural and taxonomic validation of DNA barcode data.
Project description
DNA Barcode Validator
A Python-based toolkit for validating DNA barcode sequences through structural and taxonomic validation. This tool helps ensure sequence quality and taxonomic accuracy for submissions to the Barcode of Life Data System (BOLD) and to Naturalis's DNA domain within the BioCloud.
Features
-
Structural validation of DNA barcodes:
- Sequence length requirements
- Ambiguous base detection
- Stop codon analysis for protein-coding markers
- HMM-based alignment for codon phase detection
-
Taxonomic validation:
- Validation against ID service reference databases (BOLD, Galaxy BLAST)
- Flexible taxonomy mapping (NSR, NCBI or BOLD)
- Support for multiple taxonomic ranks
-
Triaging and filtering:
- Automatic selection of best valid sequence per specimen or sample group
- Support for assembly attempt grouping
- Customizable triage criteria
-
Input/Output:
- Support for FASTA and tabular input formats
- BOLD Excel spreadsheet integration
- Detailed validation reports in TSV format
- Filtered FASTA output for valid sequences
- Integration with Galaxy workflow platform
Installation
Using pip
Install the barcode validator from PyPI using pip:
pip install barcode-validator
Note: Additional dependencies (BLAST and HMMER) may need to be installed separately depending on your use case.
Using bioconda
The recommended way to install a complete environment with all dependencies is using bioconda:
conda create -n barcode-validator
conda activate barcode-validator
conda install -c bioconda barcode-validator blast hmmer
This will install the barcode validator along with BLAST and HMMER, which are required for taxonomic and structural validation respectively.
Usage
Command Line Interface
The barcode validator can be run as a Python module:
python -m barcode_validator [options]
Below are detailed examples for common use cases, particularly for the BGE (Biodiversity Genomics Europe) and ARISE projects.
BGE Use Case: Structural and Taxonomic Validation with Assembly Triage
The BGE use case involves validating sequences from multiple genome skimming assembly attempts per specimen, selecting the best valid sequence per specimen, and performing taxonomic validation using the Galaxy BLAST web service.
Input requirements:
- FASTA file with sequences where IDs are formatted as
processID_assemblyAttemptID - CSV file with assembly metrics (optional but recommended)
- BOLD Excel spreadsheet with 'Lab Sheet' and 'Taxonomy' tabs
- Galaxy API credentials for taxonomic validation
Example: Two-stage validation
First, perform structural validation with early triage:
# Set up input and output files
INPUT_FASTA=data/sequences.fasta
INPUT_CSV=data/metrics.csv
BOLD_EXCEL=data/bold_spreadsheet.xlsx
STRUCTVAL_FASTA=data/structval_out.fasta
STRUCTVAL_TSV=data/structval_out.tsv
# Run structural validation
python -m barcode_validator \
--input-file $INPUT_FASTA \
--csv-file $INPUT_CSV \
--mode structural \
--marker COI-5P \
--input-resolver format=bold \
--input-resolver file=$BOLD_EXCEL \
--output-fasta $STRUCTVAL_FASTA \
--output-tsv $STRUCTVAL_TSV \
--triage-config group_id_separator=_ \
--triage-config group_by_sample=true \
--log-level INFO 2> structval.log
Then, perform taxonomic validation on the triaged results:
# Set Galaxy credentials
export GALAXY_API_KEY=your_galaxy_api_key
export GALAXY_DOMAIN=galaxy.naturalis.nl
# Set output files
TAXVAL_FASTA=data/taxonval_out.fasta
TAXVAL_TSV=data/taxonval_out.tsv
# Run taxonomic validation
python -m barcode_validator \
--input-file $STRUCTVAL_FASTA \
--csv-file $INPUT_CSV \
--mode taxonomic \
--marker COI-5P \
--input-resolver format=bold \
--input-resolver file=$BOLD_EXCEL \
--output-fasta $TAXVAL_FASTA \
--output-tsv $TAXVAL_TSV \
--taxon-validation method=galaxy \
--taxon-validation rank=family \
--taxon-validation min_identity=0.8 \
--taxon-validation max_target_seqs=100 \
--log-level INFO 2> taxonval.log
Example: Combined validation in one step
For more thorough validation where all structurally valid sequences are checked taxonomically before triaging:
python -m barcode_validator \
--input-file $INPUT_FASTA \
--csv-file $INPUT_CSV \
--mode both \
--marker COI-5P \
--input-resolver format=bold \
--input-resolver file=$BOLD_EXCEL \
--output-fasta data/validated_out.fasta \
--output-tsv data/validated_out.tsv \
--taxon-validation method=galaxy \
--taxon-validation rank=family \
--taxon-validation min_identity=0.8 \
--taxon-validation max_target_seqs=100 \
--triage-config group_id_separator=_ \
--triage-config group_by_sample=true \
--log-level INFO 2> validation.log
ARISE Use Case: Simple Validation Without Assembly Grouping
The ARISE use case involves validating individual sequences (one per specimen) with both structural and taxonomic validation using the BOLD web service. This case is focused on fresh specimen sequencing, either by ONT or Sanger, which lacks the brute forcing that BGE requires.
Input requirements:
- FASTA file with sequences where the first word of the definition line is the process ID
- BOLD Excel spreadsheet with 'Lab Sheet' and 'Taxonomy' tabs
Example:
python -m barcode_validator \
--input-file input_sequences.fasta \
--mode both \
--marker COI-5P \
--input-resolver format=bold \
--input-resolver file=bold_spreadsheet.xlsx \
--output-fasta validated_sequences.fasta \
--output-tsv validation_report.tsv \
--taxon-validation method=bold \
--taxon-validation rank=family \
--taxon-validation min_identity=0.8 \
--taxon-validation max_target_seqs=100 \
--triage-config group_by_sample=false \
--log-level ERROR 2> validation.log
This produces a FASTA file with valid sequences and a TSV file with detailed validation results.
Common Options
--input-file: Input FASTA file with sequences to validate--csv-file: Optional CSV file with additional metrics--mode: Validation mode (structural,taxonomic, orboth)--marker: Marker gene (e.g.,COI-5P)--input-resolver format=bold: Use BOLD spreadsheet for taxonomy--input-resolver file=<path>: Path to BOLD Excel spreadsheet--output-fasta: Output FASTA file with valid sequences--output-tsv: Output TSV file with validation results--taxon-validation method=<bold|galaxy>: Taxonomic validation service--taxon-validation rank=<rank>: Taxonomic rank to validate at--taxon-validation min_identity=<float>: Minimum identity threshold (0-1)--taxon-validation max_target_seqs=<int>: Maximum number of BLAST hits to consider--triage-config group_id_separator=<char>: Separator for parsing group IDs--triage-config group_by_sample=<true|false>: Enable/disable assembly attempt grouping--log-level: Logging level (DEBUG,INFO,WARNING,ERROR,CRITICAL)
Galaxy Integration
The structural validation functionality is also available through the Galaxy platform:
- Galaxy Toolshed: The barcode validator structural validation tool is available in the Galaxy Toolshed, enabling easy installation into any Galaxy instance.
- Web-based interface: Users can upload sequence files, configure validation parameters through the GUI, run validations, and download results.
- Workflow integration: The tool can be incorporated into Galaxy workflows for automated processing pipelines.
To use the tool in Galaxy:
- Install the tool from the Galaxy Toolshed (search for "barcode validator")
- Upload your sequence files to your Galaxy history
- Configure validation parameters through the GUI
- Run the validation
- View results and download validation reports and filtered sequences
For taxonomic validation through Galaxy, the tool can connect to Galaxy's BLAST web service using API credentials.
Architecture
For detailed information about the software architecture, including class hierarchies and process flow diagrams, see Architecture Documentation.
Contributing
We welcome contributions! Please see:
When contributing, please ensure:
- Code follows PEP 8 style guidelines
- All functions and classes include docstrings
- New features include unit tests
- All tests pass before submitting a pull request
Testing
Run the test suite to verify your installation:
# Run all tests
pytest
# Run with coverage report
pytest --cov=barcode_validator
# Run specific test suites
pytest tests/bge/
pytest tests/arise/
The test suite includes comprehensive examples for BGE and ARISE use cases.
License
This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details.
Citation
If you use this software in your research, please cite this repository. An application note (to JOSS) that describes the software is in preparation.
Contact
For questions, issues, or contributions:
- GitHub Issues: https://github.com/naturalis/barcode_validator/issues
- Email: rutger.vos@naturalis.nl
Acknowledgments
This tool was developed to support the Biodiversity Genomics Europe (BGE) and ARISE projects, as well as general DNA barcoding initiatives at Naturalis Biodiversity Center.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file barcode_validator-2.0.10.tar.gz.
File metadata
- Download URL: barcode_validator-2.0.10.tar.gz
- Upload date:
- Size: 45.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8a242cf9756adcfb93b3b72d814d7e12742949a9ab2f39cccc6e92c40e35c42
|
|
| MD5 |
f8bf7fc87c5debb75ec6764a0c88de8a
|
|
| BLAKE2b-256 |
56de109a79798e25cae563a9d47f32be53c7b4a53d0007d4db34f66faf141916
|
File details
Details for the file barcode_validator-2.0.10-py3-none-any.whl.
File metadata
- Download URL: barcode_validator-2.0.10-py3-none-any.whl
- Upload date:
- Size: 107.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c67778a1b1467776119f70423ff4857e11f34b57ea216b0eb349ec8695111cb
|
|
| MD5 |
9542c6fc960850f7f8db9a3ed24acf4e
|
|
| BLAKE2b-256 |
c3b957eaee8a4caac20190742a7775847332f71da3c94b3c7fbd528a0f18e0ca
|