A CLI tool for dereplicating and filtering viral contigs
Project description
votuderep
A Python CLI tool for dereplicating and filtering viral contigs (vOTUs - viral Operational Taxonomic Units) using the CheckV method.
Features
- Dereplicate vOTUs: Remove redundant viral sequences using BLAST-based ANI clustering
- Filter by CheckV metrics: Filter viral contigs based on quality, completeness, and other metrics
- ...
Requirements
- Python >= 3.10
- BLAST+ toolkit (specifically
blastnandmakeblastdb)
Installation
From source
# Clone the repository
git clone https://github.com/yourusername/votuderep.git
cd votuderep
# Install in development mode
pip install -e .
# Or install normally
pip install .
Installing BLAST+
votuderep requires BLAST+ to be installed and available in your PATH:
# Using conda (recommended)
conda install -c bioconda blast
# On Ubuntu/Debian
sudo apt-get install ncbi-blast+
# On macOS
brew install blast
Usage
votuderep provides two main commands: derep and filter.
Dereplicate vOTUs
Remove redundant sequences using BLAST and ANI clustering:
votuderep derep -i input.fasta -o dereplicated.fasta
Options:
-i, --input: Input FASTA file [required]-o, --output: Output FASTA file [default: dereplicated_vOTUs.fasta]-t, --threads: Number of threads for BLAST [default: 2]--tmp: Temporary directory [default: $TEMP or /tmp]--min-ani: Minimum ANI threshold (0-100) [default: 95]--min-tcov: Minimum target coverage (0-100) [default: 85]--keep: Keep temporary directory with intermediate files
Example:
# Basic dereplication
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta
# With custom parameters
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta \
--min-ani 97 --min-tcov 90 -t 8
# Keep intermediate files for inspection
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta \
--keep --tmp ./temp_dir
How it works:
- Creates a BLAST database from input sequences
- Performs all-vs-all BLASTN comparison
- Calculates ANI (Average Nucleotide Identity) and coverage
- Clusters sequences using greedy centroid-based algorithm
- Outputs the longest sequence from each cluster (representative)
Filter by CheckV
Filter viral contigs based on CheckV quality metrics:
votuderep filter input.fasta checkv_output.tsv -o filtered.fasta
Required Arguments:
FASTA: Input FASTA file with viral contigsCHECKV_OUT: TSV output file from CheckV
Options:
Length filters:
-m, --min-len: Minimum contig length [default: 0]--max-len: Maximum contig length, 0 = unlimited [default: 0]
Quality filters:
--min-quality: Minimum quality level: low, medium, or high [default: low]--complete: Only keep complete genomes--exclude-undetermined: Exclude contigs where quality is "Not-determined"
Metrics filters:
-c, --min-completeness: Minimum completeness percentage (0-100)--max-contam: Maximum contamination percentage (0-100)--no-warnings: Only keep contigs with no warnings
Other filters:
--provirus: Only select proviruses (provirus == "Yes")-o, --output: Output FASTA file [default: STDOUT]
Examples:
# Basic filtering - minimum quality
votuderep filter viral_contigs.fasta checkv_output.tsv -o filtered.fasta
# High-quality sequences only
votuderep filter viral_contigs.fasta checkv_output.tsv \
--min-quality high -o high_quality.fasta
# Complete genomes with minimum length
votuderep filter viral_contigs.fasta checkv_output.tsv \
--complete --min-len 5000 -o complete_genomes.fasta
# Complex filtering
votuderep filter viral_contigs.fasta checkv_output.tsv \
--min-quality medium \
--min-completeness 80 \
--max-contam 5 \
--no-warnings \
--min-len 3000 \
-o high_confidence.fasta
# Output to stdout (for piping)
votuderep filter viral_contigs.fasta checkv_output.tsv > filtered.fasta
Quality Levels:
CheckV assigns quality levels to viral contigs:
- Complete: Complete genomes (highest quality)
- High-quality: High confidence viral sequences
- Medium-quality: Moderate confidence sequences
- Low-quality: Lower confidence but valid sequences
- Not-determined: Quality could not be determined
The --min-quality option filters inclusively:
low: Includes Low, Medium, High, and Complete (default)medium: Includes Medium, High, and Completehigh: Includes High and Complete only
Note: "Not-determined" sequences are included by default unless --exclude-undetermined is used.
Global Options
-v, --verbose: Enable verbose logging--version: Show version and exit--help: Show help message
License
MIT License - See LICENSE file for details
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Authors
Andrea Telatin & QIB Core Bioinformatics
©️ Quadram Institute Bioscience 2025
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file votuderep-0.1.1.tar.gz.
File metadata
- Download URL: votuderep-0.1.1.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ab0fd01cd60e858e2d03aaf0ea7b06ca4700f400bfce0abf0e980d97f09e9fc
|
|
| MD5 |
bff98826e52ff1555f6c3e6978e69f52
|
|
| BLAKE2b-256 |
07ac94267167c59fbbb3db37c4e2f885faa2a52f3583d7105b58b0d647cef25a
|
Provenance
The following attestation bundles were made for votuderep-0.1.1.tar.gz:
Publisher:
publish-pypi.yml on quadram-institute-bioscience/votuderep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
votuderep-0.1.1.tar.gz -
Subject digest:
9ab0fd01cd60e858e2d03aaf0ea7b06ca4700f400bfce0abf0e980d97f09e9fc - Sigstore transparency entry: 608215774
- Sigstore integration time:
-
Permalink:
quadram-institute-bioscience/votuderep@fe246bbc2ab59c9478b7241c186b752779214854 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/quadram-institute-bioscience
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@fe246bbc2ab59c9478b7241c186b752779214854 -
Trigger Event:
release
-
Statement type:
File details
Details for the file votuderep-0.1.1-py3-none-any.whl.
File metadata
- Download URL: votuderep-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
defb27c3a12d5862f1e3c715dd441ea00d616f5def0cce67abaaabf07043840e
|
|
| MD5 |
c857670fa74d09fb1b62d617eb6daf24
|
|
| BLAKE2b-256 |
7a23399c0213a77cf14548c3b78b88b94c8643fdf343bd0520b5af98efd6f47f
|
Provenance
The following attestation bundles were made for votuderep-0.1.1-py3-none-any.whl:
Publisher:
publish-pypi.yml on quadram-institute-bioscience/votuderep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
votuderep-0.1.1-py3-none-any.whl -
Subject digest:
defb27c3a12d5862f1e3c715dd441ea00d616f5def0cce67abaaabf07043840e - Sigstore transparency entry: 608215783
- Sigstore integration time:
-
Permalink:
quadram-institute-bioscience/votuderep@fe246bbc2ab59c9478b7241c186b752779214854 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/quadram-institute-bioscience
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@fe246bbc2ab59c9478b7241c186b752779214854 -
Trigger Event:
release
-
Statement type: