Skip to main content

Fast and accurate tool for calculating Average Nucleotide Identity (ANI) and clustering virus genomes and metagenomic contigs

Project description

Vclust logo Vclust

GitHub Release PyPI - Version Build and tests License: GPL v3

PyPI - Total Downloads PyPI - Downloads GitHub downloads Bioconda downloads

x86-64 ARM Apple M Linux macOS

Vclust is an alignment-based tool for fast and accurate calculation of Average Nucleotide Identity (ANI) between complete or metagenomically-assembled viral genomes. The tool also performs ANI-based clustering of genomes according to standards recommended by international virus consortia, including International Committee on Taxonomy of Viruses (ICTV) and Minimum Information about an Uncultivated Virus Genome (MIUViG).

Features

:gem: Accurate ANI calculations

Vclust uses a Lempel-Ziv-based pairwise sequence aligner (LZ-ANI) for ANI calculation. LZ-ANI achieves high sensitivity in detecting matched and mismatched nucleotides, ensuring accurate ANI determination. Its efficiency comes from a simplified indel handling model, making LZ-ANI magnitudes faster than alignment-based tools (e.g., BLASTn, MegaBLAST) while maintaining comparable accuracy to the most sensitive BLASTn searches.

:triangular_ruler: Multiple similarity measures

Vclust offers multiple similarity measures between two genome sequences:

  • ANI: The number of identical nucleotides across local alignments divided by the total length of the alignments.
  • Global ANI (gANI): The number of identical nucleotides across local alignments divided by the length of the query/reference genome.
  • Total ANI (tANI): The number of identical nucleotides between query-reference and reference-query genomes divided by the sum length of both genomes. tANI is equivalent to the VIRIDIC's intergenomic similarity.
  • Coverage (alignment fraction): The proportion of the query/reference sequence aligned with the reference/query sequence.
  • Number of local alignments: The number of local alignments between the two genome sequences.
  • Ratio between genome lengths: The length of the shorter genome divided by the longer one.

:star2: Multiple clustering algorithms

Vclust provides six clustering algorithms tailored to various scenarios, including taxonomic classification and dereplication of viral genomes.

  • Single-linkage
  • Complete-linkage
  • UCLUST
  • CD-HIT (Greedy incremental)
  • Greedy set cover (adopted from MMseqs2)
  • Leiden algorithm [optional]

:fire: Speed and efficiency

Vclust uses three efficient C++ tools - Kmer-db, LZ-ANI, Clusty - for prefiltering, aligning, calculating ANI, and clustering viral genomes. This combination enables the processing of millions of virus genomes within a few hours on a mid-range workstation.

:earth_americas: Web service

For datasets containing up to 1000 viral genomes, Vclust is available at http://www.vclust.org.

Quick start

# Install Vclust (requires Python >= 3.7)
pip install vclust

# Prefilter similar genome sequence pairs before conducting pairwise alignments.
vclust prefilter -i example/multifasta.fna -o fltr.txt

# Align similar genome sequence pairs and calculate pairwise ANI measures.
vclust align -i example/multifasta.fna -o ani.tsv --filter fltr.txt

# Cluster genome sequences based on given ANI measure and minimum threshold.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --metric ani --ani 0.95

Documentation

The Vclust documentation is available on the GitHub Wiki and includes the following sections:

  1. Features
  2. Installation
  3. Quick Start
  4. Usage
    1. Input data
    2. Prefilter
    3. Align
    4. Cluster
    5. Deduplicate
  5. Optimizing sensitivity and resource usage
  6. Use cases
    1. Classify viruses into species and genera following ICTV standards
    2. Assign viral contigs into vOTUs following MIUViG standards
    3. Dereplicate viral contigs into representative genomes
    4. Process large dataset of diverse virus genomes (IMG/VR)
    5. Deduplicate (remove duplicate sequences) between and within multiple datasets
    6. Process large dataset of highly redundant virus genomes
    7. Cluster plasmid genomes into pOTUs
    8. Calculate pairwise similarities between all-versus-all genomes
  7. FAQ: Frequently Asked Questions

Citation

Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. bioRxiv [doi:10.1101/2024.06.27.601020].

License

GNU General Public License, version 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vclust-1.3.0-py3-none-manylinux2014_x86_64.whl (5.4 MB view details)

Uploaded Python 3

vclust-1.3.0-py3-none-manylinux2014_aarch64.whl (4.8 MB view details)

Uploaded Python 3

vclust-1.3.0-py3-none-macosx_11_0_arm64.whl (3.8 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

vclust-1.3.0-py3-none-macosx_10_9_x86_64.whl (4.0 MB view details)

Uploaded Python 3macOS 10.9+ x86-64

File details

Details for the file vclust-1.3.0-py3-none-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vclust-1.3.0-py3-none-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4fcbdd35fcb8592753db5c8e0aa8ce412fae612c1caff04b8036277b79bf3fce
MD5 9cd6a02294d961028cee43bf8b37d8ff
BLAKE2b-256 5fa2a7c5b9983a4dd0feda2e3bb627e952313c6f69e8618b96479ca0ebad0eec

See more details on using hashes here.

File details

Details for the file vclust-1.3.0-py3-none-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vclust-1.3.0-py3-none-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fc1e325ab55662f332df938d476da7c5515da23fbd94e4bfdeba8c2f58baeff3
MD5 cc573ac574574ef39915cc9f02b916a0
BLAKE2b-256 2eea7bcea90f64ceebd93218cff335053effa1da28f46a414a6545b5738c920e

See more details on using hashes here.

File details

Details for the file vclust-1.3.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vclust-1.3.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2dc47232c7d3e4d8aad2708af4bf0984878a54ef16a377232d80f00222e9d14d
MD5 8098b71f8896649b80022c7ae8b2c23e
BLAKE2b-256 e96eabe7bf8ebf74f92a732e559ca0c6a7f5dd6f787b4b5aed005ed0b6657caa

See more details on using hashes here.

File details

Details for the file vclust-1.3.0-py3-none-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for vclust-1.3.0-py3-none-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e14a542d8601a5babfc240db8aa67cd49ae4ecda52c756eeadd585944c761737
MD5 cde9b7c40b1ad9c3a15ce70b9692c4d6
BLAKE2b-256 38370bd9e53cc9d2c2a3cc69e4813b7a202c6240df320f4e1cd564707eee3ee0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page